How to Collect Digital Trace Data: Screenscraping

Chris Bail, Duke University
SICSS, Day 2

First, ask yourself whether you need to collect new data


There is already a vast amount of data out there that has already been compiled (e.g. Facebook, Twitter, The New York Times, Reuters, Google, Wikipedia)

Here is a crowd-sourced list of datasets I curate

Second, ask whether you could supplement or expand upon someone else's dataset



Types of Data Collection


-Pre-packaged data (e.g. Google Trends)
-Screen-scraping/Browser Automation/Crowd-Sourced Scraping
-Application Programming Interfaces (APIs)

What is Screen Scraping?


Screen scraping refers to a type of computer program that:

  • loads/reads in a web page
  • finds some information on it
  • grabs the information
  • stores it in a dataset



Once upon a time you could collect virtually any piece of information from the internet by screen scraping.



We are no longer in the “Wild, Wild, West” of the internet.



Screen-scraping many sites is now against the law.



Most sites have become very difficult to scrape because they are designed to prevent screen-scraping.

Let's Try a Simple Example


Please open this link

Or you can google “Wikipedia” and “World Health Organization Ranking of Health Systems”

What a Website Looks Like to Us