How to Collect Digital Trace Data: Screenscraping

Chris Bail, Duke University
SICSS, Day 2

First, ask yourself whether you need to collect new data

 

There is already a vast amount of data out there that has already been compiled (e.g. Facebook, Twitter, The New York Times, Reuters, Google, Wikipedia)

Here is a crowd-sourced list of datasets I curate

Second, ask whether you could supplement or expand upon someone else's dataset

 

METHODS OF COLLECTING DIGITAL TRACE DATA

Types of Data Collection

 

-Pre-packaged data (e.g. Google Trends)
-Screen-scraping/Browser Automation/Crowd-Sourced Scraping
-Application Programming Interfaces (APIs)

What is Screen Scraping?

 

Screen scraping refers to a type of computer program that:

  • loads/reads in a web page
  • finds some information on it
  • grabs the information
  • stores it in a dataset

Screen-Scraping

 

Once upon a time you could collect virtually any piece of information from the internet by screen scraping.

Screen-Scraping

 

We are no longer in the “Wild, Wild, West” of the internet.

Screen-Scraping

 

Screen-scraping many sites is now against the law.

Screen-Scraping

 

Most sites have become very difficult to scrape because they are designed to prevent screen-scraping.

Let's Try a Simple Example

 

Please open this link

Or you can google “Wikipedia” and “World Health Organization Ranking of Health Systems”

What a Website Looks Like to Us