How to Collect Digital Trace Data: Screenscraping

Chris Bail, Duke University
SICSS, Day 2

First, ask yourself whether you need to collect new data

There is already a vast amount of data out there that has already been compiled (e.g. Facebook, Twitter, The New York Times, Reuters, Google, Wikipedia)

Here is a crowd-sourced list of datasets I curate

Second, ask whether you could supplement or expand upon someone else's dataset

METHODS OF COLLECTING DIGITAL TRACE DATA

Types of Data Collection

-Pre-packaged data (e.g. Google Trends)
-Screen-scraping/Browser Automation/Crowd-Sourced Scraping
-Application Programming Interfaces (APIs)

What is Screen Scraping?

Screen scraping refers to a type of computer program that:

loads/reads in a web page
finds some information on it
grabs the information
stores it in a dataset

Screen-Scraping

Once upon a time you could collect virtually any piece of information from the internet by screen scraping.

Screen-Scraping

We are no longer in the “Wild, Wild, West” of the internet.

Screen-Scraping

Screen-scraping many sites is now against the law.

Screen-Scraping

Most sites have become very difficult to scrape because they are designed to prevent screen-scraping.

Let's Try a Simple Example

Please open this link

Or you can google “Wikipedia” and “World Health Organization Ranking of Health Systems”

What a Website Looks Like to Us

What a Website Looks Like to a Computer

What if we Want the Info in this Table?

Set Your Working Directory

setwd("/Users/christopherandrewbail/Desktop/Dropbox/Teaching/Computational Soc Fall 2015/Course Dropbox")

Rvest: A Package for Screen Scraping

Install it:

install.packages("rvest")

Then load the package:

library(rvest)

Grabbing the HTML

wikipedia_page<-html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")

Let's Take a Look

wikipedia_page

{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Grabbing the Nodes

Here, we need the xpath we grabbed using Chrome:

//*[@id=“mw-content-text”]/table[1]

section_of_wikipedia_html<-
  html_nodes(wikipedia_page, xpath='//*[@id="mw-content-text"]/table[1]')

Let's Take Another Look

section_of_wikipedia_html

{xml_nodeset (0)}

Let's Grab the Table from the HTML

health_rankings<-html_table(section_of_wikipedia_html)

Check the Object Class

class(health_rankings)

[1] "list"

Let's Convert the List to a Data Frame

test<-as.data.frame(health_rankings)

If the Xpath does not Work...

Use "SelectorGadget" to find the CSS

Another Website

Please navigate to this link:

http://twittercounter.com/pages/100

the CSS Selector

.name-bio

Using the CSS Selector

toptwitter<-html("http://twittercounter.com/pages/100")
toptwitternodes<-html_nodes(toptwitter, css=".name-bio")
names<-html_text(toptwitternodes)

Let's Take a Look

names

 [1] "\n                        \n                            KATY PERRY\n                        \n                        @katyperry\n\n                        \n                    "               
 [2] "\n                        \n                            Justin Bieber\n                        \n                        @justinbieber\n\n                        \n                    "         
 [3] "\n                        \n                            Barack Obama\n                        \n                        @BarackObama\n\n                        \n                    "           
 [4] "\n                        \n                            Taylor Swift\n                        \n                        @taylorswift13\n\n                        \n                    "         
 [5] "\n                        \n                            Rihanna\n                        \n                        @rihanna\n\n                        \n                    "                    
 [6] "\n                        \n                            Ellen DeGeneres\n                        \n                        @TheEllenShow\n\n                        \n                    "       
 [7] "\n                        \n                            YouTube\n                        \n                        @YouTube\n\n                        \n                    "                    
 [8] "\n                        \n                            xoxo, Gaga\n                        \n                        @ladygaga\n\n                        \n                    "                
 [9] "\n                        \n                            Justin Timberlake\n                        \n                        @jtimberlake\n\n                        \n                    "      
[10] "\n                        \n                            Twitter\n                        \n                        @Twitter\n\n                        \n                    "                    
[11] "\n                        \n                            Britney Spears\n                        \n                        @britneyspears\n\n                        \n                    "       
[12] "\n                        \n                            Kim Kardashian West\n                        \n                        @KimKardashian\n\n                        \n                    "  
[13] "\n                        \n                            Cristiano Ronaldo\n                        \n                        @Cristiano\n\n                        \n                    "        
[14] "\n                        \n                            CNN Breaking News\n                        \n                        @cnnbrk\n\n                        \n                    "           
[15] "\n                        \n                            Selena Gomez\n                        \n                        @selenagomez\n\n                        \n                    "           
[16] "\n                        \n                            jimmy fallon\n                        \n                        @jimmyfallon\n\n                        \n                    "           
[17] "\n                        \n                            Ariana Grande\n                        \n                        @ArianaGrande\n\n                        \n                    "         
[18] "\n                        \n                            Shakira\n                        \n                        @shakira\n\n                        \n                    "                    
[19] "\n                        \n                            Demi Lovato\n                        \n                        @ddlovato\n\n                        \n                    "               
[20] "\n                        \n                            Jennifer Lopez\n                        \n                        @JLo\n\n                        \n                    "                 
[21] "\n                        \n                            Instagram\n                        \n                        @instagram\n\n                        \n                    "                
[22] "\n                        \n                            The New York Times\n                        \n                        @nytimes\n\n                        \n                    "         
[23] "\n                        \n                            Oprah Winfrey\n                        \n                        @Oprah\n\n                        \n                    "                
[24] "\n                        \n                            LeBron James\n                        \n                        @KingJames\n\n                        \n                    "             
[25] "\n                        \n                            Drizzy\n                        \n                        @Drake\n\n                        \n                    "                       
[26] "\n                        \n                            CNN\n                        \n                        @CNN\n\n                        \n                    "                            
[27] "\n                        \n                            Bill Gates\n                        \n                        @BillGates\n\n                        \n                    "               
[28] "\n                        \n                            Kevin Hart\n                        \n                        @KevinHart4real\n\n                        \n                    "          
[29] "\n                        \n                            SportsCenter\n                        \n                        @SportsCenter\n\n                        \n                    "          
[30] "\n                        \n                            ESPN\n                        \n                        @espn\n\n                        \n                    "                          
[31] "\n                        \n                            Miley Ray Cyrus\n                        \n                        @MileyCyrus\n\n                        \n                    "         
[32] "\n                        \n                            BBC Breaking News\n                        \n                        @BBCBreaking\n\n                        \n                    "      
[33] "\n                        \n                            Donald J. Trump\n                        \n                        @realDonaldTrump\n\n                        \n                    "    
[34] "\n                        \n                            One Direction\n                        \n                        @onedirection\n\n                        \n                    "         
[35] "\n                        \n                            Narendra Modi\n                        \n                        @narendramodi\n\n                        \n                    "         
[36] "\n                        \n                            Harry Styles.\n                        \n                        @Harry_Styles\n\n                        \n                    "         
[37] "\n                        \n                            Bruno Mars\n                        \n                        @BrunoMars\n\n                        \n                    "               
[38] "\n                        \n                            Niall Horan\n                        \n                        @NiallOfficial\n\n                        \n                    "          
[39] "\n                        \n                            Lil Wayne WEEZY F\n                        \n                        @LilTunechi\n\n                        \n                    "       
[40] "\n                        \n                            Wiz Khalifa\n                        \n                        @wizkhalifa\n\n                        \n                    "             
[41] "\n                        \n                            Neymar Jr\n                        \n                        @neymarjr\n\n                        \n                    "                 
[42] "\n                        \n                            P!nk\n                        \n                        @Pink\n\n                        \n                    "                          
[43] "\n                        \n                            Adele\n                        \n                        @Adele\n\n                        \n                    "                        
[44] "\n                        \n                            daniel tosh\n                        \n                        @danieltosh\n\n                        \n                    "             
[45] "\n                        \n                            Amitabh Bachchan\n                        \n                        @SrBachchan\n\n                        \n                    "        
[46] "\n                        \n                            Kaka\n                        \n                        @KAKA\n\n                        \n                    "                          
[47] "\n                        \n                            Neil Patrick Harris\n                        \n                        @ActuallyNPH\n\n                        \n                    "    
[48] "\n                        \n                            Alicia Keys\n                        \n                        @aliciakeys\n\n                        \n                    "             
[49] "\n                        \n                            Shah Rukh Khan\n                        \n                        @iamsrk\n\n                        \n                    "              
[50] "\n                        \n                            NBA\n                        \n                        @NBA\n\n                        \n                    "                            
[51] "\n                        \n                            Emma Watson\n                        \n                        @EmmaWatson\n\n                        \n                    "             
[52] "\n                        \n                            Louis Tomlinson\n                        \n                        @Louis_Tomlinson\n\n                        \n                    "    
[53] "\n                        \n                            Pitbull\n                        \n                        @pitbull\n\n                        \n                    "                    
[54] "\n                        \n                            Liam\n                        \n                        @LiamPayne\n\n                        \n                    "                     
[55] "\n                        \n                            NASA\n                        \n                        @NASA\n\n                        \n                    "                          
[56] "\n                        \n                            Khloé\n                        \n                        @khloekardashian\n\n                        \n                    "              
[57] "\n                        \n                            Real Madrid C.F.\n                        \n                        @realmadrid\n\n                        \n                    "        
[58] "\n                        \n                            Conan O'Brien\n                        \n                        @ConanOBrien\n\n                        \n                    "          
[59] "\n                        \n                            NFL\n                        \n                        @NFL\n\n                        \n                    "                            
[60] "\n                        \n                            Salman Khan\n                        \n                        @BeingSalmanKhan\n\n                        \n                    "        
[61] "\n                        \n                            Kendall\n                        \n                        @KendallJenner\n\n                        \n                    "              
[62] "\n                        \n                            Kourtney Kardashian\n                        \n                        @kourtneykardash\n\n                        \n                    "
[63] "\n                        \n                            zayn\n                        \n                        @zaynmalik\n\n                        \n                    "                     
[64] "\n                        \n                            Kylie Jenner\n                        \n                        @KylieJenner\n\n                        \n                    "           
[65] "\n                        \n                            David Guetta\n                        \n                        @davidguetta\n\n                        \n                    "           
[66] "\n                        \n                            FC Barcelona\n                        \n                        @FCBarcelona\n\n                        \n                    "           
[67] "\n                        \n                            The Economist\n                        \n                        @TheEconomist\n\n                        \n                    "         
[68] "\n                        \n                            Aamir Khan\n                        \n                        @aamir_khan\n\n                        \n                    "              
[69] "\n                        \n                            NICKI MINAJ\n                        \n                        @NICKIMINAJ\n\n                        \n                    "             
[70] "\n                        \n                            Coldplay\n                        \n                        @coldplay\n\n                        \n                    "                  
[71] "\n                        \n                            Avril Lavigne\n                        \n                        @AvrilLavigne\n\n                        \n                    "         
[72] "\n                        \n                            Marshall Mathers\n                        \n                        @Eminem\n\n                        \n                    "            
[73] "\n                        \n                            Chris Brown\n                        \n                        @chrisbrown\n\n                        \n                    "             
[74] "\n                        \n                            BBC News (World)\n                        \n                        @BBCWorld\n\n                        \n                    "          
[75] "\n                        \n                            Blake Shelton\n                        \n                        @blakeshelton\n\n                        \n                    "         
[76] "\n                        \n                            President Trump\n                        \n                        @POTUS\n\n                        \n                    "              
[77] "\n                        \n                            Ed Sheeran\n                        \n                        @edsheeran\n\n                        \n                    "               
[78] "\n                        \n                            Deepika Padukone\n                        \n                        @deepikapadukone\n\n                        \n                    "   
[79] "\n                        \n                            PMO India\n                        \n                        @PMOIndia\n\n                        \n                    "                 
[80] "\n                        \n                            Google\n                        \n                        @Google\n\n                        \n                    "                      
[81] "\n                        \n                            Akshay Kumar\n                        \n                        @akshaykumar\n\n                        \n                    "           
[82] "\n                        \n                            ashton kutcher\n                        \n                        @aplusk\n\n                        \n                    "              
[83] "\n                        \n                            Reuters Top News\n                        \n                        @Reuters\n\n                        \n                    "           
[84] "\n                        \n                            Mariah Carey\n                        \n                        @MariahCarey\n\n                        \n                    "           
[85] "\n                        \n                            National Geographic\n                        \n                        @NatGeo\n\n                        \n                    "         
[86] "\n                        \n                            Ricky Martin\n                        \n                        @ricky_martin\n\n                        \n                    "          
[87] "\n                        \n                            Leonardo DiCaprio\n                        \n                        @LeoDiCaprio\n\n                        \n                    "      
[88] "\n                        \n                            د. محمد #العريفي\n                        \n                        @MohamadAlarefe\n\n                        \n                    "    
[89] "\n                        \n                            PRIYANKA\n                        \n                        @priyankachopra\n\n                        \n                    "            
[90] "\n                        \n                            Hrithik Roshan\n                        \n                        @iHrithik\n\n                        \n                    "            
[91] "\n                        \n                            أحمد الشقيري\n                        \n                        @shugairi\n\n                        \n                    "              
[92] "\n                        \n                            Snoop Dogg\n                        \n                        @SnoopDogg\n\n                        \n                    "               
[93] "\n                        \n                            Vine Creators\n                        \n                        @VineCreators\n\n                        \n                    "         
[94] "\n                        \n                            sachin tendulkar\n                        \n                        @sachin_rt\n\n                        \n                    "         
[95] "\n                        \n                            AGNEZ MO\n                        \n                        @agnezmo\n\n                        \n                    "                   
[96] "\n                        \n                            Andrés Iniesta\n                        \n                        @andresiniesta8\n\n                        \n                    "      
[97] "\n                        \n                            Hillary Clinton\n                        \n                        @HillaryClinton\n\n                        \n                    "     
[98] "\n                        \n                            Alejandro Sanz\n                        \n                        @AlejandroSanz\n\n                        \n                    "       
[99] "\n                        \n                            Christina Aguilera\n                        \n                        @xtina\n\n                        \n                    "

To learn how to clean up these data, see the code on the SICSS Github page

Now... Repeat

All of these commands can be placed within loops

If you get SSL or OAuth errors, you are being blocked by the site you are trying to scrape.

Browser Automation

Another strategy for automatically extracting data from a website- particularly useful for sites with a lot of javascript.

Try out the RSelenium package, which can:

a) load a URL in your browser
b) navigate around the page using keystrokes
c) download different types of data (e.g. .csv)

How to Collect Digital Trace Data: Screenscraping

First, ask yourself whether you need to collect new data

Second, ask whether you could supplement or expand upon someone else's dataset

METHODS OF COLLECTING DIGITAL TRACE DATA

Types of Data Collection

What is Screen Scraping?

Screen-Scraping

Screen-Scraping

Screen-Scraping

Screen-Scraping

Let's Try a Simple Example

What a Website Looks Like to Us

What a Website Looks Like to a Computer

What if we Want the Info in this Table?

What if we Want the Info in this Table?

What if we Want the Info in this Table?

Set Your Working Directory

Rvest: A Package for Screen Scraping

Grabbing the HTML

Let's Take a Look

Grabbing the Nodes

Let's Take Another Look

Let's Grab the Table from the HTML

Check the Object Class

Let's Convert the List to a Data Frame

If the Xpath does not Work...

Use "SelectorGadget" to find the CSS

Another Website

the CSS Selector

Using the CSS Selector

Let's Take a Look

To learn how to clean up these data, see the code on the SICSS Github page

Now... Repeat

Browser Automation

Scraping with AmTurk

Scraping with AmTurk

QUESTIONS?

15 MINUTE BREAK