How to Collect Digital Trace Data: Screenscraping

Chris Bail, Duke University
SICSS, Day 2

First, ask yourself whether you need to collect new data

 

There is already a vast amount of data out there that has already been compiled (e.g. Facebook, Twitter, The New York Times, Reuters, Google, Wikipedia)

Here is a crowd-sourced list of datasets I curate

Second, ask whether you could supplement or expand upon someone else's dataset

 

METHODS OF COLLECTING DIGITAL TRACE DATA

Types of Data Collection

 

-Pre-packaged data (e.g. Google Trends)
-Screen-scraping/Browser Automation/Crowd-Sourced Scraping
-Application Programming Interfaces (APIs)

What is Screen Scraping?

 

Screen scraping refers to a type of computer program that:

  • loads/reads in a web page
  • finds some information on it
  • grabs the information
  • stores it in a dataset

Screen-Scraping

 

Once upon a time you could collect virtually any piece of information from the internet by screen scraping.

Screen-Scraping

 

We are no longer in the “Wild, Wild, West” of the internet.

Screen-Scraping

 

Screen-scraping many sites is now against the law.

Screen-Scraping

 

Most sites have become very difficult to scrape because they are designed to prevent screen-scraping.

Let's Try a Simple Example

 

Please open this link

Or you can google “Wikipedia” and “World Health Organization Ranking of Health Systems”

What a Website Looks Like to Us

What a Website Looks Like to a Computer

What if we Want the Info in this Table?

What if we Want the Info in this Table?

What if we Want the Info in this Table?

Set Your Working Directory

 

setwd("/Users/christopherandrewbail/Desktop/Dropbox/Teaching/Computational Soc Fall 2015/Course Dropbox")

Rvest: A Package for Screen Scraping

 

Install it:

install.packages("rvest")

Then load the package:

library(rvest)

Grabbing the HTML

 

wikipedia_page<-html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")

Let's Take a Look

wikipedia_page
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...

Grabbing the Nodes

 

Here, we need the xpath we grabbed using Chrome:

//*[@id=“mw-content-text”]/table[1]

section_of_wikipedia_html<-
  html_nodes(wikipedia_page, xpath='//*[@id="mw-content-text"]/table[1]')

Let's Take Another Look

section_of_wikipedia_html
{xml_nodeset (0)}

Let's Grab the Table from the HTML

 

health_rankings<-html_table(section_of_wikipedia_html)

Check the Object Class

 

class(health_rankings)
[1] "list"

Let's Convert the List to a Data Frame

 

test<-as.data.frame(health_rankings)

If the Xpath does not Work...

 

Use "SelectorGadget" to find the CSS

Another Website

 

Please navigate to this link:

http://twittercounter.com/pages/100

the CSS Selector

 

.name-bio

Using the CSS Selector

 

toptwitter<-html("http://twittercounter.com/pages/100")
toptwitternodes<-html_nodes(toptwitter, css=".name-bio")
names<-html_text(toptwitternodes)

Let's Take a Look

 

names
 [1] "\n                        \n                            KATY PERRY\n                        \n                        @katyperry\n\n                        \n                    "               
 [2] "\n                        \n                            Justin Bieber\n                        \n                        @justinbieber\n\n                        \n                    "         
 [3] "\n                        \n                            Barack Obama\n                        \n                        @BarackObama\n\n                        \n                    "           
 [4] "\n                        \n                            Taylor Swift\n                        \n                        @taylorswift13\n\n                        \n                    "         
 [5] "\n                        \n                            Rihanna\n                        \n                        @rihanna\n\n                        \n                    "                    
 [6] "\n                        \n                            Ellen DeGeneres\n                        \n                        @TheEllenShow\n\n                        \n                    "       
 [7] "\n                        \n                            YouTube\n                        \n                        @YouTube\n\n                        \n                    "                    
 [8] "\n                        \n                            xoxo, Gaga\n                        \n                        @ladygaga\n\n                        \n                    "                
 [9] "\n                        \n                            Justin Timberlake\n                        \n                        @jtimberlake\n\n                        \n                    "      
[10] "\n                        \n                            Twitter\n                        \n                        @Twitter\n\n                        \n                    "                    
[11] "\n                        \n                            Britney Spears\n                        \n                        @britneyspears\n\n                        \n                    "       
[12] "\n                        \n                            Kim Kardashian West\n                        \n                        @KimKardashian\n\n                        \n                    "  
[13] "\n                        \n                            Cristiano Ronaldo\n                        \n                        @Cristiano\n\n                        \n                    "        
[14] "\n                        \n                            CNN Breaking News\n                        \n                        @cnnbrk\n\n                        \n                    "           
[15] "\n                        \n                            Selena Gomez\n                        \n                        @selenagomez\n\n                        \n                    "           
[16] "\n                        \n                            jimmy fallon\n                        \n                        @jimmyfallon\n\n                        \n                    "           
[17] "\n                        \n                            Ariana Grande\n                        \n                        @ArianaGrande\n\n                        \n                    "         
[18] "\n                        \n                            Shakira\n                        \n                        @shakira\n\n                        \n                    "                    
[19] "\n                        \n                            Demi Lovato\n                        \n                        @ddlovato\n\n                        \n                    "               
[20] "\n                        \n                            Jennifer Lopez\n                        \n                        @JLo\n\n                        \n                    "                 
[21] "\n                        \n                            Instagram\n                        \n                        @instagram\n\n                        \n                    "                
[22] "\n                        \n                            The New York Times\n                        \n                        @nytimes\n\n                        \n                    "         
[23] "\n                        \n                            Oprah Winfrey\n                        \n                        @Oprah\n\n                        \n                    "                
[24] "\n                        \n                            LeBron James\n                        \n                        @KingJames\n\n                        \n                    "             
[25] "\n                        \n                            Drizzy\n                        \n                        @Drake\n\n                        \n                    "                       
[26] "\n                        \n                            CNN\n                        \n                        @CNN\n\n                        \n                    "                            
[27] "\n                        \n                            Bill Gates\n                        \n                        @BillGates\n\n                        \n                    "               
[28] "\n                        \n                            Kevin Hart\n                        \n                        @KevinHart4real\n\n                        \n                    "          
[29] "\n                        \n                            SportsCenter\n                        \n                        @SportsCenter\n\n                        \n                    "          
[30] "\n                        \n                            ESPN\n                        \n                        @espn\n\n                        \n                    "                          
[31] "\n                        \n                            Miley Ray Cyrus\n                        \n                        @MileyCyrus\n\n                        \n                    "         
[32] "\n                        \n                            BBC Breaking News\n                        \n                        @BBCBreaking\n\n                        \n                    "      
[33] "\n                        \n                            Donald J. Trump\n                        \n                        @realDonaldTrump\n\n                        \n                    "    
[34] "\n                        \n                            One Direction\n                        \n                        @onedirection\n\n                        \n                    "         
[35] "\n                        \n                            Narendra Modi\n                        \n                        @narendramodi\n\n                        \n                    "         
[36] "\n                        \n                            Harry Styles.\n                        \n                        @Harry_Styles\n\n                        \n                    "         
[37] "\n                        \n                            Bruno Mars\n                        \n                        @BrunoMars\n\n                        \n                    "               
[38] "\n                        \n                            Niall Horan\n                        \n                        @NiallOfficial\n\n                        \n                    "          
[39] "\n                        \n                            Lil Wayne WEEZY F\n                        \n                        @LilTunechi\n\n                        \n                    "       
[40] "\n                        \n                            Wiz Khalifa\n                        \n                        @wizkhalifa\n\n                        \n                    "             
[41] "\n                        \n                            Neymar Jr\n                        \n                        @neymarjr\n\n                        \n                    "                 
[42] "\n                        \n                            P!nk\n                        \n                        @Pink\n\n                        \n                    "                          
[43] "\n                        \n                            Adele\n                        \n                        @Adele\n\n                        \n                    "                        
[44] "\n                        \n                            daniel tosh\n                        \n                        @danieltosh\n\n                        \n                    "             
[45] "\n                        \n                            Amitabh Bachchan\n                        \n                        @SrBachchan\n\n                        \n                    "        
[46] "\n                        \n                            Kaka\n                        \n                        @KAKA\n\n                        \n                    "                          
[47] "\n                        \n                            Neil Patrick Harris\n                        \n                        @ActuallyNPH\n\n                        \n                    "    
[48] "\n                        \n                            Alicia Keys\n                        \n                        @aliciakeys\n\n                        \n                    "             
[49] "\n                        \n                            Shah Rukh Khan\n                        \n                        @iamsrk\n\n                        \n                    "              
[50] "\n                        \n                            NBA\n                        \n                        @NBA\n\n                        \n                    "                            
[51] "\n                        \n                            Emma Watson\n                        \n                        @EmmaWatson\n\n                        \n                    "             
[52] "\n                        \n                            Louis Tomlinson\n                        \n                        @Louis_Tomlinson\n\n                        \n                    "    
[53] "\n                        \n                            Pitbull\n                        \n                        @pitbull\n\n                        \n                    "                    
[54] "\n                        \n                            Liam\n                        \n                        @LiamPayne\n\n                        \n                    "                     
[55] "\n                        \n                            NASA\n                        \n                        @NASA\n\n                        \n                    "                          
[56] "\n                        \n                            Khloé\n                        \n                        @khloekardashian\n\n                        \n                    "              
[57] "\n                        \n                            Real Madrid C.F.\n                        \n                        @realmadrid\n\n                        \n                    "        
[58] "\n                        \n                            Conan O'Brien\n                        \n                        @ConanOBrien\n\n                        \n                    "          
[59] "\n                        \n                            NFL\n                        \n                        @NFL\n\n                        \n                    "                            
[60] "\n                        \n                            Salman Khan\n                        \n                        @BeingSalmanKhan\n\n                        \n                    "        
[61] "\n                        \n                            Kendall\n                        \n                        @KendallJenner\n\n                        \n                    "              
[62] "\n                        \n                            Kourtney Kardashian\n                        \n                        @kourtneykardash\n\n                        \n                    "
[63] "\n                        \n                            zayn\n                        \n                        @zaynmalik\n\n                        \n                    "                     
[64] "\n                        \n                            Kylie Jenner\n                        \n                        @KylieJenner\n\n                        \n                    "           
[65] "\n                        \n                            David Guetta\n                        \n                        @davidguetta\n\n                        \n                    "           
[66] "\n                        \n                            FC Barcelona\n                        \n                        @FCBarcelona\n\n                        \n                    "           
[67] "\n                        \n                            The Economist\n                        \n                        @TheEconomist\n\n                        \n                    "         
[68] "\n                        \n                            Aamir Khan\n                        \n                        @aamir_khan\n\n                        \n                    "              
[69] "\n                        \n                            NICKI MINAJ\n                        \n                        @NICKIMINAJ\n\n                        \n                    "             
[70] "\n                        \n                            Coldplay\n                        \n                        @coldplay\n\n                        \n                    "                  
[71] "\n                        \n                            Avril Lavigne\n                        \n                        @AvrilLavigne\n\n                        \n                    "         
[72] "\n                        \n                            Marshall Mathers\n                        \n                        @Eminem\n\n                        \n                    "            
[73] "\n                        \n                            Chris Brown\n                        \n                        @chrisbrown\n\n                        \n                    "             
[74] "\n                        \n                            BBC News (World)\n                        \n                        @BBCWorld\n\n                        \n                    "          
[75] "\n                        \n                            Blake Shelton\n                        \n                        @blakeshelton\n\n                        \n                    "         
[76] "\n                        \n                            President Trump\n                        \n                        @POTUS\n\n                        \n                    "              
[77] "\n                        \n                            Ed Sheeran\n                        \n                        @edsheeran\n\n                        \n                    "               
[78] "\n                        \n                            Deepika Padukone\n                        \n                        @deepikapadukone\n\n                        \n                    "   
[79] "\n                        \n                            PMO India\n                        \n                        @PMOIndia\n\n                        \n                    "                 
[80] "\n                        \n                            Google\n                        \n                        @Google\n\n                        \n                    "                      
[81] "\n                        \n                            Akshay Kumar\n                        \n                        @akshaykumar\n\n                        \n                    "           
[82] "\n                        \n                            ashton kutcher\n                        \n                        @aplusk\n\n                        \n                    "              
[83] "\n                        \n                            Reuters Top News\n                        \n                        @Reuters\n\n                        \n                    "           
[84] "\n                        \n                            Mariah Carey\n                        \n                        @MariahCarey\n\n                        \n                    "           
[85] "\n                        \n                            National Geographic\n                        \n                        @NatGeo\n\n                        \n                    "         
[86] "\n                        \n                            Ricky Martin\n                        \n                        @ricky_martin\n\n                        \n                    "          
[87] "\n                        \n                            Leonardo DiCaprio\n                        \n                        @LeoDiCaprio\n\n                        \n                    "      
[88] "\n                        \n                            د. محمد #العريفي\n                        \n                        @MohamadAlarefe\n\n                        \n                    "    
[89] "\n                        \n                            PRIYANKA\n                        \n                        @priyankachopra\n\n                        \n                    "            
[90] "\n                        \n                            Hrithik Roshan\n                        \n                        @iHrithik\n\n                        \n                    "            
[91] "\n                        \n                            أحمد الشقيري\n                        \n                        @shugairi\n\n                        \n                    "              
[92] "\n                        \n                            Snoop Dogg\n                        \n                        @SnoopDogg\n\n                        \n                    "               
[93] "\n                        \n                            Vine Creators\n                        \n                        @VineCreators\n\n                        \n                    "         
[94] "\n                        \n                            sachin tendulkar\n                        \n                        @sachin_rt\n\n                        \n                    "         
[95] "\n                        \n                            AGNEZ MO\n                        \n                        @agnezmo\n\n                        \n                    "                   
[96] "\n                        \n                            Andrés Iniesta\n                        \n                        @andresiniesta8\n\n                        \n                    "      
[97] "\n                        \n                            Hillary Clinton\n                        \n                        @HillaryClinton\n\n                        \n                    "     
[98] "\n                        \n                            Alejandro Sanz\n                        \n                        @AlejandroSanz\n\n                        \n                    "       
[99] "\n                        \n                            Christina Aguilera\n                        \n                        @xtina\n\n                        \n                    "           

To learn how to clean up these data, see the code on the SICSS Github page

 

Now... Repeat

 

All of these commands can be placed within loops

If you get SSL or OAuth errors, you are being blocked by the site you are trying to scrape.

Browser Automation

 

Another strategy for automatically extracting data from a website- particularly useful for sites with a lot of javascript.

Try out the RSelenium package, which can:

a) load a URL in your browser
b) navigate around the page using keystrokes
c) download different types of data (e.g. .csv)

Scraping with AmTurk

Scraping with AmTurk

QUESTIONS?

15 MINUTE BREAK