Chris Bail, Duke University
SICSS, Day 2
There is already a vast amount of data out there that has already been compiled (e.g. Facebook, Twitter, The New York Times, Reuters, Google, Wikipedia)
Here is a crowd-sourced list of datasets I curate
-Pre-packaged data (e.g. Google Trends)
-Screen-scraping/Browser Automation/Crowd-Sourced Scraping
-Application Programming Interfaces (APIs)
Screen scraping refers to a type of computer program that:
Once upon a time you could collect virtually any piece of information from the internet by screen scraping.
We are no longer in the “Wild, Wild, West” of the internet.
Screen-scraping many sites is now against the law.
Most sites have become very difficult to scrape because they are designed to prevent screen-scraping.
Please open this link
Or you can google “Wikipedia” and “World Health Organization Ranking of Health Systems”
setwd("/Users/christopherandrewbail/Desktop/Dropbox/Teaching/Computational Soc Fall 2015/Course Dropbox")
Install it:
install.packages("rvest")
Then load the package:
library(rvest)
wikipedia_page<-html("https://en.wikipedia.org/wiki/World_Health_Organization_ranking_of_health_systems_in_2000")
wikipedia_page
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-sub ...
Here, we need the xpath we grabbed using Chrome:
//*[@id=“mw-content-text”]/table[1]
section_of_wikipedia_html<-
html_nodes(wikipedia_page, xpath='//*[@id="mw-content-text"]/table[1]')
section_of_wikipedia_html
{xml_nodeset (0)}
health_rankings<-html_table(section_of_wikipedia_html)
class(health_rankings)
[1] "list"
test<-as.data.frame(health_rankings)
.name-bio
toptwitter<-html("http://twittercounter.com/pages/100")
toptwitternodes<-html_nodes(toptwitter, css=".name-bio")
names<-html_text(toptwitternodes)
names
[1] "\n \n KATY PERRY\n \n @katyperry\n\n \n "
[2] "\n \n Justin Bieber\n \n @justinbieber\n\n \n "
[3] "\n \n Barack Obama\n \n @BarackObama\n\n \n "
[4] "\n \n Taylor Swift\n \n @taylorswift13\n\n \n "
[5] "\n \n Rihanna\n \n @rihanna\n\n \n "
[6] "\n \n Ellen DeGeneres\n \n @TheEllenShow\n\n \n "
[7] "\n \n YouTube\n \n @YouTube\n\n \n "
[8] "\n \n xoxo, Gaga\n \n @ladygaga\n\n \n "
[9] "\n \n Justin Timberlake\n \n @jtimberlake\n\n \n "
[10] "\n \n Twitter\n \n @Twitter\n\n \n "
[11] "\n \n Britney Spears\n \n @britneyspears\n\n \n "
[12] "\n \n Kim Kardashian West\n \n @KimKardashian\n\n \n "
[13] "\n \n Cristiano Ronaldo\n \n @Cristiano\n\n \n "
[14] "\n \n CNN Breaking News\n \n @cnnbrk\n\n \n "
[15] "\n \n Selena Gomez\n \n @selenagomez\n\n \n "
[16] "\n \n jimmy fallon\n \n @jimmyfallon\n\n \n "
[17] "\n \n Ariana Grande\n \n @ArianaGrande\n\n \n "
[18] "\n \n Shakira\n \n @shakira\n\n \n "
[19] "\n \n Demi Lovato\n \n @ddlovato\n\n \n "
[20] "\n \n Jennifer Lopez\n \n @JLo\n\n \n "
[21] "\n \n Instagram\n \n @instagram\n\n \n "
[22] "\n \n The New York Times\n \n @nytimes\n\n \n "
[23] "\n \n Oprah Winfrey\n \n @Oprah\n\n \n "
[24] "\n \n LeBron James\n \n @KingJames\n\n \n "
[25] "\n \n Drizzy\n \n @Drake\n\n \n "
[26] "\n \n CNN\n \n @CNN\n\n \n "
[27] "\n \n Bill Gates\n \n @BillGates\n\n \n "
[28] "\n \n Kevin Hart\n \n @KevinHart4real\n\n \n "
[29] "\n \n SportsCenter\n \n @SportsCenter\n\n \n "
[30] "\n \n ESPN\n \n @espn\n\n \n "
[31] "\n \n Miley Ray Cyrus\n \n @MileyCyrus\n\n \n "
[32] "\n \n BBC Breaking News\n \n @BBCBreaking\n\n \n "
[33] "\n \n Donald J. Trump\n \n @realDonaldTrump\n\n \n "
[34] "\n \n One Direction\n \n @onedirection\n\n \n "
[35] "\n \n Narendra Modi\n \n @narendramodi\n\n \n "
[36] "\n \n Harry Styles.\n \n @Harry_Styles\n\n \n "
[37] "\n \n Bruno Mars\n \n @BrunoMars\n\n \n "
[38] "\n \n Niall Horan\n \n @NiallOfficial\n\n \n "
[39] "\n \n Lil Wayne WEEZY F\n \n @LilTunechi\n\n \n "
[40] "\n \n Wiz Khalifa\n \n @wizkhalifa\n\n \n "
[41] "\n \n Neymar Jr\n \n @neymarjr\n\n \n "
[42] "\n \n P!nk\n \n @Pink\n\n \n "
[43] "\n \n Adele\n \n @Adele\n\n \n "
[44] "\n \n daniel tosh\n \n @danieltosh\n\n \n "
[45] "\n \n Amitabh Bachchan\n \n @SrBachchan\n\n \n "
[46] "\n \n Kaka\n \n @KAKA\n\n \n "
[47] "\n \n Neil Patrick Harris\n \n @ActuallyNPH\n\n \n "
[48] "\n \n Alicia Keys\n \n @aliciakeys\n\n \n "
[49] "\n \n Shah Rukh Khan\n \n @iamsrk\n\n \n "
[50] "\n \n NBA\n \n @NBA\n\n \n "
[51] "\n \n Emma Watson\n \n @EmmaWatson\n\n \n "
[52] "\n \n Louis Tomlinson\n \n @Louis_Tomlinson\n\n \n "
[53] "\n \n Pitbull\n \n @pitbull\n\n \n "
[54] "\n \n Liam\n \n @LiamPayne\n\n \n "
[55] "\n \n NASA\n \n @NASA\n\n \n "
[56] "\n \n Khloé\n \n @khloekardashian\n\n \n "
[57] "\n \n Real Madrid C.F.\n \n @realmadrid\n\n \n "
[58] "\n \n Conan O'Brien\n \n @ConanOBrien\n\n \n "
[59] "\n \n NFL\n \n @NFL\n\n \n "
[60] "\n \n Salman Khan\n \n @BeingSalmanKhan\n\n \n "
[61] "\n \n Kendall\n \n @KendallJenner\n\n \n "
[62] "\n \n Kourtney Kardashian\n \n @kourtneykardash\n\n \n "
[63] "\n \n zayn\n \n @zaynmalik\n\n \n "
[64] "\n \n Kylie Jenner\n \n @KylieJenner\n\n \n "
[65] "\n \n David Guetta\n \n @davidguetta\n\n \n "
[66] "\n \n FC Barcelona\n \n @FCBarcelona\n\n \n "
[67] "\n \n The Economist\n \n @TheEconomist\n\n \n "
[68] "\n \n Aamir Khan\n \n @aamir_khan\n\n \n "
[69] "\n \n NICKI MINAJ\n \n @NICKIMINAJ\n\n \n "
[70] "\n \n Coldplay\n \n @coldplay\n\n \n "
[71] "\n \n Avril Lavigne\n \n @AvrilLavigne\n\n \n "
[72] "\n \n Marshall Mathers\n \n @Eminem\n\n \n "
[73] "\n \n Chris Brown\n \n @chrisbrown\n\n \n "
[74] "\n \n BBC News (World)\n \n @BBCWorld\n\n \n "
[75] "\n \n Blake Shelton\n \n @blakeshelton\n\n \n "
[76] "\n \n President Trump\n \n @POTUS\n\n \n "
[77] "\n \n Ed Sheeran\n \n @edsheeran\n\n \n "
[78] "\n \n Deepika Padukone\n \n @deepikapadukone\n\n \n "
[79] "\n \n PMO India\n \n @PMOIndia\n\n \n "
[80] "\n \n Google\n \n @Google\n\n \n "
[81] "\n \n Akshay Kumar\n \n @akshaykumar\n\n \n "
[82] "\n \n ashton kutcher\n \n @aplusk\n\n \n "
[83] "\n \n Reuters Top News\n \n @Reuters\n\n \n "
[84] "\n \n Mariah Carey\n \n @MariahCarey\n\n \n "
[85] "\n \n National Geographic\n \n @NatGeo\n\n \n "
[86] "\n \n Ricky Martin\n \n @ricky_martin\n\n \n "
[87] "\n \n Leonardo DiCaprio\n \n @LeoDiCaprio\n\n \n "
[88] "\n \n د. محمد #العريفي\n \n @MohamadAlarefe\n\n \n "
[89] "\n \n PRIYANKA\n \n @priyankachopra\n\n \n "
[90] "\n \n Hrithik Roshan\n \n @iHrithik\n\n \n "
[91] "\n \n أحمد الشقيري\n \n @shugairi\n\n \n "
[92] "\n \n Snoop Dogg\n \n @SnoopDogg\n\n \n "
[93] "\n \n Vine Creators\n \n @VineCreators\n\n \n "
[94] "\n \n sachin tendulkar\n \n @sachin_rt\n\n \n "
[95] "\n \n AGNEZ MO\n \n @agnezmo\n\n \n "
[96] "\n \n Andrés Iniesta\n \n @andresiniesta8\n\n \n "
[97] "\n \n Hillary Clinton\n \n @HillaryClinton\n\n \n "
[98] "\n \n Alejandro Sanz\n \n @AlejandroSanz\n\n \n "
[99] "\n \n Christina Aguilera\n \n @xtina\n\n \n "
All of these commands can be placed within loops
If you get SSL or OAuth errors, you are being blocked by the site you are trying to scrape.
Another strategy for automatically extracting data from a website- particularly useful for sites with a lot of javascript.
Try out the RSelenium package, which can:
a) load a URL in your browser
b) navigate around the page using keystrokes
c) download different types of data (e.g. .csv)