Feb 10, 2016

About Me

  • Data Scientist at Digital Roots
  • M.A. Applied Statistics, 2012, UM

Situation

  • The Consumer Electronics Show (CES) is an annual conference that takes place in Las Vegas at the beginnning of January. My employer (Digital Roots) was a main social media provider for them.
  • Many questions of interest:
  • What topics are people discussing?
  • What companies are getting buzz?
  • Which events are people at?
  • Which venues?
  • Customer assistance needs?
  • Public safety concerns?

Gathering industry types

  • Which industries are getting interest?
  • Wearables, 3D printing, Audio, Autonomous Vehicles, …
  • A mention of a company can give us a clue about what industries are discussed
  • To collect these categories for each of the ~3,800 companies at the show, we can scrape the CEs website

About rvest

  • Authored by Hadley Wickham (who else…)
  • Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML

Extracting text from an HTML node

extractTextFromNode <- function(node, url)
{
  require(rvest)
  
  text <- 
    url %>% 
    read_html() %>% 
    html_nodes(node) %>% 
    html_text() 
  
  return(text)
}

What is that argument in html_nodes()?

  • A CSS selector, the node of how the webpage is structured
  • We need to figure out which node has the information we want
  • Use CSSSelectorGadget in Chrome is an easy way to find this

Examples

exhId <- "T0011542"
baseUrl <- "http://ces16.mapyourshow.com/7_0/exhibitor/exhibitor-details.cfm?exhid="
url <- paste0(baseUrl, exhId)

# Company description
descript = extractTextFromNode(".mys-taper-measure", url)
## Loading required package: rvest
## Loading required package: xml2
cat(descript)
##  
##                  D-Vine, the connected sommelier, is a wine tasting device that allows to enjoy wine by the glass in the perfect conditions of temperature and decantation in less than 1 minute. The data for each of our wines are contained in a microchip (RFID) located on our 10cl wine flacons.
##              

Examples

# Company categories
cats = extractTextFromNode(".mys-insideToggle", url)
cat(cats)
## 
##                                          
##                                              
##                                                  E-commerce
##                                              
##                                                  Other Consumer Technology 
##                                              
##                                                  Smart Home/Appliances
##                                              
##                                          
##                                      

End-to-end function

cesCompanyCategoryScraper <- function(exhId)
{
  require(rvest)

  baseUrl <- "http://ces16.mapyourshow.com/7_0/exhibitor/exhibitor-details.cfm?exhid="
  url <- paste0(baseUrl, exhId)
  
  categoriesRaw <- extractTextFromNode(".mys-insideToggle", url)

  categoriesClean <- categoriesRaw %>% 
    gsub(pattern = "\t", replacement = "") %>% 
    strsplit(split="\r\n") %>%
    unlist 
  
  if(length(categoriesClean) == 0)
  {
    return("")
  }
  
  categoriesClean = categoriesClean[categoriesClean != ""]
  
  return(categoriesClean)
  
}

Extracting categories

id = "T0011542"
cesCompanyCategoryScraper(id)
## [1] "E-commerce"                 "Other Consumer Technology "
## [3] "Smart Home/Appliances"

Scraping tables

extractTableFromNode <- function(node, url)
{
  require(rvest)
  
  table = url %>% 
      read_html() %>% 
      html_nodes(node) %>% 
      html_table(header=TRUE)
    
  df = do.call(cbind.data.frame, table)
  
  return(df)
}

Table example

url = "https://cesweb.org/hotel"
node = "table"
tableRaw = extractTableFromNode(node, url)

head(tableRaw)
##                 Hotel Name                            Tue Jan 5
## 1              Alexis Park Select hotel to view prevailing rate
## 2                     ARIA                             Sold Out
## 3        Bally's Las Vegas  Select hotel for exclusive CES rate
## 4                 Bellagio                             Sold Out
## 5 Caesars Palace Las Vegas  Select hotel for exclusive CES rate
## 6            Circus Circus                             Sold Out
##                              Wed Jan 6
## 1 Select hotel to view prevailing rate
## 2                             Sold Out
## 3  Select hotel for exclusive CES rate
## 4 Select hotel to view prevailing rate
## 5  Select hotel for exclusive CES rate
## 6                             Sold Out
##                             Thur Jan 7
## 1 Select hotel to view prevailing rate
## 2 Select hotel to view prevailing rate
## 3  Select hotel for exclusive CES rate
## 4 Select hotel to view prevailing rate
## 5  Select hotel for exclusive CES rate
## 6                             Sold Out
##                              Fri Jan 8
## 1 Select hotel to view prevailing rate
## 2 Select hotel to view prevailing rate
## 3  Select hotel for exclusive CES rate
## 4                                $407*
## 5  Select hotel for exclusive CES rate
## 6                                 $99*
##                              Sat Jan 9
## 1 Select hotel to view prevailing rate
## 2 Select hotel to view prevailing rate
## 3  Select hotel for exclusive CES rate
## 4                                $227*
## 5  Select hotel for exclusive CES rate
## 6                                 $53*
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Notes
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## 2                                                                                                                   *Rates shown include resort fee. \r\n        \r\n        \r\n          \r\n            \r\n          \r\n        \r\n        ARIA: $29 resort fee includes: Internet access (in-room and campus wide at City Center), local and toll-free telephone calls, access to the Spa at ARIA Fitness Center, daily newspaper, and airline boarding pass printing.
## 3                                                                                                                                               CES shuttle service not provided. Las Vegas Monorail accessible to/from the LVCC. \r\n        \r\n        \r\n          \r\n            \r\n          \r\n        \r\n        Bally's: $22 resort fee includes daily Fitness Center admission for two guests; daily in-room Internet access for two devices; all local calls. 
## 4                                                                                                                                              *Rates shown include resort fee. \r\n        \r\n        \r\n          \r\n            \r\n          \r\n        \r\n        Bellagio: $29 resort fee includes: access to Hotel Fitness Center, wireless internet in guest rooms, boarding pass printing in Hotel's Business Center, free local calls, and free toll free calls.
## 5                                                                                                                                                                                                 CES shuttle service not provided. \r\n        \r\n        \r\n          \r\n            \r\n          \r\n        \r\n        Caesars: $25 resort fee includes daily Fitness Center admission for two guests; daily in-room Internet access for two devices; all local calls.
## 6 *Rates shown include resort fee. \r\n        \r\n        \r\n          \r\n            \r\n          \r\n        \r\n        Circus Circus: $15 resort fee includes in-room wireless internet service (daily), one (1) free premium ride at Adventuredome, two (2) fitness passes (daily), buy 1, get 1 drink at Slots A Fun or West Bar, two (2) free Midway games, 800/local calls up to 30 minutes maximum per call (daily), resort funbook valued at approximately $100.

Summary

  • rvest makes the process of web scraping relatively simple
  • Caution: sometimes the web text can't be gathered this way