MPP-E1180 Lecture 7: Web Scraping + Transforms

14 March 2016

Objectives for the week

Assignments
Review
Intro to web scraping
Processing strings, including an intro to regular expressions
Data and data set transformations with dplyr

Assignment 2

Proposal for your Collaborative Research Project.

Deadline: 25 March

Submit: A (max) 2,000 word proposal created with R Markdown. The proposal will:

Be written in R Markdown.
State your research question. And justify why it is interesting.
Provide a basic literature review (properly cited with BibTeX).
Identify data sources and appropriate research methodologies for answering your question.

As always, submit the entire GitHub repo.

Assignment 3

Purpose: Gather, clean, and analyse data

Deadline: TBD

You will submit a GitHub repo that:

Gathers web-based data from at least two sources. Cleans and merges the data so that it is ready for statistical analyses.
Conducts basic descriptive and inferential statistics with the data to address a relevant research question.
Briefly describes the results including with dynamically generated tables and figures.
Has a write up of 1,500 words maximum that describes the data gathering and analysis, It also will use literate programming.

Assignment 3

This is ideally a good first run at the data gathering and analysis parts of your final project.

Review

What is open public data?

Name one challenge and one opportunity presented by open public data.

What is a data API?

What are the characteristics of tidy data?

Why are unique observation IDs so important for data cleaning?

Caveat to Web scraping

I don't expect you to master the tools of web scraping in this course.

I just want you to know that these things are possible, so that you know where to look in future work.

Web scraping

Web scraping simply means gathering data from websites.

Last class we learned a particular form of web scraping: downloading explicitly structured data files/data APIs.

You can also download information that is not as well structured for statistical analysis:

HTML tables
Text on websites
Information that requires you to navigate through web forms

To really master web scraping you need a good knowledge of HTML.

Key tools

The most basic tools for web scraping in R:

rvest scraping + parsing
- Parsing: the analysis of HTML (and other) markup so that each element is syntactically related in a parse tree.
httr: gather data from APIs + simple parsing
Also, XML. parsing

Key steps:

Look at the HTML for the webpage you want to scrape (e.g. use Inspect Element in Chrome).
Request a URL with read_html (rvest) or GET (httr).
Extract the specific content nodes from the request with html_nodes.
Convert the nodes to your desired R object type.
Clean content (there are many tools for this suited to a variety of problems).

Web scraping example

Scrape BBC's MP's Expenses table.

HTML markup marks tables using <table> tags.

We can use these to extract tabular information and convert it into data frames.

In particular, we want the table tag with the id expenses_table. This will be the node that we want to extract.

Viewing the web pages source

Web scraping example

library(rvest)
library(dplyr)

URL <- 'http://news.bbc.co.uk/2/hi/uk_news/politics/8044207.stm'

# Get and parse expenses_table from the webpage
ExpensesTable <- URL %>% read_html() %>%
                    html_nodes('#expenses_table') %>%
                    html_table() %>% 
                    as.data.frame

Web scraping example

Now we need to clean the ExpensesTable data frame.

head(ExpensesTable)[, 1:3]

##                      MP Party                                  Seat
## 1      Abbott, Ms Diane   LAB       Hackney North & Stoke Newington
## 2       Adams, Mr Gerry    SF                          West Belfast
## 3         Afriyie, Adam   CON                               Windsor
## 4          Ainger, Nick   LAB Carmarthen West & Pembrokeshire South
## 5   Ainsworth, Mr Peter   CON                           Surrey East
## 6 Ainsworth, Rt Hon Bob   LAB                   Coventry North East

Background on GET from httr

GET is probably the most common RESTful API verb you will use when webscraping.

RESTful API (Representational State Transfer) an approach to creating APIs where resources are referenced (usually via URLs) and representations (documents in HTML, JSON, CSV, etc) are transfered.

Another important verb to consider is POST, which allows you to fill in web forms. httr has a POST function

Processing strings

A (frustratingly) large proportion of time web scraping and doing data cleaning generally is taken up with processing strings.

Key tools for processing strings:

knowing your encoding and iconv function in base R
grep, gsub, and related functions in base R
Regular expressions
stringr package

Character encoding: Motivation

Sometimes when you load text into R you will get weird symbols like � (the replacement character) or other strange things will happen to the text.

NOTE: remember to always check your data when you import it!

This often happens when R is using the wrong character encoding.

Character encoding

All characters in a computer are encoded using some standardised system.

R can recognise latin1 and UTF-8.

latin1 is fairly limited (mostly to the latin alphabet)
UTF-8 covers a much wider range of characters in many languages

You may need to use the iconv function to convert a text to UTF-8 before trying to process it.

`grep`, `gsub`, and related functions

Terminology

grep stands for: Globally search a Regular Expression and Print

Matching

You can use grep and grepl to find patterns in a vector.

pets <- c('cats', 'dogs', 'a big snake')

grep(pattern = 'cat', x = pets)

## [1] 1

grepl(pattern = 'cat', pets)

## [1]  TRUE FALSE FALSE

# Subset vector
pets[grep('cats', pets)]

## [1] "cats"

Manipulation

Use gsub to substitute strings.

gsub(pattern = 'big', replacement = 'small', x = pets)

## [1] "cats"          "dogs"          "a small snake"

Regular expressions

Regular expressions are a powerful tool for finding and manipulating strings.

They are special characters that can be used to search for text.

For example:

find characters at only the beginning or end of a string
find characters that follow or are preceded by a particular character
find only the first or last occurrence of a character in a string

Many more possibilities.

Regular expressions examples

Examples (modified from Robin Lovelace).

base <- c("cat16_24", "25_34cat", "35_44catch",
          "45_54Cat", "55_4fat$", 'colour', 'color')

## Find only all 'cat' regardles of case
grep('cat', base, ignore.case = T)

## [1] 1 2 3 4

Regular expressions examples

# Find only 'cat' at the end of the string with $
grep('cat$', base)

## [1] 2

# Find only 'cat' at the begining of the string with ^
grep('^cat', base)

## [1] 1

Regular expressions examples

# Find zero or one of the preceeding character with ?
grep('colou?r', base)

## [1] 6 7

# Find one or more of the preceeding character with +
grep('colou+r', base)

## [1] 6

# Find '$' with the escape character \
grep('\\$', base)

## [1] 5

Regular expressions examples

# Find string with any single character between 'c' and 'l' with .
grep('c.l', base)

## [1] 6 7

# Find a range of numbers with [ - ]
grep('[1-3]', base)

## [1] 1 2 3

# Find capital letters
grep('[A-Z]', base)

## [1] 4

Simple regular expressions cheatsheet

Character	Use
`$`	characters at the end of the string
`^`	characters at the beginning of the string
`?`	zero or one of the preceding character
`*`	zero or more of the preceding character
`+`	one or more of the preceding character
`\`	escape character use to find strings that are expressions
`.`	any single character
`[ - ]`	a range of characters

Simple regular expressions cheatsheet

You can also find the cheat-sheet at: SyllabusAndLectures/Lecture7/README

String processing with stringr

The stringr package has many helpful functions that make dealing with strings a bit easier.

stringr examples

Remove leading and trailing whitespace (this can be a real problem when creating consistent variable values):

library(stringr)

str_trim(' hello   ')

## [1] "hello"

stringr examples

Split strings (really useful for turning 1 variable into 2):

trees <- c('Jomon Sugi', 'Huon Pine')

str_split_fixed(trees, pattern = ' ', n = 2)

##      [,1]    [,2]  
## [1,] "Jomon" "Sugi"
## [2,] "Huon"  "Pine"

More data transformations with dplyr

The dplyr package has powerful capabilities to manipulate data frames quickly (many of the functions are written in the compiled language C++).

It is also useful for transforming data from grouped observations, e.g. countries, households.

dplyr

Set up for examples

# Create fake grouped data
library(randomNames)
library(dplyr)
library(tidyr)

people <- randomNames(n = 1000)
people <- sort(rep(people, 4))
year <- rep(2010:2013, 1000)
trend_income <- c(30000, 31000, 32000, 33000)
income <-  replicate(trend_income + rnorm(4, sd = 20000),
                     n = 1000) %>%
            data.frame() %>%
            gather(obs, value, X1:X1000)
income$value[income$value < 0] <- 0
data <- data.frame(people, year, income = income$value)

dplyr

head(data)

##                  people year   income
## 1 Abdalla-Lenox, Daniel 2010 25810.24
## 2 Abdalla-Lenox, Daniel 2011 56520.13
## 3 Abdalla-Lenox, Daniel 2012 54738.62
## 4 Abdalla-Lenox, Daniel 2013 34676.01
## 5       Abdikadir, Juan 2010 53907.63
## 6       Abdikadir, Juan 2011 63780.45

Simple dplyr

Select rows

higher_income <- filter(data, income > 60000)

head(higher_income)

##                  people year   income
## 1       Abdikadir, Juan 2011 63780.45
## 2       Abdikadir, Juan 2012 62648.12
## 3  Acevedo Soto, Kendra 2012 66147.97
## 4         Acharya, Luis 2012 74540.36
## 5 Acosta-Garcia, Denise 2012 71184.41
## 6       Aguilar, Vivian 2011 64138.65

Simple dplyr

Select columns

people_income <- select(data, people, income)

# OR

people_income <- select(data, -year)

head(people_income)

##                  people   income
## 1 Abdalla-Lenox, Daniel 25810.24
## 2 Abdalla-Lenox, Daniel 56520.13
## 3 Abdalla-Lenox, Daniel 54738.62
## 4 Abdalla-Lenox, Daniel 34676.01
## 5       Abdikadir, Juan 53907.63
## 6       Abdikadir, Juan 63780.45

dplyr with grouped data

Tell dplyr what the groups are in the data with group_by.

group_data <- group_by(data, people)
head(group_data)[1:5, ]

## Source: local data frame [5 x 3]
## Groups: people [2]
## 
##                  people  year   income
##                  (fctr) (int)    (dbl)
## 1 Abdalla-Lenox, Daniel  2010 25810.24
## 2 Abdalla-Lenox, Daniel  2011 56520.13
## 3 Abdalla-Lenox, Daniel  2012 54738.62
## 4 Abdalla-Lenox, Daniel  2013 34676.01
## 5       Abdikadir, Juan  2010 53907.63

Note: the following functions work on non-grouped data as well.

dplyr with grouped data

Now that we have declared the data as grouped, we can do operations on each group.

For example, we can extract the highest and lowest income years for each person:

min_max_income <- summarize(group_data,
                            min_income = min(income),
                            max_income = max(income))
head(min_max_income)[1:3, ]

## Source: local data frame [3 x 3]
## 
##                  people min_income max_income
##                  (fctr)      (dbl)      (dbl)
## 1 Abdalla-Lenox, Daniel  25810.240   56520.13
## 2       Abdikadir, Juan  13289.512   63780.45
## 3        Abrams, Draven   1834.621   44883.84

dplyr with grouped data

We can sort the data using arrange.

# Sort highest income for each person in ascending order
ascending <- arrange(min_max_income, max_income)
head(ascending)[1:3, ]

## Source: local data frame [3 x 3]
## 
##           people min_income max_income
##           (fctr)      (dbl)      (dbl)
## 1 Swift Bird, Jd   2990.411   12734.71
## 2  Scott, Jordan      0.000   14984.71
## 3     Lee, Emily      0.000   15502.94

dplyr with grouped data

Add desc to sort in descending order

descending <- arrange(min_max_income, desc(max_income))
head(descending)[1:3, ]

## Source: local data frame [3 x 3]
## 
##                 people min_income max_income
##                 (fctr)      (dbl)      (dbl)
## 1      Padilla, Kiyana  11896.271   99264.17
## 2 Rojas Duarte, Rukiya  30194.305   96694.43
## 3 Martinez, Keosomalee   3081.153   96161.78

dplyr with grouped data

summarize creates a new data frame with the summarised data.

We can use mutate to add new columns to the original data frame.

data <- mutate(group_data,
                min_income = min(income),
                max_income = max(income))
head(data)[1:3, ]

## Source: local data frame [3 x 5]
## Groups: people [1]
## 
##                  people  year   income min_income max_income
##                  (fctr) (int)    (dbl)      (dbl)      (dbl)
## 1 Abdalla-Lenox, Daniel  2010 25810.24   25810.24   56520.13
## 2 Abdalla-Lenox, Daniel  2011 56520.13   25810.24   56520.13
## 3 Abdalla-Lenox, Daniel  2012 54738.62   25810.24   56520.13

Seminar: Web scraping and data transformations

Scrape and clean the Medal Table from http://www.bbc.com/sport/winter-olympics/2014/medals/countries.

Also, sort by total medals in descending order.

Work on gathering data and cleaning for Assignment 3.

Objectives for the week

Assignment 2

Assignment 3

Assignment 3

Review

Caveat to Web scraping

Web scraping

Key tools

Key steps:

Web scraping example

Viewing the web pages source

Web scraping example

Web scraping example

Background on GET from httr

Processing strings

Character encoding: Motivation

Character encoding

grep, gsub, and related functions

Terminology

Matching

Manipulation

Regular expressions

Regular expressions examples

Regular expressions examples

Regular expressions examples

Regular expressions examples

Simple regular expressions cheatsheet

Simple regular expressions cheatsheet

String processing with stringr

stringr examples

stringr examples

More data transformations with dplyr

dplyr

dplyr

Simple dplyr

Simple dplyr

dplyr with grouped data

dplyr with grouped data

dplyr with grouped data

dplyr with grouped data

dplyr with grouped data

Seminar: Web scraping and data transformations

`grep`, `gsub`, and related functions