Assignments
Review
Intro to web scraping
Processing strings, including an intro to regular expressions
Data and data set transformations with dplyr
16 October 2015
Assignments
Review
Intro to web scraping
Processing strings, including an intro to regular expressions
Data and data set transformations with dplyr
Proposal for your Collaborative Research Project.
Deadline: 23 October
Submit: A (max) 2,000 word proposal created with R Markdown. The proposal will:
Be written in R Markdown.
State your research question. And justify why it is interesting.
Provide a basic literature review (properly cited with BibTeX).
Identify data sources and appropriate research methodologies for answering your question.
As always, submit the entire GitHub repo.
Purpose: Gather, clean, and analyse data
Deadline: 13 November 2015
You will submit a GitHub repo that:
Gathers web-based data from at least two sources. Cleans and merges the data so that it is ready for statistical analyses.
Conducts basic descriptive and inferential statistics with the data to address a relevant research question.
Briefly describes the results including with dynamically generated tables and figures.
Has a write up of 1,500 words maximum that describes the data gathering and analysis, It also will use literate programming.
This is ideally a good first run at the data gathering and analysis parts of your final project.
What is open public data?
What is a data API?
What are the characteristics of tidy data?
Why are unique observation IDs so important for data cleaning?
I don't expect you to master the tools of web scraping in this course.
I just want you to know that these things are possible, so that you know where to look in future work.
Web scraping simply means gathering data from websites.
Last class we learned a particular form of web scraping: downloading explicitly structured data files/data APIs.
You can also download information that is not as well structured for statistical analysis:
HTML tables
Text on websites
Information that requires you to navigate through web forms
To really master web scraping you need a good knowledge of HTML.
The most basic tools for web scraping in R:
httr: gather data + simple parsing
XML: more advanced parsing
Also take a look at rvest. It is a new package that aims to implement features from Python's popular Beautiful Soup.
Look at the HTML for the webpage you want to scrape (e.g. use Inspect Element in Chrome).
Request a URL with GET.
Extract the content from the request with content.
as = 'text' or parse the content with as = 'parsed'.Clean content (there are many tools for this suited to a variety of problems).
Scrape BBC's MP's Expenses table.
HTML markup marks tables using <table> tags.
We can use these to extract tabular information and convert them into data frames.
In particular, we want the table tag with the id expenses_table.
library(httr)
library(dplyr)
library(XML)
URL <- 'http://news.bbc.co.uk/2/hi/uk_news/politics/8044207.stm'
# Get and parse all tables on the webpage
tables <- URL %>% GET() %>%
content(as = 'parsed') %>%
readHTMLTable()
names(tables)
## [1] "NULL" "NULL" "NULL" "NULL" ## [5] "expenses_table" "NULL" "NULL"
Now we just need to subset the tables list for the expenses_table data frame.
ExpensesTable <- tables[[5]] head(ExpensesTable)[, 1:3]
## MP Party Seat ## 1 Abbott, Ms Diane LAB Hackney North & Stoke Newington ## 2 Adams, Mr Gerry SF West Belfast ## 3 Afriyie, Adam CON Windsor ## 4 Ainger, Nick LAB Carmarthen West & Pembrokeshire South ## 5 Ainsworth, Mr Peter CON Surrey East ## 6 Ainsworth, Rt Hon Bob LAB Coventry North East
GET is probably the most common RESTful API verb you will use when webscraping.
Another important verb to consider is POST, which allows you to fill in web forms. httr has a POST function
A (frustratingly) large proportion of time web scraping and doing data cleaning generally is taken up with processing strings.
Key tools for processing strings:
knowing your encoding and iconv function in base R
grep, gsub, and related functions in base R
Regular expressions
stringr package
Sometimes when you load text into R you will get weird symbols like � (the replacement character) or other strange things will happen to the text.
NOTE: remember to always check your data when you import it!
This often happens when R is using the wrong character encoding.
All characters in a computer are encoded using some standardised system.
R can recognise latin1 and UTF-8.
latin1 is fairly limited (mostly to the latin alphabet)
UTF-8 covers a much wider range of characters in many languages
You may need to use the iconv function to convert a text to UTF-8 before trying to process it.
grep, gsub, and related functions
R (and many programming languages) have functions for identifying and manipulating strings.
grep stands for: Globally search a Regular Expression and Print
You can use grep and grepl to find patterns in a vector.
pets <- c('cats', 'dogs', 'a big snake')
grep(pattern = 'cat', x = pets)
## [1] 1
grepl(pattern = 'cat', pets)
## [1] TRUE FALSE FALSE
# Subset vector
pets[grep('cats', pets)]
## [1] "cats"
Use gsub to substitute strings.
gsub(pattern = 'big', replacement = 'small', x = pets)
## [1] "cats" "dogs" "a small snake"
Regular expressions are a powerful tool for finding and manipulating strings.
They are special characters that can be used to search for text.
For example:
find characters at only the beginning or end of a string
find characters that follow or are preceded by a particular character
find only the first or last occurrence of a character in a string
Many more possibilities.
Examples (modified from Robin Lovelace).
base <- c("cat16_24", "25_34cat", "35_44catch",
"45_54Cat", "55_4fat$", 'colour', 'color')
## Find only all 'cat' regardles of case
grep('cat', base, ignore.case = T)
## [1] 1 2 3 4
# Find only 'cat' at the end of the string with $
grep('cat$', base)
## [1] 2
# Find only 'cat' at the begining of the string with ^
grep('^cat', base)
## [1] 1
# Find zero or one of the preceeding character with ?
grep('colou?r', base)
## [1] 6 7
# Find one or more of the preceeding character with +
grep('colou+r', base)
## [1] 6
# Find '$' with the escape character \
grep('\\$', base)
## [1] 5
# Find string with any single character between 'c' and 'l' with .
grep('c.l', base)
## [1] 6 7
# Find a range of numbers with [ - ]
grep('[1-3]', base)
## [1] 1 2 3
# Find capital letters
grep('[A-Z]', base)
## [1] 4
| Character | Use |
|---|---|
$ |
characters at the end of the string |
^ |
characters at the beginning of the string |
? |
zero or one of the preceding character |
* |
zero or more of the preceding character |
+ |
one or more of the preceding character |
\ |
escape character use to find strings that are expressions |
. |
any single character |
[ - ] |
a range of characters |
You can also find the cheat-sheet at: SyllabusAndLectures/Lecture7/README
The stringr package has many helpful functions that make dealing with strings a bit easier.
Remove leading and trailing whitespace (this can be a real problem when creating consistent variable values):
library(stringr)
str_trim(' hello ')
## [1] "hello"
Split strings (really useful for turning 1 variable into 2):
trees <- c('Jomon Sugi', 'Huon Pine')
str_split_fixed(trees, pattern = ' ', n = 2)
## [,1] [,2] ## [1,] "Jomon" "Sugi" ## [2,] "Huon" "Pine"
The dplyr package has powerful capabilities to manipulate data frames quickly (many of the functions are written in the compiled language C++).
It is also useful for transforming data from grouped observations, e.g. countries, households.
Set up for examples
# Create fake grouped data
library(randomNames)
library(dplyr)
library(tidyr)
people <- randomNames(n = 1000)
people <- sort(rep(people, 4))
year <- rep(2010:2013, 1000)
trend_income <- c(30000, 31000, 32000, 33000)
income <- replicate(trend_income + rnorm(4, sd = 20000),
n = 1000) %>%
data.frame() %>%
gather(obs, value, X1:X1000)
income$value[income$value < 0] <- 0
data <- data.frame(people, year, income = income$value)
head(data)
## people year income ## 1 Abastta, Shenique 2010 0.00 ## 2 Abastta, Shenique 2011 31577.28 ## 3 Abastta, Shenique 2012 90623.35 ## 4 Abastta, Shenique 2013 23622.05 ## 5 Abdella, Vince 2010 29019.99 ## 6 Abdella, Vince 2011 31717.75
Select rows
higher_income <- filter(data, income > 60000) head(higher_income)
## people year income ## 1 Abastta, Shenique 2012 90623.35 ## 2 Achziger, Grant 2012 88947.39 ## 3 Achziger, Grant 2013 69768.27 ## 4 Adams, Timothy 2012 76421.76 ## 5 Adkins, Asha 2013 69254.91 ## 6 Agrawal, Lorne 2013 74282.24
Select columns
people_income <- select(data, people, income) # OR people_income <- select(data, -year) head(people_income)
## people income ## 1 Abastta, Shenique 0.00 ## 2 Abastta, Shenique 31577.28 ## 3 Abastta, Shenique 90623.35 ## 4 Abastta, Shenique 23622.05 ## 5 Abdella, Vince 29019.99 ## 6 Abdella, Vince 31717.75
Tell dplyr what the groups are in the data with group_by.
group_data <- group_by(data, people) head(group_data)[1:5, ]
## Source: local data frame [5 x 3] ## Groups: people [2] ## ## people year income ## (fctr) (int) (dbl) ## 1 Abastta, Shenique 2010 0.00 ## 2 Abastta, Shenique 2011 31577.28 ## 3 Abastta, Shenique 2012 90623.35 ## 4 Abastta, Shenique 2013 23622.05 ## 5 Abdella, Vince 2010 29019.99
Note: the following functions work on non-grouped data as well.
Now that we have declared the data as grouped, we can do operations on each group.
For example, we can extract the highest and lowest income years for each person:
min_max_income <- summarize(group_data,
min_income = min(income),
max_income = max(income))
head(min_max_income)[1:3, ]
## Source: local data frame [3 x 3] ## ## people min_income max_income ## (fctr) (dbl) (dbl) ## 1 Abastta, Shenique 0.000 90623.35 ## 2 Abdella, Vince 14377.565 38444.78 ## 3 Abrego, Vanessa 3289.616 49202.10
We can sort the data using arrange.
# Sort highest income for each person in ascending order ascending <- arrange(min_max_income, max_income) head(ascending)[1:3, ]
## Source: local data frame [3 x 3] ## ## people min_income max_income ## (fctr) (dbl) (dbl) ## 1 Schneeberger, Gabrielle 2592.038 12531.61 ## 2 Solarin, Katiery 2596.987 16622.67 ## 3 Kellenberger, Destin 8350.077 16827.18
Add desc to sort in descending order
descending <- arrange(min_max_income, desc(max_income)) head(descending)[1:3, ]
## Source: local data frame [3 x 3] ## ## people min_income max_income ## (fctr) (dbl) (dbl) ## 1 Ambaye, Brandon 6040.89577 104822.55 ## 2 Reddy, Bruno 67.78868 98055.28 ## 3 Bryant, Anna 45377.06137 97155.70
summarize creates a new data frame with the summarised data.
We can use mutate to add new columns to the original data frame.
data <- mutate(group_data,
min_income = min(income),
max_income = max(income))
head(data)[1:3, ]
## Source: local data frame [3 x 5] ## Groups: people [1] ## ## people year income min_income max_income ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 Abastta, Shenique 2010 0.00 0 90623.35 ## 2 Abastta, Shenique 2011 31577.28 0 90623.35 ## 3 Abastta, Shenique 2012 90623.35 0 90623.35
Scrape and clean the Medal Table from http://www.bbc.com/sport/winter-olympics/2014/medals/countries.
Work on gathering data and cleaning for Assignment 3.