Assignment info
Review
Benefits and challenges of open public data
Automatic Data Gathering
Tidying, Cleaning, and Merging data
29 February 2016
Assignment info
Review
Benefits and challenges of open public data
Automatic Data Gathering
Tidying, Cleaning, and Merging data
Avoid including ?FUNCTION
, search
, View
and similar in your source files.
Avoid straight install.packages
calls.
Remember to set the working directory!
"dofile.R" is redundant. Use file names to indicate contents/position within file hierarchy.
Use human readable labels on your plots.
You can have R automatically switch the working directory using set_valid_wd
from the repmis package.
# Create list of commonly used working directories possible_dir <- c('/git_repos/Assignment1', 'C:\class\Assignment1') # Set to first valid directory in the possible_dir vector repmis::set_valid_wd(possible_dir)
You can split a vector by values of another variable using split
.
library(dplyr) split(sleep$extra, sleep$group) %>% boxplot()
Serve an HTML file hosted on GitHub by putting its raw version through https://rawgit.com/.
Proposal for your Collaborative Research Project.
Deadline: 25 March
Submit: A (max) 2,000 word proposal created with R Markdown. The proposal will:
Be written in R Markdown.
State your research question. And justify why it is interesting.
Provide a basic literature review (properly cited with BibTeX).
Identify data sources and appropriate research methodologies for answering your question.
As always, submit the entire GitHub repo.
Why is literate programming useful for reproducible research?
What is a markup language?
What is a code chunk?
What is BibTeX?
How is it an example of working hard so you can be lazy?
What is caching?
Janssen, Charalabidis, and Zuiderwijk (2012, 258):
"Non-privacy-restricted and non-confidential data which is produced with public money and is made available without any restrictions on its usage or distribution."
Greater returns on public investment in data gathering.
Better coordination within government.
Provides policy-makers with information needed to address complex problems.
''Mends the traditional separation between public organizations and users''.
See Janssen, Charalabidis, and Zuiderwijk (2012, 261) for more.
Assumes that government considers outside views and information (including opposing) views as constructive.
Results in government giving up (some) control and more actively interacting with its environment.
Not only should data be published, but potential data users in society should be actively sought for input on improving government.
Lack of technological competence to implement open data that is useful.
Worry by bureaucrats that open data will be used to criticise them.
No incentives for users to access the data. Lack of skills needed to use and understand the data.
Balancing individual privacy concerns.
See Janssen, Charalabidis, and Zuiderwijk (2012, 262–63) for more.
Social science and public data is becoming increasingly open and accessible.
However, the level of accessibility varies:
use restrictions
format
documentation
version control
We are only going to begin scratching the surface of the data access obstacles you are likely to encounter.
Do as much data gathering and cleaning as possible in R scripts:
Fully document for reproducible research.
Can find (inevitable) mistakes.
Easy to update when the data is updated.
Can apply methods to other data sets.
Plain-text data (e.g. CSV) stored at non-secure (http) URL, not embedded in a larger HTML marked-up website.
Plain-text data (e.g. CSV) stored at secure (https) URL, not embedded in a larger HTML marked-up website.
Data stored in a database with a well structured API (Application Programming Interface), that has a corresponding R package.
Use read.table
or read.csv
(just a wrapper for read.csv
with sep = ','
).
Include the URL rather than the file path.
read.table('http://SOMEDATA.csv')
You can download and load data files stored in compressed formats.
Download the compressed file into a temporary file.
Uncompress the file and pass it to read.table
, import
, etc.
Load data from Pemstein, Meserve, and Melton (2010) in a file called uds_summary.csv.
# For simplicity, store the URL in an object called 'URL'. URL <- "http://bit.ly/1jXJgDh" # Create a temporary file called 'temp' to put the zip file into. temp <- tempfile() # Download the compressed file into the temporary file. download.file(URL, temp) # Decompress the file and convert it into a data frame UDSData <- read.csv(gzfile(temp, "uds_summary.csv")) # Delete the temporary file. unlink(temp)
Use source_data
from the repmis package.
Data on GitHub is stored at secure URLs. Select the RAW URL:
URL <- paste0('https://raw.githubusercontent.com/christophergandrud/', 'LegislativeViolence/master/Data/LegViolenceDescriptives.csv') main <- repmis::source_data(URL)
## Downloading data from: https://raw.githubusercontent.com/christophergandrud/LegislativeViolence/master/Data/LegViolenceDescriptives.csv
## SHA-1 hash of the downloaded data file is: ## 01cff579b689cea9ef9c98e433ce3122745cc5cb
Data maintainers (unfortunately) often change data sets with little or no documentation.
source_data
allows you to notice these changes by assigning each file a unique SHA1 Hash.
Each download can be checked against the Hash
main <- repmis::source_data(URL, sha1 = '01cff579b689cea9ef9c98e433ce3122745cc5cb')
## Downloading data from: https://raw.githubusercontent.com/christophergandrud/LegislativeViolence/master/Data/LegViolenceDescriptives.csv
## Specified SHA-1 hash matches downloaded data file.
You can also use the rio package (but no sha1 hashing):
main <- rio::import(URL)
The source_XlsxData
function in repmis does the same thing as source_data
, but for Excel files.
Builds on read.xlsx
for loading locally stored Excel files.
Can also use import
from rio.
Note: Excel data often needs a lot of cleaning before it is useful for statistical/graphical analyses.
source_data
allows you to cache data with cache = TRUE
.
This is useful if you are downloading a large data set.
You can also cache data with you knit your R Markdown files.
API = Application Programming Interface, a documented way for programs to talk to each other.
Data API = a documented way to access data from one program stored with another.
R can interact with most data APIs using the httr package.
Even easier: users have written API-specific packages to interact with particular data APIs.
Access the World Bank's Development Indicators with the WDI package.
Alternative Energy Use example:
# Load WDI package library(WDI) # Per country alternative energy use as % of total energy use AltEnergy <- WDI(indicator = 'EG.USE.COMM.CL.ZS')
Note: The indicator ID is at the end of the indicator's URL on the World Bank site.
The quantmod package allows you to access financial data from a variety of sources (e.g. Yahoo Finance, Google Finance, US Federal Reserve's FRED database).
library(quantmod) # Download Yen/USD exchange rate YenDollar <- getSymbols(Symbols = 'DEXJPUS', src = 'FRED')
There are many more R packages that interact with web data APIs.
For a good beginner list see: http://cran.r-project.org/web/views/WebTechnologies.html
Format | R packages |
---|---|
Excel | Try to save as CSV, otherwise xlsx, rio |
Stata, SPSS, SAS | foreign, rio |
JSON | jsonlite |
MySQL | RMySQL, more info |
couchDB | R4CouchDB |
The data you need for your analysis is often not clean.
Perhaps 80% of data analysis is typically spent cleaning and preparing data (Dasu and Johnson 2003).
To help streamline this process Wickham (2014) laid out principles of data tidying.
Many (not all) statistical data sets are organised into rows and columns.
Rows and columns have no inherent meaning.
Person | treatmentA | treatmentB |
---|---|---|
John Smith | 2 | |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
Treatment | John Smith | Jane Doe | Mary Johnson |
---|---|---|---|
treatmentA | 16 | 3 | |
treatmentB | 2 | 11 | 1 |
Data sets are collections of values.
All values are assigned to a variable and an observation.
Variable: all values measuring the same attribute across units
Observation: all values measured within the same unit across attributes.
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Person | treatment | result |
---|---|---|
John Smith | a | |
Jane Doe | a | 16 |
Mary Johnson | a | 3 |
John Smith | b | 2 |
Jane Doe | b | 11 |
Mary Johnson | b | 1 |
First identify what your observations and variables are.
Then use R tools to convert your data into this format.
tidyr and its predecessor reshape2 are particularly useful.
# Create messy (wide) data messy <- data.frame( person = c("John Smith", "Jane Doe", "Mary Johnson"), a = c(NA, 16, 3), b = c(2, 11, 1) ) messy
## person a b ## 1 John Smith NA 2 ## 2 Jane Doe 16 11 ## 3 Mary Johnson 3 1
library(tidyr) # Gather the data into tidy long format tidy <- gather(messy, treatement, result, a:b) tidy
## person treatement result ## 1 John Smith a NA ## 2 Jane Doe a 16 ## 3 Mary Johnson a 3 ## 4 John Smith b 2 ## 5 Jane Doe b 11 ## 6 Mary Johnson b 1
Sometimes it is useful to reverse this operation with spread
.
messyAgain <- spread(data = tidy, key = treatement, value = result) messyAgain
## person a b ## 1 Jane Doe 16 11 ## 2 John Smith NA 2 ## 3 Mary Johnson 3 1
Always look at and poke your data.
For example, see if:
Missing values are designated with NA
Variable classes are what you expect them to be.
Distributions are what you expect them to be.
testdat can be useful for this.
Once you have tidy data frames, you can merge them for analysis.
In general: each observation must have a unique identifier to merge them on.
These identifiers must match exactly across the data frames.
tail(AltEnergy, n = 3)
## iso2c country EG.USE.COMM.CL.ZS year ## 1741 ZW Zimbabwe 4.858643 2007 ## 1742 ZW Zimbabwe 4.666912 2006 ## 1743 ZW Zimbabwe 4.396196 2005
tail(UDSData, n = 3)
## country year cowcode mean sd median pct025 ## 9135 Western Samoa 2006 990 0.2485397 0.2155926 0.2474691 -0.1702781 ## 9136 Western Samoa 2007 990 0.2439135 0.2151686 0.2487054 -0.1874372 ## 9137 Western Samoa 2008 990 0.2407623 0.2192563 0.2443505 -0.1881970 ## pct975 ## 9135 0.6648156 ## 9136 0.6571918 ## 9137 0.6622615
Unique identifier will be iso 2 letter country code and year.
Use the countrycode package to turn UDS data's Correlates of War Country Code (cowcode
) to iso2c
.
library(countrycode) # Assign iso2c codes base on correlates of war codes UDSData$iso2c <- countrycode(UDSData$cowcode, origin = 'cown', destination = 'iso2c', warn = TRUE)
NOTE: Always check the data to make sure the correct codes have been applied!
countrycode clearly only works for standardising country IDs
Other packages can be useful for standardising other unit IDs.
For example, geocode
from ggmap can be used to create latitude/longitudes for other geographic units:
places <- c('Bavaria', 'Seoul', '6 Parisier Platz, Berlin') ggmap::geocode(places, source = 'google')
## lon lat ## 1 11.49789 48.79045 ## 2 126.97797 37.56654 ## 3 13.37854 52.51701
Time units may be important components of observation IDs.
Use the lubridate package to standardise dates.
library(lubridate) # Create time data times <- c('Sep. 17 1980', 'March 23 2000', 'Nov. 3 2003') mdy(times)
## [1] "1980-09-17 UTC" "2000-03-23 UTC" "2003-11-03 UTC"
Note: Times should always go from longest to shortest unit. Makes dates sortable.
# Keep only desired variables UDSData <- UDSData[, c('iso2c', 'year', 'median')] names(UDSData)
## [1] "iso2c" "year" "median"
Combined <- merge(AltEnergy, UDSData, by = c('iso2c', 'year')) head(Combined, n = 3)
## iso2c year country EG.USE.COMM.CL.ZS median ## 1 AD 2005 Andorra NA 0.7785676 ## 2 AD 2006 Andorra NA 0.7860388 ## 3 AD 2007 Andorra NA 0.7853250
By default, only observations in both data sets are kept. Use all
, all.x
, or all.y
to keep non-merged observations.
Always check your data after a merge to see if you did what you wanted to do!
You many want to do some post merge cleaning. For example assign new variable names:
names(Combined) <- c('iso2c', 'year', 'country', 'alt_energy_use', 'uds_median')
or
Combined <- dplyr::rename(Combined, new_year = year)
And reorder variables
Combined <- DataCombine::MoveFront(Combined, 'country')
Thinking of your pair research project, write an R script to download two or more data sets from the web.
Either in the same or a linked R script clean and merge the data.
Dasu, Tamraparni, and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. Hoboken, NJ: John Wiley & Sons.
Janssen, Marijn, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. “Benefits, Adoption Barriers and Myths of Open Data and Open Government.” Information Systems Management 29 (4): 258–68.
Pemstein, Daniel, Stephen A. Meserve, and James Melton. 2010. “Democratic Compromise: A Latent Variable Analysis of Ten Measures of Regime Type.” Political Analysis 18 (4): 426–49.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.