Learning Objectives
* Provide an example that demonstrates how to read in a flat file, perform minimal data cleaning and tidying using dplyr and tidyr, introduce dataframes and why their structure matters, generate basic data visualization examples using ggplot2, and using a linear regression for analyzing data.
* The introductory example demonstrates the use of RMarkdown to generate a single reproducible report. _*In no way should the objective of this example be taken seriously, nor do they reflect the views of the instructors or organization sponsors of this workshop._
The feline species are found everywhere in urban areas. Although the vast population of felines do not have “thumbs” to manipulate a steering wheel, increasing research and developement of autonomous vehicles might render their anatomical barrier moot. In preparation of the eventual takeover of autonomous vehicles do we see their infilitration to human households increasing? If so, the objective of this study is to first investigate if there’s any relationship between the number of cat adoptions with volume of traffic.
Monthly traffic volume data were downloaded from the PORTAL website (https://portal.its.pdx.edu) for the Alberta to NB I-5 station for all years available.
Yearly animal services trends are reported by Multnomah County Animal Services (MCAS) (https://multcopets.org/reports-and-agency-statistics). All reporting by MCAS is presented in a non machine readable format and has been converted into a csv file prior to use for this example. Data reported by MCAS is aggregated by fiscal year (July through June).
A fair amount of data cleanup happens prior to any analysis; however, with R and packages like Tidyverse make the process of data wrangling less painful. The process of cleaning and tidying data is recorded in each line of code that is written. This makes it easier and faster to catch and fix mistakes for large and small datasets. Below is an example of working with a fairly tidy dataset.
# Load libraries needed for tidying data.
library(tidyverse)
library(lubridate)
library(DT)
# Read in data files.
alberta_raw <- read.csv("data/I5-NB-Alberta.csv", stringsAsFactors = F, strip.white = T)
cats_raw <- read.csv("data/cats.csv", stringsAsFactors = F, strip.white = T)
# Examine traffic data structure.
str(alberta_raw)
## 'data.frame': 223 obs. of 3 variables:
## $ starttime: chr "1/1/2000 12:00 am" "2/1/2000 12:00 am" "3/1/2000 12:00 am" "4/1/2000 12:00 am" ...
## $ speed : num NA NA NA NA NA NA NA NA NA NA ...
## $ volume : int NA NA NA NA NA NA NA NA NA NA ...
# Delete rows of incomplete data
alberta <- alberta_raw[complete.cases(alberta_raw),]
# Convert "starttime" into class POSIXct
alberta$starttime <- mdy_hm(alberta$starttime, tz = "America/Los_Angeles")
# Check for completeness of data
alberta_year <- alberta %>%
mutate(year = year(starttime))
table(alberta_year$year)
##
## 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2018
## 1 12 12 12 12 12 12 12 12 12 12 6 5
# Subset (filter) data by starttime for complete range of continous counts
alberta_fy <- alberta_year %>%
filter(starttime > "2005-06-30" & starttime < "2015-07-01")
# Creative way of aggregating volume by fiscal year, July through June
alberta_fy$fy.adjust <- seq(ymd("2005-01-01"), ymd("2014-12-31"), by = "month")
alberta_fy <- alberta_fy %>%
group_by(Fiscal.Year = year(fy.adjust)) %>%
summarise(FY.Volume = sum(volume))
alberta_fy
## # A tibble: 10 x 2
## Fiscal.Year FY.Volume
## <dbl> <int>
## 1 2005 8534
## 2 2006 9811
## 3 2007 10419
## 4 2008 10586
## 5 2009 10958
## 6 2010 10644
## 7 2011 9938
## 8 2012 8785
## 9 2013 7317
## 10 2014 7773
# Examine cat data
str(cats_raw)
## 'data.frame': 37 obs. of 3 variables:
## $ Fiscal.Year : int 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
## $ Total.Live.Intake: int 4224 4963 5086 1013 3811 4012 3999 3992 3994 5000 ...
## $ Adopt : int NA NA 212 224 195 200 252 238 266 311 ...
# Filter cat data for available years corresponding to traffic data.
cats <- cats_raw %>%
filter(between(Fiscal.Year, 2005, 2014))
cats
## Fiscal.Year Total.Live.Intake Adopt
## 1 2005 5175 1143
## 2 2006 5208 956
## 3 2007 5567 1012
## 4 2008 5461 1062
## 5 2009 5063 837
## 6 2010 5302 732
## 7 2011 5210 732
## 8 2012 4324 711
## 9 2013 3684 623
## 10 2014 3954 961
There are a number of visualization packages for R that allow the user to creating and customizing simple static figures to creating and customizing interactive maps and applications. Below are some basic examples of generating plots to examine the cat and traffic data.
Prior to visualizing data a few additional steps of tidying data are required. Traffic volume and cat adoptions were initially two separate data files that were cleaned. The next step would be to combine the two datasets into a useable dataframe.
# There are many ways of combining datasets, the follow example uses dplyr::left_join
alberta_cats <- left_join(alberta_fy, cats, by = "Fiscal.Year")
alberta_cats$Fiscal.Year <- as.character(alberta_cats$Fiscal.Year)
alberta_cats
## # A tibble: 10 x 4
## Fiscal.Year FY.Volume Total.Live.Intake Adopt
## <chr> <int> <int> <int>
## 1 2005 8534 5175 1143
## 2 2006 9811 5208 956
## 3 2007 10419 5567 1012
## 4 2008 10586 5461 1062
## 5 2009 10958 5063 837
## 6 2010 10644 5302 732
## 7 2011 9938 5210 732
## 8 2012 8785 4324 711
## 9 2013 7317 3684 623
## 10 2014 7773 3954 961
# transforming data into another useable dataframe structure
trans_cats <- alberta_cats %>%
gather("variable", "count", 2:4)
trans_cats
## # A tibble: 30 x 3
## Fiscal.Year variable count
## <chr> <chr> <int>
## 1 2005 FY.Volume 8534
## 2 2006 FY.Volume 9811
## 3 2007 FY.Volume 10419
## 4 2008 FY.Volume 10586
## 5 2009 FY.Volume 10958
## 6 2010 FY.Volume 10644
## 7 2011 FY.Volume 9938
## 8 2012 FY.Volume 8785
## 9 2013 FY.Volume 7317
## 10 2014 FY.Volume 7773
## # ... with 20 more rows
There are now two dataframes, alberta_cats and trans_cats, that contain the same data but organized in two different ways. The difference of having dataframes arranged differently becomes more apparent when trying to do different types of visualizations and analyses.
# Fiscal year trends of total cat intake, cat adoptions, and car volume over time using trans_cats
fy_trends_fig <- trans_cats %>%
ggplot(aes(x = Fiscal.Year, y = count)) +
geom_bar(stat = "identity") +
facet_grid(variable ~ ., scales = "free") +
ggtitle("Figure 1. Cat adoptions, volume of cars (FY.Volume), and total live intake of cats (Total.Live.Intake)") +
theme(plot.title = element_text(size = 8))
fy_trends_fig
Figure 1 shows the number of total live cat intakes, total cat adoptions, and car volume for the I-5 Alberta highway station per fiscal year. From FY2010 through FY2013, figure 1 shows a gradual decrease across all variables and then an observable increase across all variables in 2014. Figures 2 and 3, below, plot the relationships of adoption and total live intake with car volume.
# Plotting relationship between cat adoptions and volume of cars uses the alberta_cats dataframe
cars_adopt_figure <- catplot(xs = alberta_cats$Adopt, ys = alberta_cats$FY.Volume,
cat = 5, catcolor = "#33FCFF", size = 0.2,
canvas = c(-0.1, 1.8, -0.1, 1.2),
xlab = "Cat adoptions", ylab = "Volume of cars",
main = "Figure 2. Cat adoptions versus volume of cars")
# Plotting relationship between intake of live cats and volume of cars uses the alberta_cats dataframe
cars_adopt_figure <- alberta_cats %>%
ggplot(aes(x = Total.Live.Intake, y = FY.Volume)) +
geom_point() +
labs(title = "Figure 3. Total live intake of cats versus volume of cars",
x = "Total live intake of cats", y = "Volume of cars") +
theme(plot.title = element_text(size = 8))
cars_adopt_figure
Figure 2 shows zero relationship between cat adoptions and volume of cars. Figure 3 suggests that there could be a possible relationship between total live intake of cats and volume of cars.
To test if there’s a significant relationship between total live intake and volume of vehicles, and disregarding any statistical assumptions about normality, heteroscedasity, independent sample, sampling size, and data being representive of the population, we implement a simple linear regression:
# Correlation using alberta_cats dataframe
cor(alberta_cats$Total.Live.Intake, alberta_cats$FY.Volume)
## [1] 0.860653
# Linear regression using alberta_cats dataframe
lm_cats <- lm(FY.Volume ~ Total.Live.Intake, data = alberta_cats)
lm_cats
##
## Call:
## lm(formula = FY.Volume ~ Total.Live.Intake, data = alberta_cats)
##
## Coefficients:
## (Intercept) Total.Live.Intake
## 1253.37 1.68
# Linear regression summary
summary(lm_cats)
##
## Call:
## lm(formula = FY.Volume ~ Total.Live.Intake, data = alberta_cats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1413.2 -171.4 -95.5 240.2 1198.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1253.3713 1734.0798 0.723 0.49040
## Total.Live.Intake 1.6800 0.3514 4.781 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 695.7 on 8 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7083
## F-statistic: 22.86 on 1 and 8 DF, p-value: 0.001389
Wow. There is a “significant” positive relationship between total live intake of cats and volume of vehicles recorded at the I-5 at Alberta highway station.
There is no relationship between cat adoptions and volume of cars recorded at the I-5 at Alberta highway station. However, there is a positive relationship between total live intake of cats and vehicle volume.
This example demostrates some of the basic functions and capabilities of using R for reproducible data wrangling and analysis.
Last updated: 2018-07-26
Back to Introduction