Learning Objectives
* Provide an example that demonstrates how to read in a flat file, perform minimal data cleaning and tidying using dplyr and tidyr, introduce dataframes and why their structure matters, generate basic data visualization examples using ggplot2, and using a linear regression for analyzing data.
* The introductory example demonstrates the use of RMarkdown to generate a single reproducible report. _*In no way should the objective of this example be taken seriously, nor do they reflect the views of the instructors or organization sponsors of this workshop._

Objective

The feline species are found everywhere in urban areas. Although the vast population of felines do not have “thumbs” to manipulate a steering wheel, increasing research and developement of autonomous vehicles might render their anatomical barrier moot. In preparation of the eventual takeover of autonomous vehicles do we see their infilitration to human households increasing? If so, the objective of this study is to first investigate if there’s any relationship between the number of cat adoptions with volume of traffic.

Methods & Results

Data sources

Monthly traffic volume data were downloaded from the PORTAL website (https://portal.its.pdx.edu) for the Alberta to NB I-5 station for all years available.

Yearly animal services trends are reported by Multnomah County Animal Services (MCAS) (https://multcopets.org/reports-and-agency-statistics). All reporting by MCAS is presented in a non machine readable format and has been converted into a csv file prior to use for this example. Data reported by MCAS is aggregated by fiscal year (July through June).

Data cleanup

A fair amount of data cleanup happens prior to any analysis; however, with R and packages like Tidyverse make the process of data wrangling less painful. The process of cleaning and tidying data is recorded in each line of code that is written. This makes it easier and faster to catch and fix mistakes for large and small datasets. Below is an example of working with a fairly tidy dataset.

# Load libraries needed for tidying data.
library(tidyverse)
library(lubridate)
library(DT)
# Read in data files.
alberta_raw <- read.csv("data/I5-NB-Alberta.csv", stringsAsFactors = F, strip.white = T)
cats_raw <- read.csv("data/cats.csv", stringsAsFactors = F, strip.white = T)
# Examine traffic data structure.
str(alberta_raw)
## 'data.frame':    223 obs. of  3 variables:
##  $ starttime: chr  "1/1/2000 12:00 am" "2/1/2000 12:00 am" "3/1/2000 12:00 am" "4/1/2000 12:00 am" ...
##  $ speed    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ volume   : int  NA NA NA NA NA NA NA NA NA NA ...
# Delete rows of incomplete data
alberta <- alberta_raw[complete.cases(alberta_raw),]

# Convert "starttime" into class POSIXct
alberta$starttime <- mdy_hm(alberta$starttime, tz = "America/Los_Angeles")

# Check for completeness of data
alberta_year <- alberta %>%
  mutate(year = year(starttime))
table(alberta_year$year)
## 
## 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2018 
##    1   12   12   12   12   12   12   12   12   12   12    6    5
# Subset (filter) data by starttime for complete range of continous counts
alberta_fy <- alberta_year %>%
  filter(starttime > "2005-06-30" & starttime < "2015-07-01")

# Creative way of aggregating volume by fiscal year, July through June
alberta_fy$fy.adjust <- seq(ymd("2005-01-01"), ymd("2014-12-31"), by = "month")
alberta_fy <- alberta_fy %>%
  group_by(Fiscal.Year = year(fy.adjust)) %>%
  summarise(FY.Volume = sum(volume))
alberta_fy
## # A tibble: 10 x 2
##    Fiscal.Year FY.Volume
##          <dbl>     <int>
##  1        2005      8534
##  2        2006      9811
##  3        2007     10419
##  4        2008     10586
##  5        2009     10958
##  6        2010     10644
##  7        2011      9938
##  8        2012      8785
##  9        2013      7317
## 10        2014      7773
# Examine cat data
str(cats_raw)
## 'data.frame':    37 obs. of  3 variables:
##  $ Fiscal.Year      : int  1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 ...
##  $ Total.Live.Intake: int  4224 4963 5086 1013 3811 4012 3999 3992 3994 5000 ...
##  $ Adopt            : int  NA NA 212 224 195 200 252 238 266 311 ...
# Filter cat data for available years corresponding to traffic data.
cats <- cats_raw %>%
  filter(between(Fiscal.Year, 2005, 2014))
cats
##    Fiscal.Year Total.Live.Intake Adopt
## 1         2005              5175  1143
## 2         2006              5208   956
## 3         2007              5567  1012
## 4         2008              5461  1062
## 5         2009              5063   837
## 6         2010              5302   732
## 7         2011              5210   732
## 8         2012              4324   711
## 9         2013              3684   623
## 10        2014              3954   961

Data visualization

There are a number of visualization packages for R that allow the user to creating and customizing simple static figures to creating and customizing interactive maps and applications. Below are some basic examples of generating plots to examine the cat and traffic data.

Prior to visualizing data a few additional steps of tidying data are required. Traffic volume and cat adoptions were initially two separate data files that were cleaned. The next step would be to combine the two datasets into a useable dataframe.

# There are many ways of combining datasets, the follow example uses dplyr::left_join
alberta_cats <- left_join(alberta_fy, cats, by = "Fiscal.Year")
alberta_cats$Fiscal.Year <- as.character(alberta_cats$Fiscal.Year)
alberta_cats
## # A tibble: 10 x 4
##    Fiscal.Year FY.Volume Total.Live.Intake Adopt
##    <chr>           <int>             <int> <int>
##  1 2005             8534              5175  1143
##  2 2006             9811              5208   956
##  3 2007            10419              5567  1012
##  4 2008            10586              5461  1062
##  5 2009            10958              5063   837
##  6 2010            10644              5302   732
##  7 2011             9938              5210   732
##  8 2012             8785              4324   711
##  9 2013             7317              3684   623
## 10 2014             7773              3954   961
# transforming data into another useable dataframe structure
trans_cats <- alberta_cats %>%
  gather("variable", "count", 2:4)
trans_cats
## # A tibble: 30 x 3
##    Fiscal.Year variable  count
##    <chr>       <chr>     <int>
##  1 2005        FY.Volume  8534
##  2 2006        FY.Volume  9811
##  3 2007        FY.Volume 10419
##  4 2008        FY.Volume 10586
##  5 2009        FY.Volume 10958
##  6 2010        FY.Volume 10644
##  7 2011        FY.Volume  9938
##  8 2012        FY.Volume  8785
##  9 2013        FY.Volume  7317
## 10 2014        FY.Volume  7773
## # ... with 20 more rows

There are now two dataframes, alberta_cats and trans_cats, that contain the same data but organized in two different ways. The difference of having dataframes arranged differently becomes more apparent when trying to do different types of visualizations and analyses.

# Fiscal year trends of total cat intake, cat adoptions, and car volume over time using trans_cats
fy_trends_fig <- trans_cats %>%
  ggplot(aes(x = Fiscal.Year, y = count)) +
  geom_bar(stat = "identity") +
  facet_grid(variable ~ ., scales = "free") +
  ggtitle("Figure 1. Cat adoptions, volume of cars (FY.Volume), and total live intake of cats (Total.Live.Intake)") +
  theme(plot.title = element_text(size = 8))
fy_trends_fig

Figure 1 shows the number of total live cat intakes, total cat adoptions, and car volume for the I-5 Alberta highway station per fiscal year. From FY2010 through FY2013, figure 1 shows a gradual decrease across all variables and then an observable increase across all variables in 2014. Figures 2 and 3, below, plot the relationships of adoption and total live intake with car volume.

# Plotting relationship between cat adoptions and volume of cars uses the alberta_cats dataframe
cars_adopt_figure <- catplot(xs = alberta_cats$Adopt, ys = alberta_cats$FY.Volume, 
                             cat = 5, catcolor = "#33FCFF", size = 0.2,
                             canvas = c(-0.1, 1.8, -0.1, 1.2),
                             xlab = "Cat adoptions", ylab = "Volume of cars",
                             main = "Figure 2. Cat adoptions versus volume of cars")

# Plotting relationship between intake of live cats and volume of cars uses the alberta_cats dataframe
cars_adopt_figure <- alberta_cats %>%
  ggplot(aes(x = Total.Live.Intake, y = FY.Volume)) +
  geom_point() +
  labs(title = "Figure 3. Total live intake of cats versus volume of cars",
       x = "Total live intake of cats", y = "Volume of cars") +
  theme(plot.title = element_text(size = 8))
cars_adopt_figure

Figure 2 shows zero relationship between cat adoptions and volume of cars. Figure 3 suggests that there could be a possible relationship between total live intake of cats and volume of cars.

Statistics

To test if there’s a significant relationship between total live intake and volume of vehicles, and disregarding any statistical assumptions about normality, heteroscedasity, independent sample, sampling size, and data being representive of the population, we implement a simple linear regression:

# Correlation using alberta_cats dataframe
cor(alberta_cats$Total.Live.Intake, alberta_cats$FY.Volume)
## [1] 0.860653
# Linear regression using alberta_cats dataframe
lm_cats <- lm(FY.Volume ~ Total.Live.Intake, data = alberta_cats)
lm_cats
## 
## Call:
## lm(formula = FY.Volume ~ Total.Live.Intake, data = alberta_cats)
## 
## Coefficients:
##       (Intercept)  Total.Live.Intake  
##           1253.37               1.68
# Linear regression summary
summary(lm_cats)
## 
## Call:
## lm(formula = FY.Volume ~ Total.Live.Intake, data = alberta_cats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1413.2  -171.4   -95.5   240.2  1198.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       1253.3713  1734.0798   0.723  0.49040   
## Total.Live.Intake    1.6800     0.3514   4.781  0.00139 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 695.7 on 8 degrees of freedom
## Multiple R-squared:  0.7407, Adjusted R-squared:  0.7083 
## F-statistic: 22.86 on 1 and 8 DF,  p-value: 0.001389

Wow. There is a “significant” positive relationship between total live intake of cats and volume of vehicles recorded at the I-5 at Alberta highway station.

Conclusions

There is no relationship between cat adoptions and volume of cars recorded at the I-5 at Alberta highway station. However, there is a positive relationship between total live intake of cats and vehicle volume.

This example demostrates some of the basic functions and capabilities of using R for reproducible data wrangling and analysis.

Last updated: 2018-07-26
Back to Introduction