Introduction to statistics with R

One of the really cool things about R, as opposed to other languages like python, matlab etc. is that it has such a wide user base in the statistics community and many packages that have been developed to analyze data with old and new statistical techniques and modelling. Most of these (but not all) are available on CRAN. Although we won’t cover much beyond the basics today (which could just as easily be done in most other languages), more advanced things like bayesian techniques

Preliminary Step: load example data

Pokies in Queensland

Pokies in Queensland

Read in (again) the gambling dataset:

gambling.data <- read.csv(file = "http://data.justice.qld.gov.au/JSD/OLGR/20170817_OLGR_LGA-EGM-data.csv",
                 header = TRUE,
                 sep = ",",
                 stringsAsFactors = FALSE)

# rename columns
names(gambling.data)[2] <- "Local.Govt.Area"
names(gambling.data)[7] <- "Player.Money.Lost"
names(gambling.data)[1] <- "Month.Year"

gambling.data <- na.omit(gambling.data)

#Add a day of month (1st) to each date string
date.string <- paste0( "1 " , gambling.data$Month.Year )

 #Convert to POSIXlt, a date-time format
strptime( date.string , format = "%d %B %Y" ) -> gambling.data$Date

 # subset to Brisbane only
brisbane.only <- gambling.data[gambling.data$Local.Govt.Area=="BRISBANE",]

# Choose only the brisbane data from 2010
#row.indicies <- (brisbane.only$Date>="2010-01-01 AEST" &
#                 brisbane.only$Date<="2010-12-31 AEST")
#brisbane.2010.data <- brisbane.only[row.indicies,]

Descriptive Statistics

Measures of Central Tendency

Measures of Central Tendency

All the standard summary statistics are available in R including:

  • mean
  • sd
  • var
  • min
  • max
  • median
  • range
  • quantile
x <- c(1,3,5,2,9,NA,7,10,15,4,7,7)
x
##  [1]  1  3  5  2  9 NA  7 10 15  4  7  7
mean(x)
## [1] NA

If there are missing values, we can ignore them by setting na.rm=TRUE:

mean(x, na.rm=TRUE)
## [1] 6.363636
median(x,na.rm = T)
## [1] 7
var(x, na.rm = T)     # variance
## [1] 16.25455
sd(x, na.rm = T)      # standard deviation
## [1] 4.031693

In R, you can use the function quantile() to get median and quartiles, or you can also use the function summary(), to get the quantiles as well as the mean, and the number of NA values:

quantile(x, na.rm = T)
##   0%  25%  50%  75% 100% 
##  1.0  3.5  7.0  8.0 15.0
# mean,median,25th and 75th quartiles,min,max
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.500   7.000   6.364   8.000  15.000       1

Visual Summaries: Box plots and Histograms

You get most of the same summary statistics visually from a boxplot: (in base R graphics)

boxplot(x, horizontal = TRUE)

Or alternatively in ggplot:

library(ggplot2)

some.LGAs<-sample(gambling.data$Local.Govt.Area,5)

subset.data<-gambling.data[gambling.data$Local.Govt.Area %in% some.LGAs,]

ggplot(data=subset.data,
       aes(x=Local.Govt.Area,y=Approved.EGMs))+
  geom_boxplot()+
  scale_y_log10()+
  coord_flip()+
  theme_minimal()

We can also look at the histogram of the counts of the data. To obtain a histogram, the scale of the variable is divided into consecutive intervals and the number of observations in each interval is counted.

hist(x)

Or plot a histogram in ggplot:

ggplot(data=subset.data,
       aes(x=Approved.EGMs,fill=Local.Govt.Area))+
  geom_histogram()+
  scale_x_log10()+
  theme_minimal()

Correlation

x1=sin(1:100/3)
y1=sin(1:100/3)
plot(x1,y1)

cor(x1,y1)
## [1] 1
x2=rnorm(100)
y2=rnorm(100)
plot(x2,y2)

cor(x2,y2)
## [1] -0.03172506

Calculating correlations for many variables at once

#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
MoneyDateLGA<-gambling.data[,c("Player.Money.Lost","Date","Local.Govt.Area")]

spreaddata<-tidyr::spread(MoneyDateLGA,3,1)

#remove the columns with NA values
spreaddata$TORRES<-NULL
spreaddata$LONGREACH<-NULL
spreaddata$BALONNE<-NULL

#plot corrplot
corrplot(cor(spreaddata[,-1]),
         order="hclust",
         tl.cex = .5, 
         tl.col = "black")

Pair plots

#install.packages("GGally")
library(GGally)

ggpairs(gambling.data[,-c(1,2)])

There are other functions and packages to obtain summary statistics of the datasets:

  • stat.desc() in the pastecs package
  • describe() in the Hmisc package
  • describe() in the psych package…

Simple Linear Regression

Simple linear regression is useful for examining or modelling the relationship between two variables.

ggplot(data=gambling.data,
       aes(x=Operational.EGMs,
           y=Player.Money.Lost,
           color=Local.Govt.Area,
           group=1))+
  guides(color=FALSE)+
  geom_point()+
  geom_smooth(method = "lm",se=TRUE)

Fit a linear regression and look at a summary of its results. The model is of the form \(y= mx+c\) where \(y=\)Player.Money.Lost and \(x=\)Operational.EGMs

dollars_per_machine_model <- lm(Player.Money.Lost ~ Operational.EGMs, 
                                data = gambling.data)
summary(dollars_per_machine_model)
## 
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs, data = gambling.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9146097  -246066      408   153205 11721046 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.724e+05  1.557e+04  -17.49   <2e-16 ***
## Operational.EGMs  4.152e+03  8.386e+00  495.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083000 on 6673 degrees of freedom
## Multiple R-squared:  0.9735, Adjusted R-squared:  0.9735 
## F-statistic: 2.452e+05 on 1 and 6673 DF,  p-value: < 2.2e-16
attributes(dollars_per_machine_model)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"
dollars_per_machine_model$coefficients 
##      (Intercept) Operational.EGMs 
##      -272416.501         4152.383
confint(dollars_per_machine_model)
##                        2.5 %      97.5 %
## (Intercept)      -302943.761 -241889.241
## Operational.EGMs    4135.944    4168.822
anova(dollars_per_machine_model)
## Analysis of Variance Table
## 
## Response: Player.Money.Lost
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Operational.EGMs    1 2.8758e+17 2.8758e+17  245187 < 2.2e-16 ***
## Residuals        6673 7.8268e+15 1.1729e+12                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualize residuals and fitted values.
plot(dollars_per_machine_model, 
     pch=16, 
     which=1)

hist(dollars_per_machine_model$residuals,
     breaks=1000,
     xlim=c(-2e6,2e6))

sd(dollars_per_machine_model$residuals)
## [1] 1082929

Multiple Linear Regression

We can make a much better model by including more variables. The functions we will use are exactly the same though.

dollars_model <- lm(Player.Money.Lost ~ Operational.EGMs + Operational.Sites + as.numeric(Date) + Local.Govt.Area, 
                                data = gambling.data)
summary(dollars_model)
## 
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs + Operational.Sites + 
##     as.numeric(Date) + Local.Govt.Area, data = gambling.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5480600  -179541   -11605   168530  6292338 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       1.375e+06  1.315e+05  10.462  < 2e-16
## Operational.EGMs                  4.950e+03  1.338e+02  37.005  < 2e-16
## Operational.Sites                -2.200e+05  3.490e+03 -63.027  < 2e-16
## as.numeric(Date)                 -2.966e-04  8.724e-05  -3.400 0.000677
## Local.Govt.AreaBANANA             8.594e+05  8.058e+04  10.665  < 2e-16
## Local.Govt.AreaBRISBANE           3.862e+07  1.211e+06  31.896  < 2e-16
## Local.Govt.AreaBUNDABERG          6.010e+06  1.807e+05  33.263  < 2e-16
## Local.Govt.AreaBURDEKIN           2.324e+05  7.930e+04   2.931 0.003393
## Local.Govt.AreaCAIRNS             6.187e+06  2.298e+05  26.920  < 2e-16
## Local.Govt.AreaCASSOWARY COAST    1.542e+06  8.892e+04  17.341  < 2e-16
## Local.Govt.AreaCENTRAL HIGHLANDS  2.336e+06  9.475e+04  24.659  < 2e-16
## Local.Govt.AreaCHARTERS TOWERS    6.474e+05  7.995e+04   8.097 6.63e-16
## Local.Govt.AreaCLONCURRY          1.948e+04  7.843e+04   0.248 0.803842
## Local.Govt.AreaDOUGLAS            6.988e+05  7.950e+04   8.791  < 2e-16
## Local.Govt.AreaFRASER COAST       5.469e+06  1.806e+05  30.288  < 2e-16
## Local.Govt.AreaGLADSTONE          3.616e+06  1.114e+05  32.456  < 2e-16
## Local.Govt.AreaGOLD COAST         2.261e+07  7.475e+05  30.240  < 2e-16
## Local.Govt.AreaGOONDIWINDI        7.911e+05  8.018e+04   9.866  < 2e-16
## Local.Govt.AreaGYMPIE             2.494e+06  9.476e+04  26.321  < 2e-16
## Local.Govt.AreaHINCHINBROOK       5.139e+05  7.875e+04   6.526 7.25e-11
## Local.Govt.AreaIPSWICH            6.552e+06  2.098e+05  31.231  < 2e-16
## Local.Govt.AreaISAAC              2.219e+06  8.869e+04  25.021  < 2e-16
## Local.Govt.AreaLIVINGSTONE        1.182e+06  8.404e+04  14.060  < 2e-16
## Local.Govt.AreaLOCKYER VALLEY     2.386e+06  9.001e+04  26.510  < 2e-16
## Local.Govt.AreaLOGAN              8.916e+06  2.629e+05  33.917  < 2e-16
## Local.Govt.AreaLONGREACH         -8.728e+04  8.953e+04  -0.975 0.329631
## Local.Govt.AreaMACKAY             6.274e+06  2.213e+05  28.355  < 2e-16
## Local.Govt.AreaMARANOA            5.808e+05  7.916e+04   7.338 2.43e-13
## Local.Govt.AreaMAREEBA            1.569e+05  7.959e+04   1.971 0.048738
## Local.Govt.AreaMORETON BAY        1.127e+07  4.058e+05  27.779  < 2e-16
## Local.Govt.AreaMOUNT ISA          1.394e+05  9.255e+04   1.507 0.131985
## Local.Govt.AreaMURWEH             3.045e+05  7.863e+04   3.872 0.000109
## Local.Govt.AreaNOOSA              2.023e+06  1.014e+05  19.944  < 2e-16
## Local.Govt.AreaNORTH BURNETT      7.991e+05  7.952e+04  10.050  < 2e-16
## Local.Govt.AreaREDLAND            4.060e+06  1.752e+05  23.175  < 2e-16
## Local.Govt.AreaROCKHAMPTON        6.143e+06  1.680e+05  36.556  < 2e-16
## Local.Govt.AreaSCENIC RIM         2.661e+06  9.186e+04  28.969  < 2e-16
## Local.Govt.AreaSOMERSET           1.424e+06  8.175e+04  17.414  < 2e-16
## Local.Govt.AreaSOUTH BURNETT      2.485e+06  9.413e+04  26.401  < 2e-16
## Local.Govt.AreaSOUTHERN DOWNS     2.004e+06  9.396e+04  21.332  < 2e-16
## Local.Govt.AreaSUNSHINE COAST     1.138e+07  3.931e+05  28.945  < 2e-16
## Local.Govt.AreaTABLELANDS        -6.438e+03  7.874e+04  -0.082 0.934835
## Local.Govt.AreaTOOWOOMBA          8.595e+06  2.361e+05  36.411  < 2e-16
## Local.Govt.AreaTORRES             1.678e+05  4.807e+05   0.349 0.727023
## Local.Govt.AreaTOWNSVILLE         7.903e+06  2.260e+05  34.961  < 2e-16
## Local.Govt.AreaWESTERN DOWNS      3.432e+06  1.030e+05  33.304  < 2e-16
## Local.Govt.AreaWHITSUNDAY         2.948e+06  1.011e+05  29.164  < 2e-16
##                                     
## (Intercept)                      ***
## Operational.EGMs                 ***
## Operational.Sites                ***
## as.numeric(Date)                 ***
## Local.Govt.AreaBANANA            ***
## Local.Govt.AreaBRISBANE          ***
## Local.Govt.AreaBUNDABERG         ***
## Local.Govt.AreaBURDEKIN          ** 
## Local.Govt.AreaCAIRNS            ***
## Local.Govt.AreaCASSOWARY COAST   ***
## Local.Govt.AreaCENTRAL HIGHLANDS ***
## Local.Govt.AreaCHARTERS TOWERS   ***
## Local.Govt.AreaCLONCURRY            
## Local.Govt.AreaDOUGLAS           ***
## Local.Govt.AreaFRASER COAST      ***
## Local.Govt.AreaGLADSTONE         ***
## Local.Govt.AreaGOLD COAST        ***
## Local.Govt.AreaGOONDIWINDI       ***
## Local.Govt.AreaGYMPIE            ***
## Local.Govt.AreaHINCHINBROOK      ***
## Local.Govt.AreaIPSWICH           ***
## Local.Govt.AreaISAAC             ***
## Local.Govt.AreaLIVINGSTONE       ***
## Local.Govt.AreaLOCKYER VALLEY    ***
## Local.Govt.AreaLOGAN             ***
## Local.Govt.AreaLONGREACH            
## Local.Govt.AreaMACKAY            ***
## Local.Govt.AreaMARANOA           ***
## Local.Govt.AreaMAREEBA           *  
## Local.Govt.AreaMORETON BAY       ***
## Local.Govt.AreaMOUNT ISA            
## Local.Govt.AreaMURWEH            ***
## Local.Govt.AreaNOOSA             ***
## Local.Govt.AreaNORTH BURNETT     ***
## Local.Govt.AreaREDLAND           ***
## Local.Govt.AreaROCKHAMPTON       ***
## Local.Govt.AreaSCENIC RIM        ***
## Local.Govt.AreaSOMERSET          ***
## Local.Govt.AreaSOUTH BURNETT     ***
## Local.Govt.AreaSOUTHERN DOWNS    ***
## Local.Govt.AreaSUNSHINE COAST    ***
## Local.Govt.AreaTABLELANDS           
## Local.Govt.AreaTOOWOOMBA         ***
## Local.Govt.AreaTORRES               
## Local.Govt.AreaTOWNSVILLE        ***
## Local.Govt.AreaWESTERN DOWNS     ***
## Local.Govt.AreaWHITSUNDAY        ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 674500 on 6628 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9897 
## F-statistic: 1.397e+04 on 46 and 6628 DF,  p-value: < 2.2e-16
plot(dollars_model, 
     pch=16, 
     which=1)

hist(dollars_model$residuals,
     breaks=1000,
     xlim=c(-2e6,2e6))

sd(dollars_model$residuals)
## [1] 672180.9

Hypothesis testing

A hypothesis test (or more precisely a p-value from a hypothesis test) will tell you the probability that the data you have is generated by the null model (your null hypothesis).

For example, I may have rolled a red die 10 times and got the following results:

redDie<-c(6, 4, 4, 6, 2, 1, 4, 6, 5, 2)

Let’s say that my “null model” is that the mean of many rolls will be 3.5, i.e., the result we would expect, on average, if the die is fair.

mean(redDie)
## [1] 4

Is this data consistent with our null hypothesis?

redDieTTest<-t.test(redDie, alternative="two.sided", mu = 3.5 ,conf.level=0.95)
redDieTTest
## 
##  One Sample t-test
## 
## data:  redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
##  2.693943 5.306057
## sample estimates:
## mean of x 
##         4
names(redDieTTest)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
redDieTTest$parameter
## df 
##  9
redDieTTest$p.value
## [1] 0.4089688
redDieTTest$conf.int
## [1] 2.693943 5.306057
## attr(,"conf.level")
## [1] 0.95
redDieTTest$estimate
## mean of x 
##         4
redDieTTest$null.value
## mean 
##  3.5
redDieTTest$alternative
## [1] "two.sided"
redDieTTest
## 
##  One Sample t-test
## 
## data:  redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
##  2.693943 5.306057
## sample estimates:
## mean of x 
##         4

I then rolled a blue die 10 times and got the following results from it:

blueDie<-c(5, 1, 2, 2, 3, 1, 6, 2, 4, 1)

Are the mean values for ten roll different between the red and blue die (within statistical uncertainty) ?

redblueDieTTest<-t.test(redDie,blueDie, alternative="two.sided", mu = 0 ,conf.level=0.95)
redblueDieTTest
## 
##  Welch Two Sample t-test
## 
## data:  redDie and blueDie
## t = 1.618, df = 17.981, p-value = 0.1231
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3881461  2.9881461
## sample estimates:
## mean of x mean of y 
##       4.0       2.7

Testing for homogeneous variances

Let’s use a Fisher’s F-test to check that the rolls from the red die and the rolls from the blue die have simmilar variances (i.e. verify the homoskedasticity). In R you can do this in this way:

var.test(redDie,blueDie)
## 
##  F test to compare two variances
## 
## data:  redDie and blueDie
## F = 1.0676, num df = 9, denom df = 9, p-value = 0.924
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2651806 4.2982144
## sample estimates:
## ratio of variances 
##           1.067616

Conducting Chi-Squared Test of Independence

The chi-square test of independence is a parametric method appropriate for testing independence between two categorical variables.

Smoking and Exercise

Smoking and Exercise

We’ll use a different dataset here as it is more illustrative of the concept. We have survey data from some students and we have asked them to report how heavy a smoker they are in the following categories:

  • “Heavy”,
  • “Regul” (regularly)
  • “Occas” (occasionally)
  • “Never”

as well as getting them to report how often they exercise (from the following options):

  • “Freq” (frequently)
  • “Some”
  • “None”.

First we will make a the contingency table of the two variables, the students’ smoking habit against the exercise level using the table() function in R. The result is called the contingency table of the two variables.

library(MASS)       # load the MASS package, which contains the data 
smokeExercise <- table(survey$Smoke, survey$Exer) # tabulate the entries in each combination of categories
smokeExercise<-smokeExercise[c(2,3,4,1),c(2,3,1)]  #rearrange in the correct order

# the contingency table
smokeExercise
##        
##         None Some Freq
##   Never   18   84   87
##   Occas    3    4   12
##   Regul    1    7    9
##   Heavy    1    3    7

See the relationship using barplot:

barplot(smokeExercise, 
        beside=T, 
        legend=T,
        xlab = "Exercise",
        ylab = "Number of People")

We want to test the hypothesis that students smoking habit is independent of their exercise level. This would mean that each row is another sample from the distribution of exercise, and each column is another sample from the distribution of smoking, and this completely describes the system.

We test this by applying the chisq.test() function to the contingency table.

chisq.test(smokeExercise) 
## Warning in chisq.test(smokeExercise): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  smokeExercise
## X-squared = 5.4885, df = 6, p-value = 0.4828

The p-value of 0.4828 indicates that the probability of generating data like what we see here, given that the null hypothesis is true (smoking and exercise level are independent random variables) is around 50%. So it seems plausible that they are independent, in absence of further information.

The warning message found in the solution above is due to the small cell values in the contingency table. To avoid such warning, we can combine the exercise answers for “None” and “Some” into a single category, and save it in a new table named reducedSmokeExercise. Then we apply the chisq.test() function against reducedSmokeExercise instead.

reducedSmokeExercise = cbind(smokeExercise[,"Freq"], 
                             smokeExercise[,"None"] + smokeExercise[,"Some"])
reducedSmokeExercise 
##       [,1] [,2]
## Never   87  102
## Occas   12    7
## Regul    9    8
## Heavy    7    4
chisq.test(reducedSmokeExercise) 
## 
##  Pearson's Chi-squared test
## 
## data:  reducedSmokeExercise
## X-squared = 3.2328, df = 3, p-value = 0.3571

If the assumptions of Chi-square test are met we may consider using Fisher’s Exact test that is non-parametric alternative to the Chi-Square test.

fisher.test(smokeExercise, conf.int=T, conf.level=0.99)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  smokeExercise
## p-value = 0.4138
## alternative hypothesis: two.sided

Anova: One-Way Analysis of Variance (ANOVA)

Anova is a parameteric method appropriate for comparing the means of 2 or more independent populations. We’re going to use a data set called InsectSprays. Six different insect sprays (1 Independent Variable: which brand of spray, with 6 levels/brands) were tested to see if there was a difference in the number of insects found in the field after each spraying.

attach(InsectSprays)

View(InsectSprays)
boxplot(count ~ spray, 
        xlab = "Brand of Insect Spray",
        ylab = "Number of Insects found")

Let’s run the ANalysis Of VAriance (ANOVA) to determine whether the number of insects found varies between the six different brands. Our null hypothesis is that the mean count of insects is the same for all six brands of insect spray.

 ANOVA.Sprays <- aov(count ~ spray, data=InsectSprays)
summary(ANOVA.Sprays)
##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value here tells us the probability that our data (or more extreme data) would be generated by our null hpothesis, e.g. all six brands are the same and any difference is just due to random sampling. As the p-value is quite small, we can conclude that all six means are not the same.

Let’s have a look at all possible pairwise comaprisons between two of the six brands of spray.

pairwiseSprayComparison<-TukeyHSD(ANOVA.Sprays)

We can best see the differences in mean values in the following plot:

plot(pairwiseSprayComparison,
     las=1)
title(xlab = "Differences in mean number of insects found")

Kruskal Wallis One-Way Analysis of Variance is a non-parameteric equivalent to the one-way Analysis of Variance

kruskal.test(count ~ spray, data=InsectSprays)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  count by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10