Welcome to the home page for the Brain & Mind Centre R course. It is a three day course covering six seperate topics that occur frequently to anyone doing pragmatic data analysis. The content is structured in such a way that the material in a section is independent to the other sections, so if you can only make part of the course this won’t impact your learning.
These courses are designed to be informal and free flowing, so whilst there is some material we have prepared, if you have a question then at anytime please let us know and we will be more than happy to assist. If there is anything that we can’t answer on the day, we will note it down and make sure that we get back to you with a thorough solution.
R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. Today, it is one of the most popular languages, being used all across the world in a wide variety of domains and fields.
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
Below is a script that installs packages that you will need for the next three days and also sets some options which will make the course run smoothly. We will explain what the script does as we get further into the material, but for the meantime we would ask that you copy the script and run it in your environment.
LoadPackages <- function(x){ for( i in x ){ if( ! require( i , character.only = TRUE ) ){
install.packages( i , dependencies = TRUE , repos = "http://cran.us.r-project.org")
require( i , character.only = TRUE )}}}
# Then try/install packages...
LoadPackages(c("tidyverse",
"xlsx",
"ggplot2",
"foreign",
"dplyr"))
#options(max.print = 100)
options(scipen = 999)
set.seed(5)
Standard math functions work in R:
2+3
## [1] 5
1e5-5e4
## [1] 50000
1/1000
## [1] 0.001
sqrt(2)
## [1] 1.414214
2*pi
## [1] 6.283185
Let’s use vectors:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
2*(1:10)
## [1] 2 4 6 8 10 12 14 16 18 20
2^(1:10)
## [1] 2 4 8 16 32 64 128 256 512 1024
We can store values:
x<-3
x=3
3->x
x
## [1] 3
x^2
## [1] 9
We can store vectors:
x<-1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
2^x
## [1] 2 4 8 16 32 64 128 256 512 1024
y<-c(3,4,5)
y^2
## [1] 9 16 25
some.fruits<-c("apple","orange","banana")
some.fruits
## [1] "apple" "orange" "banana"
A lot of the time in R we are working with tables of data, know as “data frames”.
Commonly,
rows may represent instances e.g. data points, people, events, etc. while
columns will represent different types of data associated with each data point or instance e.g. Name, ID, location, time, value…
Here is an example data frame
simple.data<-data.frame(first.name=c("Alice",
"Bob",
"Cathy",
"Daniel"),
gender=as.factor(c("Female",
"Male",
"Female",
"Male")),
favourite.number=c(4,23,pi,exp(1)),
favourite.letter=c("a","c","x","q"),
favourite.weekday=c("Monday","Thursday","Sunday","Friday"),
stringsAsFactors = FALSE)
Use the function View()
to visually look at the data
View(simple.data)
How many rows do we have?
nrow(simple.data)
## [1] 4
How many columns do we have?
ncol(simple.data)
## [1] 5
Return only the first 3 rows of the data set
simple.data[1:3,]
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 2 Bob Male 23.000000 c Thursday
## 3 Cathy Female 3.141593 x Sunday
simple.data[c(1,2,3),]
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 2 Bob Male 23.000000 c Thursday
## 3 Cathy Female 3.141593 x Sunday
simple.data[c(TRUE,TRUE,TRUE,FALSE),]
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 2 Bob Male 23.000000 c Thursday
## 3 Cathy Female 3.141593 x Sunday
head(simple.data,3)
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 2 Bob Male 23.000000 c Thursday
## 3 Cathy Female 3.141593 x Sunday
Return the last two rows in a data set
simple.data[nrow(simple.data)+(-1:0),]
## first.name gender favourite.number favourite.letter favourite.weekday
## 3 Cathy Female 3.141593 x Sunday
## 4 Daniel Male 2.718282 q Friday
tail(simple.data,2)
## first.name gender favourite.number favourite.letter favourite.weekday
## 3 Cathy Female 3.141593 x Sunday
## 4 Daniel Male 2.718282 q Friday
simple.data[c(FALSE,FALSE,TRUE,TRUE),]
## first.name gender favourite.number favourite.letter favourite.weekday
## 3 Cathy Female 3.141593 x Sunday
## 4 Daniel Male 2.718282 q Friday
Return a random two rows from the data set
dplyr::sample_n(simple.data,2)
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 3 Cathy Female 3.141593 x Sunday
Return only the “favourite.letter” column in the data set
simple.data[,4]
## [1] "a" "c" "x" "q"
simple.data[,"favourite.letter"]
## [1] "a" "c" "x" "q"
simple.data[,c("favourite.letter")]
## [1] "a" "c" "x" "q"
simple.data[,c(F,F,F,T,F,F,F)]
## [1] "a" "c" "x" "q"
simple.data[,names(simple.data) %in% c("favourite.letter")]
## [1] "a" "c" "x" "q"
simple.data$favourite.letter
## [1] "a" "c" "x" "q"
Return only the first 3 rows and columns 2 and 5 of the data set
simple.data[1:3, c(2,5)]
## gender favourite.weekday
## 1 Female Monday
## 2 Male Thursday
## 3 Female Sunday
Return the columns named “first.name” and “favourite.number”
simple.data[,c("first.name","favourite.number")]
## first.name favourite.number
## 1 Alice 4.000000
## 2 Bob 23.000000
## 3 Cathy 3.141593
## 4 Daniel 2.718282
Return only the rows (people) which are Female
simple.data[simple.data$gender=="Female",]
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 3 Cathy Female 3.141593 x Sunday
What exactly happened here?
We made a vector of TRUE/FALSE
statements, for each row in which this condition is true:
indexes<-simple.data$gender=="Female"
indexes
## [1] TRUE FALSE TRUE FALSE
then we subset rows in which the index is true
simple.data[indexes,]
## first.name gender favourite.number favourite.letter favourite.weekday
## 1 Alice Female 4.000000 a Monday
## 3 Cathy Female 3.141593 x Sunday
What if we want all of the people whose favourite number is larger than 10?
simple.data[simple.data$favourite.number>10,]
## first.name gender favourite.number favourite.letter favourite.weekday
## 2 Bob Male 23 c Thursday
What if we want just the name all of the people whose favourite number is larger than 10?
simple.data[simple.data$favourite.number>10,"first.name"]
## [1] "Bob"
We could have also done
simple.data$first.name[simple.data$favourite.number>10]
## [1] "Bob"
Quiz: 1. How to extract Alice’s favourite number? 2. Daniel’s favourite weekday?
The dataset that we will be using today is taken from the Queensland Government website and is Gaming Machine Data by Local Government Areas. It has the following variables:
Variable | Definition |
---|---|
Month Year | The month and year from which the gambling data is provided. |
LGA Region Name | The name of the Local Government Area. |
Approved Sites | The number of venues approved to operate electronic gaming machines. |
Operational Sites | The number of venues that were operating electronic gaming machines on the last day of the relevant month. |
Approved EGMs | The maximum number of electronic gaming machines the venue is approved to operate. |
Operational EGMs | The number of electronic gaming machines operating at the venue on the last day of the revelant month. |
Metered Win | The amount of money lost by players of eletronic gaming machines. |
If you haven’t already, the data can be downloaded from the homepage of this course, or alternatively you can click here.
The first thing to do before we can begin an analysis is loading some data. To do this we can use the below command.
gambling.data <- read.csv(file = "http://data.justice.qld.gov.au/JSD/OLGR/20170817_OLGR_LGA-EGM-data.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE)
Before going any further, let’s break down what this command does specifically.
"~/Documents/MyFolder/datafile.csv"
"http://www.website.com/subdomain/datafile.csv"
How would we find this out if we didn’t know already? Look at the help file:
?read.csv
Comma Separated Value (.csv) files are usually the standard, simplest format which is compatible with all sort of different software, e.g. R, python, Excel, MATLAB, …
But if we needed to read in a different format of data, there’s usually a function or a library for doing just that,
e.g. in base R:
read.csv()
read.table()
In the package “xlsx”:
read.xlsx()
In the package “foreign”:
read.dta()
for STATA (version 5-12) data filesread.spss()
for SPSS filesHow can we examine this data set now that we’ve loaded it?
Use the function View()
to visually look at the data
View(gambling.data)
Using the background information on the dataset, change the column names so they make sense.
names(gambling.data)
## [1] "Month.Year" "LGA.Region" "Approved.Sites"
## [4] "Operational.Sites" "Approved.EGMs" "Operational.EGMs"
## [7] "Metered.Win"
names(gambling.data)[2] <- "Local.Govt.Area"
names(gambling.data)[7] <- "Player.Money.Lost"
Afterwards, view the data again to check the column names have been changed.
View(gambling.data)
Time to add in a column which has the date as a date-time object instead of as a character string. We will need to use several functions to do this.
The first function we will need is paste0()
. This function concatenates strings. For example;
paste0("Wo","rd")
## [1] "Word"
The second function we will need is strptime()
Use the help function to find out what it does: ?strptime
Now let’s convert the character string which we have describing the month and year of each row of data, into a date-time object so that it will be ordered appropriately etc.
#Add a day of month (1st) to each date string
date.string <- paste0( "1 " , gambling.data$Month.Year )
#Convert to POSIXlt, a date-time format
strptime( date.string , format = "%d %B %Y" ) -> gambling.data$Date
Afterwards, view the data again to check the new “Date” column.
View(gambling.data)
Let’s look at all the records for Brisbane:
brisbane.only<-gambling.data[gambling.data$Local.Govt.Area=="BRISBANE",]
What is the average amount of money lost by players in any given month in Brisbane?
mean(brisbane.only$Player.Money.Lost)
## [1] 35825434
That’s a lot of money!
What is the largest amount of money lost by players in any given month in Brisbane?
max(brisbane.only$Player.Money.Lost)
## [1] 47092687
What month was this?
brisbane.only$Month.Year[which.max(brisbane.only$Player.Money.Lost)]
## [1] "July 2017"
What if we want the rows from brisbane which happen in 2010?
row.indicies<-(brisbane.only$Date>="2010-01-01 AEST" &
brisbane.only$Date<="2010-12-31 AEST")
(brisbane.2010.data<-brisbane.only[row.indicies,])
## Month.Year Local.Govt.Area Approved.Sites Operational.Sites
## 3635 January 2010 BRISBANE 227 220
## 3690 February 2010 BRISBANE 227 220
## 3745 March 2010 BRISBANE 227 220
## 3800 April 2010 BRISBANE 227 221
## 3855 May 2010 BRISBANE 228 222
## 3910 June 2010 BRISBANE 227 222
## 3965 July 2010 BRISBANE 226 219
## 4020 August 2010 BRISBANE 226 218
## 4075 September 2010 BRISBANE 225 218
## 4130 October 2010 BRISBANE 225 218
## 4185 November 2010 BRISBANE 225 218
## 4240 December 2010 BRISBANE 225 217
## Approved.EGMs Operational.EGMs Player.Money.Lost Date
## 3635 9183 8834 31268720 2010-01-01
## 3690 9175 8854 30025451 2010-02-01
## 3745 9225 8859 32183381 2010-03-01
## 3800 9345 8956 32017037 2010-04-01
## 3855 9230 8815 32244843 2010-05-01
## 3910 9166 8872 31873072 2010-06-01
## 3965 9144 8809 36225638 2010-07-01
## 4020 9119 8791 36861039 2010-08-01
## 4075 9106 8812 34763792 2010-09-01
## 4130 9106 8799 36211785 2010-10-01
## 4185 9126 8830 33534227 2010-11-01
## 4240 9126 8797 35019142 2010-12-01
On the last line we used a shortcut; if you want to assign a variable with <-
but also print it, you can put the whole expression in parentheses.
Use the summary function to return a quick summary
summary(gambling.data)
## Month.Year Local.Govt.Area Approved.Sites Operational.Sites
## Length:8635 Length:8635 Min. : 1.0 Min. : 1.00
## Class :character Class :character 1st Qu.: 5.0 1st Qu.: 5.00
## Mode :character Mode :character Median : 12.0 Median : 11.00
## Mean : 24.1 Mean : 23.36
## 3rd Qu.: 28.0 3rd Qu.: 28.00
## Max. :243.0 Max. :235.00
## NA's :19
## Approved.EGMs Operational.EGMs Player.Money.Lost
## Min. : 5 Min. : 4.0 Min. : 9265
## 1st Qu.: 77 1st Qu.: 76.0 1st Qu.: 438955
## Median : 255 Median : 238.0 Median : 1028498
## Mean : 796 Mean : 762.5 Mean : 3773709
## 3rd Qu.:1027 3rd Qu.: 979.0 3rd Qu.: 4458886
## Max. :9345 Max. :8970.0 Max. :47092687
## NA's :19 NA's :1959
## Date
## Min. :2004-07-01 00:00:00
## 1st Qu.:2007-10-01 00:00:00
## Median :2011-01-01 00:00:00
## Mean :2010-12-31 11:12:59
## 3rd Qu.:2014-04-01 00:00:00
## Max. :2017-07-01 00:00:00
##
summary(brisbane.only)
## Month.Year Local.Govt.Area Approved.Sites Operational.Sites
## Length:157 Length:157 Min. :190.0 Min. :183.0
## Class :character Class :character 1st Qu.:207.0 1st Qu.:204.0
## Mode :character Mode :character Median :225.0 Median :216.0
## Mean :220.4 Mean :213.5
## 3rd Qu.:233.0 3rd Qu.:226.0
## Max. :243.0 Max. :235.0
## Approved.EGMs Operational.EGMs Player.Money.Lost
## Min. :8779 Min. :8347 Min. :27820060
## 1st Qu.:8997 1st Qu.:8642 1st Qu.:32591123
## Median :9118 Median :8747 Median :34936940
## Mean :9076 Mean :8723 Mean :35825434
## 3rd Qu.:9176 3rd Qu.:8818 3rd Qu.:38277230
## Max. :9345 Max. :8970 Max. :47092687
## Date
## Min. :2004-07-01 00:00:00
## 1st Qu.:2007-10-01 00:00:00
## Median :2011-01-01 00:00:00
## Mean :2010-12-31 11:12:59
## 3rd Qu.:2014-04-01 00:00:00
## Max. :2017-07-01 00:00:00
summary(brisbane.2010.data)
## Month.Year Local.Govt.Area Approved.Sites Operational.Sites
## Length:12 Length:12 Min. :225.0 Min. :217.0
## Class :character Class :character 1st Qu.:225.0 1st Qu.:218.0
## Mode :character Mode :character Median :226.5 Median :219.5
## Mean :226.2 Mean :219.4
## 3rd Qu.:227.0 3rd Qu.:220.2
## Max. :228.0 Max. :222.0
## Approved.EGMs Operational.EGMs Player.Money.Lost
## Min. :9106 Min. :8791 Min. :30025451
## 1st Qu.:9124 1st Qu.:8806 1st Qu.:31981046
## Median :9155 Median :8822 Median :32889535
## Mean :9171 Mean :8836 Mean :33519011
## 3rd Qu.:9194 3rd Qu.:8855 3rd Qu.:35317303
## Max. :9345 Max. :8956 Max. :36861039
## Date
## Min. :2010-01-01 00:00:00
## 1st Qu.:2010-03-24 06:00:00
## Median :2010-06-16 00:00:00
## Mean :2010-06-16 11:30:00
## 3rd Qu.:2010-09-08 12:00:00
## Max. :2010-12-01 00:00:00
Use the summary function to return a quick summary of only the money column
summary(brisbane.2010.data$Player.Money.Lost)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30025451 31981046 32889535 33519011 35317303 36861039
Get a random selection of data and look at which Local government areas are included
random.fifty.rows<-dplyr::sample_n(gambling.data,50)
#look at each row's Local Government Area (LGA)
random.fifty.rows$Local.Govt.Area
## [1] "SUNSHINE COAST" "FLINDERS" "TORRES"
## [4] "LONGREACH" "WESTERN DOWNS" "WINTON"
## [7] "MURWEH" "BURDEKIN" "COOK"
## [10] "GOLD COAST" "GYMPIE" "WEIPA"
## [13] "BURDEKIN" "BALONNE" "HINCHINBROOK"
## [16] "NORTH BURNETT" "NOOSA" "BUNDABERG"
## [19] "SOMERSET" "IPSWICH" "SOMERSET"
## [22] "IPSWICH" "IPSWICH" "LOCKYER VALLEY"
## [25] "MACKAY" "GLADSTONE" "QUILPIE"
## [28] "TORRES" "TOWNSVILLE" "SUNSHINE COAST"
## [31] "TOWNSVILLE" "REDLAND" "TORRES"
## [34] "BUNDABERG" "LIVINGSTONE" "IPSWICH"
## [37] "CAIRNS" "ROCKHAMPTON" "TOOWOOMBA"
## [40] "GYMPIE" "WHITSUNDAY" "GLADSTONE"
## [43] "CENTRAL HIGHLANDS" "BURDEKIN" "RICHMOND"
## [46] "CAIRNS" "NOOSA" "TORRES"
## [49] "CLONCURRY" "GYMPIE"
#look at the unique local government areas (duplicates removed)
unique(random.fifty.rows$Local.Govt.Area)
## [1] "SUNSHINE COAST" "FLINDERS" "TORRES"
## [4] "LONGREACH" "WESTERN DOWNS" "WINTON"
## [7] "MURWEH" "BURDEKIN" "COOK"
## [10] "GOLD COAST" "GYMPIE" "WEIPA"
## [13] "BALONNE" "HINCHINBROOK" "NORTH BURNETT"
## [16] "NOOSA" "BUNDABERG" "SOMERSET"
## [19] "IPSWICH" "LOCKYER VALLEY" "MACKAY"
## [22] "GLADSTONE" "QUILPIE" "TOWNSVILLE"
## [25] "REDLAND" "LIVINGSTONE" "CAIRNS"
## [28] "ROCKHAMPTON" "TOOWOOMBA" "WHITSUNDAY"
## [31] "CENTRAL HIGHLANDS" "RICHMOND" "CLONCURRY"
#look at how many rows there are for each LGA
table(random.fifty.rows$Local.Govt.Area)
##
## BALONNE BUNDABERG BURDEKIN CAIRNS
## 1 2 3 2
## CENTRAL HIGHLANDS CLONCURRY COOK FLINDERS
## 1 1 1 1
## GLADSTONE GOLD COAST GYMPIE HINCHINBROOK
## 2 1 3 1
## IPSWICH LIVINGSTONE LOCKYER VALLEY LONGREACH
## 4 1 1 1
## MACKAY MURWEH NOOSA NORTH BURNETT
## 1 1 2 1
## QUILPIE REDLAND RICHMOND ROCKHAMPTON
## 1 1 1 1
## SOMERSET SUNSHINE COAST TOOWOOMBA TORRES
## 2 2 1 4
## TOWNSVILLE WEIPA WESTERN DOWNS WHITSUNDAY
## 2 1 1 1
## WINTON
## 1
Challenge: find which Local government area had the lowest non-zero amount of money lost in a month.
There are several different types of data you can use in R, and you can even make your own new types (but we won’t touch on that today). Now we will examine a few common ones in a little more detail.
Character strings are known as “character” in R:
gambling.data$Local.Govt.Area[3]
## [1] "BARCALDINE"
class(gambling.data$Local.Govt.Area)
## [1] "character"
Numbers have different classes, they can be integer:
gambling.data$Operational.EGMs[36]
## [1] 531
class(gambling.data$Operational.EGMs)
## [1] "integer"
They can be numeric:
gambling.data$Player.Money.Lost[84]
## [1] 3913829
class(gambling.data$Player.Money.Lost)
## [1] "numeric"
class(0.1)
## [1] "numeric"
class(pi)
## [1] "numeric"
They can be complex:
class(1i+3)
## [1] "complex"
They can be double:
x<-as.double(3.345678987654323456789)
They can have rounding errors
x-(x-0.0000000000001)
## [1] 9.992007e-14
class(TRUE)
## [1] "logical"
class(1==1)
## [1] "logical"
A factor is a string label on top of an integer index
as.factor("I am a label")
## [1] I am a label
## Levels: I am a label
This makes most sense in the context of a survey response, e.g. 1=Good, 2, 3=Average, 4, 5=Bad; or something like that. For example in the simple data we had at the beginning, gender is a factor.
class(simple.data$gender)
## [1] "factor"
simple.data$gender
## [1] Female Male Female Male
## Levels: Female Male
as.numeric(simple.data$gender)
## [1] 1 2 1 2
table(simple.data$gender)
##
## Female Male
## 2 2
Vectors are all of the same class
#a vector
c("a","b","c")
## [1] "a" "b" "c"
#another vector
c(1,4,7)
## [1] 1 4 7
If you try to put different classes of things into a vector it will try to convert them to the same class.
#this will be converted to a vector of character strings
(what.am.I<-c("a",2,"c",FALSE,c(1,2,3)))
## [1] "a" "2" "c" "FALSE" "1" "2" "3"
class(what.am.I)
## [1] "character"
Lists can contain different classes of things in each element, possibly of different lengths
#a list
list("a",2,"c",FALSE,c(1,2,3))
## [[1]]
## [1] "a"
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] "c"
##
## [[4]]
## [1] FALSE
##
## [[5]]
## [1] 1 2 3
Data frames are really a list of vectors of equal length, and they function like a table of data. We made one right at the beginning:
simple.data<-data.frame(first.name=c("Alice",
"Bob",
"Cathy",
"Daniel"),
gender=as.factor(c("Female",
"Male",
"Female",
"Male")),
favourite.number=c(4,23,pi,exp(1)),
favourite.letter=c("a","c","x","q"),
favourite.weekday=c("Monday","Thursday","Sunday","Friday"),
stringsAsFactors = FALSE)
We made a column called “first.name”, a vector of character strings (favourite.letter and favourite.weekday are also character strings)
We made a column called “gender”, a vector of factors
We made a column called “favourite.number”, a vector of class numeric
Lastly we told it to not convert all the character string colums into factors (which is something it does by default) by adding the optional argument stringsAsFactors = FALSE
to the data.frame()
function.
Add a new row to the data set using the rbind() function:
new.person <- data.frame(first.name="Evelyn",
gender="Female",
favourite.number=12,
favourite.letter="z",
favourite.weekday="Monday",
stringsAsFactors = FALSE)
simple.data<-rbind(simple.data,new.person)
Use the write.table() function to make the datafile a text file
write.table(x = simple.data,
file = "simpleData.txt",
row.names = FALSE,
col.names = TRUE,
sep = ",")
Use the write.csv() function to make the datafile a csv (comma separated values) file
write.csv(x = simple.data,
file = "simpleData.csv",
row.names = FALSE)
Use the write.xlsx