Introduction

Welcome to the home page for the Brain & Mind Centre R course. It is a three day course covering six seperate topics that occur frequently to anyone doing pragmatic data analysis. The content is structured in such a way that the material in a section is independent to the other sections, so if you can only make part of the course this won’t impact your learning.

These courses are designed to be informal and free flowing, so whilst there is some material we have prepared, if you have a question then at anytime please let us know and we will be more than happy to assist. If there is anything that we can’t answer on the day, we will note it down and make sure that we get back to you with a thorough solution.

About R

The R Programming Language

R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. Today, it is one of the most popular languages, being used all across the world in a wide variety of domains and fields.

RStudio

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

Example RStudio Screenshot

Example RStudio Screenshot

Installation & Setup

Below is a script that installs packages that you will need for the next three days and also sets some options which will make the course run smoothly. We will explain what the script does as we get further into the material, but for the meantime we would ask that you copy the script and run it in your environment.

LoadPackages <- function(x){ for( i in x ){ if( ! require( i , character.only = TRUE ) ){
  install.packages( i , dependencies = TRUE , repos = "http://cran.us.r-project.org")
  require( i , character.only = TRUE )}}}

#  Then try/install packages...
LoadPackages(c("tidyverse",
               "xlsx",
               "ggplot2",
               "foreign",
               "dplyr"))

#options(max.print = 100)

options(scipen = 999)
set.seed(5)

Getting Started

Calculating things in R

Standard math functions work in R:

2+3
## [1] 5
1e5-5e4
## [1] 50000
1/1000
## [1] 0.001
sqrt(2)
## [1] 1.414214
2*pi
## [1] 6.283185

Let’s use vectors:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
2*(1:10)
##  [1]  2  4  6  8 10 12 14 16 18 20
2^(1:10)
##  [1]    2    4    8   16   32   64  128  256  512 1024

We can store values:

x<-3
x=3
3->x
x
## [1] 3
x^2
## [1] 9

Vectors

We can store vectors:

x<-1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
2^x
##  [1]    2    4    8   16   32   64  128  256  512 1024
y<-c(3,4,5)
y^2
## [1]  9 16 25
some.fruits<-c("apple","orange","banana")
some.fruits
## [1] "apple"  "orange" "banana"

Working with Simple Data

A lot of the time in R we are working with tables of data, know as “data frames”.

Commonly,

rows may represent instances e.g. data points, people, events, etc. while

columns will represent different types of data associated with each data point or instance e.g. Name, ID, location, time, value

Here is an example data frame

simple.data<-data.frame(first.name=c("Alice",
                                     "Bob",
                                     "Cathy",
                                     "Daniel"),
                        gender=as.factor(c("Female",
                                           "Male",
                                           "Female",
                                           "Male")),
                        favourite.number=c(4,23,pi,exp(1)),
                        favourite.letter=c("a","c","x","q"),
                        favourite.weekday=c("Monday","Thursday","Sunday","Friday"),
                        stringsAsFactors = FALSE)

Viewing The Data

Use the function View() to visually look at the data

View(simple.data)

How many rows do we have?

nrow(simple.data)
## [1] 4

How many columns do we have?

ncol(simple.data)
## [1] 5

Accessing Subsets

Return only the first 3 rows of the data set

simple.data[1:3,]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 2        Bob   Male        23.000000                c          Thursday
## 3      Cathy Female         3.141593                x            Sunday
simple.data[c(1,2,3),]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 2        Bob   Male        23.000000                c          Thursday
## 3      Cathy Female         3.141593                x            Sunday
simple.data[c(TRUE,TRUE,TRUE,FALSE),]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 2        Bob   Male        23.000000                c          Thursday
## 3      Cathy Female         3.141593                x            Sunday
head(simple.data,3)
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 2        Bob   Male        23.000000                c          Thursday
## 3      Cathy Female         3.141593                x            Sunday

Return the last two rows in a data set

simple.data[nrow(simple.data)+(-1:0),]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 3      Cathy Female         3.141593                x            Sunday
## 4     Daniel   Male         2.718282                q            Friday
tail(simple.data,2)
##   first.name gender favourite.number favourite.letter favourite.weekday
## 3      Cathy Female         3.141593                x            Sunday
## 4     Daniel   Male         2.718282                q            Friday
simple.data[c(FALSE,FALSE,TRUE,TRUE),]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 3      Cathy Female         3.141593                x            Sunday
## 4     Daniel   Male         2.718282                q            Friday

Return a random two rows from the data set

dplyr::sample_n(simple.data,2)
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 3      Cathy Female         3.141593                x            Sunday

Return only the “favourite.letter” column in the data set

simple.data[,4]
## [1] "a" "c" "x" "q"
simple.data[,"favourite.letter"]
## [1] "a" "c" "x" "q"
simple.data[,c("favourite.letter")]
## [1] "a" "c" "x" "q"
simple.data[,c(F,F,F,T,F,F,F)]
## [1] "a" "c" "x" "q"
simple.data[,names(simple.data) %in% c("favourite.letter")]
## [1] "a" "c" "x" "q"
simple.data$favourite.letter
## [1] "a" "c" "x" "q"

Return only the first 3 rows and columns 2 and 5 of the data set

simple.data[1:3, c(2,5)]
##   gender favourite.weekday
## 1 Female            Monday
## 2   Male          Thursday
## 3 Female            Sunday

Return the columns named “first.name” and “favourite.number”

simple.data[,c("first.name","favourite.number")]
##   first.name favourite.number
## 1      Alice         4.000000
## 2        Bob        23.000000
## 3      Cathy         3.141593
## 4     Daniel         2.718282

Filtering the data

Return only the rows (people) which are Female

simple.data[simple.data$gender=="Female",]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 3      Cathy Female         3.141593                x            Sunday

What exactly happened here?

We made a vector of TRUE/FALSE statements, for each row in which this condition is true:

indexes<-simple.data$gender=="Female"
indexes
## [1]  TRUE FALSE  TRUE FALSE

then we subset rows in which the index is true

simple.data[indexes,]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 1      Alice Female         4.000000                a            Monday
## 3      Cathy Female         3.141593                x            Sunday

What if we want all of the people whose favourite number is larger than 10?

simple.data[simple.data$favourite.number>10,]
##   first.name gender favourite.number favourite.letter favourite.weekday
## 2        Bob   Male               23                c          Thursday

What if we want just the name all of the people whose favourite number is larger than 10?

simple.data[simple.data$favourite.number>10,"first.name"]
## [1] "Bob"

We could have also done

simple.data$first.name[simple.data$favourite.number>10]
## [1] "Bob"

Quiz: 1. How to extract Alice’s favourite number? 2. Daniel’s favourite weekday?

More Realistic Data

The dataset that we will be using today is taken from the Queensland Government website and is Gaming Machine Data by Local Government Areas. It has the following variables:

Variable Definition
Month Year The month and year from which the gambling data is provided.
LGA Region Name The name of the Local Government Area.
Approved Sites The number of venues approved to operate electronic gaming machines.
Operational Sites The number of venues that were operating electronic gaming machines on the last day of the relevant month.
Approved EGMs The maximum number of electronic gaming machines the venue is approved to operate.
Operational EGMs The number of electronic gaming machines operating at the venue on the last day of the revelant month.
Metered Win The amount of money lost by players of eletronic gaming machines.

If you haven’t already, the data can be downloaded from the homepage of this course, or alternatively you can click here.

Reading in Data

The first thing to do before we can begin an analysis is loading some data. To do this we can use the below command.

gambling.data <- read.csv(file = "http://data.justice.qld.gov.au/JSD/OLGR/20170817_OLGR_LGA-EGM-data.csv",
                 header = TRUE,
                 sep = ",",
                 stringsAsFactors = FALSE)

Before going any further, let’s break down what this command does specifically.

  1. It reads the file, which could be
  • a file path e.g. "~/Documents/MyFolder/datafile.csv"
  • a url like we have here e.g. "http://www.website.com/subdomain/datafile.csv"
  1. It specifies that the first row in the csv file contains “header” information i.e. column names
  2. It specifies that neighbouring columns are separated by a comma “,”
  3. It says not to convert charachter strings (i.e. text) to something called a factor

How would we find this out if we didn’t know already? Look at the help file:

?read.csv

Comma Separated Value (.csv) files are usually the standard, simplest format which is compatible with all sort of different software, e.g. R, python, Excel, MATLAB, …

But if we needed to read in a different format of data, there’s usually a function or a library for doing just that,

e.g. in base R:

  • read.csv()
  • read.table()

In the package “xlsx”:

  • read.xlsx()

In the package “foreign”:

  • read.dta() for STATA (version 5-12) data files
  • read.spss() for SPSS files

How can we examine this data set now that we’ve loaded it?

Viewing The Data

Use the function View() to visually look at the data

View(gambling.data)

Changing The Column Names

Using the background information on the dataset, change the column names so they make sense.

names(gambling.data)
## [1] "Month.Year"        "LGA.Region"        "Approved.Sites"   
## [4] "Operational.Sites" "Approved.EGMs"     "Operational.EGMs" 
## [7] "Metered.Win"
names(gambling.data)[2] <- "Local.Govt.Area"

names(gambling.data)[7] <- "Player.Money.Lost"

Afterwards, view the data again to check the column names have been changed.

View(gambling.data)

Add a new Date column

Time to add in a column which has the date as a date-time object instead of as a character string. We will need to use several functions to do this.

The first function we will need is paste0(). This function concatenates strings. For example;

paste0("Wo","rd")
## [1] "Word"

The second function we will need is strptime()

Use the help function to find out what it does: ?strptime

Now let’s convert the character string which we have describing the month and year of each row of data, into a date-time object so that it will be ordered appropriately etc.

#Add a day of month (1st) to each date string
date.string <- paste0( "1 " , gambling.data$Month.Year )

#Convert to POSIXlt, a date-time format
strptime( date.string , format = "%d %B %Y" ) -> gambling.data$Date

Afterwards, view the data again to check the new “Date” column.

View(gambling.data)

Accessing Subsets

Let’s look at all the records for Brisbane:

brisbane.only<-gambling.data[gambling.data$Local.Govt.Area=="BRISBANE",]

What is the average amount of money lost by players in any given month in Brisbane?

mean(brisbane.only$Player.Money.Lost)
## [1] 35825434

That’s a lot of money!

What is the largest amount of money lost by players in any given month in Brisbane?

max(brisbane.only$Player.Money.Lost)
## [1] 47092687

What month was this?

brisbane.only$Month.Year[which.max(brisbane.only$Player.Money.Lost)]
## [1] "July 2017"

What if we want the rows from brisbane which happen in 2010?

row.indicies<-(brisbane.only$Date>="2010-01-01 AEST" &
                 brisbane.only$Date<="2010-12-31 AEST")

(brisbane.2010.data<-brisbane.only[row.indicies,])
##          Month.Year Local.Govt.Area Approved.Sites Operational.Sites
## 3635   January 2010        BRISBANE            227               220
## 3690  February 2010        BRISBANE            227               220
## 3745     March 2010        BRISBANE            227               220
## 3800     April 2010        BRISBANE            227               221
## 3855       May 2010        BRISBANE            228               222
## 3910      June 2010        BRISBANE            227               222
## 3965      July 2010        BRISBANE            226               219
## 4020    August 2010        BRISBANE            226               218
## 4075 September 2010        BRISBANE            225               218
## 4130   October 2010        BRISBANE            225               218
## 4185  November 2010        BRISBANE            225               218
## 4240  December 2010        BRISBANE            225               217
##      Approved.EGMs Operational.EGMs Player.Money.Lost       Date
## 3635          9183             8834          31268720 2010-01-01
## 3690          9175             8854          30025451 2010-02-01
## 3745          9225             8859          32183381 2010-03-01
## 3800          9345             8956          32017037 2010-04-01
## 3855          9230             8815          32244843 2010-05-01
## 3910          9166             8872          31873072 2010-06-01
## 3965          9144             8809          36225638 2010-07-01
## 4020          9119             8791          36861039 2010-08-01
## 4075          9106             8812          34763792 2010-09-01
## 4130          9106             8799          36211785 2010-10-01
## 4185          9126             8830          33534227 2010-11-01
## 4240          9126             8797          35019142 2010-12-01

On the last line we used a shortcut; if you want to assign a variable with <- but also print it, you can put the whole expression in parentheses.

Summarising The Data

Use the summary function to return a quick summary

summary(gambling.data)
##   Month.Year        Local.Govt.Area    Approved.Sites  Operational.Sites
##  Length:8635        Length:8635        Min.   :  1.0   Min.   :  1.00   
##  Class :character   Class :character   1st Qu.:  5.0   1st Qu.:  5.00   
##  Mode  :character   Mode  :character   Median : 12.0   Median : 11.00   
##                                        Mean   : 24.1   Mean   : 23.36   
##                                        3rd Qu.: 28.0   3rd Qu.: 28.00   
##                                        Max.   :243.0   Max.   :235.00   
##                                                        NA's   :19       
##  Approved.EGMs  Operational.EGMs Player.Money.Lost 
##  Min.   :   5   Min.   :   4.0   Min.   :    9265  
##  1st Qu.:  77   1st Qu.:  76.0   1st Qu.:  438955  
##  Median : 255   Median : 238.0   Median : 1028498  
##  Mean   : 796   Mean   : 762.5   Mean   : 3773709  
##  3rd Qu.:1027   3rd Qu.: 979.0   3rd Qu.: 4458886  
##  Max.   :9345   Max.   :8970.0   Max.   :47092687  
##                 NA's   :19       NA's   :1959      
##       Date                    
##  Min.   :2004-07-01 00:00:00  
##  1st Qu.:2007-10-01 00:00:00  
##  Median :2011-01-01 00:00:00  
##  Mean   :2010-12-31 11:12:59  
##  3rd Qu.:2014-04-01 00:00:00  
##  Max.   :2017-07-01 00:00:00  
## 
summary(brisbane.only)
##   Month.Year        Local.Govt.Area    Approved.Sites  Operational.Sites
##  Length:157         Length:157         Min.   :190.0   Min.   :183.0    
##  Class :character   Class :character   1st Qu.:207.0   1st Qu.:204.0    
##  Mode  :character   Mode  :character   Median :225.0   Median :216.0    
##                                        Mean   :220.4   Mean   :213.5    
##                                        3rd Qu.:233.0   3rd Qu.:226.0    
##                                        Max.   :243.0   Max.   :235.0    
##  Approved.EGMs  Operational.EGMs Player.Money.Lost 
##  Min.   :8779   Min.   :8347     Min.   :27820060  
##  1st Qu.:8997   1st Qu.:8642     1st Qu.:32591123  
##  Median :9118   Median :8747     Median :34936940  
##  Mean   :9076   Mean   :8723     Mean   :35825434  
##  3rd Qu.:9176   3rd Qu.:8818     3rd Qu.:38277230  
##  Max.   :9345   Max.   :8970     Max.   :47092687  
##       Date                    
##  Min.   :2004-07-01 00:00:00  
##  1st Qu.:2007-10-01 00:00:00  
##  Median :2011-01-01 00:00:00  
##  Mean   :2010-12-31 11:12:59  
##  3rd Qu.:2014-04-01 00:00:00  
##  Max.   :2017-07-01 00:00:00
summary(brisbane.2010.data)
##   Month.Year        Local.Govt.Area    Approved.Sites  Operational.Sites
##  Length:12          Length:12          Min.   :225.0   Min.   :217.0    
##  Class :character   Class :character   1st Qu.:225.0   1st Qu.:218.0    
##  Mode  :character   Mode  :character   Median :226.5   Median :219.5    
##                                        Mean   :226.2   Mean   :219.4    
##                                        3rd Qu.:227.0   3rd Qu.:220.2    
##                                        Max.   :228.0   Max.   :222.0    
##  Approved.EGMs  Operational.EGMs Player.Money.Lost 
##  Min.   :9106   Min.   :8791     Min.   :30025451  
##  1st Qu.:9124   1st Qu.:8806     1st Qu.:31981046  
##  Median :9155   Median :8822     Median :32889535  
##  Mean   :9171   Mean   :8836     Mean   :33519011  
##  3rd Qu.:9194   3rd Qu.:8855     3rd Qu.:35317303  
##  Max.   :9345   Max.   :8956     Max.   :36861039  
##       Date                    
##  Min.   :2010-01-01 00:00:00  
##  1st Qu.:2010-03-24 06:00:00  
##  Median :2010-06-16 00:00:00  
##  Mean   :2010-06-16 11:30:00  
##  3rd Qu.:2010-09-08 12:00:00  
##  Max.   :2010-12-01 00:00:00

Use the summary function to return a quick summary of only the money column

summary(brisbane.2010.data$Player.Money.Lost)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 30025451 31981046 32889535 33519011 35317303 36861039

Get a random selection of data and look at which Local government areas are included

random.fifty.rows<-dplyr::sample_n(gambling.data,50)

#look at each row's Local Government Area (LGA)
random.fifty.rows$Local.Govt.Area
##  [1] "SUNSHINE COAST"    "FLINDERS"          "TORRES"           
##  [4] "LONGREACH"         "WESTERN DOWNS"     "WINTON"           
##  [7] "MURWEH"            "BURDEKIN"          "COOK"             
## [10] "GOLD COAST"        "GYMPIE"            "WEIPA"            
## [13] "BURDEKIN"          "BALONNE"           "HINCHINBROOK"     
## [16] "NORTH BURNETT"     "NOOSA"             "BUNDABERG"        
## [19] "SOMERSET"          "IPSWICH"           "SOMERSET"         
## [22] "IPSWICH"           "IPSWICH"           "LOCKYER VALLEY"   
## [25] "MACKAY"            "GLADSTONE"         "QUILPIE"          
## [28] "TORRES"            "TOWNSVILLE"        "SUNSHINE COAST"   
## [31] "TOWNSVILLE"        "REDLAND"           "TORRES"           
## [34] "BUNDABERG"         "LIVINGSTONE"       "IPSWICH"          
## [37] "CAIRNS"            "ROCKHAMPTON"       "TOOWOOMBA"        
## [40] "GYMPIE"            "WHITSUNDAY"        "GLADSTONE"        
## [43] "CENTRAL HIGHLANDS" "BURDEKIN"          "RICHMOND"         
## [46] "CAIRNS"            "NOOSA"             "TORRES"           
## [49] "CLONCURRY"         "GYMPIE"
#look at the unique local government areas (duplicates removed)
unique(random.fifty.rows$Local.Govt.Area)
##  [1] "SUNSHINE COAST"    "FLINDERS"          "TORRES"           
##  [4] "LONGREACH"         "WESTERN DOWNS"     "WINTON"           
##  [7] "MURWEH"            "BURDEKIN"          "COOK"             
## [10] "GOLD COAST"        "GYMPIE"            "WEIPA"            
## [13] "BALONNE"           "HINCHINBROOK"      "NORTH BURNETT"    
## [16] "NOOSA"             "BUNDABERG"         "SOMERSET"         
## [19] "IPSWICH"           "LOCKYER VALLEY"    "MACKAY"           
## [22] "GLADSTONE"         "QUILPIE"           "TOWNSVILLE"       
## [25] "REDLAND"           "LIVINGSTONE"       "CAIRNS"           
## [28] "ROCKHAMPTON"       "TOOWOOMBA"         "WHITSUNDAY"       
## [31] "CENTRAL HIGHLANDS" "RICHMOND"          "CLONCURRY"
#look at how many rows there are for each LGA
table(random.fifty.rows$Local.Govt.Area)
## 
##           BALONNE         BUNDABERG          BURDEKIN            CAIRNS 
##                 1                 2                 3                 2 
## CENTRAL HIGHLANDS         CLONCURRY              COOK          FLINDERS 
##                 1                 1                 1                 1 
##         GLADSTONE        GOLD COAST            GYMPIE      HINCHINBROOK 
##                 2                 1                 3                 1 
##           IPSWICH       LIVINGSTONE    LOCKYER VALLEY         LONGREACH 
##                 4                 1                 1                 1 
##            MACKAY            MURWEH             NOOSA     NORTH BURNETT 
##                 1                 1                 2                 1 
##           QUILPIE           REDLAND          RICHMOND       ROCKHAMPTON 
##                 1                 1                 1                 1 
##          SOMERSET    SUNSHINE COAST         TOOWOOMBA            TORRES 
##                 2                 2                 1                 4 
##        TOWNSVILLE             WEIPA     WESTERN DOWNS        WHITSUNDAY 
##                 2                 1                 1                 1 
##            WINTON 
##                 1

Challenge: find which Local government area had the lowest non-zero amount of money lost in a month.

Data Types

There are several different types of data you can use in R, and you can even make your own new types (but we won’t touch on that today). Now we will examine a few common ones in a little more detail.

Strings

Character strings are known as “character” in R:

gambling.data$Local.Govt.Area[3]
## [1] "BARCALDINE"
class(gambling.data$Local.Govt.Area)
## [1] "character"

Numbers

Numbers have different classes, they can be integer:

gambling.data$Operational.EGMs[36]
## [1] 531
class(gambling.data$Operational.EGMs)
## [1] "integer"

They can be numeric:

gambling.data$Player.Money.Lost[84]
## [1] 3913829
class(gambling.data$Player.Money.Lost)
## [1] "numeric"
class(0.1)
## [1] "numeric"
class(pi)
## [1] "numeric"

They can be complex:

class(1i+3)
## [1] "complex"

They can be double:

x<-as.double(3.345678987654323456789)

They can have rounding errors

x-(x-0.0000000000001)
## [1] 9.992007e-14

Logical (True/False)

class(TRUE)
## [1] "logical"
class(1==1)
## [1] "logical"

Factor

A factor is a string label on top of an integer index

as.factor("I am a label")
## [1] I am a label
## Levels: I am a label

This makes most sense in the context of a survey response, e.g. 1=Good, 2, 3=Average, 4, 5=Bad; or something like that. For example in the simple data we had at the beginning, gender is a factor.

class(simple.data$gender)
## [1] "factor"
simple.data$gender
## [1] Female Male   Female Male  
## Levels: Female Male
as.numeric(simple.data$gender)
## [1] 1 2 1 2
table(simple.data$gender)
## 
## Female   Male 
##      2      2

Vectors and Lists

Vectors are all of the same class

#a vector
c("a","b","c")
## [1] "a" "b" "c"
#another vector
c(1,4,7)
## [1] 1 4 7

If you try to put different classes of things into a vector it will try to convert them to the same class.

#this will be converted to a vector of character strings
(what.am.I<-c("a",2,"c",FALSE,c(1,2,3)))
## [1] "a"     "2"     "c"     "FALSE" "1"     "2"     "3"
class(what.am.I)
## [1] "character"

Lists can contain different classes of things in each element, possibly of different lengths

#a list
list("a",2,"c",FALSE,c(1,2,3))
## [[1]]
## [1] "a"
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] "c"
## 
## [[4]]
## [1] FALSE
## 
## [[5]]
## [1] 1 2 3

Data frames

Data frames are really a list of vectors of equal length, and they function like a table of data. We made one right at the beginning:

simple.data<-data.frame(first.name=c("Alice",
                                     "Bob",
                                     "Cathy",
                                     "Daniel"),
                        gender=as.factor(c("Female",
                                           "Male",
                                           "Female",
                                           "Male")),
                        favourite.number=c(4,23,pi,exp(1)),
                        favourite.letter=c("a","c","x","q"),
                        favourite.weekday=c("Monday","Thursday","Sunday","Friday"),
                        stringsAsFactors = FALSE)
  • We made a column called “first.name”, a vector of character strings (favourite.letter and favourite.weekday are also character strings)

  • We made a column called “gender”, a vector of factors

  • We made a column called “favourite.number”, a vector of class numeric

  • Lastly we told it to not convert all the character string colums into factors (which is something it does by default) by adding the optional argument stringsAsFactors = FALSE to the data.frame() function.

Adding records

Add a new row to the data set using the rbind() function:

new.person <- data.frame(first.name="Evelyn",
                        gender="Female",
                        favourite.number=12,
                        favourite.letter="z",
                        favourite.weekday="Monday",
                        stringsAsFactors = FALSE)

simple.data<-rbind(simple.data,new.person)

Writing Data

Use the write.table() function to make the datafile a text file

write.table(x = simple.data,
            file = "simpleData.txt",
            row.names = FALSE,
            col.names = TRUE,
            sep = ",")

Use the write.csv() function to make the datafile a csv (comma separated values) file

write.csv(x = simple.data,
          file = "simpleData.csv",
          row.names = FALSE)

Use the write.xlsx