Basics of object-oriented programming in R
Simple R data structures
Simple descriptive statistics, (some) inferential statistics, and plotting with Base R
(A few) programming best practices
8 February 2016
Basics of object-oriented programming in R
Simple R data structures
Simple descriptive statistics, (some) inferential statistics, and plotting with Base R
(A few) programming best practices
Open source programming language, with a particular focus on statistical programming.
History: Originally (in 1993) an implementation of the S programming language (Bell Labs), by Ross Ihaka and Robert Gentleman (hence R) at University of Auckland.
Currently the R Foundation for Statistical Computing is based in Vienna. Development is distributed globally.
RStudio:
RStudio is an Integrated Developer Environment (IDE) that makes using R and other reproducible research tools easier.
R can be easily expanded by user created packages hosted on GitHub and/or CRAN.
citation()
## ## To cite R in publications use: ## ## R Core Team (2015). R: A language and environment for ## statistical computing. R Foundation for Statistical Computing, ## Vienna, Austria. URL https://www.R-project.org/. ## ## A BibTeX entry for LaTeX users is ## ## @Manual{, ## title = {R: A Language and Environment for Statistical Computing}, ## author = {{R Core Team}}, ## organization = {R Foundation for Statistical Computing}, ## address = {Vienna, Austria}, ## year = {2015}, ## url = {https://www.R-project.org/}, ## } ## ## We have invested a lot of time and effort in creating R, please ## cite it when using it for data analysis. See also ## 'citation("pkgname")' for citing R packages.
Like many other popular programming languages, R is object-oriented.
Objects are R's nouns. They include (not exhaustive):
character strings (e.g. words)
numbers
vectors of numbers or character strings
matrices
data frames
lists
2 + 3
## [1] 5
2 - 3
## [1] -1
2 * 3
## [1] 6
2 / 3
## [1] 0.6666667
You use the assignment operator (<-
) to assign character strings, numbers, vectors, etc. to object names
## Assign the number 10 to an object called number number <- 10 number
## [1] 10
# Assign Hello World to an object called words words <- "Hello World" words
## [1] "Hello World"
You can assign almost anything to an object. For example, the output of a maths operation:
divided <- 2 / 3 divided
## [1] 0.6666667
You can also use the equality sign (=
):
number = 10 number
## [1] 10
Note: it has a slightly different meaning.
See StackOverflow discussion.
NA
: not available, missing
NULL
: does not exist, is undefined
TRUE
, T
: logical true. Logical is also an object class.
FALSE
, F
: logical false
Function | Meaning |
---|---|
is.na |
Is the value NA |
is.null |
Is the value NULL |
isTRUE |
Is the value TRUE |
!isTRUE |
Is the value FALSE |
absent <- NA is.na(absent)
## [1] TRUE
Operator | Meaning |
---|---|
< |
less than |
> |
greater than |
== |
equal to |
<= |
less than or equal to |
>= |
greater than or equal to |
!= |
not equal to |
a | b |
a or b |
a & b |
a and b |
Objects have distinct classes.
# Find the class of number class(number)
## [1] "numeric"
# Find the class of absent class(absent)
## [1] "logical"
Object names cannot have spaces
CamelCase
, name_underscore
, or name.period
Avoid creating an object with the same name as a function (e.g. c
and t
) or special value (NA
, NULL
, TRUE
, FALSE
).
Use descriptive object names!
obj1
, obj2
Each object name must be unique in an environment.
# Find objects in your workspace ls()
## [1] "absent" "divided" "number" "searches" "words"
Or the Environment tab in RStudio
As with natural language writing, it is a good idea to stick to one style guide with your R code:
A vector is an ordered collection of numbers, characters, etc. of the same type.
Vectors can be created with the c
(combine) function.
# Create numeric vector numeric_vector <- c(1, 2, 3) # Create character vector character_vector <- c('Albania', 'Botswana', 'Cambodia')
Categorical variables are called factors in R.
# Create numeric vector fruits <- c(1, 1, 2) # Create character vector for factor labels fruit_names <- c('apples', 'mangos') # Convert to labelled factor fruits_factor <- factor(fruits, labels = fruit_names) summary(fruits_factor)
## apples mangos ## 2 1
Matrices are collections of vectors with the same length and class.
# Combine numeric_vector and character_vector into a matrix combined <- cbind(numeric_vector, character_vector) combined
## numeric_vector character_vector ## [1,] "1" "Albania" ## [2,] "2" "Botswana" ## [3,] "3" "Cambodia"
Note (1): R coerced numeric_vector
into a character vector.
Note (2): You can rbind
new rows onto a matrix.
Data frames are collections of vectors with the same length.
Each column (vector) can be of a different class.
# Combine numeric_vector and character_vector into a data frame combined_df <- data.frame(numeric_vector, character_vector, stringsAsFactors = FALSE) combined_df
## numeric_vector character_vector ## 1 1 Albania ## 2 2 Botswana ## 3 3 Cambodia
A list is an object containing other objects that can have different lengths and classes.
# Create a list with three objects of different lengths test_list <- list(countries = character_vector, not_there = c(NA, NA), more_numbers = 1:10) test_list
## $countries ## [1] "Albania" "Botswana" "Cambodia" ## ## $not_there ## [1] NA NA ## ## $more_numbers ## [1] 1 2 3 4 5 6 7 8 9 10
Functions do things to/with objects. Functions are like R's verbs.
When using functions to do things to objects, they are always followed by parentheses ()
. The parentheses contain the arguments. Arguments are separated by commas.
# Summarise combined_df summary(combined_df, digits = 2)
## numeric_vector character_vector ## Min. :1.0 Length:3 ## 1st Qu.:1.5 Class :character ## Median :2.0 Mode :character ## Mean :2.0 ## 3rd Qu.:2.5 ## Max. :3.0
Use ?
to find out what arguments a function can take.
?summary
The help page will also show the function's default argument values.
$
)The $
is known as the component selector. It selects a component of an object.
combined_df$character_vector
## [1] "Albania" "Botswana" "Cambodia"
[]
You can use subscripts []
to also select components.
For data frames they have a [row, column]
pattern.
# Select the second row and first column of combined_df combined_df[2, 1]
## [1] 2
# Select the first two rows combined_df[c(1, 2), ]
## numeric_vector character_vector ## 1 1 Albania ## 2 2 Botswana
[]
# Select the character_vector column combined_df[, 'character_vector']
## [1] "Albania" "Botswana" "Cambodia"
You can use assignment with parts of objects. For example:
combined_df$character_vector[3] <- 'China' combined_df$character_vector
## [1] "Albania" "Botswana" "China"
You can even add new variables:
combined_df$new_var <- 1:3 combined_df
## numeric_vector character_vector new_var ## 1 1 Albania 1 ## 2 2 Botswana 2 ## 3 3 China 3
You can greatly expand the number of functions by installing and loading user-created packages.
# Install dplyr package install.packages('dplyr') # Load dplyr package library(dplyr)
You can also call a function directly from a specific package with the double colon operator (::
).
Grouped <- dplyr::group_by(combined_df, character_vector)
List internal data sets:
data()
Load swiss data set:
data(swiss)
Find data description:
?swiss
Find variable names:
names(swiss)
## [1] "Fertility" "Agriculture" "Examination" ## [4] "Education" "Catholic" "Infant.Mortality"
See the first three rows and four columns
head(swiss[1:3, 1:4])
## Fertility Agriculture Examination Education ## Courtelary 80.2 17.0 15 12 ## Delemont 83.1 45.1 6 9 ## Franches-Mnt 92.5 39.7 5 5
Pipe: pass a value forward to a function call.
Why?
Faster compilation.
Enhanced code readability.
In R use %>%
from the dplyr package.
%>%
passes a value to the first argument of the next function call.
Not piped:
values <- rnorm(1000, mean = 10) value_mean <- mean(values) round(value_mean, digits = 2)
## [1] 9.98
Piped:
library(dplyr) rnorm(1000, mean = 10) %>% mean() %>% round(digits = 2)
## [1] 10.04
You can create a function to find the sample mean (\(\bar{x} = \frac{\sum x}{n}\)) of a vector.
fun_mean <- function(x){ sum(x) / length(x) } ## Find the mean fun_mean(x = swiss$Examination)
## [1] 16.48936
Functions:
Simplify your code if you do repeated tasks.
Lead to fewer mistakes.
Are easier to understand.
Save time over the long run–a general solution to problems in different contexts.
Descriptive Statistics: describe samples
Stats 101: describe sample distributions with appropriate measure of:
central tendancy
variability
hist(swiss$Examination)
hist(swiss$Examination, main = 'Swiss Canton Draftee Examination Scores (1888)', xlab = '% receiving highest mark on army exam')
(or use the mean function in base R)
mean(swiss$Examination)
## [1] 16.48936
If you have missing values (NA
):
mean(swiss$Examination, na.rm = TRUE)
You can 'loop' through the data set to find the mean for each column
for (i in 1:length(names(swiss))) { swiss[, i] %>% mean() %>% round(digits = 1) %>% paste(names(swiss)[i], ., '\n') %>% # the . directs the pipe cat() }
## Fertility 70.1 ## Agriculture 50.7 ## Examination 16.5 ## Education 11 ## Catholic 41.1 ## Infant.Mortality 19.9
Median
median(swiss$Examination)
## [1] 16
Mode
mode
is not an R function to find the statistical mode.
Instead use summary
for factor nominal variables or make a bar chart.
devtools::source_url('http://bit.ly/OTWEGS') plot(MortalityGDP$region, xlab = 'Region')
Variation is "perhaps the most important quantity in statistical analysis. The greater the variability in the data, the greater will be our uncertainty in the values of the parameters estimated . . . and the lower our ability to distinguish between competing hypotheses" (Crawley 2005, 33)
Range:
range(swiss$Examination)
## [1] 3 37
Quartiles:
summary(swiss$Examination)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 3.00 12.00 16.00 16.49 22.00 37.00
Interquartile Range (\(IQR = Q_{3} - Q_{1}\)):
IQR(swiss$Examination)
## [1] 10
Boxplots:
boxplot(swiss$Examination, main = '% of Draftees with Highest Mark')
Sum of squares (summing deviations from the mean):
\[\mathrm{Sum\:of\:Squares} = \sum(x - \bar{x})^2\]
But sum of squares always gets bigger with a larger sample size.
Degrees of freedom (number of values that are free to vary):
For the mean:
\[\mathrm{df} = n - 1\]
Why?
For a given mean and sample size, \(n - 1\) values can vary, but the \(n\)th value must always be the same. See Crawley (2005, 36-37).
We can use degrees of freedom to create an "unbiased" measure of variability that is not dependent on the sample size.
Variance (\(s^2\)):
\[ s^2 = \frac{\mathrm{Sum\:of\:Squares}}{\mathrm{Degrees\:of\:Freedom}} = \frac{\sum(x - \bar{x})^2}{n - 1} \]
But this is not in the same units as the mean, so it can be confusing to interpret.
Use standard deviation (\(s\)) to put variance in terms of the mean:
\[s = \sqrt{s^2}\]
The standard error of the mean:
If we think of the variation around a central tendency as a measure of the unreliability of an estimate (mean) in a population, then we want the measure to decrease as the sample size goes up.
\[ \mathrm{SE}_{\bar{x}} = \sqrt{\frac{s^{2}}{n}} \]
Note: \(\sqrt{}\) so that the dimensions of the measure of unreliability and the parameter whose variability is being measured are the same.
Good overview of variance, degrees of freedom, and standard errors in Crawley (2005, Ch. 4).
Variance:
var(swiss$Examination)
## [1] 63.64662
Standard Deviation:
sd(swiss$Examination)
## [1] 7.977883
Standard Error:
sd_error <- function(x) { sd(x) / sqrt(length(x)) } sd_error(swiss$Examination)
## [1] 1.163694
Simulated normally distributed data with SD of 30 and mean 50
Normal30 <- rnorm(1e+6, mean = 50, sd = 30)
Highly skewed data can be transformed to have a normal distribution.
Helps correct two violations of key assumptions: (a) non-linearity and (b) heteroskedasticity.
hist(swiss$Education, main = '')
log(swiss$Education) %>% hist(main = "Swiss Education")
The natural log transformation is only useful for data that does not contain zeros.
See http://robjhyndman.com/hyndsight/transformations/ for suggestions on other transformations such as Box-Cox and Inverse Hyperbolic Sine.
plot(log(swiss$Education), swiss$Examination)
cor.test(log(swiss$Education), swiss$Examination)
## ## Pearson's product-moment correlation ## ## data: log(swiss$Education) and swiss$Examination ## t = 6.4313, df = 45, p-value = 7.133e-08 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.5053087 0.8168779 ## sample estimates: ## cor ## 0.6920531
ggplot2::ggplot(swiss, aes(log(Education), Examination)) + geom_point() + geom_smooth() + theme_bw()
Always close!
In R this means closing:
()
[]
{}
''
""
There are usually many ways to acheive the same goal, but . .
make your code as simple as possible.
Easier to read.
Easier to write (ultimately).
Easier to find mistakes.
Often computationally more efficient.
One way to do this is to define things once–e.g. use variables to contain values and custom functions to contain multiple sequential function calls.
Bad
mean(rnorm(1000))
## [1] -0.003707504
sd(rnorm(1000))
## [1] 0.9980258
Good
rand_sample <- rnorm(1000) mean(rand_sample)
## [1] 0.0003898685
sd(rand_sample)
## [1] 0.9768
Access R data sets
Explore the data and find ways to numerically/graphically describe it.
Find and use R functions that were not covered in the lecture for exploring and transforming your data.
Create your own function (what it does is open to you).
Save your work as a .R source file. You will need it for the next seminar.