Less Volume, More Creativity – Introducing R to Beginners

Randall Pruim and Nicholas J Horton

eCOTS – 17 May 2016

How to play along from home

You can use your own R, but make sure mosaic is installed from CRAN (or github)

# install from CRAN
install.packages("mosaic")

You can use a temporary RStudio server account
- get username and password at http://tinyurl.com/ecots-rstudio
Be sure to attach the mosaic package

require(mosaic)

Slides are available at http://tinyurl.com/mosaic-ecots2016

Outline

Introduction to Less Volume, More Creativity Approach
The most important template
A few other things
Simulation-based inference

Introduction

The mosaic team

Danny Kaplan
Macalester

Nick Horton
Amherst

Randy Pruim
Calvin

Less Volume, More Creativity

Aim for an R toolkit that is

small: fewer commands/templates is better
coherent: commands should be as similar as possible
powerful: can do what needs doing

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

— Antoine de Saint-Exupery (writer, poet, pioneering aviator)

The mosaic package

The mosaic package facilitates a Less Volume, More Creativity approach.

available via CRAN and github (ProjectMOSAIC/mosaic)
builds on things in lattice and other core R packages

Main parts of the package

expanded support for the “formula template” (main topic today)
simplified graphics
support for simulation-based inference with do()
some odds and ends (mostly driven by first and second course)

The Minimal R Exercise

List every R command used throughout a course

Organize by syntactic similarity and by purpose

Scratch everything you could have done without

Replace dissimilar tools with more similar tools

Aim for a set of commands that is

small: fewer is better
coherent: commands should be as similar as possible
powerful: can do what needs doing

Result: Minimal R for Intro Stats

A few R details

R is case sensitive

many students are not case sensitive

Arrows and Tab

up/down arrows scroll through history
TAB completion can simplify typing

If all else fails, try ESC

If you see a + prompt, it means R is waiting for more input
If this is unintentional, you probably have a typo
ESC will get you back to the command prompt

How to play along from home

You can use your own R, but make sure mosaic is installed from CRAN (or github)

# install from CRAN
install.packages("mosaic")

You can use a temporary RStudio server account
- get username and password at http://tinyurl.com/ecots-rstudio
Be sure to attach the mosaic package

require(mosaic)

Slides are available at http://tinyurl.com/mosaic-ecots2016

The most important template

The Most Important Template

goal ( y ~ x , data = mydata )

The Most Important Template

goal ( y ~ x , data = mydata )

The Most Important Template

goal ( y ~ x , data = mydata , …)

Other versions:

# simpler version
goal( ~ x, data = mydata )          
# fancier version
goal( y ~ x | z , data = mydata )   
# unified version
goal( formula , data = mydata )

2 Questions

goal ( y ~ x , data = mydata )

What do you want R to do? (goal)

What must R know to do that?

2 Questions

goal ( y ~ x , data = mydata )

What do you want R to do? (goal)

This determines the function to use

What must R know to do that?

This determines the inputs to the function
Must identify the variables and data frame

How do we make this plot?

What is the Goal?

What does R need to know?

How do we make this plot?

What is the Goal?

a scatter plot

What does R need to know?

which variable goes where
which data set

How do we tell R to make this plot?

What is the Goal?

a scatter plot (xyplot())

What does R need to know?

which variable goes where (births ~ date)
which data set (data = Births78)
- use ?Births78 for documentation

How do we make this plot?

goal ( y ~ x , data = mydata )

xyplot(births ~ date, data = Births78)

What changes for this plot?

Two Questions?

What changes for this plot?

Command: bwplot()
The data: HELPrct

Variables: age, substance
use ?HELPrct for info about data

Comparing our first two plots

bwplot( age ~ substance, data = HELPrct)

xyplot( births ~ date, data = Births78)

How about this one?

Command: bwplot()
The data: HELPrct

Variables: age, substance
use ?HELPrct for info about data

Comparing two boxplots

bwplot(substance ~ age, data = HELPrct)

bwplot(age ~ substance, data = HELPrct)

Graphical Summaries: One Variable

histogram( ~ age, data = HELPrct)

Note: When there is one variable it is on the right side of the formula.

Graphical Summaries: Overview

One Variable

  histogram( ~age, data = HELPrct ) 
densityplot( ~age, data = HELPrct ) 
     bwplot( ~age, data = HELPrct ) 
     qqmath( ~age, data = HELPrct ) 
freqpolygon( ~age, data = HELPrct ) 
   bargraph( ~sex, data = HELPrct )

Two Variables

xyplot(  i1 ~ age,       data = HELPrct ) 
bwplot( age ~ substance, data = HELPrct ) 
bwplot( substance ~ age, data = HELPrct )

i1 average number of drinks (standard units) consumed per day in the past 30 days (measured at baseline)

The Graphics Template

One variable

plotname ( ~ x , data = mydata , …)

histogram(), qqmath(), densityplot(), freqpolygon(), bargraph()

Two Variables

plotname ( y ~ x , data = mydata , …)

xyplot(), bwplot()

groups and panels

Add groups =group to overlay.
Use y ~ x | z to create multipanel plots.

densityplot( ~ age | sex, data = HELPrct,  
               groups = substance,  auto.key = TRUE)

Bells & Whistles

titles
axis labels
colors
sizes
transparency
etc, etc.

My approach:

Let the students ask, or
let the data analysis drive
mplot()

Bells and Whistles

xyplot(births ~ date, data = Births78, groups = wday,
  type = 'l',
  auto.key = list(columns = 4, lines = TRUE, points = FALSE),
  par.settings = list(superpose.line = list( lty = 1 ) ))

Numerical Summaries: One Variable

Big idea: Replace plot name with summary name

Nothing else changes

histogram( ~ age, data = HELPrct )
     mean( ~ age, data = HELPrct )

## [1] 35.65

Other Summaries

The mosaic package includes formula aware versions of mean(), sd(), var(), min(), max(), sum(), IQR(), …

Also provides favstats() to compute our favorites.

favstats( ~ age, data = HELPrct )

##  min Q1 median Q3 max  mean   sd   n missing
##   19 30     35 40  60 35.65 7.71 453       0

Tallying

tally( ~ sex, data = HELPrct)

## 
## female   male 
##    107    346

tally( ~ substance, data = HELPrct)

## 
## alcohol cocaine  heroin 
##     177     152     124

Numerical Summaries: Two Variables

Three ways to create a plot with two variables. All three can be used to get corresponding numerical summaries.

     bwplot(age ~ substance, data = HELPrct)
         sd(age ~ substance, data = HELPrct)
         
  histogram( ~ age | substance, data = HELPrct)
         sd( ~ age | substance, data = HELPrct)
         
densityplot( ~ age , groups = substance, data = HELPrct)
         sd( ~ age , groups = substance, data = HELPrct)

## alcohol cocaine  heroin 
##   7.652   6.693   7.986

Numerical Summaries: Tables

tally( sex ~ substance, data = HELPrct )

##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94

tally( sex ~ substance, data = HELPrct, 
       format = "prop", margins = TRUE )

##         substance
## sex      alcohol cocaine heroin
##   female  0.2034  0.2697 0.2419
##   male    0.7966  0.7303 0.7581
##   Total   1.0000  1.0000 1.0000

Numerical Summaries

mean( age ~ homeless | sex, data = HELPrct )

## Yes.F  No.F Yes.M  No.M     F     M 
## 35.95 36.43 36.47 34.51 36.25 35.47

mean( age ~ homeless | sex, data = HELPrct, .format = "table" )

##   homeless sex  mean
## 1      Yes   F 35.95
## 2      Yes   M 36.47
## 3       No   F 36.43
## 4       No   M 34.51

Also works for median(), sd(), favstats(), …
I’ve abbreviated the names to make things fit on slide

One Template to Rule a Lot

single and multiple variable graphical summaries
single and multiple variable numerical summaries
hypothesis tests and confidence intervals
linear models and ANOVA

Example Workflow: How does age differ by sex?

  bwplot(age ~ sex, data = HELPrct) 
    mean(age ~ sex, data = HELPrct)
      sd(age ~ sex, data = HELPrct)
favstats(age ~ sex, data = HELPrct)
  t.test(age ~ sex, data = HELPrct)
      lm(age ~ sex, data = HELPrct)

Note: We can do all of this without any square brackets or dollar signs.

But wait, there’s more

Some other things

The mosaic package includes some other things, too

Data sets (you’ve already seen some of them)
Support for SBI via do()
xtras: xchisq.test(), xpnorm(), xqqmath()
mplot()
- mplot(HELPrct) interactive plot creation
- replacements for plot() in some situations
simplified plotting
- additional histogram() controls (e.g., width)
- simplified ways to add onto lattice plots
- simple plots of models and functions

xpnorm()

xpnorm(700, mean = 500, sd = 100)

## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 700) = P(Z <= 2) = 0.9772
##  P(X >  700) = P(Z >  2) = 0.02275

## [1] 0.9772

xpnorm()

xpnorm( c(300, 700), mean = 500, sd = 100)

## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 300) = P(Z <= -2) = 0.02275
##      P(X <= 700) = P(Z <=  2) = 0.97725
##  P(X >  300) = P(Z >  -2) = 0.97725
##      P(X >  700) = P(Z >   2) = 0.02275

## [1] 0.02275 0.97725

xchisq.test()

xchisq.test(phs)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x
## X-squared = 24, df = 1, p-value = 8e-07
## 
##    104.00   10933.00 
## (  146.52) (10890.48)
## [12.05]  [ 0.16] 
## <-3.51>  < 0.41> 
##    
##    189.00   10845.00 
## (  146.48) (10887.52)
## [12.05]  [ 0.16] 
## < 3.51>  <-0.41> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

Modeling – makeFun() and plotFun()

Modeling is really the starting point for the mosaic design.

linear models (lm() and glm()) defined the template
lattice graphics use the template (so we chose lattice)
we added functionality so numerical summaries can be done with the same template
additional things added to make modeling easier for beginners.

Models as Functions

model <- lm(width ~ length * sex, data = KidsFeet)
Width <- makeFun(model)
Width(length = 25, sex = "B")

##     1 
## 9.168

Width(length = 25, sex = "G")

##     1 
## 8.939

Models as Functions – Plotting

xyplot( width ~ length, data = KidsFeet, 
        groups = sex, auto.key = TRUE )
plotFun( Width(length, sex = "B") ~ length, 
         col = 1, add = TRUE )
plotFun( Width(length, sex = "G") ~ length, 
         col = 2, add = TRUE )

Autoplotting (simple) models

plotModel(model)

Simulation Based Inference

SBI – You can do() it!

do(r) * mean(~ age,       data = resample(HELPrct))
do(r) * prop(~ homeless,  data = resample(HELPrct))

do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct))
do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct, groups = sex))

do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct))
do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct, groups = sex))

# residual resampling
Kids.lm <- lm(length ~ width * sex, data = KidsFeet)
do(r) * lm(length ~ width * sex, data = resample(Kids.lm))

Lady Tasting Tea

Often used on first day of class
Story
woman claims she can tell whether milk has been poured into tea or vice versa.
Question: How do we test this claim?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()

## 
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
## 
## H
## 
## Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)

## 
## Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
## 
## H H H T T T H H H T
## 
## Number of Heads: 6 [Proportion Heads: 0.6]

easier to consider heads = correct; tails = incorrect than to compare with a given pattern
this switch bothers me more than it bothers my students

Let’s do that a lot of times

rflip(10) simulates 1 guessing lady tasting 10 cups.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)

##    n heads tails prop
## 1 10     6     4  0.6
## 2 10     5     5  0.5

do() is clever about what it remembers
2 isn’t many – we’ll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)

##    n heads tails prop
## 1 10     4     6  0.4

histogram( ~ heads, data = Ladies, width = 1 )

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data = Ladies)

## 
##  TRUE FALSE 
##    52  4948

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data = Ladies)

## 
##  TRUE FALSE 
##    52  4948

tally( ~(heads >= 9) , data = Ladies, format = "prop")

## 
##   TRUE  FALSE 
## 0.0104 0.9896

 prop( ~(heads >= 9), data = Ladies)

##   TRUE 
## 0.0104

A general approach to randomization

Do it for your data
Do it for “random” data
Do it lots of times for “random” data

definition of “random” is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data = HELPrct)

## diffmean 
##  -0.7841

do(2) * diffmean(age ~ shuffle(sex), data = HELPrct)

##   diffmean
## 1   0.1091
## 2   0.3171

Null <- 
  do(5000) * diffmean(age ~ shuffle(sex), data = HELPrct)

Example: Do mean ages differ by sex?

prop( ~(abs(diffmean) > 0.7841), data = Null )

##   TRUE 
## 0.3616

histogram(~ diffmean, data = Null, v = -.7841)

Example: Bootstrap

Bootstrap <- do(5000) * diffmean(age~sex, data= resample(HELPrct))

       sd( ~diffmean, data = Bootstrap)

## [1] 0.8463

histogram( ~diffmean, data = Bootstrap, v = -.7841, glwd = 4 )

Example: CI for difference in mean ages

cdata(~diffmean, data = Bootstrap, p = .95)

##       low        hi central.p 
##   -2.4787    0.8895    0.9500

confint(Bootstrap, method = c("quantile","se"))

##       name  lower  upper level     method estimate margin.of.error  df
## 1 diffmean -2.479 0.8895  0.95 percentile  -0.7841              NA  NA
## 2 diffmean -2.433 0.8931  0.95     stderr  -0.7841           1.663 452

Randomization and linear models

do(1) * lm(width ~ length, data = KidsFeet)

## Source: local data frame [1 x 9]
## 
##   Intercept length  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.862 0.2479 0.3963     0.411 25.82     1    37     1      1

do(3) * lm( width ~ shuffle(length), data = KidsFeet)

## Source: local data frame [3 x 9]
## 
##   Intercept    length  sigma r.squared        F numdf dendf  .row .index
##       (dbl)     (dbl)  (dbl)     (dbl)    (dbl) (dbl) (dbl) (int)  (dbl)
## 1     9.136 -0.005807 0.5164 0.0002254 0.008343     1    37     1      1
## 2     9.762 -0.031122 0.5147 0.0064752 0.241144     1    37     1      2
## 3    11.639 -0.107066 0.4962 0.0766357 3.070856     1    37     1      3

Randomization and linear models

do(1) * lm(width ~ length + sex, data = KidsFeet)

## Source: local data frame [1 x 10]
## 
##   Intercept length    sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)   (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     3.641  0.221 -0.2325 0.3849    0.4595 15.31     2    36     1      1

do(3) * lm( width ~ length + shuffle(sex), data = KidsFeet)

## Source: local data frame [3 x 10]
## 
##   Intercept length    sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)   (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.524 0.2546  0.3553 0.3569    0.5353 20.74     2    36     1      1
## 2     2.835 0.2458  0.1677 0.3922    0.4388 14.07     2    36     1      2
## 3     3.078 0.2432 -0.2032 0.3877    0.4516 14.82     2    36     1      3

Randomization and linear models

Null <- 
  do(5000) * lm( width ~ length + shuffle(sex), data = KidsFeet)
histogram( ~ sexG, data = Null, v = -0.2325, glwd = 4)

Randomization and linear models

histogram( ~ sexG, data = Null, v = -0.2325, glwd = 4)

prop(~ (sexG <= -0.2325), data = Null)

##  TRUE 
## 0.037

Example bootstrap distributions

do(r) * mean(~ age,       data = resample(HELPrct))
do(r) * prop(~ homeless,  data = resample(HELPrct))

do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct))
do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct, groups = sex))

do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct))
do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct, groups = sex))

Bootstrapping Residuals

Kids.lm <- lm(length ~ width * sex, data = KidsFeet)
head(resample(Kids.lm), 2)

##   length width sex
## 1  24.14   8.4   B
## 2  24.32   8.8   B

do(2) * lm(length ~ width * sex, data = resample(Kids.lm))

## Source: local data frame [2 x 11]
## 
##   Intercept width   sexG width.sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl) (dbl)  (dbl)      (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1    11.825 1.485 -6.949     0.7384 0.8894    0.5987  17.4     3    35     1      1
## 2     4.018 2.336 11.269    -1.3057 0.8069    0.6292  19.8     3    35     1      2

Transformed Response

For a few common transformation, resample knows how to invert. The tranformation argument can be used to provide the transformation in other cases.

Kids.lm2 <- lm(log(length) ~ width * sex, data = KidsFeet)
head(resample(Kids.lm2), 2)

##   length width sex
## 1  23.40   8.4   B
## 2  25.83   8.8   B

do(2) * lm(log(length) ~ width * sex, data = resample(Kids.lm2))

## Source: local data frame [2 x 11]
## 
##   Intercept   width    sexG width.sexG   sigma r.squared     F numdf dendf  .row .index
##       (dbl)   (dbl)   (dbl)      (dbl)   (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.763 0.04997 -0.2263    0.02484 0.03884    0.4433  9.29     3    35     1      1
## 2     2.355 0.09381  0.5952   -0.06717 0.03969    0.4643 10.11     3    35     1      2

A Little Goes A Long Way

Equipped with a modest subset of R, students can be quite creative:

The template goal(y ~ x, data = mydata)
A list of goals
- xyplot(), bwplot(), histogram(), etc.
- mean(), sd(), tally(), favstats(), etc.
- lm(), glm()
- t.test(), binom.test(), etc.
do()
a few bells as whistles as desired.

Additional R skills can be added as needed later.

If R is the hardest part of your course, then your R is too hard and your questions are too easy.

Additional Resources

mosaic resources

Little Books https://github.com/ProjectMOSAIC/LittleBooks
- Start Teaching with R
- A Student’s Guide to R
- Start R in Calculus
Resources Vignette includes links to companion volumes to several text books.
- Statistics: Unlocking the Power of Data (Lock5)
- Introduction to Statistical Investigations (Tintle et al)
- Statistical Sleuth (Ramsey and Shafer)
- Introduction to the Practice of Statistics (Moore, McCabe, Craig)
- Stats: Data and Models (De Veaux, Velleman, Bock)

Elsewhere at eCOTS

Tue @ 2:15 - 5:00 Breakout Sessions: Teaching with R

Technology lowering barriers: get started with R at the snap of a finger with Mine Cetinkaya-Rundel and Nicholas Horton
Notebooks with R Markdown with JJ Allaire
Lowering the barriers to inclusive, collaborative, reproducible analyses with Chester Ismay and Andrew Bray

Wed @ 2:00 - 2:45 Panel: Teaching with R: Free and Extendable with JJ Allaire, Mine Cetinkaya-Rundel, Chester Ismay, and R Pruim

Fri @ 1:30 - 2:00 BOF: Teaching with R and RStudio with R Pruim and N Horton

Elsewhere at eCOTS

Posters

R-Shiny Apps to Empower a Simulation-Based Curriculum, Allison Theobold & Jim Robison-Cox, Montana State University
Dynamic Data in the Classroom, Jo Hardin, Pomona College
Writing about Simulations in a Theoretical Statistics Course, Amy Wagaman, Amherst College
Interactive Math Stat Visualizations Using R Shiny, Justin Post, North Carolina State University