Less Volume, More Creativity – Introducing R to Beginners

Randall Pruim and Nicholas J Horton

eCOTS – 17 May 2016

How to play along from home

  • You can use your own R, but make sure mosaic is installed from CRAN (or github)
# install from CRAN
install.packages("mosaic")  
require(mosaic)

Outline

  • Introduction to Less Volume, More Creativity Approach
  • The most important template
  • A few other things
  • Simulation-based inference

Introduction

The mosaic team

Danny
Danny Kaplan
Macalester
Nick
Nick Horton
Amherst
Randy
Randy Pruim
Calvin
NSF

Less Volume, More Creativity

Aim for an R toolkit that is

  • small: fewer commands/templates is better
  • coherent: commands should be as similar as possible
  • powerful: can do what needs doing


Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

— Antoine de Saint-Exupery (writer, poet, pioneering aviator)

The mosaic package

The mosaic package facilitates a Less Volume, More Creativity approach.

  • available via CRAN and github (ProjectMOSAIC/mosaic)
  • builds on things in lattice and other core R packages

Main parts of the package

  • expanded support for the “formula template” (main topic today)
  • simplified graphics
  • support for simulation-based inference with do()
  • some odds and ends (mostly driven by first and second course)

The Minimal R Exercise

List every R command used throughout a course

Organize by syntactic similarity and by purpose

Scratch everything you could have done without

Replace dissimilar tools with more similar tools


Aim for a set of commands that is

  • small: fewer is better
  • coherent: commands should be as similar as possible
  • powerful: can do what needs doing


Result: Minimal R for Intro Stats

A few R details

R is case sensitive

  • many students are not case sensitive

Arrows and Tab

  • up/down arrows scroll through history
  • TAB completion can simplify typing

If all else fails, try ESC

  • If you see a + prompt, it means R is waiting for more input
  • If this is unintentional, you probably have a typo
  • ESC will get you back to the command prompt

How to play along from home

  • You can use your own R, but make sure mosaic is installed from CRAN (or github)
# install from CRAN
install.packages("mosaic")  
require(mosaic)

The most important template

The Most Important Template

 

goal ( y ~ x , data = mydata )

 

The Most Important Template

 

goal (  y  ~  x  , data = mydata )

The Most Important Template

 

goal (  y  ~  x  , data = mydata , …)

 

Other versions:

# simpler version
goal( ~ x, data = mydata )          
# fancier version
goal( y ~ x | z , data = mydata )   
# unified version
goal( formula , data = mydata )     

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

 

What must R know to do that?

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

  • This determines the function to use

What must R know to do that?

  • This determines the inputs to the function
  • Must identify the variables and data frame

How do we make this plot?

What is the Goal?

What does R need to know?

How do we make this plot?

What is the Goal?

  • a scatter plot

What does R need to know?

  • which variable goes where
  • which data set

How do we tell R to make this plot?

What is the Goal?

  • a scatter plot (xyplot())

What does R need to know?

  • which variable goes where (births ~ date)
  • which data set (data = Births78)
    • use ?Births78 for documentation

How do we make this plot?

 

goal (  y  ~  x  , data = mydata )

 

xyplot(births ~ date, data = Births78) 

What changes for this plot?

Two Questions?

What changes for this plot?

  1. Command: bwplot()

  2. The data: HELPrct
  • Variables: age, substance
  • use ?HELPrct for info about data

Comparing our first two plots

bwplot( age ~ substance, data = HELPrct)

xyplot( births ~ date, data = Births78)

How about this one?

  1. Command: bwplot()

  2. The data: HELPrct
  • Variables: age, substance
  • use ?HELPrct for info about data

Comparing two boxplots

bwplot(substance ~ age, data = HELPrct)

bwplot(age ~ substance, data = HELPrct)

Graphical Summaries: One Variable

histogram( ~ age, data = HELPrct) 

Note: When there is one variable it is on the right side of the formula.

Graphical Summaries: Overview

One Variable

  histogram( ~age, data = HELPrct ) 
densityplot( ~age, data = HELPrct ) 
     bwplot( ~age, data = HELPrct ) 
     qqmath( ~age, data = HELPrct ) 
freqpolygon( ~age, data = HELPrct ) 
   bargraph( ~sex, data = HELPrct )

Two Variables

xyplot(  i1 ~ age,       data = HELPrct ) 
bwplot( age ~ substance, data = HELPrct ) 
bwplot( substance ~ age, data = HELPrct ) 
  • i1 average number of drinks (standard units) consumed per day in the past 30 days (measured at baseline)

The Graphics Template

One variable

plotname ( ~  x  , data = mydata , …)

  • histogram(), qqmath(), densityplot(), freqpolygon(), bargraph()

 

Two Variables

plotname (  y  ~  x  , data = mydata , …)

  • xyplot(), bwplot()

groups and panels

  • Add groups =group to overlay.
  • Use y ~ x | z to create multipanel plots.
densityplot( ~ age | sex, data = HELPrct,  
               groups = substance,  auto.key = TRUE)   

Bells & Whistles

  • titles
  • axis labels
  • colors
  • sizes
  • transparency
  • etc, etc.

My approach:

  • Let the students ask, or
  • let the data analysis drive
  • mplot()

Bells and Whistles

xyplot(births ~ date, data = Births78, groups = wday,
  type = 'l',
  auto.key = list(columns = 4, lines = TRUE, points = FALSE),
  par.settings = list(superpose.line = list( lty = 1 ) ))

Numerical Summaries: One Variable

Big idea: Replace plot name with summary name

  • Nothing else changes
histogram( ~ age, data = HELPrct )
     mean( ~ age, data = HELPrct )
## [1] 35.65

Other Summaries

The mosaic package includes formula aware versions of mean(), sd(), var(), min(), max(), sum(), IQR(), …

Also provides favstats() to compute our favorites.

favstats( ~ age, data = HELPrct )
##  min Q1 median Q3 max  mean   sd   n missing
##   19 30     35 40  60 35.65 7.71 453       0

Tallying

tally( ~ sex, data = HELPrct)
## 
## female   male 
##    107    346
tally( ~ substance, data = HELPrct)
## 
## alcohol cocaine  heroin 
##     177     152     124

Numerical Summaries: Two Variables

Three ways to create a plot with two variables. All three can be used to get corresponding numerical summaries.

     bwplot(age ~ substance, data = HELPrct)
         sd(age ~ substance, data = HELPrct)
         
  histogram( ~ age | substance, data = HELPrct)
         sd( ~ age | substance, data = HELPrct)
         
densityplot( ~ age , groups = substance, data = HELPrct)
         sd( ~ age , groups = substance, data = HELPrct)
## alcohol cocaine  heroin 
##   7.652   6.693   7.986

Numerical Summaries: Tables

tally( sex ~ substance, data = HELPrct )
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94
tally( sex ~ substance, data = HELPrct, 
       format = "prop", margins = TRUE )
##         substance
## sex      alcohol cocaine heroin
##   female  0.2034  0.2697 0.2419
##   male    0.7966  0.7303 0.7581
##   Total   1.0000  1.0000 1.0000

Numerical Summaries

mean( age ~ homeless | sex, data = HELPrct )
## Yes.F  No.F Yes.M  No.M     F     M 
## 35.95 36.43 36.47 34.51 36.25 35.47
mean( age ~ homeless | sex, data = HELPrct, .format = "table" )
##   homeless sex  mean
## 1      Yes   F 35.95
## 2      Yes   M 36.47
## 3       No   F 36.43
## 4       No   M 34.51
  • Also works for median(), sd(), favstats(), …
  • I’ve abbreviated the names to make things fit on slide

One Template to Rule a Lot

  • single and multiple variable graphical summaries
  • single and multiple variable numerical summaries
  • hypothesis tests and confidence intervals
  • linear models and ANOVA

Example Workflow: How does age differ by sex?

  bwplot(age ~ sex, data = HELPrct) 
    mean(age ~ sex, data = HELPrct)
      sd(age ~ sex, data = HELPrct)
favstats(age ~ sex, data = HELPrct)
  t.test(age ~ sex, data = HELPrct)
      lm(age ~ sex, data = HELPrct)
  • Note: We can do all of this without any square brackets or dollar signs.

But wait, there’s more

Some other things

The mosaic package includes some other things, too

  • Data sets (you’ve already seen some of them)
  • Support for SBI via do()
  • xtras: xchisq.test(), xpnorm(), xqqmath()
  • mplot()
    • mplot(HELPrct) interactive plot creation
    • replacements for plot() in some situations
  • simplified plotting
    • additional histogram() controls (e.g., width)
    • simplified ways to add onto lattice plots
    • simple plots of models and functions

xpnorm()

xpnorm(700, mean = 500, sd = 100)
## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 700) = P(Z <= 2) = 0.9772
##  P(X >  700) = P(Z >  2) = 0.02275

## [1] 0.9772

xpnorm()

xpnorm( c(300, 700), mean = 500, sd = 100)
## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 300) = P(Z <= -2) = 0.02275
##      P(X <= 700) = P(Z <=  2) = 0.97725
##  P(X >  300) = P(Z >  -2) = 0.97725
##      P(X >  700) = P(Z >   2) = 0.02275

## [1] 0.02275 0.97725

xchisq.test()

xchisq.test(phs)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x
## X-squared = 24, df = 1, p-value = 8e-07
## 
##    104.00   10933.00 
## (  146.52) (10890.48)
## [12.05]  [ 0.16] 
## <-3.51>  < 0.41> 
##    
##    189.00   10845.00 
## (  146.48) (10887.52)
## [12.05]  [ 0.16] 
## < 3.51>  <-0.41> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

Modeling – makeFun() and plotFun()

Modeling is really the starting point for the mosaic design.

  • linear models (lm() and glm()) defined the template
  • lattice graphics use the template (so we chose lattice)
  • we added functionality so numerical summaries can be done with the same template
  • additional things added to make modeling easier for beginners.

Models as Functions

model <- lm(width ~ length * sex, data = KidsFeet)
Width <- makeFun(model)
Width(length = 25, sex = "B")
##     1 
## 9.168
Width(length = 25, sex = "G")
##     1 
## 8.939

Models as Functions – Plotting

xyplot( width ~ length, data = KidsFeet, 
        groups = sex, auto.key = TRUE )
plotFun( Width(length, sex = "B") ~ length, 
         col = 1, add = TRUE )
plotFun( Width(length, sex = "G") ~ length, 
         col = 2, add = TRUE )

Autoplotting (simple) models

plotModel(model)

Simulation Based Inference

SBI – You can do() it!

do(r) * mean(~ age,       data = resample(HELPrct))
do(r) * prop(~ homeless,  data = resample(HELPrct))

do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct))
do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct, groups = sex))

do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct))
do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct, groups = sex))
# residual resampling
Kids.lm <- lm(length ~ width * sex, data = KidsFeet)
do(r) * lm(length ~ width * sex, data = resample(Kids.lm))

Lady Tasting Tea

  • Often used on first day of class

  • Story

  • woman claims she can tell whether milk has been poured into tea or vice versa.

  • Question: How do we test this claim?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()
## 
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
## 
## H
## 
## Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)
## 
## Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
## 
## H H H T T T H H H T
## 
## Number of Heads: 6 [Proportion Heads: 0.6]
  • easier to consider heads = correct; tails = incorrect than to compare with a given pattern
  • this switch bothers me more than it bothers my students

Let’s do that a lot of times

rflip(10) simulates 1 guessing lady tasting 10 cups.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)
##    n heads tails prop
## 1 10     6     4  0.6
## 2 10     5     5  0.5
  • do() is clever about what it remembers
  • 2 isn’t many – we’ll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)
##    n heads tails prop
## 1 10     4     6  0.4
histogram( ~ heads, data = Ladies, width = 1 )

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data = Ladies)
## 
##  TRUE FALSE 
##    52  4948

How often does guessing score 9 or 10?

tally( ~(heads >= 9) , data = Ladies)
## 
##  TRUE FALSE 
##    52  4948
tally( ~(heads >= 9) , data = Ladies, format = "prop")
## 
##   TRUE  FALSE 
## 0.0104 0.9896
 prop( ~(heads >= 9), data = Ladies)
##   TRUE 
## 0.0104

A general approach to randomization

  1. Do it for your data
  2. Do it for “random” data
  3. Do it lots of times for “random” data
  • definition of “random” is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data = HELPrct)
## diffmean 
##  -0.7841
do(2) * diffmean(age ~ shuffle(sex), data = HELPrct)
##   diffmean
## 1   0.1091
## 2   0.3171
Null <- 
  do(5000) * diffmean(age ~ shuffle(sex), data = HELPrct)

Example: Do mean ages differ by sex?

prop( ~(abs(diffmean) > 0.7841), data = Null )
##   TRUE 
## 0.3616
histogram(~ diffmean, data = Null, v = -.7841) 

Example: Bootstrap

Bootstrap <- do(5000) * diffmean(age~sex, data= resample(HELPrct))

       sd( ~diffmean, data = Bootstrap)
## [1] 0.8463
histogram( ~diffmean, data = Bootstrap, v = -.7841, glwd = 4 )

Example: CI for difference in mean ages

cdata(~diffmean, data = Bootstrap, p = .95)
##       low        hi central.p 
##   -2.4787    0.8895    0.9500
confint(Bootstrap, method = c("quantile","se"))
##       name  lower  upper level     method estimate margin.of.error  df
## 1 diffmean -2.479 0.8895  0.95 percentile  -0.7841              NA  NA
## 2 diffmean -2.433 0.8931  0.95     stderr  -0.7841           1.663 452

Randomization and linear models

do(1) * lm(width ~ length, data = KidsFeet)
## Source: local data frame [1 x 9]
## 
##   Intercept length  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.862 0.2479 0.3963     0.411 25.82     1    37     1      1
do(3) * lm( width ~ shuffle(length), data = KidsFeet)
## Source: local data frame [3 x 9]
## 
##   Intercept    length  sigma r.squared        F numdf dendf  .row .index
##       (dbl)     (dbl)  (dbl)     (dbl)    (dbl) (dbl) (dbl) (int)  (dbl)
## 1     9.136 -0.005807 0.5164 0.0002254 0.008343     1    37     1      1
## 2     9.762 -0.031122 0.5147 0.0064752 0.241144     1    37     1      2
## 3    11.639 -0.107066 0.4962 0.0766357 3.070856     1    37     1      3

Randomization and linear models

do(1) * lm(width ~ length + sex, data = KidsFeet)
## Source: local data frame [1 x 10]
## 
##   Intercept length    sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)   (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     3.641  0.221 -0.2325 0.3849    0.4595 15.31     2    36     1      1
do(3) * lm( width ~ length + shuffle(sex), data = KidsFeet)
## Source: local data frame [3 x 10]
## 
##   Intercept length    sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl)  (dbl)   (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.524 0.2546  0.3553 0.3569    0.5353 20.74     2    36     1      1
## 2     2.835 0.2458  0.1677 0.3922    0.4388 14.07     2    36     1      2
## 3     3.078 0.2432 -0.2032 0.3877    0.4516 14.82     2    36     1      3

Randomization and linear models

Null <- 
  do(5000) * lm( width ~ length + shuffle(sex), data = KidsFeet)
histogram( ~ sexG, data = Null, v = -0.2325, glwd = 4)

Randomization and linear models

histogram( ~ sexG, data = Null, v = -0.2325, glwd = 4)

prop(~ (sexG <= -0.2325), data = Null)
##  TRUE 
## 0.037

Example bootstrap distributions

do(r) * mean(~ age,       data = resample(HELPrct))
do(r) * prop(~ homeless,  data = resample(HELPrct))

do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct))
do(r) * diffmean(age ~ sex, 
                 data = resample(HELPrct, groups = sex))

do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct))
do(r) * diffprop(homeless ~ sex, 
                 data = resample(HELPrct, groups = sex))

Bootstrapping Residuals

Kids.lm <- lm(length ~ width * sex, data = KidsFeet)
head(resample(Kids.lm), 2)
##   length width sex
## 1  24.14   8.4   B
## 2  24.32   8.8   B
do(2) * lm(length ~ width * sex, data = resample(Kids.lm))
## Source: local data frame [2 x 11]
## 
##   Intercept width   sexG width.sexG  sigma r.squared     F numdf dendf  .row .index
##       (dbl) (dbl)  (dbl)      (dbl)  (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1    11.825 1.485 -6.949     0.7384 0.8894    0.5987  17.4     3    35     1      1
## 2     4.018 2.336 11.269    -1.3057 0.8069    0.6292  19.8     3    35     1      2

Transformed Response

For a few common transformation, resample knows how to invert. The tranformation argument can be used to provide the transformation in other cases.

Kids.lm2 <- lm(log(length) ~ width * sex, data = KidsFeet)
head(resample(Kids.lm2), 2)
##   length width sex
## 1  23.40   8.4   B
## 2  25.83   8.8   B
do(2) * lm(log(length) ~ width * sex, data = resample(Kids.lm2))
## Source: local data frame [2 x 11]
## 
##   Intercept   width    sexG width.sexG   sigma r.squared     F numdf dendf  .row .index
##       (dbl)   (dbl)   (dbl)      (dbl)   (dbl)     (dbl) (dbl) (dbl) (dbl) (int)  (dbl)
## 1     2.763 0.04997 -0.2263    0.02484 0.03884    0.4433  9.29     3    35     1      1
## 2     2.355 0.09381  0.5952   -0.06717 0.03969    0.4643 10.11     3    35     1      2

A Little Goes A Long Way

A Little Goes A Long Way

Equipped with a modest subset of R, students can be quite creative:

  • The template goal(y ~ x, data = mydata)
  • A list of goals
    • xyplot(), bwplot(), histogram(), etc.
    • mean(), sd(), tally(), favstats(), etc.
    • lm(), glm()
    • t.test(), binom.test(), etc.
  • do()
  • a few bells as whistles as desired.

Additional R skills can be added as needed later.

If R is the hardest part of your course, then your R is too hard and your questions are too easy.

Additional Resources

mosaic resources

  • Little Books https://github.com/ProjectMOSAIC/LittleBooks
    • Start Teaching with R
    • A Student’s Guide to R
    • Start R in Calculus
  • Resources Vignette includes links to companion volumes to several text books.
    • Statistics: Unlocking the Power of Data (Lock5)
    • Introduction to Statistical Investigations (Tintle et al)
    • Statistical Sleuth (Ramsey and Shafer)
    • Introduction to the Practice of Statistics (Moore, McCabe, Craig)
    • Stats: Data and Models (De Veaux, Velleman, Bock)

Elsewhere at eCOTS

Tue @ 2:15 - 5:00 Breakout Sessions: Teaching with R

  • Technology lowering barriers: get started with R at the snap of a finger with Mine Cetinkaya-Rundel and Nicholas Horton

  • Notebooks with R Markdown with JJ Allaire

  • Lowering the barriers to inclusive, collaborative, reproducible analyses with Chester Ismay and Andrew Bray

Wed @ 2:00 - 2:45 Panel: Teaching with R: Free and Extendable with JJ Allaire, Mine Cetinkaya-Rundel, Chester Ismay, and R Pruim

Fri @ 1:30 - 2:00 BOF: Teaching with R and RStudio with R Pruim and N Horton

Elsewhere at eCOTS

Posters

  • R-Shiny Apps to Empower a Simulation-Based Curriculum, Allison Theobold & Jim Robison-Cox, Montana State University

  • Dynamic Data in the Classroom, Jo Hardin, Pomona College

  • Writing about Simulations in a Theoretical Statistics Course, Amy Wagaman, Amherst College

  • Interactive Math Stat Visualizations Using R Shiny, Justin Post, North Carolina State University