Bayesian Regression Models with RStanARM

TJ Mahr
Sept. 21, 2016

Madison R Users Group

Github repository
@tjmahr

Overview

Building mathematical intuitions

Caveat

  • These slides and these examples are meant to illustrate the pieces of Bayes theorem.
  • This is not a rigorous mathematical description of Bayesian probability or regression.

Classifying emails

I got an email with the word “cialis” in it. Is it spam?

\[ P(\mathrm{spam} \mid \mathrm{cialis}) = \frac{ P(\mathrm{cialis} \mid \mathrm{spam}) \, P(\mathrm{spam})}{P(\mathrm{cialis})} \]

The two unconditional probabilities are base rates that need to be accounted for.

\[ P(\mathrm{spam} \mid \mathrm{cialis}) = \frac{\mathrm{cialis\ freq.\ in\ spam} \, * \mathrm{spam\ rate}}{\mathrm{cialis\ freq.}} \]

This example is just counting events. We can think more broadly about the probabilities.

I saw a creature, and it just quacked at me! Was it a duck?

\[ P(\mathrm{duck} \mid \mathrm{quacks}) = \frac{ P(\mathrm{quacks} \mid \mathrm{duck}) \, P(\mathrm{duck})}{P(\mathrm{quacks})} \]

  • If it quacks like a duck…
  • and after taking into account how common ducks are and how often animals quack…
  • then it probably is a duck.

General structure

How plausible is some hypothesis given the data?

\[ P(\mathrm{hypothesis} \mid \mathrm{data}) = \frac{ P(\mathrm{data} \mid \mathrm{hypothesis}) \, P(\mathrm{hypothesis})}{P(\mathrm{data})} \]

Pieces of the equation:

\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]

Likelihood measures fit

We found some IQ scores in an old, questionable dataset.

library(dplyr)
iqs <- car::Burt$IQbio
iqs
#>  [1]  82  80  88 108 116 117 132  71  75  93  95  88 111  63  77  86  83
#> [18]  93  97  87  94  96 112 113 106 107  98

IQs are designed to have a normal distribution with a population mean of 100 and an SD of 15.

How well do these data fit in that kind of bell curve?

Density is a measure of relative likelihood

  • Draw the data on a hypothetical bell curve with a mean of 100 and SD of 15.
  • Height of each point on curve is density around that point.
  • Higher density regions are more likely.
  • Data farther from center is less likely.

Density of IQ scores drawn a bell curve with mean 100.

In comparison, the data are less likely in a bell curve with a mean of 130.

Density of IQ scores drawn a bell curve with mean 130. The fit is terrible.

Density function dnorm(xs, mean = 100, sd = 15) tells us the height of each value in xs when drawn on a normal bell curve.

# likelihood (density) of each point
dnorm(iqs, 100, 15) %>% round(3)
#>  [1] 0.013 0.011 0.019 0.023 0.015 0.014 0.003 0.004 0.007 0.024 0.025
#> [12] 0.019 0.020 0.001 0.008 0.017 0.014 0.024 0.026 0.018 0.025 0.026
#> [23] 0.019 0.018 0.025 0.024 0.026

Likelihood of all points is the product. These quantities get vanishingly small so we sum their logs instead. (Hence, log-likelihoods.)

# vanishingly small!
prod(dnorm(iqs, 100, 15))
#> [1] 2.276823e-50

# log scale
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065

Log-likelihoods provide a measure of how well the data fit a given normal distribution.

Which mean best fits the data? Below average IQ (85), average IQ (100), or above average IQ (115)?

sum(dnorm(iqs, 85, 15, log = TRUE))
#> [1] -119.0065
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
sum(dnorm(iqs, 115, 15, log = TRUE))
#> [1] -136.6065

The data fit best with the “population average” mean (100).

We just used a maximum likelihood criterion to choose among these alternatives!

Likelihood summary

  • We have some model of how the data could be generated. This model has tuneable parameters.
    • “The IQs are drawn from a normal distribution with an SD of 15 and some unknown mean.”
  • Likelihood is how well the observed data fit in a particular data-generating model.
  • Classical regression's “line of best fit” finds model parameters that maximize the likelihood of the data.

Bayesian models

\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]

A Bayesian model examines a distribution over model parameters. What are all the plausible ways the data could have been generated?

  • Prior: A probability distribution over model parameters.
  • Update our prior information in proportion to how well the data fits with that information.

Bayesian updating

Let's consider all integer values from 70 to 130 as equally probable means for the IQs. This is a uniform prior.

# Keep track of last probability value
df_iq_model <- data_frame(
  mean = 70:130,
  prob = 1 / length(mean),
  previous = NA_real_)
sum(df_iq_model$prob)
#> [1] 1
df_iq_model
#> # A tibble: 61 × 3
#>     mean       prob previous
#>    <int>      <dbl>    <dbl>
#> 1     70 0.01639344       NA
#> 2     71 0.01639344       NA
#> 3     72 0.01639344       NA
#> 4     73 0.01639344       NA
#> 5     74 0.01639344       NA
#> 6     75 0.01639344       NA
#> 7     76 0.01639344       NA
#> 8     77 0.01639344       NA
#> 9     78 0.01639344       NA
#> 10    79 0.01639344       NA
#> # ... with 51 more rows

plot of chunk iq-00-data

We observe one data-point and update our prior information using the likelihood of the data at each mean.

df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[1], df_iq_model$mean, 15)
# numerator of bayes theorem
df_iq_model$prob <- likelihoods * df_iq_model$prob
sum(df_iq_model$prob)
#> [1] 0.01306729

That's not right. We need the average likelihood to ensure that the probabilities add up to 1. This is why it's sometimes called a normalizing constant.

# include denominator of bayes theorem
df_iq_model$prob  <- df_iq_model$prob / sum(df_iq_model$prob)
sum(df_iq_model$prob)
#> [1] 1

Bayesian updating after 1 observation. Previous probabilites are drawn for reference.

We observe another data-point and update the probability with the likelihood again.

df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[2], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#>     mean       prob   previous
#>    <int>      <dbl>      <dbl>
#> 1     70 0.02551519 0.02422865
#> 2     71 0.02801128 0.02549920
#> 3     72 0.03047942 0.02671736
#> 4     73 0.03287154 0.02786958
#> 5     74 0.03513768 0.02894257
#> 6     75 0.03722765 0.02992359
#> 7     76 0.03909289 0.03080065
#> 8     77 0.04068830 0.03156283
#> 9     78 0.04197406 0.03220045
#> 10    79 0.04291726 0.03270526
#> # ... with 51 more rows

Bayesian updating after 2 observations

And one more…

df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[3], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#>     mean       prob   previous
#>    <int>      <dbl>      <dbl>
#> 1     70 0.01490139 0.02551519
#> 2     71 0.01768232 0.02801128
#> 3     72 0.02070434 0.03047942
#> 4     73 0.02392174 0.03287154
#> 5     74 0.02727304 0.03513768
#> 6     75 0.03068201 0.03722765
#> 7     76 0.03405991 0.03909289
#> 8     77 0.03730890 0.04068830
#> 9     78 0.04032654 0.04197406
#> 10    79 0.04301093 0.04291726
#> # ... with 51 more rows

Bayesian updating after 3 observations

Animation!

Gif of the probabilities updating for the whole dataset

Recap

  • We have some model of how the data could be generated. This model has tuneable parameters.
  • We have some prior distribution over plausible values for the parameters.
  • Likelihood is how well the data fit in a model with a certain set of parameters.
  • We explore all the plausible ways the data could be generated by the model and those parameters, and we update our prior information based on the fit of the data.
  • We update our prior information based on the fit of the data.

Basic regression model

We want to estimate a response \( y \) with 1 predictor \( x_1 \).

\[ \begin{align*} y_i &\sim \mathrm{Normal}(\mathrm{mean} = \mu_i, \mathrm{SD} = \sigma) \\ \mu_i &= \alpha + \beta_1*x_{1i} \end{align*} \]

Data generating model: Observation \( y_i \) is a draw from a normal distribution centered around a mean \( \mu_i \) with a standard deviation of \( \sigma \).

We estimate the mean with a constant “intercept” term \( \alpha \) plus a linear combination of predictor variables (just \( x_1 \) for now).

Basically

For a model of weight predicted by height… but with continuous bell curves. A bell tunnel through the data.

Scatter plot with bell curves overlaid to how the shape of the density remains the same. It's just the center that moves with x.

Bayesian stats

Parameters we need to estimate for regression: \( \alpha, \beta_1, \sigma \)

  • A classical model provides one model of many plausible models of the data. It'll find the parameters that maximize likelihood.
  • A Bayesian model is a model of models. We get a distribution of models that are consistent with the data.

This is where things get difficult

Parameters we need to estimate: \( \alpha, \beta_1, \sigma \)

\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]

\[ P(\alpha, \beta, \sigma \mid x) = \frac{ P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma)}{\iiint \, P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma) \,d\alpha \,d\beta \,d\sigma} \]

Things get gnarly. This is the black-box step.

We don't perform this integral calculus. Instead, we rely on Markov-chain Monte Carlo simulation to get us samples from the posterior. Those samples will provide a detailed picture of the posterior.

Enter Stan.

Next