TJ Mahr
Sept. 21, 2016
Madison R Users Group
I got an email with the word “cialis” in it. Is it spam?
\[ P(\mathrm{spam} \mid \mathrm{cialis}) = \frac{ P(\mathrm{cialis} \mid \mathrm{spam}) \, P(\mathrm{spam})}{P(\mathrm{cialis})} \]
The two unconditional probabilities are base rates that need to be accounted for.
\[ P(\mathrm{spam} \mid \mathrm{cialis}) = \frac{\mathrm{cialis\ freq.\ in\ spam} \, * \mathrm{spam\ rate}}{\mathrm{cialis\ freq.}} \]
This example is just counting events. We can think more broadly about the probabilities.
I saw a creature, and it just quacked at me! Was it a duck?
\[ P(\mathrm{duck} \mid \mathrm{quacks}) = \frac{ P(\mathrm{quacks} \mid \mathrm{duck}) \, P(\mathrm{duck})}{P(\mathrm{quacks})} \]
How plausible is some hypothesis given the data?
\[ P(\mathrm{hypothesis} \mid \mathrm{data}) = \frac{ P(\mathrm{data} \mid \mathrm{hypothesis}) \, P(\mathrm{hypothesis})}{P(\mathrm{data})} \]
Pieces of the equation:
\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]
We found some IQ scores in an old, questionable dataset.
library(dplyr)
iqs <- car::Burt$IQbio
iqs
#> [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63 77 86 83
#> [18] 93 97 87 94 96 112 113 106 107 98
IQs are designed to have a normal distribution with a population mean of 100 and an SD of 15.
How well do these data fit in that kind of bell curve?
In comparison, the data are less likely in a bell curve with a mean of 130.
Density function dnorm(xs, mean = 100, sd = 15)
tells us the height of each
value in xs
when drawn on a normal bell curve.
# likelihood (density) of each point
dnorm(iqs, 100, 15) %>% round(3)
#> [1] 0.013 0.011 0.019 0.023 0.015 0.014 0.003 0.004 0.007 0.024 0.025
#> [12] 0.019 0.020 0.001 0.008 0.017 0.014 0.024 0.026 0.018 0.025 0.026
#> [23] 0.019 0.018 0.025 0.024 0.026
Likelihood of all points is the product. These quantities get vanishingly small so we sum their logs instead. (Hence, log-likelihoods.)
# vanishingly small!
prod(dnorm(iqs, 100, 15))
#> [1] 2.276823e-50
# log scale
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
Log-likelihoods provide a measure of how well the data fit a given normal distribution.
Which mean best fits the data? Below average IQ (85), average IQ (100), or above average IQ (115)?
sum(dnorm(iqs, 85, 15, log = TRUE))
#> [1] -119.0065
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
sum(dnorm(iqs, 115, 15, log = TRUE))
#> [1] -136.6065
The data fit best with the “population average” mean (100).
We just used a maximum likelihood criterion to choose among these alternatives!
\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]
A Bayesian model examines a distribution over model parameters. What are all the plausible ways the data could have been generated?
Let's consider all integer values from 70 to 130 as equally probable means for the IQs. This is a uniform prior.
# Keep track of last probability value
df_iq_model <- data_frame(
mean = 70:130,
prob = 1 / length(mean),
previous = NA_real_)
sum(df_iq_model$prob)
#> [1] 1
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.01639344 NA
#> 2 71 0.01639344 NA
#> 3 72 0.01639344 NA
#> 4 73 0.01639344 NA
#> 5 74 0.01639344 NA
#> 6 75 0.01639344 NA
#> 7 76 0.01639344 NA
#> 8 77 0.01639344 NA
#> 9 78 0.01639344 NA
#> 10 79 0.01639344 NA
#> # ... with 51 more rows
We observe one data-point and update our prior information using the likelihood of the data at each mean.
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[1], df_iq_model$mean, 15)
# numerator of bayes theorem
df_iq_model$prob <- likelihoods * df_iq_model$prob
sum(df_iq_model$prob)
#> [1] 0.01306729
That's not right. We need the average likelihood to ensure that the probabilities add up to 1. This is why it's sometimes called a normalizing constant.
# include denominator of bayes theorem
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
sum(df_iq_model$prob)
#> [1] 1
We observe another data-point and update the probability with the likelihood again.
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[2], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.02551519 0.02422865
#> 2 71 0.02801128 0.02549920
#> 3 72 0.03047942 0.02671736
#> 4 73 0.03287154 0.02786958
#> 5 74 0.03513768 0.02894257
#> 6 75 0.03722765 0.02992359
#> 7 76 0.03909289 0.03080065
#> 8 77 0.04068830 0.03156283
#> 9 78 0.04197406 0.03220045
#> 10 79 0.04291726 0.03270526
#> # ... with 51 more rows
And one more…
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[3], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.01490139 0.02551519
#> 2 71 0.01768232 0.02801128
#> 3 72 0.02070434 0.03047942
#> 4 73 0.02392174 0.03287154
#> 5 74 0.02727304 0.03513768
#> 6 75 0.03068201 0.03722765
#> 7 76 0.03405991 0.03909289
#> 8 77 0.03730890 0.04068830
#> 9 78 0.04032654 0.04197406
#> 10 79 0.04301093 0.04291726
#> # ... with 51 more rows
We want to estimate a response \( y \) with 1 predictor \( x_1 \).
\[ \begin{align*} y_i &\sim \mathrm{Normal}(\mathrm{mean} = \mu_i, \mathrm{SD} = \sigma) \\ \mu_i &= \alpha + \beta_1*x_{1i} \end{align*} \]
Data generating model: Observation \( y_i \) is a draw from a normal distribution centered around a mean \( \mu_i \) with a standard deviation of \( \sigma \).
We estimate the mean with a constant “intercept” term \( \alpha \) plus a linear combination of predictor variables (just \( x_1 \) for now).
For a model of weight predicted by height… but with continuous bell curves. A bell tunnel through the data.
Parameters we need to estimate for regression: \( \alpha, \beta_1, \sigma \)
Parameters we need to estimate: \( \alpha, \beta_1, \sigma \)
\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]
\[ P(\alpha, \beta, \sigma \mid x) = \frac{ P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma)}{\iiint \, P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma) \,d\alpha \,d\beta \,d\sigma} \]
Things get gnarly. This is the black-box step.
We don't perform this integral calculus. Instead, we rely on Markov-chain Monte Carlo simulation to get us samples from the posterior. Those samples will provide a detailed picture of the posterior.
Enter Stan.