Slides and R code that produced them are online: https://github.com/tjmahr/Psych710_BayesLecture
I gave a similar, more code-heavy version of this talk to the R Users Group: https://github.com/tjmahr/MadR_RStanARM
I learned stats and R in this course with Markus Brauer and John Curtin.
I still refer to the slides from this course on contrast codes.
But now I’m a “Bayesian”.
Open Science Collaboration (2015) tries to replicate 100 studies published in 3 psychology different journals in 2008.
Some reactionary:
Some constructive:
All those unintentional acts and rituals to appease the Statistical Significance gods.
The usual way of doing things is insecure.
I want to avoid these questionable practices.
I want to level up my stats and explore new techniques.
I want something less finicky than statistical significance.
I started reading the Gelman and Hill book.
arm
package.It emphasizes estimation, uncertainty and simulation.
Midway through, the book pivots to Bayesian estimation. (Multilevel models are kinda Bayesian because they borrow information across different clusters.)
I’m down a rabbit hole, writing Stan (Bayesian) models to fit the models from the ARM book, and there is an influx of Bayesian tools for R.
I eat all this up. I become a convert.
The replication crisis sparked my curiosity, and a wave of new tools and resources made it really easy to get started with Bayesian stats.
My goal with this approach has been to make better, more honest scientific summaries of observed data.
# Some toy data
davis <- car::Davis %>% filter(100 < height) %>% as_data_frame()
davis
#> # A tibble: 199 × 5
#> sex weight height repwt repht
#> <fctr> <int> <int> <int> <int>
#> 1 M 77 182 77 180
#> 2 F 58 161 51 159
#> 3 F 53 161 54 158
#> 4 M 68 177 70 175
#> 5 F 59 157 59 155
#> 6 M 76 170 76 165
#> 7 M 76 167 77 165
#> 8 M 69 186 73 180
#> 9 M 71 178 71 175
#> 10 M 65 171 64 170
#> # ... with 189 more rows
\[ p(A \mid B) : \text{probability of A given B}\]
Suppose that 95% of emails with the phrase “investment opportunity” are spam.
\[ p(\text{spam email} \mid \text{"investment opportunity"}) = .95 \]
What would this probability express?
\[ p(\text{"investment opportunity"} \mid \text{spam email})\]
That ordering matters. \(p(A \mid B)\) is not the same as \(p(B \mid A)\).
A theorem about conditional probability.
\[ p(B \mid A) = \frac{ p(A \mid B) * p(B)}{p(A)} \]
I can never remember this equation with letters. Here’s how I prefer to write it.
\[ p(\text{hypothesis} \mid \text{data}) = \frac{ p(\text{data} \mid \text{hypothesis}) * p(\text{hypothesis})}{p(\text{data})} \]
\[ p(\text{hypothesis} \mid \text{data}) = \frac{ p(\text{data} \mid \text{hypothesis}) * p(\text{hypothesis})}{p(\text{data})} \]
The “hypothesis” is typically something unobserved or unknown. It’s what you want to learn about using the data.
For regression models, the “hypothesis” is a parameter (intercept, slopes or error terms).
Bayes theorem tells you the probability of the hypothesis given the data.
How plausible is some hypothesis given the data?
\[ p(\text{hypothesis} \mid \text{data}) = \frac{ p(\text{data} \mid \text{hypothesis}) * p(\text{hypothesis})}{p(\text{data})} \]
Pieces of the equation:
\[ \text{posterior} = \frac{ \text{likelihood} * \text{prior}}{\text{average likelihood}} \]
I got an email with the word “cialis” in it. Is it spam?
\[ P(\text{spam} \mid \text{"cialis"}) = \frac{P(\text{"cialis"} \mid \text{spam}) * P(\text{spam})} {P(\text{"cialis"})} \]
The two unconditional probabilities are base rates that need to be accounted for.
The prior is the frequency of spam in general. The average likelihood is the frequency of the word “cialis” in emails.
\[ P(\text{spam} \mid \text{"cialis"}) = \frac{\text{"cialis" freq. in spam} * \text{spam rate}} {\text{"cialis" freq.}} \]
Some people would argue that using Bayes theorem is not “Bayesian”. After all, in this example, we’re just counting the frequency of events.
It’s kind of weird, but it is also true.
Simple event-counting is not what people usually mean by the word “Bayesian”.
\[ \text{updated information} = \frac{\text{likelihood of data} * \text{prior information}} {\text{average likelihood of data}} \]
Bayes’ theorem provides a systematic way to update our knowledge as we encounter new data.
\[ \text{updated beliefs} \propto \text{likelihood of data} * \text{prior beliefs} \]
Sidenote: This is nifty. A lot of my stats training made more sense once I had a broader understanding of likelihood.
What is a statistical model?
It’s a description of how the data could have been generated.
IQ scores are normally distributed.
\[ \mathrm{IQ}_i \sim \mathrm{Normal}(\underbrace{\mu}_{\text{mean}}, \underbrace{\sigma}_{\text{SD}}) \]
(The \(\sim\) means “sampled from” or “drawn from”.)
\(\mu\) and \(\sigma\) are parameters for this model that change the center and spread of the normal bell curve.
The normative IQ model has \(\mu = 100\) and \(\sigma = 15\).
How likely are the data in a given model?
I never see it explained this way, but I think of likelihood as “fit”.
How the well data fits in a given model.
We found some IQ scores in an old, questionable dataset.
library(dplyr)
iqs <- car::Burt$IQbio
iqs
#> [1] 82 80 88 108 116 117 132 71 75 93 95 88 111 63
#> [15] 77 86 83 93 97 87 94 96 112 113 106 107 98
IQs are designed to have a normal distribution with a population mean of 100 and an SD of 15.
How well do these data fit in that kind of bell curve?
Density function dnorm(xs, mean = 100, sd = 15)
tells us the height of each value in xs
when drawn on a normal bell curve.
# likelihood (density) of each point
dnorm(iqs, 100, 15) %>% round(3)
#> [1] 0.013 0.011 0.019 0.023 0.015 0.014 0.003 0.004 0.007
#> [10] 0.024 0.025 0.019 0.020 0.001 0.008 0.017 0.014 0.024
#> [19] 0.026 0.018 0.025 0.026 0.019 0.018 0.025 0.024 0.026
Likelihood of all points is the product. These quantities get vanishingly small so we sum their logs instead. (Hence, log-likelihoods.)
# 2 * 10^-50 is vaaaaaaanishingly small!
prod(dnorm(iqs, 100, 15))
#> [1] 2.276823e-50
# log scale
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
Log-likelihoods provide a measure of how well the data fit a given normal distribution.
Which mean best fits the data? Below average IQ (85), average IQ (100), or above average IQ (115)? (Higher is better.)
sum(dnorm(iqs, 85, 15, log = TRUE))
#> [1] -119.0065
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
sum(dnorm(iqs, 115, 15, log = TRUE))
#> [1] -136.6065
sum(dnorm(iqs, 85, 15, log = TRUE))
#> [1] -119.0065
sum(dnorm(iqs, 100, 15, log = TRUE))
#> [1] -114.3065
sum(dnorm(iqs, 115, 15, log = TRUE))
#> [1] -136.6065
Of these three, the data fit best with the “population average” mean (100).
We just used a maximum likelihood criterion to choose among these alternatives!
We have some model of how the data could be generated. This model has tuneable parameters.
The IQs are drawn from a normal distribution with an SD of 15 and some unknown mean.
Likelihood is how well the observed data fit in a particular data-generating model.
Classical regression’s “line of best fit” finds model parameters that maximize the likelihood of the data.
\[ \text{posterior} = \frac{ \text{likelihood} * \text{prior}}{\text{average likelihood}} \]
A Bayesian model examines a distribution over model parameters. What are all the plausible ways the data could have been generated?
Let’s consider all integer values from 70 to 130 as equally probable means for the IQs. This is a flat or uniform prior.
Here’s our model.
\[ \begin{align*} \mathrm{IQ}_i &\sim \text{Normal}(\mu, \sigma = 15) &\text{[likelihood]} \\ \mu &\sim \{\text{integers from 70 to 130}\} &\text{[prior for }\mu] \end{align*} \]
We are going to use grid approximation for this example. That means systematically exploring about a bunch of parameter values. (It’s mostly useful for illustrating how Bayes’ theorem works.)
df_iq_model <- data_frame(
# Candidate mean value
mean = 70:130,
# Probability of each candidate mean right now
prob = 1 / length(mean),
# Probability of each candidate mean during the last update
previous = NA_real_)
# Probabilities sum to 1
sum(df_iq_model$prob)
#> [1] 1
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.01639344 NA
#> 2 71 0.01639344 NA
#> 3 72 0.01639344 NA
#> 4 73 0.01639344 NA
#> 5 74 0.01639344 NA
#> 6 75 0.01639344 NA
#> 7 76 0.01639344 NA
#> 8 77 0.01639344 NA
#> 9 78 0.01639344 NA
#> 10 79 0.01639344 NA
#> # ... with 51 more rows
We observe one data-point, \(y = 82\), and update our prior information using the likelihood of the data at each possible mean.
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[1], df_iq_model$mean, 15)
# numerator of bayes theorem
df_iq_model$prob <- likelihoods * df_iq_model$prob
sum(df_iq_model$prob)
#> [1] 0.01306729
That’s not right! We need the average likelihood to ensure that the probabilities add up to 1. This is why it’s sometimes called a normalizing constant.
# include denominator of bayes theorem
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
sum(df_iq_model$prob)
#> [1] 1
We observe another data-point and update the probability with the likelihood again.
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[2], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.02551519 0.02422865
#> 2 71 0.02801128 0.02549920
#> 3 72 0.03047942 0.02671736
#> 4 73 0.03287154 0.02786958
#> 5 74 0.03513768 0.02894257
#> 6 75 0.03722765 0.02992359
#> 7 76 0.03909289 0.03080065
#> 8 77 0.04068830 0.03156283
#> 9 78 0.04197406 0.03220045
#> 10 79 0.04291726 0.03270526
#> # ... with 51 more rows
And one more…
df_iq_model$previous <- df_iq_model$prob
likelihoods <- dnorm(iqs[3], df_iq_model$mean, 15)
df_iq_model$prob <- likelihoods * df_iq_model$prob
# normalize
df_iq_model$prob <- df_iq_model$prob / sum(df_iq_model$prob)
df_iq_model
#> # A tibble: 61 × 3
#> mean prob previous
#> <int> <dbl> <dbl>
#> 1 70 0.01490139 0.02551519
#> 2 71 0.01768232 0.02801128
#> 3 72 0.02070434 0.03047942
#> 4 73 0.02392174 0.03287154
#> 5 74 0.02727304 0.03513768
#> 6 75 0.03068201 0.03722765
#> 7 76 0.03405991 0.03909289
#> 8 77 0.03730890 0.04068830
#> 9 78 0.04032654 0.04197406
#> 10 79 0.04301093 0.04291726
#> # ... with 51 more rows
https://github.com/tjmahr/MadR_RStanARM/blob/master/assets/simple-updating.gif
I learned stats in this course, so I bet you probably write regression models as a one-liner like:
\[ \underbrace{y_i}_{\text{observation}} = \underbrace{\alpha + \beta _{1} x_{1i}}_{\text{predicted mean given }x} + \underbrace{\epsilon _i}_{\text{random error}} \]
Data generating model: Observation \(y_i\) is a draw from a normal distribution centered around a mean.
We estimate the mean with a constant “intercept” term \(\alpha\) plus a linear combination of predictor variables (just \(x_1\) for now).
Let’s re-write the model to make the normal-distribution part clearer. No more one-liner.
\[ \begin{align*} y_i &\sim \text{Normal}(\text{mean} = \mu_i, \text{SD} = \sigma) &\text{[likelihood]} \\ \mu_i &= \alpha + \beta_1*x_{1i} &\text{[linear model]} \end{align*} \]
Observation \(y_i\) is a draw from a normal distribution centered around a mean \(\mu_i\) with a standard deviation of \(\sigma\).
The mean is a constant term \(\alpha\) plus a linear combination of predictor variables (just \(x_1\) for now).
(These equations describe the same models. It’s just a different kind of notation.)
Consider a model of weight predicted by height…
To make the model Bayesian, we need to give prior distributions to parameters.
The parameters we need to estimate for regression: \(\alpha, \beta_1, \sigma\).
\[ \begin{align*} y_i &\sim \text{Normal}(\mu_i, \sigma) &\text{[likelihood]} \\ \mu_i &= \alpha + \beta_1*x_{1i} &\text{[linear model]} \\ \alpha &\sim \text{Normal}(0, 10) &\text{[prior for }\alpha] \\ \beta_1 &\sim \text{Normal}(0, 5) &\text{[prior for }\beta_1] \\ \sigma &\sim \text{HalfCauchy}(0, 5) &\text{[prior for }\sigma] \\ \end{align*} \]
Parameters we need to estimate: \(\alpha, \beta_1, \sigma\)
\[ \mathrm{posterior} = \frac{ \mathrm{likelihood} * \mathrm{prior}}{\mathrm{average\ likelihood}} \]
\[ P(\alpha, \beta, \sigma \mid x) = \frac{ P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma)}{\iiint \, P(x \mid \alpha, \beta, \sigma) \, P(\alpha, \beta, \sigma) \,d\alpha \,d\beta \,d\sigma} \]
Things get gnarly. This is the black-box step.
We don’t perform this integral calculus.
Insead, we rely on Markov-chain Monte Carlo simulation to get samples from the posterior.
Those samples will provide a detailed picture of the posterior.
davis
#> # A tibble: 199 × 5
#> sex weight height repwt repht
#> <fctr> <int> <int> <int> <int>
#> 1 M 77 182 77 180
#> 2 F 58 161 51 159
#> 3 F 53 161 54 158
#> 4 M 68 177 70 175
#> 5 F 59 157 59 155
#> 6 M 76 170 76 165
#> 7 M 76 167 77 165
#> 8 M 69 186 73 180
#> 9 M 71 178 71 175
#> 10 M 65 171 64 170
#> # ... with 189 more rows
# Mean-center height
mean(davis$height)
#> [1] 170.5879
davis$heightC <- davis$height - mean(davis$height)
m <- glm(weight ~ heightC * sex, davis, family = gaussian())
m %>% summary() %>% coef() %>% round(3)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 60.558 1.099 55.081 0.000
#> heightC 0.623 0.135 4.626 0.000
#> sexM 7.949 1.710 4.648 0.000
#> heightC:sexM 0.373 0.190 1.964 0.051
Stan: a probabalistic programming language / MCMC sampler
RStanARM: RStan Applied Regression Modeling
glm
-> stan_glm
, glmer
-> stan_glmer
.arm
package.library(rstanarm)
#> Loading required package: Rcpp
#> Warning: package 'Rcpp' was built under R version 3.3.3
#> rstanarm (Version 2.14.1, packaged: 2017-01-16 18:47:11 UTC)
#> - Do not expect the default priors to remain the same in future rstanarm versions.
#> Thus, R scripts should specify priors explicitly, even if they are just the defaults.
#> - For execution on a local, multicore CPU with excess RAM we recommend calling
#> options(mc.cores = parallel::detectCores())
We have to use stan_glm()
.
stan_lm()
uses a different specification of the prior.By default, it does sampling with 4 MCMC chains. Each “chain” explores the posterior distribution from random starting locations.
stan_model <- stan_glm(
weight ~ heightC * sex,
data = davis,
family = gaussian,
# RStanARM rescales predictor variables and priors use that scaling
prior = normal(0, 5),
prior_intercept = normal(0, 10)
)
#>
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
#>
#> Chain 1, Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 1, Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 1, Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 1, Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 1, Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 1, Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 1, Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 1, Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 1, Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 1, Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 1, Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 1, Iteration: 2000 / 2000 [100%] (Sampling)
#> Elapsed Time: 0.301 seconds (Warm-up)
#> 0.273 seconds (Sampling)
#> 0.574 seconds (Total)
#>
#>
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
#>
#> Chain 2, Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 2, Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 2, Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 2, Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 2, Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 2, Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 2, Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 2, Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 2, Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 2, Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 2, Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 2, Iteration: 2000 / 2000 [100%] (Sampling)
#> Elapsed Time: 0.267 seconds (Warm-up)
#> 0.255 seconds (Sampling)
#> 0.522 seconds (Total)
#>
#>
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
#>
#> Chain 3, Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 3, Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 3, Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 3, Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 3, Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 3, Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 3, Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 3, Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 3, Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 3, Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 3, Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 3, Iteration: 2000 / 2000 [100%] (Sampling)
#> Elapsed Time: 0.296 seconds (Warm-up)
#> 0.259 seconds (Sampling)
#> 0.555 seconds (Total)
#>
#>
#> SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
#>
#> Chain 4, Iteration: 1 / 2000 [ 0%] (Warmup)
#> Chain 4, Iteration: 200 / 2000 [ 10%] (Warmup)
#> Chain 4, Iteration: 400 / 2000 [ 20%] (Warmup)
#> Chain 4, Iteration: 600 / 2000 [ 30%] (Warmup)
#> Chain 4, Iteration: 800 / 2000 [ 40%] (Warmup)
#> Chain 4, Iteration: 1000 / 2000 [ 50%] (Warmup)
#> Chain 4, Iteration: 1001 / 2000 [ 50%] (Sampling)
#> Chain 4, Iteration: 1200 / 2000 [ 60%] (Sampling)
#> Chain 4, Iteration: 1400 / 2000 [ 70%] (Sampling)
#> Chain 4, Iteration: 1600 / 2000 [ 80%] (Sampling)
#> Chain 4, Iteration: 1800 / 2000 [ 90%] (Sampling)
#> Chain 4, Iteration: 2000 / 2000 [100%] (Sampling)
#> Elapsed Time: 0.298 seconds (Warm-up)
#> 0.254 seconds (Sampling)
#> 0.552 seconds (Total)
stan_model
#> stan_glm(formula = weight ~ heightC * sex, family = gaussian,
#> data = davis, prior = normal(0, 5), prior_intercept = normal(0,
#> 10))
#>
#> Estimates:
#> Median MAD_SD
#> (Intercept) 60.5 1.1
#> heightC 0.6 0.1
#> sexM 7.9 1.7
#> heightC:sexM 0.4 0.2
#> sigma 8.0 0.4
#>
#> Sample avg. posterior predictive
#> distribution of y (X = xbar):
#> Median MAD_SD
#> mean_PPD 65.3 0.8
Predictors are centered and rescaled internally by rstanarm, so our priors are on the standardized scale.
normal(0, 5)
is a distribution of effect sizes with mean 0 and SD 5See ?rstanarm::priors
, esp. the autoscale
argument.
prior_summary(stan_model)
#> Priors for model 'stan_model'
#> ------
#> Intercept (after predictors centered)
#> ~ normal(location = 0, scale = 10)
#> **adjusted scale = 266.87
#>
#> Coefficients
#> ~ normal(location = [0,0,0], scale = [5,5,5])
#> **adjusted scale = [ 7.5,133.4, 11.8]
#>
#> Auxiliary (sigma)
#> ~ cauchy(location = 0, scale = 5)
#> ------
#> See help('prior_summary.stanreg') for more details
#> stan_glm(formula = weight ~ heightC * sex, family = gaussian,
#> data = davis, prior = normal(0, 5), prior_intercept = normal(0,
#> 10))
#>
#> Family: gaussian (identity)
#> Algorithm: sampling
#> Posterior sample size: 4000
#> Observations: 199
#>
#> Estimates:
#> mean sd 2.5% 25% 50% 75%
#> (Intercept) 60.5 1.1 58.4 59.8 60.5 61.3
#> heightC 0.6 0.1 0.3 0.5 0.6 0.7
#> sexM 7.9 1.7 4.6 6.8 7.9 9.1
#> heightC:sexM 0.4 0.2 0.0 0.2 0.4 0.5
#> sigma 8.1 0.4 7.3 7.8 8.0 8.3
#> mean_PPD 65.3 0.8 63.7 64.7 65.3 65.8
#> log-posterior -708.7 1.6 -712.6 -709.5 -708.3 -707.5
#> 97.5%
#> (Intercept) 62.8
#> heightC 0.9
#> sexM 11.5
#> heightC:sexM 0.8
#> sigma 8.9
#> mean_PPD 66.9
#> log-posterior -706.5
#>
#> Diagnostics:
#> mcse Rhat n_eff
#> (Intercept) 0.0 1.0 3466
#> heightC 0.0 1.0 3154
#> sexM 0.0 1.0 3800
#> heightC:sexM 0.0 1.0 3378
#> sigma 0.0 1.0 4000
#> mean_PPD 0.0 1.0 4000
#> log-posterior 0.0 1.0 1631
#>
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
mean_PPD
is the predicted value for a completely average observationHere is what classical linear regression does.
Coerce to a data-frame. Columns are parameters. One row per posterior sample.
samples <- stan_model %>% as.data.frame() %>% tbl_df()
samples
#> # A tibble: 4,000 × 5
#> `(Intercept)` heightC sexM `heightC:sexM`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 59.98907 0.3835406 6.851701 0.6939228
#> 2 58.50823 0.3593057 10.740850 0.5377965
#> 3 58.33645 0.3789028 10.672112 0.6011691
#> 4 59.32021 0.5741882 7.223488 0.6123673
#> 5 60.54385 0.6045105 6.134000 0.4558503
#> 6 61.66087 0.7953321 8.635995 0.1318412
#> 7 59.77242 0.4595197 7.204991 0.6026217
#> 8 59.86933 0.5718482 8.937060 0.4580088
#> 9 60.00049 0.6464913 9.170346 0.2593093
#> 10 61.21063 0.6913863 6.535622 0.4029219
#> # ... with 3,990 more rows, and 1 more variables:
#> # sigma <dbl>
Any stats that can describe a distribution can describe the model’s parameters now. Mean, median, skew, quantiles, etc.
If we believe there is a “true” value for a parameter, there is 90% probability that this “true” value is in the 90% interval, given our model, prior information, and the data.
The 90% interval contains the middle 90% of the parameter values.
There is a 5% chance, says the model, the height parameter that generated the data is below the 5% quantile.
posterior_interval(stan_model)
#> 5% 95%
#> (Intercept) 58.68571182 62.4163500
#> heightC 0.38525725 0.8541364
#> sexM 5.10244125 10.8269287
#> heightC:sexM 0.05005538 0.6942457
#> sigma 7.39143953 8.7501012
# This is where I toured launch_shinystan(stan_model)
# and did some other stuff.
The models provide intuitive results.
Bayesian models quantify uncertainty.
Bayesian models incorporate prior information.
Bayesian models are flexible.
Bayesian models have computational benefits.
It’s different.
More work before and after modeling.
It takes longer.
It’s not a cure-all. There are still insecurities.
These are some older slides on good resources for learning about Bayesian statistics.
https://cdn.rawgit.com/tjmahr/MadR_RStanARM/master/04-learning-more-rpubs.html