MASH v FLASH introduction

Last updated: 2018-06-18

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180609)

The command set.seed(20180609) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 8d6ce20
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/images/.DS_Store
    Ignored:    output/.DS_Store

Unstaged changes:
    Modified:   analysis/index.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	8d6ce20	Jason Willwerscheid	2018-06-18	wflow_publish(“analysis/intro.Rmd”)

Introduction

This vignette shows how flashr can be used to learn something about the covariance structure of a dataset.

Motivation, part I: flash and covariance

Recall that flashr fits the model \[\begin{aligned} Y &= X + E \\ X &= LF' \end{aligned}\] where \(Y\) (the “observed” data) and \(X\) (the “true” effects) are \(n \times p\) matrices, \(L\) is an \(n \times k\) matrix of loadings, \(F\) is a \(p \times k\) matrix of factors, and \(E\) is an \(n \times p\) matrix of residuals. Denote the columns of \(L\) as \(\ell_1, \dots, \ell_k\) and the columns of \(F\) as \(f_1, \ldots, f_k\).

Notice that the fitted model does not only tell us about how the elements of \(L\) and \(F\) are distributed; it also tells us something about how the elements of \(X\) covary.

In particular, if \(L\) is regarded as fixed (by, for example, fixing each \(L_{ij}\) at its posterior mean), then one may write \[ X_{:, j} = F_{j 1} \ell_1 + \ldots + F_{j k} \ell_k \] so that, if the loadings \(\ell_1, \ldots, \ell_k\) are orthogonal (in general, they are not), \[ \begin{aligned} \text{Cov} (X_{:, j}) &= (\mathbb{E} F_{j1}^2) \ell_1 \ell_1' + \ldots + (\mathbb{E} F_{jk}^2) \ell_k \ell_k' \end{aligned}\]

In other words, the covariance of the columns of \(X\) will be a linear combination (or, more precisely, a conical combination) of the rank-one covariance matrices \(\ell_i \ell_i'\).

Motivation, part II: covariance mixtures

One can take this idea a bit further by recalling that the priors on factors \(f_1, \ldots, f_k\) are mixture distributions. For simplicity, assume that the priors are point-normal distributions \[f_m \sim (1 - \pi_m) \delta_0 + \pi_m N(0, \tau_m^2)\] with \(0 \le \pi_m \le 1\).

By augmenting the data with hidden variables \(I_{jm}\) that indicate the mixture components from which the elements \(F_{jm}\) are drawn, one may equivalently write \[\begin{aligned} F_{jm} \mid I_{jm} = 0 &\sim \delta_0 \\ F_{jm} \mid I_{jm} = 1 &\sim N(0, \tau_m^2) \\ I_{jm} &\sim \text{Bernoulli}(\pi_m) \end{aligned}\] so that \(I_{jm} = 1\) if gene \(j\) is “active” for factor \(m\) and \(I_{jm} = 0\) otherwise.

Now, if one regards the variables \(I_{jm}\) (as well as \(L\)) as fixed, then one obtains \[ \begin{aligned} \text{Cov} (X_{:, j}) &= I_{j 1} \tau_1^2 \ell_1 \ell_1' + \ldots + I_{j_m} \tau_k^2 \ell_k \ell_k' \end{aligned}\]

In other words, depending on the values of \(I_{j1}, \ldots, I_{jm}\), the elements of column \(X_{:, j}\) will covary in one of \(2^m\) possible ways.

Relation to mash

In fact, if the priors on factors \(f_1, \ldots, f_k\) are arbitrary mixtures of normals (including the point-normal priors discussed in the previous section), then taking \(L\) and the variables \(I_{jm}\) as fixed results in the columns of \(X\) being distributed as a mixture of multivariate normals. In the point-normal case, \[ X_{:, j} \sim \sum_{r = 1}^{2^m} N_n(0, \Sigma_r), \] where each \(\Sigma_r\) is a conical combination of the rank-one covariance matrices \(\ell_1 \ell_1', \ldots, \ell_m \ell_m'\).

mashr (see here) similarly models \(X\) as a mixture of multivariate normals, so it makes sense to attempt to use flashr (which is, on its face, much simpler than mashr) to attain similar ends.

Canonical covariance structures

In addition to a set of covariance matrices that are fit to the data, mashr includes a set of “canonical” covariance matrices. For example, it is reasonable to expect that some effects will be unique to a single condition \(i\). For the corresponding columns of \(X\), the covariance matrix will have a single non-zero entry (namely, the \(i\)th diagonal entry, \(\Sigma_{ii}\)).

One can accommodate such effects in flashr by adding \(n\) fixed one-hot vectors as loadings (one for each condition). In other words, we can add “canonical” covariance structures corresponding to effects that are unique in a single condition by fitting the model \[ Y = \begin{pmatrix} L & I_n \end{pmatrix} F'. \]

Fitting the flash object

The recommended procedure is as follows:

Create a flash data object, using parameter S to specify standard errors.
Add the “data-driven” loadings to the flash fit object greedily, using parameter option var_type = "zero" to indicate that the standard errors should be considered fixed.
Add \(n\) fixed one-hot loadings vectors to the flash fit object.
Refine the flash fit object via backfitting, again using parameter option var_type = "zero". The parameter option nullcheck = FALSE should also be used so that the canonical covariance structures are retained.

Example with simulated data

The first code chunk simulates a \(10 \times 400\) data matrix where a quarter of columns \(X_{:, j}\) are null across all conditions, a quarter are nonnull in the first condition only, a quarter are nonnull in all conditions with effect sizes that are independent across conditions, and a quarter are nonnull in all conditions with an effect size that is identical across conditions.

set.seed(1)
n <- 5
p <- 400

ncond <- n # n
nsamp <- as.integer(p / 4)

# Null effects:
X.zero = matrix(0, nrow=ncond, ncol=nsamp)
# Effects that occur only in condition 1:
X.one = X.zero
b2 = 5 * rnorm(nsamp)
X.one[1,] = b2
# Independent effects:
X.ind = matrix(5 * rnorm(ncond*nsamp), nrow=ncond, ncol=nsamp) 
b = 5 * rnorm(nsamp)
# Effects that are identical across conditions:
X.ident = matrix(rep(b, ncond), nrow=ncond, ncol=nsamp, byrow=T)

X = cbind(X.zero, X.one, X.ind, X.ident)

E = matrix(rnorm(4*ncond*nsamp), nrow=ncond, ncol=4*nsamp)
Y = X + E

The next code chunk fits a flash object using the procedure described in the previous section:

#library(flashr)
devtools::load_all("/Users/willwerscheid/GitHub/flashr2/")
data <- flash_set_data(Y, S = 1)
fl_greedy <- flash_add_greedy(data, Kmax = 10, var_type = "zero")
I_n <- diag(rep(1, n))
fl_fixedl <- flash_add_fixed_l(data, I_n, fl_greedy)
fl_backfit <- flash_backfit(data, fl_fixedl, var_type = "zero", nullcheck = F)

Finally, the following code chunk calculates the mean squared error obtained using the flash fit object posterior means, relative to the mean squared error that would be obtained by simply estimating \(X\) using the data matrix \(Y\).

baseline_mse <- sum((Y - X)^2)/(n * p)

fl_preds <- flash_get_lf(fl_backfit)
flash_mse <- sum((fl_preds - X)^2)/(n * p)
flash_mse / baseline_mse

[1] 0.5224411

To verify that flashr has in fact learned something about how the data covaries, one can collapse the data into a vector and use ashr to fit a prior that is a univariate mixture of normals.

ash_fit <- ashr::ash(betahat = as.vector(Y), sebetahat = 1)
ash_pm <- ashr::get_pm(ash_fit)
ash_preds <- matrix(ash_pm, nrow=n, ncol=p)
ash_mse <- sum((ash_preds - X)^2)/(n * p)
ash_mse / baseline_mse

[1] 0.7572497

So, the flashr estimates are much better, even though flashr uses point-normal priors instead of the more flexible class of unimodal priors used by ashr.

FLASH v MASH

For this particular example, mashr outperforms flashr in terms of MSE:

library(mashr)
data <- mash_set_data(t(Y))
U.c = cov_canonical(data)
m.1by1 <- mash_1by1(data)
strong <- get_significant_results(m.1by1, 0.05)
U.pca <- cov_pca(data, 5, strong)
U.ed <- cov_ed(data, U.pca, strong)
U <- c(U.c, U.ed)
m <- mash(data, U)

mash_mse <- sum((t(get_pm(m)) - X)^2)/(n * p)
mash_mse / baseline_mse

[1] 0.4161114

However, this is not always the case, and the two methods often perform similarly. See the simulation study here and a comparison using GTEx data here.

Session information

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mashr_0.2-7  ashr_2.2-7   flashr_0.5-8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17             pillar_1.2.1            
 [3] plyr_1.8.4               compiler_3.4.3          
 [5] git2r_0.21.0             workflowr_1.0.1         
 [7] R.methodsS3_1.7.1        R.utils_2.6.0           
 [9] iterators_1.0.9          tools_3.4.3             
[11] testthat_2.0.0           digest_0.6.15           
[13] tibble_1.4.2             evaluate_0.10.1         
[15] memoise_1.1.0            gtable_0.2.0            
[17] lattice_0.20-35          rlang_0.2.0             
[19] Matrix_1.2-12            foreach_1.4.4           
[21] commonmark_1.4           yaml_2.1.17             
[23] parallel_3.4.3           mvtnorm_1.0-7           
[25] ebnm_0.1-11              withr_2.1.1.9000        
[27] stringr_1.3.0            roxygen2_6.0.1.9000     
[29] xml2_1.2.0               knitr_1.20              
[31] REBayes_1.2              devtools_1.13.4         
[33] rprojroot_1.3-2          grid_3.4.3              
[35] R6_2.2.2                 rmarkdown_1.8           
[37] rmeta_3.0                ggplot2_2.2.1           
[39] magrittr_1.5             whisker_0.3-2           
[41] backports_1.1.2          scales_0.5.0            
[43] codetools_0.2-15         htmltools_0.3.6         
[45] MASS_7.3-48              assertthat_0.2.0        
[47] softImpute_1.4           colorspace_1.3-2        
[49] stringi_1.1.6            Rmosek_7.1.3            
[51] lazyeval_0.2.1           munsell_0.4.3           
[53] doParallel_1.0.11        pscl_1.5.2              
[55] truncnorm_1.0-8          SQUAREM_2017.10-1       
[57] ExtremeDeconvolution_1.3 R.oo_1.21.0

This reproducible R Markdown analysis was created with workflowr 1.0.1