Last updated: 2018-06-18
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180609)
The command set.seed(20180609)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 8d6ce20
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/images/.DS_Store
Ignored: output/.DS_Store
Unstaged changes:
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 8d6ce20 | Jason Willwerscheid | 2018-06-18 | wflow_publish(“analysis/intro.Rmd”) |
This vignette shows how flashr
can be used to learn something about the covariance structure of a dataset.
Recall that flashr
fits the model \[\begin{aligned}
Y &= X + E \\
X &= LF'
\end{aligned}\] where \(Y\) (the “observed” data) and \(X\) (the “true” effects) are \(n \times p\) matrices, \(L\) is an \(n \times k\) matrix of loadings, \(F\) is a \(p \times k\) matrix of factors, and \(E\) is an \(n \times p\) matrix of residuals. Denote the columns of \(L\) as \(\ell_1, \dots, \ell_k\) and the columns of \(F\) as \(f_1, \ldots, f_k\).
Notice that the fitted model does not only tell us about how the elements of \(L\) and \(F\) are distributed; it also tells us something about how the elements of \(X\) covary.
In particular, if \(L\) is regarded as fixed (by, for example, fixing each \(L_{ij}\) at its posterior mean), then one may write \[ X_{:, j} = F_{j 1} \ell_1 + \ldots + F_{j k} \ell_k \] so that, if the loadings \(\ell_1, \ldots, \ell_k\) are orthogonal (in general, they are not), \[ \begin{aligned} \text{Cov} (X_{:, j}) &= (\mathbb{E} F_{j1}^2) \ell_1 \ell_1' + \ldots + (\mathbb{E} F_{jk}^2) \ell_k \ell_k' \end{aligned}\]
In other words, the covariance of the columns of \(X\) will be a linear combination (or, more precisely, a conical combination) of the rank-one covariance matrices \(\ell_i \ell_i'\).
One can take this idea a bit further by recalling that the priors on factors \(f_1, \ldots, f_k\) are mixture distributions. For simplicity, assume that the priors are point-normal distributions \[f_m \sim (1 - \pi_m) \delta_0 + \pi_m N(0, \tau_m^2)\] with \(0 \le \pi_m \le 1\).
By augmenting the data with hidden variables \(I_{jm}\) that indicate the mixture components from which the elements \(F_{jm}\) are drawn, one may equivalently write \[\begin{aligned} F_{jm} \mid I_{jm} = 0 &\sim \delta_0 \\ F_{jm} \mid I_{jm} = 1 &\sim N(0, \tau_m^2) \\ I_{jm} &\sim \text{Bernoulli}(\pi_m) \end{aligned}\] so that \(I_{jm} = 1\) if gene \(j\) is “active” for factor \(m\) and \(I_{jm} = 0\) otherwise.
Now, if one regards the variables \(I_{jm}\) (as well as \(L\)) as fixed, then one obtains \[ \begin{aligned} \text{Cov} (X_{:, j}) &= I_{j 1} \tau_1^2 \ell_1 \ell_1' + \ldots + I_{j_m} \tau_k^2 \ell_k \ell_k' \end{aligned}\]
In other words, depending on the values of \(I_{j1}, \ldots, I_{jm}\), the elements of column \(X_{:, j}\) will covary in one of \(2^m\) possible ways.
In fact, if the priors on factors \(f_1, \ldots, f_k\) are arbitrary mixtures of normals (including the point-normal priors discussed in the previous section), then taking \(L\) and the variables \(I_{jm}\) as fixed results in the columns of \(X\) being distributed as a mixture of multivariate normals. In the point-normal case, \[ X_{:, j} \sim \sum_{r = 1}^{2^m} N_n(0, \Sigma_r), \] where each \(\Sigma_r\) is a conical combination of the rank-one covariance matrices \(\ell_1 \ell_1', \ldots, \ell_m \ell_m'\).
mashr
(see here) similarly models \(X\) as a mixture of multivariate normals, so it makes sense to attempt to use flashr
(which is, on its face, much simpler than mashr
) to attain similar ends.
In addition to a set of covariance matrices that are fit to the data, mashr
includes a set of “canonical” covariance matrices. For example, it is reasonable to expect that some effects will be unique to a single condition \(i\). For the corresponding columns of \(X\), the covariance matrix will have a single non-zero entry (namely, the \(i\)th diagonal entry, \(\Sigma_{ii}\)).
One can accommodate such effects in flashr
by adding \(n\) fixed one-hot vectors as loadings (one for each condition). In other words, we can add “canonical” covariance structures corresponding to effects that are unique in a single condition by fitting the model \[ Y = \begin{pmatrix} L & I_n \end{pmatrix} F'. \]
The recommended procedure is as follows:
Create a flash data object, using parameter S
to specify standard errors.
Add the “data-driven” loadings to the flash fit object greedily, using parameter option var_type = "zero"
to indicate that the standard errors should be considered fixed.
Add \(n\) fixed one-hot loadings vectors to the flash fit object.
Refine the flash fit object via backfitting, again using parameter option var_type = "zero"
. The parameter option nullcheck = FALSE
should also be used so that the canonical covariance structures are retained.
The first code chunk simulates a \(10 \times 400\) data matrix where a quarter of columns \(X_{:, j}\) are null across all conditions, a quarter are nonnull in the first condition only, a quarter are nonnull in all conditions with effect sizes that are independent across conditions, and a quarter are nonnull in all conditions with an effect size that is identical across conditions.
set.seed(1)
n <- 5
p <- 400
ncond <- n # n
nsamp <- as.integer(p / 4)
# Null effects:
X.zero = matrix(0, nrow=ncond, ncol=nsamp)
# Effects that occur only in condition 1:
X.one = X.zero
b2 = 5 * rnorm(nsamp)
X.one[1,] = b2
# Independent effects:
X.ind = matrix(5 * rnorm(ncond*nsamp), nrow=ncond, ncol=nsamp)
b = 5 * rnorm(nsamp)
# Effects that are identical across conditions:
X.ident = matrix(rep(b, ncond), nrow=ncond, ncol=nsamp, byrow=T)
X = cbind(X.zero, X.one, X.ind, X.ident)
E = matrix(rnorm(4*ncond*nsamp), nrow=ncond, ncol=4*nsamp)
Y = X + E
The next code chunk fits a flash object using the procedure described in the previous section:
#library(flashr)
devtools::load_all("/Users/willwerscheid/GitHub/flashr2/")
data <- flash_set_data(Y, S = 1)
fl_greedy <- flash_add_greedy(data, Kmax = 10, var_type = "zero")
I_n <- diag(rep(1, n))
fl_fixedl <- flash_add_fixed_l(data, I_n, fl_greedy)
fl_backfit <- flash_backfit(data, fl_fixedl, var_type = "zero", nullcheck = F)
Finally, the following code chunk calculates the mean squared error obtained using the flash fit object posterior means, relative to the mean squared error that would be obtained by simply estimating \(X\) using the data matrix \(Y\).
baseline_mse <- sum((Y - X)^2)/(n * p)
fl_preds <- flash_get_lf(fl_backfit)
flash_mse <- sum((fl_preds - X)^2)/(n * p)
flash_mse / baseline_mse
[1] 0.5224411
To verify that flashr
has in fact learned something about how the data covaries, one can collapse the data into a vector and use ashr
to fit a prior that is a univariate mixture of normals.
ash_fit <- ashr::ash(betahat = as.vector(Y), sebetahat = 1)
ash_pm <- ashr::get_pm(ash_fit)
ash_preds <- matrix(ash_pm, nrow=n, ncol=p)
ash_mse <- sum((ash_preds - X)^2)/(n * p)
ash_mse / baseline_mse
[1] 0.7572497
So, the flashr
estimates are much better, even though flashr
uses point-normal priors instead of the more flexible class of unimodal priors used by ashr
.
For this particular example, mashr
outperforms flashr
in terms of MSE:
library(mashr)
data <- mash_set_data(t(Y))
U.c = cov_canonical(data)
m.1by1 <- mash_1by1(data)
strong <- get_significant_results(m.1by1, 0.05)
U.pca <- cov_pca(data, 5, strong)
U.ed <- cov_ed(data, U.pca, strong)
U <- c(U.c, U.ed)
m <- mash(data, U)
mash_mse <- sum((t(get_pm(m)) - X)^2)/(n * p)
mash_mse / baseline_mse
[1] 0.4161114
However, this is not always the case, and the two methods often perform similarly. See the simulation study here and a comparison using GTEx data here.
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mashr_0.2-7 ashr_2.2-7 flashr_0.5-8
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 pillar_1.2.1
[3] plyr_1.8.4 compiler_3.4.3
[5] git2r_0.21.0 workflowr_1.0.1
[7] R.methodsS3_1.7.1 R.utils_2.6.0
[9] iterators_1.0.9 tools_3.4.3
[11] testthat_2.0.0 digest_0.6.15
[13] tibble_1.4.2 evaluate_0.10.1
[15] memoise_1.1.0 gtable_0.2.0
[17] lattice_0.20-35 rlang_0.2.0
[19] Matrix_1.2-12 foreach_1.4.4
[21] commonmark_1.4 yaml_2.1.17
[23] parallel_3.4.3 mvtnorm_1.0-7
[25] ebnm_0.1-11 withr_2.1.1.9000
[27] stringr_1.3.0 roxygen2_6.0.1.9000
[29] xml2_1.2.0 knitr_1.20
[31] REBayes_1.2 devtools_1.13.4
[33] rprojroot_1.3-2 grid_3.4.3
[35] R6_2.2.2 rmarkdown_1.8
[37] rmeta_3.0 ggplot2_2.2.1
[39] magrittr_1.5 whisker_0.3-2
[41] backports_1.1.2 scales_0.5.0
[43] codetools_0.2-15 htmltools_0.3.6
[45] MASS_7.3-48 assertthat_0.2.0
[47] softImpute_1.4 colorspace_1.3-2
[49] stringi_1.1.6 Rmosek_7.1.3
[51] lazyeval_0.2.1 munsell_0.4.3
[53] doParallel_1.0.11 pscl_1.5.2
[55] truncnorm_1.0-8 SQUAREM_2017.10-1
[57] ExtremeDeconvolution_1.3 R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1