Last updated: 2018-09-22
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180714)
The command set.seed(20180714)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 5bfea2c
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Untracked files:
Untracked: analysis/binary_data.Rmd
Untracked: data/greedy19.rds
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
In a previous analysis, I used FLASH to obtain a nonnegative matrix factorization of the GTEx donation matrix. There, I proceeded as if the observations were normally distributed. Here, I explore a more sophisticated approach to count data.
I assume that the data is generated as \[ Y \sim \text{Poisson}(\mu), \] where \[ \log(\mu) = LF' = \sum_{k = 1}^K l_k f_k', \] with an ASH prior on the entries of each loading \(l_k\) and each factor \(f_k\) \[ l_k \sim g_{l_k},\ f_k \sim g_{f_k}. \]
To account for overdispersion, one can also fit the model \[ \log(\mu) = LF' + E, \] where the entries of \(E\) are i.i.d. \(N(0, \sigma^2)\) and \(\sigma^2\) is to be estimated.
The general idea is outlined by Matthew Stephens here.
To apply the idea to FLASH, I replace the part of the objective function that comes from the residuals with the Poisson log likelihood \[ \ell(\mu) = \sum_{i, j} -\mu_{ij} + Y_{ij} \log \mu_{ij}. \] Set \(\eta_{ij} = \log \mu_{ij}\) so that \[ \ell(\eta) = \sum_{i, j} -e^{\eta_{ij}} + Y_{ij} \eta_{ij}. \]
Now do a second-order Taylor expansion of \(\ell(\eta)\) around some (for now, arbitrary) point \(\eta^\star\): \[ \ell(\eta) = \ell(\eta^\star) + (\eta - \eta^\star) \ell'(\eta^\star) + \frac{(\eta - \eta^\star)^2}{2} \ell''(\eta^\star). \]
This is precisely the log likelihood for the distribution \[ \eta \sim \text{Normal}\left(\eta^\star - \frac{\ell'(\eta^\star)}{\ell''(\eta^\star)}, -\frac{1}{\ell''(\eta^\star)}\right). \]
Thus, given a good enough choice of \(\eta^\star\), we should be able to run FLASH on the “pseudo-data” \[ X = \eta^\star - \frac{\ell'(\eta^\star)}{\ell''(\eta^\star)} = \log(\mu^\star) + \frac{Y - \mu^\star}{\mu^\star}\] with “standard errors” \[S = -\frac{1}{\sqrt{\ell''(\mu^\star)}} = \frac{1}{\sqrt{\mu^\star}}, \] where \(\mu^\star = e^{\eta^\star}\). To fit the model \(\log(\mu) = LF'\), one fixes the standard errors by setting var_type = "zero"
. To fit the overdispersed model \(\log(\mu) = LF' + E\), one sets var_type = "constant"
so that each observation has variance \(S_{ij}^2 + \sigma^2\). For purposes of illustration, I take var_type = "zero"
as the default setting. I do so because I have taken a shortcut in calculating the Poisson log likelihood by simply fixing \(\log(\mu)\) at the posterior mean of \(LF'\); this approximation is much worse when there is an additional “error term” \(E\).
The crux in all of this is to choose an \(\eta^\star\) that is as close as possible to the \(\eta\) that minimizes the objective function. An iterative procedure suggests itself whereby one fits FLASH to the pseudo-data generated by a given choice of \(\eta^\star\), then updates \(\eta^\star\) to be the posterior mean given by the FLASH fit, then fits FLASH to the pseudo-data generated by the new \(\eta^\star\), and so on until convergence.
First I load the data:
devtools::load_all("~/GitHub/flashr")
#> Loading flashr
devtools::load_all("~/GitHub/ebnm")
#> Loading ebnm
raw <- read.csv("https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDS.txt",
header=TRUE, sep='\t')
data <- raw[, c("SAMPID", "SMTSD")] # sample ID, tissue type
# Extract donor ID:
tmp <- strsplit(as.character(data$SAMPID), "-")
data$SAMPID <- as.factor(sapply(tmp, function(x) {x[[2]]}))
names(data) <- c("DonorID", "TissueType")
data <- suppressMessages(reshape2::acast(data, TissueType ~ DonorID))
missing.tissues <- c(1, 8, 9, 20, 21, 24, 26, 27, 33, 36, 39)
data <- data[-missing.tissues, ]
# Drop columns with no samples:
data = data[, colSums(data) > 0]
gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE",
sep = '\t', comment.char = '')
gtex.colors <- gtex.colors[-c(7, 8, 19, 20, 24, 25, 31, 34, 37), 2]
gtex.colors <- as.character(gtex.colors)
I will use the following functions to fit the model:
# Computing Poisson likelihood and FLASH ELBO --------------------------
pois_lik <- function(the_data, fl) {
return(sum(-exp(fl$fitted_values) + the_data * fl$fitted_values))
}
pois_obj <- function(the_data, fl) {
return(pois_lik(the_data, fl) + sum(unlist(fl$fit$KL_l)) +
sum(unlist(fl$fit$KL_f)))
}
# Calculating pseudo-data ----------------------------------------------
calc_X <- function(the_data, mu) {
return(log(mu) + (the_data - mu) / mu)
}
calc_S <- function(the_data, mu) {
return(1 / sqrt(mu))
}
set_pseudodata <- function(the_data, mu) {
return(flash_set_data(calc_X(the_data, mu), S = calc_S(the_data, mu)))
}
# Setting FLASH parameters ---------------------------------------------
# Initialization function for nonnegative loadings
# (but arbitrary factors):
my_init_fn <- function(Y, K = 1) {
ret = udv_svd(Y, K)
sum_pos = sum(ret$u[ret$u > 0]^2)
sum_neg = sum(ret$u[ret$u < 0]^2)
if (sum_neg > sum_pos) {
return(list(u = -ret$u, d = ret$d, v = -ret$v))
} else
return(ret)
}
get_init_fn <- function(nonnegative = FALSE) {
if (nonnegative) {
return("my_init_fn")
} else {
return("udv_svd")
}
}
get_ebnm_fn <- function(nonnegative = FALSE) {
if (nonnegative) {
return(list(l = "ebnm_ash", f = "ebnm_pn"))
} else {
return(list(l = "ebnm_pn", f = "ebnm_pn"))
}
}
get_ebnm_param <- function(nonnegative = FALSE) {
if (nonnegative) {
return(list(l = list(mixcompdist = "+uniform"),
f = list(warmstart = TRUE)))
} else {
return(list(l = list(warmstart = TRUE),
f = list(warmstart = TRUE)))
}
}
# Initializing mu and running FLASH ------------------------------------
init_mu <- function(the_data, f_init) {
if (is.null(f_init)) {
return(matrix(colMeans(the_data),
nrow = nrow(the_data), ncol = ncol(the_data),
byrow = TRUE))
} else {
return(exp(f_init$fitted_values))
}
}
greedy_iter <- function(the_data, mu, f_init, niter,
nonnegative = FALSE, var_type = "zero") {
suppressWarnings(
flash_greedy_workhorse(set_pseudodata(the_data, mu),
Kmax = 1,
f_init = f_init,
var_type = var_type,
ebnm_fn = get_ebnm_fn(nonnegative),
ebnm_param = get_ebnm_param(nonnegative),
init_fn = get_init_fn(nonnegative),
verbose_output = "",
nullcheck = FALSE,
maxiter = niter)
)
}
backfit_iter <- function(the_data, mu, f_init, kset, niter,
nonnegative = FALSE, var_type = "zero") {
suppressWarnings(
flash_backfit_workhorse(set_pseudodata(the_data, mu),
kset = kset,
f_init = f_init,
var_type = var_type,
ebnm_fn = get_ebnm_fn(nonnegative),
ebnm_param = get_ebnm_param(nonnegative),
verbose_output = "",
nullcheck = FALSE,
maxiter = niter)
)
}
run_one_fit <- function(the_data, f_init, greedy, maxiter = 200,
n_subiter = 200, nonnegative = FALSE,
var_type = "zero",
verbose = TRUE, tol = .01) {
mu <- init_mu(the_data, f_init)
if (greedy) {
fl <- greedy_iter(the_data, mu, f_init, n_subiter,
nonnegative, var_type)
kset <- ncol(fl$fit$EL) # Only "backfit" the greedily added factor
mu <- exp(fl$fitted_values)
} else {
fl <- f_init
kset <- 1:ncol(fl$fit$EL) # Backfit all factor/loadings
}
# The objective can get stuck oscillating between two values, so we
# need to track the last two values attained:
old_old_obj <- -Inf
old_obj <- -Inf
diff <- Inf
iter <- 0
while (diff > tol && iter < maxiter) {
iter <- iter + 1
fl <- backfit_iter(the_data, mu, fl, kset, n_subiter,
nonnegative, var_type)
mu <- exp(fl$fitted_values)
new_obj <- pois_obj(the_data, fl)
diff <- min(abs(new_obj - old_obj), abs(new_obj - old_old_obj))
old_old_obj <- old_obj
old_obj <- new_obj
if (verbose) {
message("Iteration ", iter, ": ", old_obj)
}
}
return(fl)
}
I greedily add a factor by initializing \(\mu^\star\) to the column means of the data, then alternating between running FLASH on the “pseudo-data” corresponding to the current \(\mu^\star\) and updating \(\mu^\star\) to be the posterior mean of the current FLASH fit:
fl <- run_one_fit(data, f_init = NULL, greedy = TRUE)
#> Iteration 1: -927714082346.181
#> Iteration 2: -341286945918.445
#> Iteration 3: -125552459516.84
#> Iteration 4: -46188177871.3924
#> Iteration 5: -16991690774.7395
#> Iteration 6: -6250903760.78647
#> Iteration 7: -2299589270.42295
#> Iteration 8: -845982056.385945
#> Iteration 9: -311229940.405073
#> Iteration 10: -114505696.365842
#> Iteration 11: -42134944.2935817
#> Iteration 12: -15511272.9754123
#> Iteration 13: -5717001.94860467
#> Iteration 14: -2113906.45466518
#> Iteration 15: -788404.890125627
#> Iteration 16: -300778.640349495
#> Iteration 17: -121382.535730719
#> Iteration 18: -55374.158506545
#> Iteration 19: -31079.0542736632
#> Iteration 20: -22130.6357421538
#> Iteration 21: -18834.2241948271
#> Iteration 22: -17631.9197964083
#> Iteration 23: -17227.1683461842
#> Iteration 24: -17120.3909758248
#> Iteration 25: -17094.2647075556
#> Iteration 26: -17091.0177410941
#> Iteration 27: -17090.5831181879
#> Iteration 28: -17090.5851822656
So, although the initial estimate of \(\mu^\star\) is terrible, the objective seems to eventually converge.
I continue to greedily add factors in the same way:
fl <- run_one_fit(data, f_init = fl, greedy = TRUE)
#> Iteration 1: -16012.4385475068
#> Iteration 2: -15996.0987462569
#> Iteration 3: -15994.474709406
#> Iteration 4: -15994.5035110124
#> Iteration 5: -15994.522807295
#> Iteration 6: -15994.5304893364
fl <- run_one_fit(data, f_init = fl, greedy = TRUE)
#> Iteration 1: -15994.5335404488
#> Iteration 2: -15994.5347437438
Since the third factor no longer offers any improvement, I stop adding factors and backfit:
fl <- run_one_fit(data, f_init = fl, greedy = FALSE)
#> Iteration 1: -15820.0913699191
#> Iteration 2: -15644.1196467394
#> Iteration 3: -15574.356156741
#> Iteration 4: -15526.9227894188
#> Iteration 5: -15503.4303568297
#> Iteration 6: -15492.9576959489
#> Iteration 7: -15488.9547268523
#> Iteration 8: -15487.6157343197
#> Iteration 9: -15487.1185250066
#> Iteration 10: -15486.8695270274
#> Iteration 11: -15486.7522169866
#> Iteration 12: -15486.7015595336
#> Iteration 13: -15486.6810231773
#> Iteration 14: -15486.6758741976
I now examine the resulting loadings. One loading primarily represents correlation among brain (and pituitary) tissues; the other represents correlations among the remaining tissues.
plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
loading_legend_size = 3, plot_scree = FALSE)
Version | Author | Date |
---|---|---|
8b73159 | Jason Willwerscheid | 2018-09-21 |
It is not necessary to run FLASH to convergence before updating \(\mu^\star\). At the extreme, one could do a single FLASH iteration, then update \(\mu^\star\), then run a second FLASH iteration beginning from the updated \(\mu^\star\), and so on.
In the following experiment, I vary the maximum number of iterations between updates of \(\mu^\star\) (n_subiter
) from 1 to 100 (which is typically enough to achieve convergence). In each case, I greedily add as many factors as possible and then backfit (as in the above example). I track the final objective attained and the time required to fit (in seconds).
flash_fit <- function(the_data, n_subiter, nonnegative = FALSE,
var_type = "zero", maxiter = 100, tol = .01) {
fl <- run_one_fit(the_data, f_init = NULL, greedy = TRUE,
maxiter = maxiter, n_subiter = n_subiter,
nonnegative = nonnegative, var_type = var_type,
verbose = FALSE)
old_obj <- pois_obj(the_data, fl)
# Keep greedily adding factors until the objective no longer improves:
diff <- Inf
while (diff > tol) {
fl <- run_one_fit(the_data, fl, greedy = TRUE,
maxiter = maxiter, n_subiter = n_subiter,
nonnegative = nonnegative, var_type = var_type,
verbose = FALSE)
new_obj <- pois_obj(the_data, fl)
diff <- new_obj - old_obj
old_obj <- new_obj
}
# Now backfit the whole thing:
fl <- run_one_fit(the_data, fl, greedy = FALSE,
maxiter = maxiter, n_subiter = n_subiter,
nonnegative = nonnegative, var_type = var_type,
verbose = FALSE)
return(fl)
}
n_subiters <- c(1, 2, 5, 10, 25, 50, 100)
all_t <- rep(0, length(n_subiters))
all_obj <- rep(0, length(n_subiters))
for (i in 1:length(n_subiters)) {
t <- system.time(fl <- flash_fit(data, n_subiters[i]))
all_t[i] <- t[3]
all_obj[i] <- pois_obj(data, fl)
}
df <- data.frame(n_subiter = n_subiters,
"time to fit" = all_t,
"final objective" = all_obj)
knitr::kable(df)
n_subiter | time.to.fit | final.objective |
---|---|---|
1 | 5.342 | -15536.72 |
2 | 6.120 | -15486.69 |
5 | 5.957 | -15486.61 |
10 | 7.511 | -15486.67 |
25 | 8.380 | -15486.67 |
50 | 8.816 | -15486.68 |
100 | 9.276 | -15486.68 |
Clearly, capping the number of iterations per FLASH fit reduces overall fitting time. It is possible that too few iterations produces a suboptimal fit (the objective for n_subiter = 1
is about 50 less than the other objectives), but this could be an isolated event.
Since, as I’ve argued in other analyses, putting nonnegative constraints on the loadings helps produce more interpretable factors, I repeat the above experiment with nonnegative “+uniform” priors on the loadings.
for (i in 1:length(n_subiters)) {
t <- system.time(fl <- flash_fit(data, n_subiters[i],
nonnegative = TRUE))
all_t[i] <- t[3]
all_obj[i] <- pois_obj(data, fl)
}
nonneg_df <- data.frame(n_subiter = n_subiters,
"time to fit (seconds)" = all_t,
"final objective" = all_obj)
knitr::kable(nonneg_df)
n_subiter | time.to.fit..seconds. | final.objective |
---|---|---|
1 | 18.083 | -15267.11 |
2 | 19.358 | -15278.50 |
5 | 22.609 | -15297.36 |
10 | 23.729 | -15297.36 |
25 | 24.079 | -15297.36 |
50 | 24.083 | -15297.36 |
100 | 23.991 | -15297.36 |
Again, the objective for n_subiter = 1
is much different from other settings of n_subiter
, but here it is almost 30 higher!
Finally, I refit with n_subiter = 1
and plot the resulting factors. Notice, in particular, that the fifth factor is almost exclusively loaded on female reproductive tissues. (Further rounds of greedy addition and backfitting do not yield any additional factors.)
fl <- flash_fit(data, 1, TRUE)
plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
loading_legend_size = 3, plot_scree = FALSE)
Version | Author | Date |
---|---|---|
8b73159 | Jason Willwerscheid | 2018-09-21 |
Version | Author | Date |
---|---|---|
8b73159 | Jason Willwerscheid | 2018-09-21 |
It might be worth it to try to streamline the approach by getting a good initialization point, running FLASH a single time to update \(\mu^\star\), and then forgetting about the original data entirely.
I obtain an initialization point by running nonnegative FLASH on the data (recall that \(\mu^\star\) must be strictly positive).
t_init <- system.time(
fl <- suppressWarnings(
flash(data, var_type = "by_row", ebnm_fn = "ebnm_ash",
ebnm_param = list(mixcompdist = "+uniform"),
verbose = FALSE)
)
)
mu <- fl$fitted_values
Note that this takes about as long as the entire fitting process with n_subiter = 1
above!
round(t_init[3], digits = 2)
#> elapsed
#> 5.33
Now I run FLASH on the pseudo-data and update \(\mu^\star\):
pseudodata <- set_pseudodata(data, mu)
t_update <- system.time(
fl <- suppressWarnings(
flash(pseudodata, var_type = "zero", backfit = TRUE,
verbose = FALSE)
)
)
mu <- exp(fl$fitted_values)
round(t_update[3], digits = 2)
#> elapsed
#> 3.25
pseudodata <- set_pseudodata(data, mu)
t_final <- system.time(
fl <- suppressWarnings(
flash(pseudodata, var_type = "zero", backfit = TRUE,
verbose = FALSE)
)
)
round(t_final[3], digits = 2)
#> elapsed
#> 1.94
The objective is not much worse than the the objective attained using the more involved approach above (it is about 70 lower):
pois_obj(data, fl)
#> [1] -15588
However, the loadings tell a somewhat different story:
plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
loading_legend_size = 3, plot_scree = FALSE)
Version | Author | Date |
---|---|---|
8b73159 | Jason Willwerscheid | 2018-09-21 |
Further, this simpler approach is slower than the above approach for all values of n_subiter
:
round(t_init[3] + t_update[3] + t_final[3], digits = 2)
#> elapsed
#> 10.52
As explained above, I chose var_type = "zero"
merely for purposes of illustration. Often, it will be more appropriate to set var_type = "constant"
. In the example under consideration, a “constant” variance structure yields more structure and more interpretable factors (note, in particular, the sex-specific fourth loading):
fl <- flash_fit(data, 1, var_type = "constant")
plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
loading_legend_size = 3, plot_scree = FALSE)
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ebnm_0.1-15 flashr_0.6-2
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.18 highr_0.6 pillar_1.2.1
#> [4] plyr_1.8.4 compiler_3.4.3 git2r_0.21.0
#> [7] workflowr_1.0.1 R.methodsS3_1.7.1 R.utils_2.6.0
#> [10] iterators_1.0.9 tools_3.4.3 testthat_2.0.0
#> [13] digest_0.6.15 tibble_1.4.2 evaluate_0.10.1
#> [16] memoise_1.1.0 gtable_0.2.0 lattice_0.20-35
#> [19] rlang_0.2.0 Matrix_1.2-12 foreach_1.4.4
#> [22] commonmark_1.4 yaml_2.1.17 parallel_3.4.3
#> [25] withr_2.1.1.9000 stringr_1.3.0 roxygen2_6.0.1.9000
#> [28] xml2_1.2.0 knitr_1.20 REBayes_1.2
#> [31] devtools_1.13.4 rprojroot_1.3-2 grid_3.4.3
#> [34] R6_2.2.2 rmarkdown_1.8 reshape2_1.4.3
#> [37] ggplot2_2.2.1 ashr_2.2-13 magrittr_1.5
#> [40] whisker_0.3-2 backports_1.1.2 scales_0.5.0
#> [43] codetools_0.2-15 htmltools_0.3.6 MASS_7.3-48
#> [46] assertthat_0.2.0 softImpute_1.4 colorspace_1.3-2
#> [49] labeling_0.3 stringi_1.1.6 Rmosek_7.1.3
#> [52] lazyeval_0.2.1 doParallel_1.0.11 pscl_1.5.2
#> [55] munsell_0.4.3 truncnorm_1.0-8 SQUAREM_2017.10-1
#> [58] R.oo_1.21.0
This reproducible R Markdown analysis was created with workflowr 1.0.1