Matrix factorization of count data

Last updated: 2018-09-22

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180714)

The command set.seed(20180714) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 5bfea2c
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store

Untracked files:
    Untracked:  analysis/binary_data.Rmd
    Untracked:  data/greedy19.rds
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	5bfea2c	Jason Willwerscheid	2018-09-22	wflow_publish(“analysis/count_data.Rmd”)
html	8b73159	Jason Willwerscheid	2018-09-21	Build site.
Rmd	884f8fa	Jason Willwerscheid	2018-09-21	wflow_publish(“analysis/count_data.Rmd”)

Introduction

In a previous analysis, I used FLASH to obtain a nonnegative matrix factorization of the GTEx donation matrix. There, I proceeded as if the observations were normally distributed. Here, I explore a more sophisticated approach to count data.

Model

I assume that the data is generated as \[ Y \sim \text{Poisson}(\mu), \] where \[ \log(\mu) = LF' = \sum_{k = 1}^K l_k f_k', \] with an ASH prior on the entries of each loading \(l_k\) and each factor \(f_k\) \[ l_k \sim g_{l_k},\ f_k \sim g_{f_k}. \]

To account for overdispersion, one can also fit the model \[ \log(\mu) = LF' + E, \] where the entries of \(E\) are i.i.d. \(N(0, \sigma^2)\) and \(\sigma^2\) is to be estimated.

Idea

The general idea is outlined by Matthew Stephens here.

To apply the idea to FLASH, I replace the part of the objective function that comes from the residuals with the Poisson log likelihood \[ \ell(\mu) = \sum_{i, j} -\mu_{ij} + Y_{ij} \log \mu_{ij}. \] Set \(\eta_{ij} = \log \mu_{ij}\) so that \[ \ell(\eta) = \sum_{i, j} -e^{\eta_{ij}} + Y_{ij} \eta_{ij}. \]

Now do a second-order Taylor expansion of \(\ell(\eta)\) around some (for now, arbitrary) point \(\eta^\star\): \[ \ell(\eta) = \ell(\eta^\star) + (\eta - \eta^\star) \ell'(\eta^\star) + \frac{(\eta - \eta^\star)^2}{2} \ell''(\eta^\star). \]

This is precisely the log likelihood for the distribution \[ \eta \sim \text{Normal}\left(\eta^\star - \frac{\ell'(\eta^\star)}{\ell''(\eta^\star)}, -\frac{1}{\ell''(\eta^\star)}\right). \]

Thus, given a good enough choice of \(\eta^\star\), we should be able to run FLASH on the “pseudo-data” \[ X = \eta^\star - \frac{\ell'(\eta^\star)}{\ell''(\eta^\star)} = \log(\mu^\star) + \frac{Y - \mu^\star}{\mu^\star}\] with “standard errors” \[S = -\frac{1}{\sqrt{\ell''(\mu^\star)}} = \frac{1}{\sqrt{\mu^\star}}, \] where \(\mu^\star = e^{\eta^\star}\). To fit the model \(\log(\mu) = LF'\), one fixes the standard errors by setting var_type = "zero". To fit the overdispersed model \(\log(\mu) = LF' + E\), one sets var_type = "constant" so that each observation has variance \(S_{ij}^2 + \sigma^2\). For purposes of illustration, I take var_type = "zero" as the default setting. I do so because I have taken a shortcut in calculating the Poisson log likelihood by simply fixing \(\log(\mu)\) at the posterior mean of \(LF'\); this approximation is much worse when there is an additional “error term” \(E\).

The crux in all of this is to choose an \(\eta^\star\) that is as close as possible to the \(\eta\) that minimizes the objective function. An iterative procedure suggests itself whereby one fits FLASH to the pseudo-data generated by a given choice of \(\eta^\star\), then updates \(\eta^\star\) to be the posterior mean given by the FLASH fit, then fits FLASH to the pseudo-data generated by the new \(\eta^\star\), and so on until convergence.

Code and example

First I load the data:

devtools::load_all("~/GitHub/flashr")
#> Loading flashr
devtools::load_all("~/GitHub/ebnm")
#> Loading ebnm

raw <- read.csv("https://storage.googleapis.com/gtex_analysis_v6/annotations/GTEx_Data_V6_Annotations_SampleAttributesDS.txt",
                header=TRUE, sep='\t')

data <- raw[, c("SAMPID", "SMTSD")] # sample ID, tissue type
# Extract donor ID:
tmp <- strsplit(as.character(data$SAMPID), "-")
data$SAMPID <- as.factor(sapply(tmp, function(x) {x[[2]]})) 
names(data) <- c("DonorID", "TissueType")

data <- suppressMessages(reshape2::acast(data, TissueType ~ DonorID))

missing.tissues <- c(1, 8, 9, 20, 21, 24, 26, 27, 33, 36, 39)
data <- data[-missing.tissues, ]

# Drop columns with no samples:
data = data[, colSums(data) > 0]

gtex.colors <- read.table("https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt?raw=TRUE",
                          sep = '\t', comment.char = '')
gtex.colors <- gtex.colors[-c(7, 8, 19, 20, 24, 25, 31, 34, 37), 2]
gtex.colors <- as.character(gtex.colors)

I will use the following functions to fit the model:

# Computing Poisson likelihood and FLASH ELBO --------------------------

pois_lik <- function(the_data, fl) {
  return(sum(-exp(fl$fitted_values) + the_data * fl$fitted_values))
}

pois_obj <- function(the_data, fl) {
  return(pois_lik(the_data, fl) + sum(unlist(fl$fit$KL_l)) +
           sum(unlist(fl$fit$KL_f)))
}

# Calculating pseudo-data ----------------------------------------------

calc_X <- function(the_data, mu) {
  return(log(mu) + (the_data - mu) / mu)
}

calc_S <- function(the_data, mu) {
  return(1 / sqrt(mu))
}

set_pseudodata <- function(the_data, mu) {
  return(flash_set_data(calc_X(the_data, mu), S = calc_S(the_data, mu)))
}

# Setting FLASH parameters ---------------------------------------------

# Initialization function for nonnegative loadings 
#   (but arbitrary factors):
my_init_fn <- function(Y, K = 1) {
  ret = udv_svd(Y, K)
  sum_pos = sum(ret$u[ret$u > 0]^2)
  sum_neg = sum(ret$u[ret$u < 0]^2)
  if (sum_neg > sum_pos) {
    return(list(u = -ret$u, d = ret$d, v = -ret$v))
  } else
    return(ret)
}

get_init_fn <- function(nonnegative = FALSE) {
  if (nonnegative) {
    return("my_init_fn")
  } else {
    return("udv_svd")
  }
}

get_ebnm_fn <- function(nonnegative = FALSE) {
  if (nonnegative) {
    return(list(l = "ebnm_ash", f = "ebnm_pn"))
  } else {
    return(list(l = "ebnm_pn", f = "ebnm_pn"))
  }
}

get_ebnm_param <- function(nonnegative = FALSE) {
  if (nonnegative) {
    return(list(l = list(mixcompdist = "+uniform"),
                f = list(warmstart = TRUE)))
  } else {
    return(list(l = list(warmstart = TRUE),
                f = list(warmstart = TRUE)))
  }
}

# Initializing mu and running FLASH ------------------------------------

init_mu <- function(the_data, f_init) {
  if (is.null(f_init)) {
    return(matrix(colMeans(the_data),
                  nrow = nrow(the_data), ncol = ncol(the_data),
                  byrow = TRUE))
  } else {
    return(exp(f_init$fitted_values))
  }
}

greedy_iter <- function(the_data, mu, f_init, niter, 
                        nonnegative = FALSE, var_type = "zero") {
  suppressWarnings(
    flash_greedy_workhorse(set_pseudodata(the_data, mu),
                           Kmax = 1,
                           f_init = f_init,
                           var_type = var_type,
                           ebnm_fn = get_ebnm_fn(nonnegative),
                           ebnm_param = get_ebnm_param(nonnegative),
                           init_fn = get_init_fn(nonnegative),
                           verbose_output = "",
                           nullcheck = FALSE,
                           maxiter = niter)
  )
}

backfit_iter <- function(the_data, mu, f_init, kset, niter, 
                         nonnegative = FALSE, var_type = "zero") {
  suppressWarnings(
    flash_backfit_workhorse(set_pseudodata(the_data, mu),
                            kset = kset,
                            f_init = f_init,
                            var_type = var_type,
                            ebnm_fn = get_ebnm_fn(nonnegative),
                            ebnm_param = get_ebnm_param(nonnegative),
                            verbose_output = "",
                            nullcheck = FALSE,
                            maxiter = niter)
  )
}

run_one_fit <- function(the_data, f_init, greedy, maxiter = 200,
                        n_subiter = 200, nonnegative = FALSE, 
                        var_type = "zero", 
                        verbose = TRUE, tol = .01) {
  mu <- init_mu(the_data, f_init)

  if (greedy) {
    fl <- greedy_iter(the_data, mu, f_init, n_subiter, 
                      nonnegative, var_type)
    kset <- ncol(fl$fit$EL) # Only "backfit" the greedily added factor
    mu <- exp(fl$fitted_values)
  } else {
    fl <- f_init
    kset <- 1:ncol(fl$fit$EL) # Backfit all factor/loadings
  }

  # The objective can get stuck oscillating between two values, so we
  #   need to track the last two values attained:
  old_old_obj <- -Inf
  old_obj <- -Inf
  diff <- Inf
  iter <- 0
  while (diff > tol && iter < maxiter) {
    iter <- iter + 1
    fl <- backfit_iter(the_data, mu, fl, kset, n_subiter, 
                       nonnegative, var_type)
    mu <- exp(fl$fitted_values)

    new_obj <- pois_obj(the_data, fl)
    diff <- min(abs(new_obj - old_obj), abs(new_obj - old_old_obj))

    old_old_obj <- old_obj
    old_obj <- new_obj

    if (verbose) {
      message("Iteration ", iter, ": ", old_obj)
    }
  }
  return(fl)
}

I greedily add a factor by initializing \(\mu^\star\) to the column means of the data, then alternating between running FLASH on the “pseudo-data” corresponding to the current \(\mu^\star\) and updating \(\mu^\star\) to be the posterior mean of the current FLASH fit:

fl <- run_one_fit(data, f_init = NULL, greedy = TRUE)
#> Iteration 1: -927714082346.181
#> Iteration 2: -341286945918.445
#> Iteration 3: -125552459516.84
#> Iteration 4: -46188177871.3924
#> Iteration 5: -16991690774.7395
#> Iteration 6: -6250903760.78647
#> Iteration 7: -2299589270.42295
#> Iteration 8: -845982056.385945
#> Iteration 9: -311229940.405073
#> Iteration 10: -114505696.365842
#> Iteration 11: -42134944.2935817
#> Iteration 12: -15511272.9754123
#> Iteration 13: -5717001.94860467
#> Iteration 14: -2113906.45466518
#> Iteration 15: -788404.890125627
#> Iteration 16: -300778.640349495
#> Iteration 17: -121382.535730719
#> Iteration 18: -55374.158506545
#> Iteration 19: -31079.0542736632
#> Iteration 20: -22130.6357421538
#> Iteration 21: -18834.2241948271
#> Iteration 22: -17631.9197964083
#> Iteration 23: -17227.1683461842
#> Iteration 24: -17120.3909758248
#> Iteration 25: -17094.2647075556
#> Iteration 26: -17091.0177410941
#> Iteration 27: -17090.5831181879
#> Iteration 28: -17090.5851822656

So, although the initial estimate of \(\mu^\star\) is terrible, the objective seems to eventually converge.

I continue to greedily add factors in the same way:

fl <- run_one_fit(data, f_init = fl, greedy = TRUE)
#> Iteration 1: -16012.4385475068
#> Iteration 2: -15996.0987462569
#> Iteration 3: -15994.474709406
#> Iteration 4: -15994.5035110124
#> Iteration 5: -15994.522807295
#> Iteration 6: -15994.5304893364
fl <- run_one_fit(data, f_init = fl, greedy = TRUE)
#> Iteration 1: -15994.5335404488
#> Iteration 2: -15994.5347437438

Since the third factor no longer offers any improvement, I stop adding factors and backfit:

fl <- run_one_fit(data, f_init = fl, greedy = FALSE)
#> Iteration 1: -15820.0913699191
#> Iteration 2: -15644.1196467394
#> Iteration 3: -15574.356156741
#> Iteration 4: -15526.9227894188
#> Iteration 5: -15503.4303568297
#> Iteration 6: -15492.9576959489
#> Iteration 7: -15488.9547268523
#> Iteration 8: -15487.6157343197
#> Iteration 9: -15487.1185250066
#> Iteration 10: -15486.8695270274
#> Iteration 11: -15486.7522169866
#> Iteration 12: -15486.7015595336
#> Iteration 13: -15486.6810231773
#> Iteration 14: -15486.6758741976

I now examine the resulting loadings. One loading primarily represents correlation among brain (and pituitary) tissues; the other represents correlations among the remaining tissues.

plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
     loading_legend_size = 3, plot_scree = FALSE)

Expand here to see past versions of ex4-1.png:

Version	Author	Date
8b73159	Jason Willwerscheid	2018-09-21

Fitting the model

It is not necessary to run FLASH to convergence before updating \(\mu^\star\). At the extreme, one could do a single FLASH iteration, then update \(\mu^\star\), then run a second FLASH iteration beginning from the updated \(\mu^\star\), and so on.

In the following experiment, I vary the maximum number of iterations between updates of \(\mu^\star\) (n_subiter) from 1 to 100 (which is typically enough to achieve convergence). In each case, I greedily add as many factors as possible and then backfit (as in the above example). I track the final objective attained and the time required to fit (in seconds).

flash_fit <- function(the_data, n_subiter, nonnegative = FALSE,
                      var_type = "zero", maxiter = 100, tol = .01) {
  fl <- run_one_fit(the_data, f_init = NULL, greedy = TRUE,
                    maxiter = maxiter, n_subiter = n_subiter,
                    nonnegative = nonnegative, var_type = var_type,
                    verbose = FALSE)
  old_obj <- pois_obj(the_data, fl)
  
  # Keep greedily adding factors until the objective no longer improves:
  diff <- Inf
  while (diff > tol) {
    fl <- run_one_fit(the_data, fl, greedy = TRUE,
                      maxiter = maxiter, n_subiter = n_subiter,
                      nonnegative = nonnegative, var_type = var_type,
                      verbose = FALSE)
    new_obj <- pois_obj(the_data, fl)
    diff <- new_obj - old_obj
    old_obj <- new_obj
  }
  
  # Now backfit the whole thing:
  fl <- run_one_fit(the_data, fl, greedy = FALSE, 
                    maxiter = maxiter, n_subiter = n_subiter,
                    nonnegative = nonnegative, var_type = var_type,
                    verbose = FALSE)
  
  return(fl)
}

n_subiters <- c(1, 2, 5, 10, 25, 50, 100)
all_t <- rep(0, length(n_subiters))
all_obj <- rep(0, length(n_subiters))
for (i in 1:length(n_subiters)) {
  t <- system.time(fl <- flash_fit(data, n_subiters[i]))
  all_t[i] <- t[3]
  all_obj[i] <- pois_obj(data, fl)
}

df <- data.frame(n_subiter = n_subiters, 
                 "time to fit" = all_t, 
                 "final objective" = all_obj)
knitr::kable(df)

n_subiter	time.to.fit	final.objective
1	5.342	-15536.72
2	6.120	-15486.69
5	5.957	-15486.61
10	7.511	-15486.67
25	8.380	-15486.67
50	8.816	-15486.68
100	9.276	-15486.68

Clearly, capping the number of iterations per FLASH fit reduces overall fitting time. It is possible that too few iterations produces a suboptimal fit (the objective for n_subiter = 1 is about 50 less than the other objectives), but this could be an isolated event.

Nonnegative loadings

Since, as I’ve argued in other analyses, putting nonnegative constraints on the loadings helps produce more interpretable factors, I repeat the above experiment with nonnegative “+uniform” priors on the loadings.

for (i in 1:length(n_subiters)) {
  t <- system.time(fl <- flash_fit(data, n_subiters[i], 
                                   nonnegative = TRUE))
  all_t[i] <- t[3]
  all_obj[i] <- pois_obj(data, fl)
}

nonneg_df <- data.frame(n_subiter = n_subiters, 
                 "time to fit (seconds)" = all_t, 
                 "final objective" = all_obj)
knitr::kable(nonneg_df)

n_subiter	time.to.fit..seconds.	final.objective
1	18.083	-15267.11
2	19.358	-15278.50
5	22.609	-15297.36
10	23.729	-15297.36
25	24.079	-15297.36
50	24.083	-15297.36
100	23.991	-15297.36

Again, the objective for n_subiter = 1 is much different from other settings of n_subiter, but here it is almost 30 higher!

Finally, I refit with n_subiter = 1 and plot the resulting factors. Notice, in particular, that the fifth factor is almost exclusively loaded on female reproductive tissues. (Further rounds of greedy addition and backfitting do not yield any additional factors.)

fl <- flash_fit(data, 1, TRUE)

plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
     loading_legend_size = 3, plot_scree = FALSE)

Expand here to see past versions of nonneg2-1.png:

Version	Author	Date
8b73159	Jason Willwerscheid	2018-09-21

Expand here to see past versions of nonneg2-2.png:

Version	Author	Date
8b73159	Jason Willwerscheid	2018-09-21

A streamlined approach

It might be worth it to try to streamline the approach by getting a good initialization point, running FLASH a single time to update \(\mu^\star\), and then forgetting about the original data entirely.

I obtain an initialization point by running nonnegative FLASH on the data (recall that \(\mu^\star\) must be strictly positive).

t_init <- system.time(
  fl <- suppressWarnings(
    flash(data, var_type = "by_row", ebnm_fn = "ebnm_ash", 
          ebnm_param = list(mixcompdist = "+uniform"),
          verbose = FALSE)
  )
)
mu <- fl$fitted_values

Note that this takes about as long as the entire fitting process with n_subiter = 1 above!

round(t_init[3], digits = 2)
#> elapsed 
#>    5.33

Now I run FLASH on the pseudo-data and update \(\mu^\star\):

pseudodata <- set_pseudodata(data, mu)
t_update <- system.time(
  fl <- suppressWarnings(
    flash(pseudodata, var_type = "zero", backfit = TRUE, 
          verbose = FALSE)
  )
)
mu <- exp(fl$fitted_values)
round(t_update[3], digits = 2)
#> elapsed 
#>    3.25

pseudodata <- set_pseudodata(data, mu)
t_final <- system.time(
  fl <- suppressWarnings(
    flash(pseudodata, var_type = "zero", backfit = TRUE, 
          verbose = FALSE)
  )
)
round(t_final[3], digits = 2)
#> elapsed 
#>    1.94

The objective is not much worse than the the objective attained using the more involved approach above (it is about 70 lower):

pois_obj(data, fl)
#> [1] -15588

However, the loadings tell a somewhat different story:

plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
     loading_legend_size = 3, plot_scree = FALSE)

Expand here to see past versions of two_iter5b-1.png:

Version	Author	Date
8b73159	Jason Willwerscheid	2018-09-21

Further, this simpler approach is slower than the above approach for all values of n_subiter:

round(t_init[3] + t_update[3] + t_final[3], digits = 2)
#> elapsed 
#>   10.52

Accounting for overdispersion

As explained above, I chose var_type = "zero" merely for purposes of illustration. Often, it will be more appropriate to set var_type = "constant". In the example under consideration, a “constant” variance structure yields more structure and more interpretable factors (note, in particular, the sex-specific fourth loading):

fl <- flash_fit(data, 1, var_type = "constant")
plot(fl, plot_loadings = TRUE, loading_colors = gtex.colors,
     loading_legend_size = 3, plot_scree = FALSE)

Session information

sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ebnm_0.1-15  flashr_0.6-2
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18        highr_0.6           pillar_1.2.1       
#>  [4] plyr_1.8.4          compiler_3.4.3      git2r_0.21.0       
#>  [7] workflowr_1.0.1     R.methodsS3_1.7.1   R.utils_2.6.0      
#> [10] iterators_1.0.9     tools_3.4.3         testthat_2.0.0     
#> [13] digest_0.6.15       tibble_1.4.2        evaluate_0.10.1    
#> [16] memoise_1.1.0       gtable_0.2.0        lattice_0.20-35    
#> [19] rlang_0.2.0         Matrix_1.2-12       foreach_1.4.4      
#> [22] commonmark_1.4      yaml_2.1.17         parallel_3.4.3     
#> [25] withr_2.1.1.9000    stringr_1.3.0       roxygen2_6.0.1.9000
#> [28] xml2_1.2.0          knitr_1.20          REBayes_1.2        
#> [31] devtools_1.13.4     rprojroot_1.3-2     grid_3.4.3         
#> [34] R6_2.2.2            rmarkdown_1.8       reshape2_1.4.3     
#> [37] ggplot2_2.2.1       ashr_2.2-13         magrittr_1.5       
#> [40] whisker_0.3-2       backports_1.1.2     scales_0.5.0       
#> [43] codetools_0.2-15    htmltools_0.3.6     MASS_7.3-48        
#> [46] assertthat_0.2.0    softImpute_1.4      colorspace_1.3-2   
#> [49] labeling_0.3        stringi_1.1.6       Rmosek_7.1.3       
#> [52] lazyeval_0.2.1      doParallel_1.0.11   pscl_1.5.2         
#> [55] munsell_0.4.3       truncnorm_1.0-8     SQUAREM_2017.10-1  
#> [58] R.oo_1.21.0

This reproducible R Markdown analysis was created with workflowr 1.0.1