Last updated: 2017-02-14

Code version: 7693edf

Introduction

Up until now, we are using simulated correlated null data for exploration, and it would be helpful to examine and document the pipeline for generating these summary statistics, including \(\hat\beta\), \(\hat s\), \(z\) score, \(t\) statistics, \(p\) value.

Pipeline

An expression matrix \(X_{G \times N}\) will go through the following steps to get measurements of length \(g\): Choose the top \(g\) expressed genes; randomly sample \(n\) cases vs \(n\) controls; calculate normalizing factor by edgeR::calcNormFactors; turn counts to \(\log_2\); calculate variance weight by limma::voom. We’ll go through this pipeline with a toy data set.

Sometimes it’s very helpful to read the source code of these different functions.

library(edgeR)
Loading required package: limma
library(limma)

read/generate raw counts

Suppose the raw counts are stored in a \(10 \times 8\) matrix with \(10\) genes and \(8\) samples, counts simulated from Possion.

set.seed(777)
r = matrix(rpois(n = 80, lambda = 5), nrow = 10)
r
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
 [1,]    6    6    4    7    1    6    6    2
 [2,]    5    6    6    6    4    4    3    7
 [3,]    4    5    5    7    5    2    7    9
 [4,]   12    5    3    4    4    3    0    4
 [5,]    6    7    1    4    3    5    8    4
 [6,]    1    2    7    4   11    3    2    8
 [7,]    4    4    5    6    8    2    2    4
 [8,]    3    8    5    4    8    9    3    6
 [9,]    9    4   11    6    8    4    9    5
[10,]    3    4    6    6    3    1    5    4

Select top expressed genes:

  1. convert each count to its lcpm.
  2. select top expressed genes according their sum of lcpm.

In this example we choose top \(6\) genes from a total of \(10\) genes.

lcpm = function(r){R = colSums(r); t(log2(((t(r)+0.5)/(R+1))* 10^6))}
top_genes_index=function(g,X){return(order(rowSums(X),decreasing =TRUE)[1:g])}
Y=lcpm(r)
subset = top_genes_index(6,Y)
r = r[subset,]
r
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    9    4   11    6    8    4    9    5
[2,]    3    8    5    4    8    9    3    6
[3,]    4    5    5    7    5    2    7    9
[4,]    5    6    6    6    4    4    3    7
[5,]    6    7    1    4    3    5    8    4
[6,]    6    6    4    7    1    6    6    2

randomly sample cases and controls

In this example we use 2 cases vs 2 controls, and put labes on them by condition, based on which a design matrix is generated.

counts = r[,sample(1:ncol(r),2*2)]
counts
     [,1] [,2] [,3] [,4]
[1,]    6    5   11    4
[2,]    4    6    5    9
[3,]    7    9    5    2
[4,]    6    7    6    4
[5,]    4    4    1    5
[6,]    7    2    4    6
condition = c(rep(0,2),rep(1,2))
condition
[1] 0 0 1 1
design = model.matrix(~condition)
design
  (Intercept) condition
1           1         0
2           1         0
3           1         1
4           1         1
attr(,"assign")
[1] 0 1

calculate normalizing factors for each sample using edgeR::calcNormFactors

This might be potentially a problem, because the normalizing factors are only calculated based on top \(g\) genes. 1. A DGEList object is generated by counts and condition, using edgeR::DGEList 2. The normalizing factors are calculated by edgeR::calcNormFactors.

DGEList_obj = edgeR::DGEList(counts=counts,group=condition)
DGEList_obj
An object of class "DGEList"
$counts
  Sample1 Sample2 Sample3 Sample4
1       6       5      11       4
2       4       6       5       9
3       7       9       5       2
4       6       7       6       4
5       4       4       1       5
6       7       2       4       6

$samples
        group lib.size norm.factors
Sample1     0       34            1
Sample2     0       33            1
Sample3     1       32            1
Sample4     1       30            1
dgecounts = edgeR::calcNormFactors(DGEList_obj)
dgecounts
An object of class "DGEList"
$counts
  Sample1 Sample2 Sample3 Sample4
1       6       5      11       4
2       4       6       5       9
3       7       9       5       2
4       6       7       6       4
5       4       4       1       5
6       7       2       4       6

$samples
        group lib.size norm.factors
Sample1     0       34    1.1201997
Sample2     0       33    0.9397760
Sample3     1       32    0.9051827
Sample4     1       30    1.0494070

Calculate weight for each \(\log_2\) expression using limma::voom

limma::voom takes in counts and normalizing factors, and gives, among other things, modified library sizes, \(\log_2\) expression, weights for each expression, all for Gaussian-based linear modeling.

The peculiarity of limma::voom lies in the way it calculates the “average” log2-count for each gene.: sx <- fit$Amean + mean(log2(lib.size + 1)) - log2(1e+06). Note that for each gene, the count as well as log2-count could vary wildly from sample to sample due to library size, sequencing depth, and / or experimental design, so the way to find an “average” is not self-evident.

v = voom(dgecounts, design, plot=FALSE)
v
An object of class "EList"
$targets
        group lib.size norm.factors
Sample1     0 38.08679    1.1201997
Sample2     0 31.01261    0.9397760
Sample3     1 28.96585    0.9051827
Sample4     1 31.48221    1.0494070

$E
   Sample1  Sample2  Sample3  Sample4
1 17.34340 17.39043 18.54988 17.07992
2 16.81288 17.63144 17.48575 18.15792
3 17.54985 18.17893 17.48575 16.23192
4 17.34340 17.83789 17.72676 17.07992
5 16.81288 17.10093 15.61128 17.36942
6 17.54985 16.25293 17.19625 17.61043

$weights
         [,1]     [,2]     [,3]     [,4]
[1,] 1.849269 2.060690 1.849269 1.849269
[2,] 4.229762 2.147872 1.849269 1.849269
[3,] 1.849269 1.849269 1.260232 1.424488
[4,] 1.849269 4.502218 2.095619 2.171379
[5,] 2.133848 1.565478 1.260232 1.260232
[6,] 2.111233 1.462248 2.095619 2.171379

$design
  (Intercept) condition
1           1         0
2           1         0
3           1         1
4           1         1
attr(,"assign")
[1] 0 1

Here is what limma::voom does as in Law et al 2013

voom: variance modeling at the observation-level

The limma-trend pipeline models the variance at the gene level. However, in RNA-seq applications, the count sizes may vary considerably from sample to sample for the same gene. Different samples may be sequenced to different depths, so different count sizes may be quite different even if the cpm values are the same. For this reason, we wish to model the mean-variance trend of the log-cpm values at the individual observation level, instead of applying a gene-level variability estimate to all observations from the same gene.

Our strategy is to estimate non-parametrically the mean-variance trend of the logged read counts and to use this mean-variance relationship to predict the variance of each log-cpm value. The predicted variance is then encapsulated as an inverse weight for the log-cpm value. When the weights are incorporated into a linear modeling procedure, the mean-variance relationship in the log-cpm values is effectively eliminated.

A technical difficulty is that we want to predict the variances of individual observations although there is, by definition, no replication at the observational level from which variances could be estimated. We work around this inconvenience by estimating the mean-variance trend at the gene level, then interpolating this trend to predict the variances of individual observations.

The algorithm proceeds as follows. First, gene-wise linear models are fitted to the normalized log-cpm values, taking into account the experimental design, treatment conditions, replicates and so on. This generates a residual standard deviation for each gene. A robust trend is then fitted to the residual standard deviations as a function of the average log-count for each gene.

Also available from the linear models is a fitted value for each log-cpm observation. Taking the library sizes into account, the fitted log-cpm for each observation is converted into a predicted count. The standard deviation trend is then interpolated to predict the standard deviation of each individual observation based on its predicted count size. Finally, the inverse squared predicted standard deviation for each observation becomes the weight for that observation.

The log-cpm values and associated weights are then input into limma’s standard differential expression pipeline. Most limma functions are designed to accept quantitative weights, providing the ability to perform microarray-like analyses while taking account of the mean-variance relationship of the log-cpm values at the observation level.

Perform weighted \(t\)-test for each gene and obtain primary \(\hat\beta\), \(\hat s\), \(t\), d.f., \(p\) using limma::lmFit

Taking in the EList from limma::voom, limma::lmFit produces \(g\) \(t\)-tests, giving vectors of \(\hat\beta\), \(\hat s\), degree of freedom.

Note that $stdev.unscaled\(= (X^TWX)^{-1}\), \(W\) being the weights calculated by limma::voom.

lim = lmFit(v)
lim
An object of class "MArrayLM"
$coefficients
  (Intercept)  condition
1    17.36819  0.4467125
2    17.08856  0.7332766
3    17.86439 -1.0439088
4    17.69392 -0.2963208
5    16.93478 -0.4444242
6    17.01916  0.3878582

$stdev.unscaled
  (Intercept) condition
1   0.5057244 0.7253511
2   0.3959772 0.6535863
3   0.5199780 0.8017827
4   0.3967914 0.6259395
5   0.5199226 0.8167446
6   0.5289983 0.7170746

$sigma
[1] 1.0000276 0.8283184 0.8417619 0.6191636 1.0056442 0.9044428

$df.residual
[1] 2 2 2 2 2 2

$cov.coefficients
            (Intercept) condition
(Intercept)         0.5      -0.5
condition          -0.5       1.0

$pivot
[1] 1 2

$rank
[1] 2

$Amean
       1        2        3        4        5        6 
17.59091 17.52200 17.36161 17.49699 16.72363 17.15236 

$method
[1] "ls"

$design
  (Intercept) condition
1           1         0
2           1         0
3           1         1
4           1         1
attr(,"assign")
[1] 0 1
betahat.lim = lim$coefficients[,2]
betahat.lim
         1          2          3          4          5          6 
 0.4467125  0.7332766 -1.0439088 -0.2963208 -0.4444242  0.3878582 
sebetahat.lim = lim$stdev.unscaled[,2]*lim$sigma
sebetahat.lim
        1         2         3         4         5         6 
0.7253712 0.5413776 0.6749101 0.3875590 0.8213545 0.6485530 
t.lim = betahat.lim / sebetahat.lim
df.lim = lim$df.residual
df.lim
[1] 2 2 2 2 2 2

Do variance shrinkage on \(\hat s\) and obtain moderated \(t\), d.f., \(p\) by empirical Bayes using limma::eBayes

Is it a variance shrinkage procedure?

Taking in summary statistics (\(\hat \beta\), \(\hat s\), d.f.) for each gene, limma::eBayes produces shrunk standard deviation for each gene, and thus, moderated \(t\)-statistics and \(p\)-values.

lim.ebayes = ebayes(lim)
lim.ebayes
$df.prior
[1] Inf

$s2.prior
[1] 1.304242

$s2.post
[1] 1.304242 1.304242 1.304242 1.304242 1.304242 1.304242

$t
  (Intercept)  condition
1    30.07195  0.5392632
2    37.78820  0.9823942
3    30.08321 -1.1400578
4    39.04655 -0.4145250
5    28.52080 -0.4764665
6    28.17116  0.4736195

$df.total
[1] 12 12 12 12 12 12

$p.value
   (Intercept) condition
1 1.144359e-12 0.5995682
2 7.583604e-14 0.3452983
3 1.139291e-12 0.2765067
4 5.133780e-14 0.6858037
5 2.142997e-12 0.6423004
6 2.479864e-12 0.6442714

$var.prior
[1] 12.267665051  0.007667291

$lods
  (Intercept) condition
1    436.3862 -4.600265
2    698.1822 -4.595506
3    436.2283 -4.593388
4    745.8775 -4.603161
5    391.4357 -4.599544
6    381.4607 -4.600873
sebetahat.ebayes = lim$stdev.unscaled[, 2] * sqrt(lim.ebayes$s2.post)
sebetahat.ebayes
        1         2         3         4         5         6 
0.8283756 0.7464178 0.9156631 0.7148442 0.9327501 0.8189236 
t.ebayes = betahat.lim / sebetahat.ebayes
t.ebayes
         1          2          3          4          5          6 
 0.5392632  0.9823942 -1.1400578 -0.4145250 -0.4764665  0.4736195 
lim.ebayes$t[, 2]
         1          2          3          4          5          6 
 0.5392632  0.9823942 -1.1400578 -0.4145250 -0.4764665  0.4736195 
df.ebayes = lim.ebayes$df.total
df.ebayes
[1] 12 12 12 12 12 12
p.ebayes = (1 - pt(abs(t.ebayes), df.ebayes)) * 2
p.ebayes
        1         2         3         4         5         6 
0.5995682 0.3452983 0.2765067 0.6858037 0.6423004 0.6442714 
lim.ebayes$p.value[, 2]
        1         2         3         4         5         6 
0.5995682 0.3452983 0.2765067 0.6858037 0.6423004 0.6442714 

obtain \(z\) scores from moderated \(p\)-values and \(t\)-statistics

z = qnorm(1 - lim.ebayes$p[, 2] / 2) * sign(lim.ebayes$t[, 2])
z
         1          2          3          4          5          6 
 0.5250216  0.9437483 -1.0882004 -0.4045563 -0.4644849  0.4617350 

Session Information

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.14.0  limma_3.28.14

loaded via a namespace (and not attached):
 [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
 [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.9     stringi_1.1.2  
 [9] rmarkdown_1.3   knitr_1.15.1    git2r_0.18.0    stringr_1.1.0  
[13] digest_0.6.9    workflowr_0.3.0 evaluate_0.10  

This R Markdown site was created with workflowr