Last updated: 2018-06-21

workflowr checks: (Click a bullet for more information)
Expand here to see past versions:


Overview

To apply multivariate adaptive shrinkage (mash) to data from the GTEx study, we created an R data set (serialized R object) containing matrices of SNP-gene association statistics. These association statistics include effect estimates, Z scores and corresponding standard errors.

See here for the scripts used to generate these statistics from the SNP-gene data that were provided by the GTEx Project.

How to download the data

These are the recommended steps for retrieving the GTEx SNP-gene association statistics:

  1. Download or clone the git repository.

  2. The association statistics are found in file MatrixEQTLSumStats.Portable.Z.rds.

How to load the data into R

Change the working directory in R (or RStudio) to the analysis directory of the gtexresults repository, e.g.,

setwd("gtexresults/analysis")

Next, read the data object into R:

dat <- readRDS("../data/MatrixEQTLSumStats.Portable.Z.rds")

Then get an overview of the data from this file:

names(dat)
#  [1] "strong.b"      "strong.s"      "strong.z"      "random.b"     
#  [5] "random.s"      "random.z"      "random.test.b" "random.test.s"
#  [9] "random.test.z" "vhat"

Description of the data

This file contains SNP-gene association statistics for 16,069 genes and 44 human tissues. These 16,069 genes were selected because they all showed some indication of being expressed in all 44 tissues. Therefore, the association statistics are stored as matrices each with 16,069 rows and 44 columns, e.g.,

dim(dat$strong.b)
# [1] 16069    44

As input to mash, we use a matrix of expression quantitative trait loci (eQTL) effect estimate, and corresponding standard errors. (We also provide Z scores.) See the manuscript for details on how these association statistics were obtained.

These association statistics were subdivided into three subsets:

  1. Results from a subset “strong” tests. These tests were identified by taking the “top eQTL” in each gene based on univariate SNP-gene association tests. (Here, “top eQTL” for a given gene is defined as the SNP with the largest (univariate) Z statistic across all 44 tissues. The estimated effects, Z scores and standard errors for the strong tests are stored in three \(16,069 \times 44\) matrices, dat$strong.b, dat$strong.z and dat$strong.s.

  2. Results from a random subset of 20,0000 SNP-gene tests (this includes both “null” and “non”-null tests). The estimated effects, Z stores and standard errors for these random tests are stored in three \(20,000 \times 44\) matrices, dat$random.b, dat$random.z and dat$random.z.

  3. Results from a second random subset of 28,198 SNP-gene tests. This is used for the cross-validation part of the mash analysis. The estimated effects, Z stores and standard errors for these random tests are stored in three 28,198 44$ matrices, dat$random.test.b, dat$random.test.z and dat$random.test.z.

Finally, the gene expression measurements in the GTEx study are correlated due to sample overlap (sometimes multiple measurements were obtained from the same individual). Therefore, we have also estimated a correlation matrix, which is stored in dat$vhat:

dim(dat$vhat)
# [1] 44 44

See the manuscript for additional details how these data are used in the mash analysis.

Session information

sessionInfo()
# R version 3.4.3 (2017-11-30)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS High Sierra 10.13.5
# 
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# loaded via a namespace (and not attached):
#  [1] workflowr_1.0.1.9000 Rcpp_0.12.17         digest_0.6.15       
#  [4] rprojroot_1.3-2      R.methodsS3_1.7.1    backports_1.1.2     
#  [7] git2r_0.21.0         magrittr_1.5         evaluate_0.10.1     
# [10] stringi_1.1.7        whisker_0.3-2        R.oo_1.21.0         
# [13] R.utils_2.6.0        rmarkdown_1.9        tools_3.4.3         
# [16] stringr_1.3.0        yaml_2.1.18          compiler_3.4.3      
# [19] htmltools_0.3.6      knitr_1.20

This reproducible R Markdown analysis was created with workflowr 1.0.1.9000