mash analysis of GTEx data

Last updated: 2018-06-06

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Repository version: 88eff09
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .sos/
    Ignored:    data/.sos/
    Ignored:    output/MatrixEQTLSumStats.Portable.Z.coved.K3.P3.lite.single.expanded.V1.loglik.rds
    Ignored:    workflows/.ipynb_checkpoints/
    Ignored:    workflows/.sos/

Untracked files:
    Untracked:  analysis/files.txt
    Untracked:  fastqtl_to_mash_output/
    Untracked:  gtex6_workflow_output/
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	88eff09	Peter Carbonetto	2018-06-06	wflow_publish(“gtex.Rmd”)
Rmd	4d67aed	Peter Carbonetto	2018-06-06	wflow_publish(“GTEx.Rmd”)
Rmd	dae0caf	Peter Carbonetto	2018-06-06	Renamed Tspecific analysis.
Rmd	29ef614	Peter Carbonetto	2018-06-05	wflow_publish(c(“gtex.Rmd”, “fastqtl2mash.Rmd”))
Rmd	249eba0	Peter Carbonetto	2018-06-05	Fixed bad formatting with one of the bash code blocks.
Rmd	867052f	Peter Carbonetto	2018-06-05	wflow_publish(“gtex.Rmd”)
Rmd	0a5d3bc	Peter Carbonetto	2018-06-05	Renamed Fig.ExpressionAnalysis.Rmd as ExpressionAnalysis.Rmd.
html	521321d	Peter Carbonetto	2018-06-05	Rebuilt gtex and fastqtl2mash pages after after various improvements.
Rmd	62c92b0	Peter Carbonetto	2018-06-05	wflow_publish(c(“gtex.Rmd”, “fastqtl2mash.Rmd”))
html	0583813	Peter Carbonetto	2018-06-05	Additions to gtex page.
Rmd	cdb0541	Peter Carbonetto	2018-06-05	wflow_publish(“gtex.Rmd”)
Rmd	3345ce3	Peter Carbonetto	2018-06-04	A few minor revisions to instructions for running pipelines.
Rmd	8b09c36	Peter Carbonetto	2018-06-04	wflow_publish(c(“index.Rmd”, “gtex.Rmd”))
html	ec7e83e	Peter Carbonetto	2018-06-04	Re-built main webpages after several updates and improvements.
Rmd	340ad6f	Peter Carbonetto	2018-06-04	wflow_publish(c(“index.Rmd”, “gtex.Rmd”, “fastqtl2mash.Rmd”))
Rmd	ec55029	Peter Carbonetto	2018-06-01	Misc. fixes.
html	00c1dfa	Peter Carbonetto	2018-06-01	Build site.
Rmd	6a456e6	Peter Carbonetto	2018-06-01	Moved some output files to data folder; removed some old files from
Rmd	2945c3d	Peter Carbonetto	2018-05-31	Removed most of the materials from the README.

Overview

To reproduce the results of Urbut, Wang & Stephens (2017), please follow these instructions. You are welcome to adapt these steps to your own study. Please also visit the mashr R package repository, which has a more user-friendly interface and tutorials on how to apply multivariate adaptive shrinkage (mash) to association analysis gene expression (eQTL analysis).

The complete analyses of the GTEx data require installation of several programs and libraries, as well as large data sets that are specifically prepared for mash. To facilitate reproducing our results, we provide data that was pre-processed using the fastqtl2mash preprocessing pipeline. We have also developed a Docker container that includes all software components necessary to run the analyses. Docker can run on most popular operating systems (Mac, Windows and Linux) and cloud computing services such as Amazon Web Services and Microsoft Azure. If you have not used Docker before, you might want to read this to learn the basic concepts and understand the main benefits of Docker.

For details on how the Docker image was configured, see mash.dockerfile in the workflows directory of the git repository. The Docker image used for our analyses is based on gaow/lab-base, a customized Docker image for development with R and Python.

If you find a bug in any of these steps, please post an issue.

Download and install Docker

Download Docker (note that a free community edition of Docker is available), and install it following the instructions provided on the Docker website. Once you have installed Docker, check that Docker is working correctly by following Part 1 of the “Getting Started” guide. If you are new to Docker, we recommend reading the entire “Getting Started” guide.

Note: Setting up Docker requires that you have administrator access to your computer. Singularity is an alternative that accepts Docker images and does not require administrator access.

Download and test Docker image

Run this alias command in the shell, which will be used below to run commands inside the Docker container:

alias mash-docker='docker run --security-opt label:disable -t '\
'-P -h MASH -w $PWD -v $HOME:/home/$USER -v /tmp:/tmp -v $PWD:$PWD '\
'-u $UID:${GROUPS[0]} -e HOME=/home/$USER -e USER=$USER gaow/mash-paper'

The -v flags in this command map directories between the standard computing environment and the Docker container. Since the analyses below will write files to these directories, it is important to ensure that:

Environment variables $HOME and $PWD are set to valid and writeable directories (usually your home and current working directories, respectively).
/tmp should also be a valid and writeable directory.

If any of these statements are not true, please adjust the alias accordingly. The remaining options only affect operation of the container, and so should function the same regardless of your operating system.

Next, run a simple command in the Docker container to check that has loaded successfully:

mash-docker uname -sn

This command will download the Docker image if it has not already been downloaded.

If the container was successfully run, you should see this information about the Docker container outputted to the screen:

Linux MASH

You can also run these commands to show the information about the image downloaded to your computer and the container that has run (and exited):

docker image list
docker container list --all

Note: If you get error “Cannot connect to the Docker daemon. Is the docker daemon running on this host?” in Linux or macOS, see here for Linux or here for Mac for suggestions on how to resolve this issue.

Clone or download the gtexresults repository

Clone or download the gtexresults repository to your computer, then change your working directory in the shell to the root of the repository, e.g.,

cd gtexresults

All the commands below will be run from this directory.

Fit mash model and compute posterior statistics

Assuming your working directory is the root of the git repository (you can check this by running pwd), run all the steps of the analysis with this command:

mash-docker sos run workflows/gtex6_mash_analysis.ipynb

This command will take several hours to run—see below for more information on the individual steps. All outputs generated by this command will be saved to folder output inside the repository.

Note that you may recognize file gtex6_mash_analysis.ipynb as a Jupyter notebook. Indeed, you may open this notebook in Jupyter. However, you should not step through the code sequentially as you would in a typical Jupyter notebook; this is because the code in this notebook is meant to be run using the Script of Scripts (SoS) framework.

This command will execute the following steps of the analysis:

Compute a sparse factorization of the (centered) z-scores using the SFA software, with K = 5 factors, and save the factors in an .rds file. This will be used to construct the mixture-of-multivariate normals prior. This step is labeled sfa, and should only take a few minutes to run.
Compute additional “data-driven” prior matrices by computing a singular value decomposition of the (centered) z-scores and low-rank approximations to the empirical covariance matrices. Most of the work in this step involves running the Extreme Deconvolution method. The outcome of running the Extreme Deconvolution method is saved to a new .rds file. This step is labeled mash-paper_1 and may take several hours to run (in one run on a MacBook Pro with a 3.5 GHz Intel Core i7, it took over 6 hours to complete).
Compute a final collection of “canonical” and single-rank prior matrices based on SFA and the “BMAlite” models of Flutre et al (2013). These matrices are again written to another .rds file. This step is labeled mash-paper_2, and should take at most a minute to run.
The mash-paper_3 step fits the mash model to the GTEx data (the centered z-scores); the model parameters estimated in this fitting step are the weights of the multivariate normal mixture. The outputs from this step are the estimated mixture weights and the conditional likelihood matrix. These two outputs are saved to two separate .rds files. This step is expected to take at most a few hours to complete.
The mash-paper_4 step computes posterior statistics using the fitted mash model from the previous step. These posterior statistics are summarized and visualized in subsequent analysis steps. The posterior statistics are saved to another .rds file. This step is expected to take a few hours to complete.

Note: All containers that have run and exited will still be retained in the Docker system. Run docker container list --all to list all previous run containers. To clear these previously run containers, run docker container prune. See here for more information.

Generate figures and tables summarizing results of mash analysis

Once you have completed the mash analysis pipeline, the next step is to examine and interpret the results. We provide R code implementing this step; you can either view the webpages listed below, or view the R Markdown source files in the analysis directory of the gtexresults repository.

If you were unable to complete the mash analysis pipeline, we have provided the outputs needed to generate the figures and tables below. (If you were able to successfully complete the mash analysis, then these files will be overwritten by your outputs.) For convenience, the results needed to generate the figures and tables have been saved in the output folder.

Before running the R code in the pages listed below, you will need to first install several R packages that are used in the code:

install.packages(c("colorRamps","rmeta"))

Note that at the bottom of each webpage, we have recorded information about the exact version of R and the R Packages that were used. This might be useful if you would like to replicate our computing setup as closely as possible. Also note that the webpages were generated from the R Markdown files using workflowr.

Visit the following links to view the R code we used to generate summarizies of the GTEx results incorporated into the manuscript:

Primary correlation patterns identified by mash in GTEx data.
Correlation patterns from other components with larger weights: second, fourth, fifth and eighth covariance components.
Examples illustrating how mash uses patterns of sharing to inform effect estimates.
Pairwise sharing by magnitude of eQTLs among tissues.
Pairwise sharing by sign of eQTLs among tissues.
Comparison of genes with tissue-specific eQTLs against other genes.
Tissue effective sample sizes.
Number of tissue-specific eQTLs in each tissue.

Figure 5:Histogram of Sharing

Supplementary Figure 3: Illustration of how Linkage Disequilibrium can impact effect estimate table and figure

Table 1: Heterogeneity Analysis Simulation and Data.

Additional usage notes

All containers that have run and exited will still be retained in the Docker system. Run docker container list --all to list all previous run containers. To clear these previously run containers, run docker container prune. See [here][docker-prune] for more information.
See the Jupyter notebook to get more details; how the notebook should be interpreted.
In the data folder, we have provided a file MatrixEQTLSumStats.Portable.Z.rds containing eQTL summary statistics from the GTEx study, in a format suitable for running mash. This was generated from the original eQTL summary statistics downloaded from the GTEx Portal website, then converted using the code in fastqtl_to_mash.ipynb. See here for details on this step.
Run the following command to update the Docker image: docker pull gaow/mash-paper

This reproducible R Markdown analysis was created with workflowr 1.0.1.9000