Last updated: 2018-05-14
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(1)
The command set.seed(1)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 3431f02
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .sos/
Ignored: data/.sos/
Ignored: workflows/.ipynb_checkpoints/
Ignored: workflows/.sos/
Untracked files:
Untracked: gtex6_workflow_output/
Untracked: temp.R
Unstaged changes:
Modified: README.md
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 3431f02 | Peter Carbonetto | 2018-05-14 | wflow_publish(“index.Rmd”) |
Rmd | ea42e8a | Peter Carbonetto | 2018-05-07 | Added step to install packages in instructions (index.Rmd). |
Rmd | 6ebb3f9 | Peter Carbonetto | 2018-05-07 | wflow_publish(“index.Rmd”) |
Rmd | c6c8450 | Peter Carbonetto | 2018-05-07 | Setting up workflowr settings. |
html | c6c8450 | Peter Carbonetto | 2018-05-07 | Setting up workflowr settings. |
Rmd | 33f72d6 | Peter Carbonetto | 2018-05-07 | Added workflowr 1.0.1.9000 files. |
html | b4e38c2 | Gao Wang | 2017-09-20 | Fix one more broken link |
html | 80f285f | Gao Wang | 2017-09-20 | Update figures |
html | afc401f | Peter Carbonetto | 2017-09-20 | Moved doc to docs. |
To reproduce the results of Urbut, Wang & Stephens (2017), please follow these instructions.
x <- rnorm(8)
For more information, please see the README.
The complete analyses require installation of several programs and libraries, and requires several large data sets. To facilitate reproducing our results, we provide pre-processed data for use with the core analysis, and a bioinformatics pipeline with a small toy data-set to demonstrate the pre-processing step. We have also developed a Docker container that includes all software components necessary to run the analyses. Docker can run on most popular operating systems (Mac, Windows and Linux). It also runs on cloud computing services such as Amazon Web Services and Microsoft Azure. If you have not used Docker before, you might want to read this to learn the basic concepts and understand the benefits of Docker.
For details on how the Docker image was configured, see the Dockerfile. The Docker image used for our analyses is based on gaow/lab-base, a Docker image for development with R and Python.
If you prefer to run the analyses without Docker, add a few details about where you can find out more about software and libraries used, and other computing environment setup steps (mention Python 3.x, R, SFA, ExtremeDeconvolution, MOSEK, OpenMP, MKL, GSL, HDF5 tools, pytables rhdf5, and for an improved MASH implementation mashr is also installed).
Download Docker (note that a free community edition of Docker is available), and install it following the instructions provided on the Docker website. Once you have installed Docker, check that Docker is working correctly by following Part 1 of the Getting Started guide. If you are using Docker for the first time, we recommend reading the entire Getting Started guide. Note that setting up Docker requires that you have administrator access to your computer. (Singularity is an alternative that accepts Docker images and does not require administrator access.)
Run this alias
command in the shell, which will be used below to run commands inside the Docker container:
alias mash-docker='docker run --security-opt label:disable -t -P -h MASH '\
'-w $PWD -v $HOME:/home/$USER -v /tmp:/tmp -v $PWD:$PWD '\
'-u $UID:${GROUPS[0]} -e HOME=/home/$USER -e USER=$USER gaow/mash-paper'
The -v
flags in this command map directories between the standard computing environment and the Docker container. Since the analyses below will write files to these directories, it is important to ensure that:
Environment variables $HOME
and $PWD
are set to valid and writeable directories (usually your home and current working directories, respectively).
/tmp
should also be a valid and writeable directory.
If any of these statements are not true, please adjust the alias
accordingly. The remaining options only affect operation of the container, and so should function the same regardless of your operating system.
Next, run a simple command in the Docker container to check that has loaded successfully:
mash-docker uname -sn
This command will download the Docker image if it has not already been downloaded.
If the container was successfully run, you should see this information about the Docker container outputted to the screen:
Linux MASH
You can also run these commands to show the information about the image downloaded to your computer and the container that has run (and exited):
docker image list
docker container list --all
Note: If you get error “Cannot connect to the Docker daemon. Is the docker daemon running on this host?” in Linux or macOS, see here for Linux or here for Mac for suggestions on how to resolve this issue.
Clone or download the gtexresults
repository to your computer, then change your working directory in the shell to the root of the repository, e.g.,
cd gtexresults
After doing this, running ls -1
should show the top-level contents of this repository:
LICENSE
README.md
TODO.txt
analysis
data
docs
output
workflows
All commands below will be run from this directory.
Assuming your working directory is the root of the git repository (you can check by running pwd
), run all the steps of the analysis with this command:
mash-docker sos run workflows/gtex6_mash_analysis.ipynb
This command will take several hours to run—see below for more information on the individual steps. All outputs generated by this command will be saved to folder gtex6_workflow_output
inside the repository.
Note that you may recognize this file as a Jupyter notebook. Indeed, you may open this notebook in Jupyter. However, you should not step through the code sequentially as you would in a typical Jupyter notebook; this is because the code in this notebook is meant to be run using the Script of Scripts (SoS) framework.
This command will execute the following steps of the analysis:
Compute a sparse factorization of the (centered) z-scores using the SFA software, with K = 5 factors, and save the factors in an .rds
file. This will be used to construct the mixture-of-multivariate normals prior. This step is labeled sfa
, and should only take a few minutes to run.
Compute additional “data-driven” prior matrices by computing a singular value decomposition of the (centered) z-scores and low-rank approximations to the empirical covariance matrices. Most of the work in this step involves running the Extreme Deconvolution method. The outcome of running the Extreme Deconvolution method is saved to a new .rds
file. This step is labeled mash-paper_1
and may take several hours to run (in one run on a MacBook Pro with a 3.5 GHz Intel Core i7, it took over 6 hours to complete).
A final collection of “canonical” and single-rank prior matrices based on SFA and the “BMAlite” models of Flutre et al (2013). These matrices are again written to another .rds
file. This step is labeled mash-paper_2
, and should take at most a minute to run.
The mash-paper_3
step fits the MASH (“multivariate adaptive shrinkage”) model to the GTEx data (the centered z-scores); the model parameters estimated in this fitting step are the weights of the multivariate normal mixture. The outputs from this step are the estimated mixture weights and the conditional likelihood matrix. These two outputs are saved to two separate .rds
files. This step is expected to take at most a few hours to complete.
The mash-paper_4
step computes posterior statistics using the fitted MASH model from the previous step. These posterior statistics are summarized and visualized in subsequent analyses. The posterior statistics are saved to another .rds
file. This step is expected to take a few hours to complete.
Finally, note that all containers that have run and exited will still be retained in the Docker system. Run docker container list --all
to list all previous run containers. To clear these previously run containers, run docker container prune
. See here for more information.
Install some packages from CRAN:
install.packages("colorRamps","fields","ggplot2")
FIXME: update figure plotting instructions
The input data necessary to run this analysis is all available under inputs. This may take some time to run. We have provided the outputs of running mash in Data_vhat
.
This repo is organized so that you can run Mash using the gteX data contained in Inputs to produce the parameters and posteriors from mashr.
The directory Plots_for_Paper_vmat contains .Rmd files to plot figures from the paper, using our results which are provided in Results_Data.
Here is the link to the index discussing the reproduction of all of the plots.
Figure 3:Summary of primary patterns identified by mash in GTEx data
Figure 5:Histogram of Sharing
Figure 6:Pairwise sharing by magnitude of eQTL among tissues
Supplementary Figure 1:Sample sizes and effective sample sizes from mash analysis across tissues
Supplementary Figure 2:There are 4 figures here:
Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk2
Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk4
Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk5
Summary of covariance matrices Uk with largest estimated weight (> 1%) in GTEx data:Uk8
Supplementary Figure 3: Illustration of how Linkage Disequilibrium can impact effect estimate table and figure
Supplementary Figure 4:Pairwise Sharing By Sign
Supplementary Figure 5:Number of “tissue-specific eQTLs” in each tissues.
Supplementary Figure 6:Expression levels in genes with “tissue-specific eQTLs” are similar to those in other genes
Table 1: Heterogeneity Analysis Simulation and Data.
Above we have given the minimal instructions necessary to reproduce the results of Urbut et al (2017). Here are some additional details about the analyses.
TO DO: Things that will go here:
Explain how to get a summary of the possible analysis steps that can be run.
See the Jupyter notebook to get more details; how the notebook should be interpreted.
Explain how to run the analysis using the improved (faster) implementation of the mashr R package.
MOSEK?
In the data folder, we have provided a file MatrixEQTLSumStats.Portable.Z.rds containing eQTL summary statistics in a format convenient for running MASH. This was generated from the original eQTL summary statistics downloaded from the GTEx Portal website, then converted using the code in fastqtl_to_mash.ipynb
. See below for details on this step.
Here we explain how the MatrixEQTLSumStats.Portable.Z.rds data file was generated from the source data downloaded from the GTEx Portal. The source data are the SNP-gene association statistics from release 6 of the GTEx Project (GTEx_Analysis_V6_all-snp-gene-associations.tar
).
Under the repo you will find workflows/fastqtl_to_mash.ipynb
to convert eQTL summary statistics (default to fastqtl
format) to MASH format. Computation is configured to run in parallel for eQTL results from multiple studies. Example data-set can be found at data/eQTLDataDemo
. The workflow file is documented in itself, and has a few options to customize the input and output.
To read what’s available, run:
mash-docker sos run workflows/fastqtl_to_mash.ipynb export
and read the HTML files gtex6_workflow_output/fastqtl_to_mash.lite.html
and gtex6_workflow_output/fastqtl_to_mash.full.html
To run the conversion:
mash-docker sos run workflows/fastqtl_to_mash.ipynb \
--data_list data/eQTLDataDemo/FastQTLSumStats.list \
--gene_list data/eQTLDataDemo/GTEx_genes.txt
In practice for GTEx data the conversion is computationally intensive and is best done on a cluster environment with configurations to run the workflow across different nodes.
Run the following command to update the Docker image:
docker pull gaow/mash-paper
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1.9000 Rcpp_0.12.16 digest_0.6.15
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.7 whisker_0.3-2 R.oo_1.21.0
[13] R.utils_2.6.0 rmarkdown_1.9 tools_3.4.3
[16] stringr_1.3.0 yaml_2.1.18 compiler_3.4.3
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.0.1.9000