<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta charset="utf-8" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> <meta name="author" content="Lambda Moses" /> <meta name="date" content="2018-12-14" /> <title>Processing kallisto bus Output (10x v2 chemistry)</title> <script src="site_libs/jquery-1.11.3/jquery.min.js"></script> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link href="site_libs/bootstrap-3.3.5/css/cosmo.min.css" rel="stylesheet" /> <script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script> <script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script> <script src="site_libs/jqueryui-1.11.4/jquery-ui.min.js"></script> <link href="site_libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" /> <script src="site_libs/tocify-1.9.1/jquery.tocify.js"></script> <script src="site_libs/navigation-1.1/tabsets.js"></script> <link href="site_libs/highlightjs-9.12.0/textmate.css" rel="stylesheet" /> <script src="site_libs/highlightjs-9.12.0/highlight.js"></script> <link href="site_libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" /> <link href="site_libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" /> <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> pre:not([class]) { background-color: white; } </style> <script type="text/javascript"> if (window.hljs) { hljs.configure({languages: []}); hljs.initHighlightingOnLoad(); if (document.readyState && document.readyState === "complete") { window.setTimeout(function() { hljs.initHighlighting(); }, 0); } } </script> <style type="text/css"> h1 { font-size: 34px; } h1.title { font-size: 38px; } h2 { font-size: 30px; } h3 { font-size: 24px; } h4 { font-size: 18px; } h5 { font-size: 16px; } h6 { font-size: 12px; } .table th:not([align]) { text-align: left; } </style> </head> <body> <style type = "text/css"> .main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } .html-widget { margin-bottom: 20px; } button.code-folding-btn:focus { outline: none; } summary { display: list-item; } </style> <style type="text/css"> /* padding for bootstrap navbar */ body { padding-top: 51px; padding-bottom: 40px; } /* offset scroll position for anchor links (for fixed navbar) */ .section h1 { padding-top: 56px; margin-top: -56px; } .section h2 { padding-top: 56px; margin-top: -56px; } .section h3 { padding-top: 56px; margin-top: -56px; } .section h4 { padding-top: 56px; margin-top: -56px; } .section h5 { padding-top: 56px; margin-top: -56px; } .section h6 { padding-top: 56px; margin-top: -56px; } .dropdown-submenu { position: relative; } .dropdown-submenu>.dropdown-menu { top: 0; left: 100%; margin-top: -6px; margin-left: -1px; border-radius: 0 6px 6px 6px; } .dropdown-submenu:hover>.dropdown-menu { display: block; } .dropdown-submenu>a:after { display: block; content: " "; float: right; width: 0; height: 0; border-color: transparent; border-style: solid; border-width: 5px 0 5px 5px; border-left-color: #cccccc; margin-top: 5px; margin-right: -10px; } .dropdown-submenu:hover>a:after { border-left-color: #ffffff; } .dropdown-submenu.pull-left { float: none; } .dropdown-submenu.pull-left>.dropdown-menu { left: -100%; margin-left: 10px; border-radius: 6px 0 6px 6px; } </style> <script> // manage active state of menu based on current page $(document).ready(function () { // active menu anchor href = window.location.pathname href = href.substr(href.lastIndexOf('/') + 1) if (href === "") href = "index.html"; var menuAnchor = $('a[href="' + href + '"]'); // mark it active menuAnchor.parent().addClass('active'); // if it's got a parent navbar menu mark it active as well menuAnchor.closest('li.dropdown').addClass('active'); }); </script> <div class="container-fluid main-container"> <!-- tabsets --> <style type="text/css"> .tabset-dropdown > .nav-tabs { display: inline-table; max-height: 500px; min-height: 44px; overflow-y: auto; background: white; border: 1px solid #ddd; border-radius: 4px; } .tabset-dropdown > .nav-tabs > li.active:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before { content: ""; border: none; } .tabset-dropdown > .nav-tabs.nav-tabs-open:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs > li.active { display: block; } .tabset-dropdown > .nav-tabs > li > a, .tabset-dropdown > .nav-tabs > li > a:focus, .tabset-dropdown > .nav-tabs > li > a:hover { border: none; display: inline-block; border-radius: 4px; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li { display: block; float: none; } .tabset-dropdown > .nav-tabs > li { display: none; } </style> <script> $(document).ready(function () { window.buildTabsets("TOC"); }); $(document).ready(function () { $('.tabset-dropdown > .nav-tabs > li').click(function () { $(this).parent().toggleClass('nav-tabs-open') }); }); </script> <!-- code folding --> <script> $(document).ready(function () { // move toc-ignore selectors from section div to header $('div.section.toc-ignore') .removeClass('toc-ignore') .children('h1,h2,h3,h4,h5').addClass('toc-ignore'); // establish options var options = { selectors: "h1,h2,h3", theme: "bootstrap3", context: '.toc-content', hashGenerator: function (text) { return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase(); }, ignoreSelector: ".toc-ignore", scrollTo: 0 }; options.showAndHide = true; options.smoothScroll = true; // tocify var toc = $("#TOC").tocify(options).data("toc-tocify"); }); </script> <style type="text/css"> #TOC { margin: 25px 0px 20px 0px; } @media (max-width: 768px) { #TOC { position: relative; width: 100%; } } .toc-content { padding-left: 30px; padding-right: 40px; } div.main-container { max-width: 1200px; } div.tocify { width: 20%; max-width: 260px; max-height: 85%; } @media (min-width: 768px) and (max-width: 991px) { div.tocify { width: 25%; } } @media (max-width: 767px) { div.tocify { width: 100%; max-width: none; } } .tocify ul, .tocify li { line-height: 20px; } .tocify-subheader .tocify-item { font-size: 0.90em; padding-left: 25px; text-indent: 0; } .tocify .list-group-item { border-radius: 0px; } </style> <!-- setup 3col/9col grid for toc_float and main content --> <div class="row-fluid"> <div class="col-xs-12 col-sm-4 col-md-3"> <div id="TOC" class="tocify"> </div> </div> <div class="toc-content col-xs-12 col-sm-8 col-md-9"> <div class="navbar navbar-default navbar-fixed-top" role="navigation"> <div class="container"> <div class="navbar-header"> <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> <a class="navbar-brand" href="index.html">BUStoolsR_notebooks</a> </div> <div id="navbar" class="navbar-collapse collapse"> <ul class="nav navbar-nav"> <li> <a href="index.html">Home</a> </li> <li> <a href="about.html">About</a> </li> <li> <a href="license.html">License</a> </li> </ul> <ul class="nav navbar-nav navbar-right"> <li> <a href="https://github.com/jdblischak/workflowr"> <span class="fa fa-github"></span> </a> </li> </ul> </div><!--/.nav-collapse --> </div><!--/.container --> </div><!--/.navbar --> <!-- Add a small amount of space between sections. --> <style type="text/css"> div.section { padding-top: 12px; } </style> <div class="fluid-row" id="header"> <h1 class="title toc-ignore">Processing kallisto bus Output (10x v2 chemistry)</h1> <h4 class="author"><em>Lambda Moses</em></h4> <h4 class="date"><em>2018-12-14</em></h4> </div> <p><strong>Last updated:</strong> 2018-12-14</p> <strong>workflowr checks:</strong> <small>(Click a bullet for more information)</small> <ul> <li> <p><details> <summary> <strong style="color:blue;">✔</strong> <strong>R Markdown file:</strong> up-to-date </summary></p> <p>Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.</p> </details> </li> <li> <p><details> <summary> <strong style="color:blue;">✔</strong> <strong>Environment:</strong> empty </summary></p> <p>Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.</p> </details> </li> <li> <p><details> <summary> <strong style="color:blue;">✔</strong> <strong>Seed:</strong> <code>set.seed(20181214)</code> </summary></p> <p>The command <code>set.seed(20181214)</code> was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.</p> </details> </li> <li> <p><details> <summary> <strong style="color:blue;">✔</strong> <strong>Session information:</strong> recorded </summary></p> <p>Great job! Recording the operating system, R version, and package versions is critical for reproducibility.</p> </details> </li> <li> <p><details> <summary> <strong style="color:blue;">✔</strong> <strong>Repository version:</strong> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/tree/d288e19e9b36268288537c66f056598990d32062" target="_blank">d288e19</a> </summary></p> Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated. <br><br> Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use <code>wflow_publish</code> or <code>wflow_git_commit</code>). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated: <pre><code> Ignored files: Ignored: .Rhistory Ignored: .Rproj.user/ Ignored: data/fastqs/ Ignored: data/hgmm_1k_fastqs.tar Ignored: data/hs_cdna.fa.gz Ignored: data/mm_cdna.fa.gz Ignored: data/whitelist_v2.txt Ignored: output/hs_mm_tr_index.idx Ignored: output/out_hgmm1k/ </code></pre> Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes. </details> </li> </ul> <details> <summary> <small><strong>Expand here to see past versions:</strong></small> </summary> <ul> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> File </th> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Message </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/d288e19e9b36268288537c66f056598990d32062/analysis/10xv2.Rmd" target="_blank">d288e19</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Cache rather than skip evaluation </td> </tr> <tr> <td style="text-align:left;"> html </td> <td style="text-align:left;"> <a href="https://cdn.rawgit.com/lambdamoses/BUStoolsR_notebooks/30c9aa31c60fd01ec97934196bdcaad38e1f920e/docs/10xv2.html" target="_blank">30c9aa3</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Build site. </td> </tr> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/fd3d5ae03b7336d210a1e59af3b2b99f20a6c787/analysis/10xv2.Rmd" target="_blank">fd3d5ae</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Don’t collapse output </td> </tr> <tr> <td style="text-align:left;"> html </td> <td style="text-align:left;"> <a href="https://cdn.rawgit.com/lambdamoses/BUStoolsR_notebooks/2da3b350a1686dc06961608b49e193f1995d74fa/docs/10xv2.html" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Build site. </td> </tr> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/1fc3e91f1f693eef956c8117a016e06851a470f8/analysis/10xv2.Rmd" target="_blank">1fc3e91</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Publish 10xv2 notebook </td> </tr> <tr> <td style="text-align:left;"> Rmd </td> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/074d55f43942e807d167dab5636ec5bb8a1559fd/analysis/10xv2.Rmd" target="_blank">074d55f</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> <td style="text-align:left;"> Initial commit, already with 10xv2 notebook </td> </tr> </tbody> </table> </ul> <p></details></p> <hr /> <p>In this vignette, we process fastq data from scRNA-seq (10x v2 chemistry) to make a sparse matrix that can be used in downstream analysis. In this vignette, we will start that standard downstream analysis with <code>Seurat</code>.</p> <div id="download-data" class="section level2"> <h2>Download data</h2> <p>The data set we are using here is 1k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells from the 10x website. First, we download the fastq files (6.34 GB).</p> <pre class="r"><code>download.file("http://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_1k/hgmm_1k_fastqs.tar", destfile = "./data/hgmm_1k_fastqs.tar", quiet = TRUE)</code></pre> <p>Then untar this file</p> <pre class="bash"><code>cd ./data tar -xvf ./hgmm_1k_fastqs.tar</code></pre> <pre><code>#> fastqs/ #> fastqs/hgmm_1k_S1_L001_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L002_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L003_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L004_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L005_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L006_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L007_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz #> fastqs/hgmm_1k_S1_L008_I1_001.fastq.gz #> fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz #> fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz</code></pre> </div> <div id="build-the-kallisto-index" class="section level2"> <h2>Build the kallisto index</h2> <p>Here we use <a href="https://pachterlab.github.io/kallisto/about">kallisto</a> (see this link for install instructions) to pseudoalign the reads to the transcriptome and then to create the <code>bus</code> file to be converted to a sparse matrix. The first step is to build an index of the transcriptome. This data set has both human and mouse cells, so we need both human and mouse transcriptomes.</p> <pre class="r"><code># Human transcriptome download.file("ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz", "./data/hs_cdna.fa.gz", quiet = TRUE) # Mouse transcriptome download.file("ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz", "./data/mm_cdna.fa.gz", quiet = TRUE)</code></pre> <pre class="bash"><code>kallisto version</code></pre> <pre><code>#> kallisto, version 0.45.0</code></pre> <p>Actually, we don’t need to unzip the fasta files</p> <pre class="bash"><code>kallisto index -i ./output/hs_mm_tr_index.idx ./data/hs_cdna.fa.gz ./data/mm_cdna.fa.gz</code></pre> <pre><code>#> #> [build] loading fasta file ./data/hs_cdna.fa.gz #> [build] loading fasta file ./data/mm_cdna.fa.gz #> [build] k-mer length: 31 #> [build] warning: clipped off poly-A tail (longer than 10) #> from 2071 target sequences #> [build] warning: replaced 8 non-ACGUT characters in the input sequence #> with pseudorandom nucleotides #> [build] counting k-mers ... done. #> [build] building target de Bruijn graph ... done #> [build] creating equivalence classes ... done #> [build] target de Bruijn graph has 2138563 contigs and contains 206125466 k-mers</code></pre> </div> <div id="run-kallisto-bus" class="section level2"> <h2>Run kallisto bus</h2> <p>Here we will generate the bus file. These are the technologies supported by <code>kallisto bus</code>:</p> <pre class="r"><code>system("kallisto bus --list", intern = TRUE)</code></pre> <pre><code>#> Warning in system("kallisto bus --list", intern = TRUE): running command #> 'kallisto bus --list' had status 1</code></pre> <pre><code>#> [1] "List of supported single cell technologies" #> [2] "" #> [3] "short name description" #> [4] "---------- -----------" #> [5] "10Xv1 10X chemistry version 1" #> [6] "10Xv2 10X chemistry verison 2" #> [7] "DropSeq DropSeq" #> [8] "inDrop inDrop" #> [9] "CELSeq CEL-Seq" #> [10] "CELSeq2 CEL-Seq version 2" #> [11] "SCRBSeq SCRB-Seq" #> [12] "" #> attr(,"status") #> [1] 1</code></pre> <p>Here we have 8 samples. Each sample has 3 files: <code>I1</code> means sample index, <code>R1</code> means barcode and UMI, and <code>R2</code> means the piece of cDNA. The <code>-i</code> argument specifies the index file we just built. The <code>-o</code> argument specifies the output directory. The <code>-x</code> argument specifies the sequencing technology used to generate this data set. The <code>-t</code> argument specifies the number of threads used.</p> <pre class="bash"><code>cd ./data kallisto bus -i ../output/hs_mm_tr_index.idx -o ../output/out_hgmm1k -x 10xv2 -t8 \ ./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz \ ./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz</code></pre> <pre><code>#> #> [index] k-mer length: 31 #> [index] number of targets: 302,896 #> [index] number of k-mers: 206,125,466 #> [index] number of equivalence classes: 1,252,306 #> [quant] will process sample 1: ./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz #> [quant] will process sample 2: ./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz #> [quant] will process sample 3: ./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz #> [quant] will process sample 4: ./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz #> [quant] will process sample 5: ./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz #> [quant] will process sample 6: ./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz #> [quant] will process sample 7: ./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz #> [quant] will process sample 8: ./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz #> ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz #> [quant] finding pseudoalignments for the reads ... done #> [quant] processed 63,252,296 reads, 52,229,344 reads pseudoaligned</code></pre> <p>See what are the outputs</p> <pre class="r"><code>list.files("./output/out_hgmm1k/")</code></pre> <pre><code>#> [1] "matrix.ec" "output.bus" "output.sorted" #> [4] "output.sorted.txt" "run_info.json" "transcripts.txt"</code></pre> </div> <div id="running-bustools" class="section level2"> <h2>Running <code>BUStools</code></h2> <p>The <code>output.bus</code> file is a binary. In order to make R parse it, we need to convert it into a sorted text file. There’s a command line tool <a href="https://github.com/BUStools/bustools"><code>bustools</code></a> for this.</p> <pre class="bash"><code># Add where I installed bustools to PATH export PATH=$PATH:/home/lambda/mylibs/bin/ # Sort bustools sort -o ./output/out_hgmm1k/output.sorted -t8 ./output/out_hgmm1k/output.bus # Convert sorted file to text bustools text -o ./output/out_hgmm1k/output.sorted.txt ./output/out_hgmm1k/output.sorted</code></pre> <pre><code>#> Read in 52229344 number of busrecords #> All sorted #> Read in 43769910 number of busrecords</code></pre> </div> <div id="mapping-transcripts-to-genes" class="section level2"> <h2>Mapping transcripts to genes</h2> <pre class="r"><code>library(BUStoolsR)</code></pre> <p>For the sparse matrix, we are interested in how many UMIs per gene per cell, rather than per transcript. Remember in the output of <code>kallisto bus</code>, there’s the file <code>transcripts.txt</code>. Those are the transcripts in the transcriptome index. Now we’ll only keep the transcripts present there and make sure that the transcripts in <code>tr2g</code> are in the same order as those in the index. This function might be a bit slow; what’s slow is the biomart query, not processing data frames.</p> <p>Note that the function <code>transcript2gene</code> only works for organisms that have gene and transcript IDs in Ensembl, since behind the scene, it’s using biomart to query Ensembl.</p> <pre class="r"><code>tr2g <- transcript2gene(c("Homo sapiens", "Mus musculus"), kallisto_out_path = "./output/out_hgmm1k")</code></pre> </div> <div id="mapping-ecs-to-genes" class="section level2"> <h2>Mapping ECs to genes</h2> <p>The 3rd column in the <code>output.sorted.txt</code> is the equivalence class index of each UMI for each cell barcode. Equivalence class (EC) means the set of transcripts in the transcriptome that the read is compatible to. While in most cases, an EC only has transcripts for the same gene, there are some ECs that have transcripts for different genes. The file in the <code>kallisto bus</code> output, <code>matrix.ec</code>, maps the EC index in <code>output.sorted.txt</code> to sets of line numbers in the transcriptome assembly. That’s why we ensured that the <code>tr2g</code> data frame has the same order as the transcripts in the index.</p> <pre class="r"><code>genes <- EC2gene(tr2g, "./output/out_hgmm1k", ncores = 10, verbose = FALSE)</code></pre> <p>Now for each EC, we have a set of genes the EC is compatible to.</p> </div> <div id="making-the-sparse-matrix" class="section level2"> <h2>Making the sparse matrix</h2> <pre class="r"><code>library(data.table)</code></pre> <p>For 10x, we do have a file with all valid cell barcodes that comes with CellRanger.</p> <pre class="bash"><code># Copy v2 chemistry whitelist to working directory cp /home/lambda/cellranger-3.0.1/cellranger-cs/3.0.1/lib/python/cellranger/barcodes/737K-august-2016.txt \ ./data/whitelist_v2.txt</code></pre> <pre class="r"><code># Read in the whitelist whitelist_v2 <- fread("./data/whitelist_v2.txt", header = FALSE)$V1</code></pre> <p>Now we have everything we need to make the sparse matrix. This function reads in <code>output.sorted.txt</code> line by line and processes them. It does not do barcode correction for now, so the barcode must exactly match those in the whitelist if one is provided. It took 5 to 6 minutes to construct the sparse matrix in the hgmm6k dataset, which has over 280 million lines in <code>output.sorted.txt</code>, which is over 9GB. Here the data set is smaller, so it’s not taking as long.</p> <pre class="r"><code>res_mat <- make_sparse_matrix("./output/out_hgmm1k/output.sorted.txt", genes = genes, est_ncells = 1500, est_ngenes = nrow(tr2g), whitelist = whitelist_v2)</code></pre> <pre><code>#> Reading data #> Read 5 million lines #> Read 10 million lines #> Read 15 million lines #> Read 20 million lines #> Read 25 million lines #> Read 30 million lines #> Read 35 million lines #> Read 40 million lines #> Constructing sparse matrix</code></pre> </div> <div id="explore-the-data" class="section level2"> <h2>Explore the data</h2> <pre class="r"><code>library(Seurat) library(tidyverse) library(parallel) library(Matrix)</code></pre> <div id="filter-data" class="section level3"> <h3>Filter data</h3> <p>Cool, so now we have the sparse matrix. What does it look like?</p> <pre class="r"><code>dim(res_mat)</code></pre> <pre><code>#> [1] 51306 353957</code></pre> <p>That’s way more cells than we expect, which is about 1000. So what’s going on?</p> <p>How many UMIs per barcode?</p> <pre class="r"><code>tot_counts <- colSums(res_mat) summary(tot_counts)</code></pre> <pre><code>#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 1.0 1.0 2.0 101.8 11.0 92910.0</code></pre> <p>The vast majority of “cells” have only a few UMI detected. Those are likely to be spurious. In Seurat’s vignettes, a low cutoff is usually set to the total number of UMIs in a cell, and that depends on the sequencing depth.</p> <pre class="r"><code>bcs_use <- tot_counts > 500 tot_counts_filtered <- tot_counts[bcs_use] hist(tot_counts_filtered, breaks = 100, main = "Histogram of nUMI")</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-20-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-20-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-20-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <pre class="r"><code># Filter the matrix res_mat <- res_mat[,bcs_use] dim(res_mat)</code></pre> <pre><code>#> [1] 51306 1176</code></pre> <p>Now this is a more reasonable number of cells.</p> </div> <div id="cell-species" class="section level3"> <h3>Cell species</h3> <p>How many cells are from humans and how many from mice? The number of cells with mixed species indicates doublet rate.</p> <pre class="r"><code>gene_species <- ifelse(str_detect(rownames(res_mat), "^ENSMUSG"), "mouse", "human") mouse_inds <- gene_species == "mouse" human_inds <- gene_species == "human" # mark cells as mouse or human cell_species <- tibble(n_mouse_umi = colSums(res_mat[mouse_inds,]), n_human_umi = colSums(res_mat[human_inds,]), tot_umi = colSums(res_mat), prop_mouse = n_mouse_umi / tot_umi, prop_human = n_human_umi / tot_umi)</code></pre> <pre class="r"><code># Classify species based on proportion of UMI cell_species <- cell_species %>% mutate(species = case_when( prop_mouse > 0.9 ~ "mouse", prop_human > 0.9 ~ "human", TRUE ~ "mixed" ))</code></pre> <pre class="r"><code>ggplot(cell_species, aes(n_human_umi, n_mouse_umi, color = species)) + geom_point(size = 0.5) + theme_bw()</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-24-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-24-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-24-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <p>Great, looks like the vast majority of cells are not mixed.</p> <pre class="r"><code>cell_species %>% count(species) %>% mutate(proportion = n / ncol(res_mat))</code></pre> <pre><code>#> # A tibble: 3 x 3 #> species n proportion #> <chr> <int> <dbl> #> 1 human 603 0.513 #> 2 mixed 5 0.00425 #> 3 mouse 568 0.483</code></pre> <p>Great, only about 0.4% of cells here are doublets, which is lower than the ~1% 10x lists. However, doublets can still be formed with cells from the same species.</p> </div> <div id="seurat-exploration" class="section level3"> <h3>Seurat exploration</h3> <pre class="r"><code>seu <- CreateSeuratObject(res_mat, min.cells = 3) %>% NormalizeData(verbose = FALSE) %>% ScaleData(verbose = FALSE) %>% FindVariableFeatures(verbose = FALSE)</code></pre> <pre class="r"><code># Add species to meta data seu <- AddMetaData(seu, metadata = cell_species$species, col.name = "species")</code></pre> <pre class="r"><code>VlnPlot(seu, c("nCount_RNA", "nFeature_RNA"), group.by = "species")</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-28-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-28-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-28-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <pre class="r"><code>seu <- RunPCA(seu, verbose = FALSE, npcs = 30) ElbowPlot(seu, ndims = 30)</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-29-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-29-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-29-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <pre class="r"><code>DimPlot(seu, reduction = "pca", pt.size = 0.5, group.by = "species")</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-30-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-30-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-30-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <p>The first PC separates species, as expected.</p> <pre class="r"><code>seu <- RunTSNE(seu, dims = 1:20, check_duplicates = FALSE) DimPlot(seu, reduction = "tsne", pt.size = 0.5, group.by = "species")</code></pre> <p><img src="figure/10xv2.Rmd/unnamed-chunk-31-1.png" width="672" style="display: block; margin: auto;" /></p> <details> <summary><em>Expand here to see past versions of unnamed-chunk-31-1.png:</em></summary> <table style="border-collapse:separate; border-spacing:5px;"> <thead> <tr> <th style="text-align:left;"> Version </th> <th style="text-align:left;"> Author </th> <th style="text-align:left;"> Date </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> <a href="https://github.com/lambdamoses/BUStoolsR_notebooks/blob/2da3b350a1686dc06961608b49e193f1995d74fa/docs/figure/10xv2.Rmd/unnamed-chunk-31-1.png" target="_blank">2da3b35</a> </td> <td style="text-align:left;"> Lambda Moses </td> <td style="text-align:left;"> 2018-12-14 </td> </tr> </tbody> </table> <p></details></p> <p>The species separate, as expected.</p> </div> </div> <div id="session-information" class="section level2"> <h2>Session information</h2> <pre class="r"><code>sessionInfo()</code></pre> <pre><code>#> R version 3.5.1 (2018-07-02) #> Platform: x86_64-redhat-linux-gnu (64-bit) #> Running under: CentOS Linux 7 (Core) #> #> Matrix products: default #> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] parallel stats graphics grDevices utils datasets methods #> [8] base #> #> other attached packages: #> [1] bindrcpp_0.2.2 Matrix_1.2-15 forcats_0.3.0 #> [4] stringr_1.3.1 dplyr_0.7.8 purrr_0.2.5 #> [7] readr_1.3.0 tidyr_0.8.2 tibble_1.4.2 #> [10] ggplot2_3.1.0 tidyverse_1.2.1 Seurat_3.0.0.9000 #> [13] data.table_1.11.8 BUStoolsR_0.99.0 #> #> loaded via a namespace (and not attached): #> [1] Rtsne_0.15 colorspace_1.3-2 ggridges_0.5.1 #> [4] rprojroot_1.3-2 rstudioapi_0.8 listenv_0.7.0 #> [7] npsurv_0.4-0 ggrepel_0.8.0 bit64_0.9-7 #> [10] fansi_0.4.0 AnnotationDbi_1.42.1 lubridate_1.7.4 #> [13] xml2_1.2.0 codetools_0.2-15 splines_3.5.1 #> [16] R.methodsS3_1.7.1 lsei_1.2-0 knitr_1.21 #> [19] zeallot_0.1.0 jsonlite_1.6 workflowr_1.1.1 #> [22] broom_0.5.1 ica_1.0-2 cluster_2.0.7-1 #> [25] png_0.1-7 R.oo_1.22.0 compiler_3.5.1 #> [28] httr_1.4.0 backports_1.1.2 assertthat_0.2.0 #> [31] lazyeval_0.2.1 cli_1.0.1 htmltools_0.3.6 #> [34] prettyunits_1.0.2 tools_3.5.1 rsvd_1.0.0 #> [37] igraph_1.2.2 gtable_0.2.0 glue_1.3.0 #> [40] RANN_2.6 Rcpp_1.0.0 Biobase_2.42.0 #> [43] cellranger_1.1.0 gdata_2.18.0 nlme_3.1-137 #> [46] gbRd_0.4-11 lmtest_0.9-36 xfun_0.4 #> [49] globals_0.12.4 rvest_0.3.2 irlba_2.3.2 #> [52] gtools_3.8.1 XML_3.98-1.16 future_1.10.0 #> [55] MASS_7.3-51.1 zoo_1.8-4 scales_1.0.0 #> [58] hms_0.4.2 RColorBrewer_1.1-2 yaml_2.2.0 #> [61] curl_3.2 memoise_1.1.0 reticulate_1.10 #> [64] pbapply_1.3-4 biomaRt_2.38.0 stringi_1.2.4 #> [67] RSQLite_2.1.1 S4Vectors_0.20.1 caTools_1.17.1.1 #> [70] BiocGenerics_0.28.0 bibtex_0.4.2 Rdpack_0.10-1 #> [73] SDMTools_1.1-221 rlang_0.3.0.1 pkgconfig_2.0.2 #> [76] bitops_1.0-6 evaluate_0.12 lattice_0.20-38 #> [79] ROCR_1.0-7 bindr_0.1.1 labeling_0.3 #> [82] htmlwidgets_1.3 cowplot_0.9.3 bit_1.1-14 #> [85] tidyselect_0.2.5 plyr_1.8.4 magrittr_1.5 #> [88] R6_2.3.0 IRanges_2.16.0 gplots_3.0.1 #> [91] generics_0.0.2 DBI_1.0.0 withr_2.1.2 #> [94] pillar_1.3.0 haven_2.0.0 whisker_0.3-2 #> [97] fitdistrplus_1.0-11 survival_2.43-3 RCurl_1.95-4.11 #> [100] future.apply_1.0.1 tsne_0.1-3 modelr_0.1.2 #> [103] crayon_1.3.4 utf8_1.1.4 KernSmooth_2.23-15 #> [106] plotly_4.8.0 rmarkdown_1.11 progress_1.2.0 #> [109] readxl_1.1.0 grid_3.5.1 blob_1.1.1 #> [112] git2r_0.23.0 metap_1.0 digest_0.6.18 #> [115] R.utils_2.7.0 stats4_3.5.1 munsell_0.5.0 #> [118] viridisLite_0.3.0</code></pre> </div> <!-- Adjust MathJax settings so that all math formulae are shown using TeX fonts only; see http://docs.mathjax.org/en/latest/configuration.html. This will make the presentation more consistent at the cost of the webpage sometimes taking slightly longer to load. Note that this only works because the footer is added to webpages before the MathJax javascript. --> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ "HTML-CSS": { availableFonts: ["TeX"] } }); </script> <hr> <p> This reproducible <a href="http://rmarkdown.rstudio.com">R Markdown</a> analysis was created with <a href="https://github.com/jdblischak/workflowr">workflowr</a> 1.1.1 </p> <hr> </div> </div> </div> <script> // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); }); </script> <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> </body> </html>