Motivation

While the most useful analysis of VCF variants comes from using GEMINI, often-times the analyst just wants to quickly check a VCF. SeeGEM can use the vcfR package to import a VCF and produce a SeeGEM report

Warning

This process will load the entire VCF into memory. I strongly recommend you check the file size to make sure that the VCF is a reasonable size, so as not to take down your computer. I am being vague about what reasonable is, as that depends on how much memory you have and whether the VCF is compressed (ends in *gz) or not. As a rough rule of thumb, if the VCF is larger than 100mb, then I suggest you filter it before running through SeeGEM.

Load libraries

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(SeeGEM)

Import vcf

vcf <- read.vcfR(system.file("extdata/example.vcf.gz", package="SeeGEM"))
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 469
#>   header_line: 470
#>   variant count: 459
#>   column count: 10
#> 
Meta line 469 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 459
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 459
#>   row_num: 0
#> 
Processed variant: 459
#> All variants processed

vcfR contains a handy function to convert the vcf into a tidy dataframe.

vcf_df <- vcfR2tidy(vcf)
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Extracting gt element GT
#> Extracting gt element MIN_DP
#> Extracting gt element PGT
#> Extracting gt element PID
#> Extracting gt element PL
#> Extracting gt element RGQ
#> Extracting gt element SB

vcfR2tidy actually returns three tibbles (data frames):

  1. fix which holds all of the information
  2. gt which contains the genotype information
  3. meta which has the information for the INFO fields contained in the VCF header

Since we want the information and the genotype, we are going to need to glue these together.

The fix and gt tibbles both have ChromKey and POS columns, so I will drop them from one when we column bind them

for_see_gem <- cbind(vcf_df$fix, vcf_df$gt %>% select(-`ChromKey`, -`POS`))

We could just dump for_see_gem into a SeeGEM report, but that would be a bad idea. Why?

Well, let’s look at how many columns and rows this has.

for_see_gem %>% dim()
#> [1] 459 296

459 rows and 284 columns. This will be unpleasant to look it.

I do some light filtering (removing variants which fail the GATK filter and have a gnomAD allele frequency over 0.01) and only select the columns you are interested in before passing the data frame to knit_see_gem_from_table to create the document.

knit_see_gem_from_table(table_data = for_see_gem %>% 
                          filter(FILTER == 'PASS', gno_af_all < 0.01) %>% 
                          select(CHROM, POS, REF, ALT, AC, Indiv, gt_GT, gt_DP, gno_af_all, ccr_pct_v1, REVEL), 
                        sample_name = 'EXAMPLE', 
                        title = 'Vignette')
#> 
#> 
#> processing file: document_template_from_vcf.Rmd
#> 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |.............                                                    |  20%
#>    inline R code fragments
#> 
#> 
  |                                                                       
  |..........................                                       |  40%
#> label: Set parameter loading (with options) 
#> List of 1
#>  $ include: logi FALSE
#> 
#> 
  |                                                                       
  |.......................................                          |  60%
#>    inline R code fragments
#> 
#> 
  |                                                                       
  |....................................................             |  80%
#> label: Print GEMINI DataTable
#> 
  |                                                                       
  |.................................................................| 100%
#>    inline R code fragments
#> output file: document_template_from_vcf.knit.md
#> /usr/local/bin/pandoc +RTS -K512m -RTS document_template_from_vcf.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /Users/mcgaugheyd/git/see_gem/SeeGEM_document.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:flatly' --include-in-header /var/folders/s4/y5f1tt296dj8088gvczcx11d4lrnr7/T//Rtmp57qchP/rmarkdown-str146713727dc2c.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
#> 
#> Output created: SeeGEM_document.html

To make this process a touch easier, I have written a function which wraps the vcfR2tidy and cbind steps into one call called vcf_to_tibble

You skip most the steps and above and get your tibble by calling this

ready_for_see_gem <- vcf_to_tibble(system.file("extdata/example.vcf.gz", package="SeeGEM"))
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 469
#>   header_line: 470
#>   variant count: 459
#>   column count: 10
#> 
Meta line 469 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 459
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 459
#>   row_num: 0
#> 
Processed variant: 459
#> All variants processed
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Extracting gt element GT
#> Extracting gt element MIN_DP
#> Extracting gt element PGT
#> Extracting gt element PID
#> Extracting gt element PL
#> Extracting gt element RGQ
#> Extracting gt element SB