While the most useful analysis of VCF variants comes from using GEMINI, often-times the analyst just wants to quickly check a VCF. SeeGEM can use the vcfR package to import a VCF and produce a SeeGEM report
This process will load the entire VCF into memory. I strongly recommend you check the file size to make sure that the VCF is a reasonable size, so as not to take down your computer. I am being vague about what reasonable is, as that depends on how much memory you have and whether the VCF is compressed (ends in *gz
) or not. As a rough rule of thumb, if the VCF is larger than 100mb, then I suggest you filter it before running through SeeGEM.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(SeeGEM)
vcf <- read.vcfR(system.file("extdata/example.vcf.gz", package="SeeGEM"))
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 469
#> header_line: 470
#> variant count: 459
#> column count: 10
#>
Meta line 469 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 459
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 459
#> row_num: 0
#>
Processed variant: 459
#> All variants processed
vcfR contains a handy function to convert the vcf into a tidy dataframe.
vcf_df <- vcfR2tidy(vcf)
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Extracting gt element GT
#> Extracting gt element MIN_DP
#> Extracting gt element PGT
#> Extracting gt element PID
#> Extracting gt element PL
#> Extracting gt element RGQ
#> Extracting gt element SB
vcfR2tidy
actually returns three tibbles (data frames):
fix
which holds all of the informationgt
which contains the genotype informationmeta
which has the information for the INFO fields contained in the VCF headerSince we want the information and the genotype, we are going to need to glue these together.
The fix
and gt
tibbles both have ChromKey
and POS
columns, so I will drop them from one when we column bind them
for_see_gem <- cbind(vcf_df$fix, vcf_df$gt %>% select(-`ChromKey`, -`POS`))
We could just dump for_see_gem
into a SeeGEM report, but that would be a bad idea. Why?
Well, let’s look at how many columns and rows this has.
for_see_gem %>% dim()
#> [1] 459 296
459 rows and 284 columns. This will be unpleasant to look it.
I do some light filtering (removing variants which fail the GATK filter and have a gnomAD allele frequency over 0.01) and only select the columns you are interested in before passing the data frame to knit_see_gem_from_table
to create the document.
knit_see_gem_from_table(table_data = for_see_gem %>%
filter(FILTER == 'PASS', gno_af_all < 0.01) %>%
select(CHROM, POS, REF, ALT, AC, Indiv, gt_GT, gt_DP, gno_af_all, ccr_pct_v1, REVEL),
sample_name = 'EXAMPLE',
title = 'Vignette')
#>
#>
#> processing file: document_template_from_vcf.Rmd
#>
|
| | 0%
|
|............. | 20%
#> inline R code fragments
#>
#>
|
|.......................... | 40%
#> label: Set parameter loading (with options)
#> List of 1
#> $ include: logi FALSE
#>
#>
|
|....................................... | 60%
#> inline R code fragments
#>
#>
|
|.................................................... | 80%
#> label: Print GEMINI DataTable
#>
|
|.................................................................| 100%
#> inline R code fragments
#> output file: document_template_from_vcf.knit.md
#> /usr/local/bin/pandoc +RTS -K512m -RTS document_template_from_vcf.utf8.md --to html4 --from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash+smart --output /Users/mcgaugheyd/git/see_gem/SeeGEM_document.html --email-obfuscation none --self-contained --standalone --section-divs --template /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:flatly' --include-in-header /var/folders/s4/y5f1tt296dj8088gvczcx11d4lrnr7/T//Rtmp57qchP/rmarkdown-str146713727dc2c.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
#>
#> Output created: SeeGEM_document.html
To make this process a touch easier, I have written a function which wraps the vcfR2tidy
and cbind
steps into one call called vcf_to_tibble
You skip most the steps and above and get your tibble by calling this
ready_for_see_gem <- vcf_to_tibble(system.file("extdata/example.vcf.gz", package="SeeGEM"))
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 469
#> header_line: 470
#> variant count: 459
#> column count: 10
#>
Meta line 469 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 459
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 459
#> row_num: 0
#>
Processed variant: 459
#> All variants processed
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Extracting gt element GT
#> Extracting gt element MIN_DP
#> Extracting gt element PGT
#> Extracting gt element PID
#> Extracting gt element PL
#> Extracting gt element RGQ
#> Extracting gt element SB