RTG product usage - baseline progressions

This chapter provides baseline task progressions that describe best-practice use of the product for data analysis.

Human read mapping and sequence variant detection

Use the following steps to detect all sequence variants between a reference genome and a sequenced DNA sample or set of related samples. This set of tasks steps through the main functionality of the RTG variant detection software pipeline: generating and evaluating gapped alignments, testing coverage depth, and calling sequence variants (SNPs, MNPs, and indels). While this progression is aimed at human variant detection, the process can easily accommodate other mammalian genomes. For non-mammalian species, some features such as sex and pedigree-aware mapping and variant calling may need to be omitted.

The following example supports the steps typical to human whole exome or whole genome analysis in which high-throughput sequencing with Illumina sequencing systems has generated reads of length 100 to 300 base pairs and at 25 to 30× genome coverage.

RTG features the ability to adjust mapping and variant calling according to the sex of the individual that has been sequenced. During mapping, reads will only be mapped against chromosomes that are present in that individuals genome (for example, for a female individual, reads will not be mapped to the Y chromosome). Similarly, sex-aware variant calling will automatically determine when to switch between diploid and haploid models. In addition, when pedigree information is available, Mendelian inheritance patterns are used to inform the variant calling.

This example demonstrates exome variant calling, automatic sex-aware calling capabilities, and joint variant calling with respect to a pedigree.

Data:

This baseline uses actual data downloaded from public databases. The reference sequence is the human GRCh38, downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz This version excludes alternate haplotypes but includes the hs38d1 decoy sequence which improves variant calling accuracy by providing targets for reads with high homology to sites on the primary chromosomes.

The read data is from human genome sequencing of the CEPH pedigree 1463, which comprises 17 members across three generations. Illumina sequencing data is provided as part of the Illumina Platinum Genomes project, available via https://www.illumina.com/platinumgenomes.html and also available in the European Nucleotide Archive. Complete Genomics sequencing data for these samples is available via http://www.completegenomics.com/public-data/69-Genomes/

Table : Overview of basic pipeline stages

Task

Command & Utilities

Purpose

1

Format reference data

rtg format

Convert reference sequence from FASTA files to RTG Sequence Data Format (SDF)

2

Prepare pedigree/sex information

rtg pedstats

Configure per-sample sex and pedigree relationship information in a PED file

3

Format read data

rtg format, rtg cg2sdf

Convert read sequence from FASTA or FASTQ files to RTG Sequence Data Format (SDF)

4

Map reads against a reference genome

rtg map, rtg cgmap

Generate read alignments against a given reference, and report in a BAM file for downstream analysis

5

View alignment results

rtg samstats

Evaluate alignments and determine if the mapping should be repeated with different settings

6

Generate coverage information

rtg coverage

Run the coverage command to generate coverage breadth and depth statistics

7

Call sequence variants (single sample)

rtg snp

Detect SNPs, MNPs, and indels in a sample relative to a reference genome

8

Call sequence variants (single family)

rtg family

Perform sex-aware joint variant calls relative to the reference on a Mendelian family

9

Call sequence variants (population)

rtg population

Perform sex-aware joint variant calls relative to the reference on a population

Task 1 - Format reference data

RTG requires a one-time conversion of the reference genome from FASTA files into the RTG SDF format. This provides the software with fast access to arbitrary genomic coordinates in a pre-parsed format. In addition, metadata detailed autosomes and sex chromosomes is associated with the SDF, enabling automatic handing of sex chromosomes in later tasks. This task will be completed with the format command. The conversion will create an SDF directory containing the reference genome.

For variant calling and CNV analysis, the reference genome should ideally employ fully assembled chromosomes (as opposed to contig-based coordinates where expected ploidy and inheritance characteristics may be unknown).

First, observe a typical genome reference with multiple chromosomes stored in compressed FASTA format.

$ ls -l /data/human/grch38/fasta
874637925 Feb 28 11:00 GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz

Now, use the format command to convert this input file into a single SDF directory for the reference genome. If the reference is comprised of multiple input files, these should be supplied on a single command line.

$ rtg format -o grch38 /data/human/grch38/fasta/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
Formatting FASTA data
Processing "/data/human/grch38/fasta/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz"
Unexpected symbol "M" in sequence "chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38" replaced with "N".
[...]

Detected: 'Human GRCh38 with UCSC naming', installing reference.txt

Input Data
Files              : GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
Format             : FASTA
Type               : DNA
Number of sequences: 2580
Total residues     : 3105715063
Minimum length     : 970
Maximum length     : 248956422

Output Data
SDF-ID             : c477b940-e9e7-4068-b7ed-f8968de308e3
Number of sequences: 2580
Total residues     : 3105715063
Minimum length     : 970
Maximum length     : 248956422

This takes the human reference FASTA file and creates a directory called grch38 containing the SDF. If the reference genome contains IUPAC ambiguity codes or other non-DNA characters these are replaced with ‘N’ (unknown) in the formatted SDF. The first few such codes will generate warning messages as above. Every SDF contains a unique SDF-ID identifier that is reported in log files of subsequent commands and to help identify potential pipeline issues that arise from incorrectly using different human reference builds at different pipeline stages.

Note

When formatting a reference genome, the format command will automatically recognize several common human reference genomes and install a reference.txt configuration file. For reference genomes which are not recognized, you should copy or create an appropriate reference.txt file into the SDF directory if you plan on performing sex-aware mapping and variant calling. See RTG reference file format

You can use the sdfstats command to show statistics for your reference SDF, including the individual sequence lengths.

$ rtg sdfstats --lengths grch38
Type               : DNA
Number of sequences: 2580
Maximum length     : 248956422
Minimum length     : 970
Sequence names     : yes
Sex metadata       : yes
Taxonomy metadata  : no
N                  : 165046090
A                  : 868075685
C                  : 599957776
G                  : 602090069
T                  : 870545443
Total residues     : 3105715063
Residue qualities  : no

Sequence lengths:
chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
chr5    181538259
chr6    170805979
chr7    159345973
chr8    145138636
chr9    138394717
chr10   133797422
chr11   135086622
chr12   133275309
chr13   114364328
chr14   107043718
chr15   101991189
chr16   90338345
chr17   83257441
chr18   80373285
chr19   58617616
chr20   64444167
chr21   46709983
chr22   50818468
chrX    156040895
chrY    57227415
chrM    16569
chr1_KI270706v1_random  175055
chr1_KI270707v1_random  32032
chr1_KI270708v1_random  127682
[...]

Note the presence of reference chromosome metadata on the line Sex metadata. This indicates that the information for sex-aware processing is present. you can use the sdfstats command to verify the chromosome ploidy information for each sex. For the male sex chromosomes the PAR region mappings are also displayed.

$ rtg sdfstats grch38 --sex male --sex female
[...]
Sequences for sex=MALE:
chr1 DIPLOID linear 248956422
chr2 DIPLOID linear 242193529
chr3 DIPLOID linear 198295559
chr4 DIPLOID linear 190214555
chr5 DIPLOID linear 181538259
chr6 DIPLOID linear 170805979
chr7 DIPLOID linear 159345973
chr8 DIPLOID linear 145138636
chr9 DIPLOID linear 138394717
chr10 DIPLOID linear 133797422
chr11 DIPLOID linear 135086622
chr12 DIPLOID linear 133275309
chr13 DIPLOID linear 114364328
chr14 DIPLOID linear 107043718
chr15 DIPLOID linear 101991189
chr16 DIPLOID linear 90338345
chr17 DIPLOID linear 83257441
chr18 DIPLOID linear 80373285
chr19 DIPLOID linear 58617616
chr20 DIPLOID linear 64444167
chr21 DIPLOID linear 46709983
chr22 DIPLOID linear 50818468
chrX HAPLOID linear 156040895 ~=chrY
    chrX:10001-2781479  chrY:10001-2781479
    chrX:155701383-156030895  chrY:56887903-57217415
chrY HAPLOID linear 57227415 ~=chrX
    chrX:10001-2781479  chrY:10001-2781479
    chrX:155701383-156030895  chrY:56887903-57217415
chrM POLYPLOID circular 16569
chr1_KI270706v1_random DIPLOID linear 175055
[...]

Sequences for sex=FEMALE:
chr1 DIPLOID linear 248956422
chr2 DIPLOID linear 242193529
chr3 DIPLOID linear 198295559
chr4 DIPLOID linear 190214555
chr5 DIPLOID linear 181538259
chr6 DIPLOID linear 170805979
chr7 DIPLOID linear 159345973
chr8 DIPLOID linear 145138636
chr9 DIPLOID linear 138394717
chr10 DIPLOID linear 133797422
chr11 DIPLOID linear 135086622
chr12 DIPLOID linear 133275309
chr13 DIPLOID linear 114364328
chr14 DIPLOID linear 107043718
chr15 DIPLOID linear 101991189
chr16 DIPLOID linear 90338345
chr17 DIPLOID linear 83257441
chr18 DIPLOID linear 80373285
chr19 DIPLOID linear 58617616
chr20 DIPLOID linear 64444167
chr21 DIPLOID linear 46709983
chr22 DIPLOID linear 50818468
chrX DIPLOID linear 156040895
chrY NONE linear 57227415
chrM POLYPLOID circular 16569
chr1_KI270706v1_random DIPLOID linear 175055
[...]

Task 2 - Prepare sex/pedigree information

RTG commands for mapping and variant calling multiple samples make use of a pedigree file specifying sample names, their sex, and any relations between them. This is done by creating a standard PED format file containing information about the individual samples using a text editor. For example a PED file for the CEPH 1463 pedigree looks like the following.

$ cat population.ped
# PED format pedigree for CEPH/Utah 1463
# fam-id        ind-id  pat-id  mat-id  sex     phen
1       NA12889 0       0       1       0
1       NA12890 0       0       2       0
1       NA12877 NA12889 NA12890 1       0
2       NA12891 0       0       1       0
2       NA12892 0       0       2       0
2       NA12878 NA12891 NA12892 2       0
3       NA12879 NA12877 NA12878 2       0
3       NA12880 NA12877 NA12878 2       0
3       NA12881 NA12877 NA12878 2       0
3       NA12882 NA12877 NA12878 1       0
3       NA12883 NA12877 NA12878 1       0
3       NA12884 NA12877 NA12878 1       0
3       NA12885 NA12877 NA12878 2       0
3       NA12886 NA12877 NA12878 1       0
3       NA12887 NA12877 NA12878 2       0
3       NA12888 NA12877 NA12878 1       0
3       NA12893 NA12877 NA12878 1       0

A value of 0 indicates the field is unknown or that the sample is not present. Note that the IDs used in columns 2, 3, and 4 must match the sample IDs used during data formatting and variant calling. The sex field (column 5) may be 0 if the sex of a sample is unknown. The family ID in column 1 is ignored as RTG identifies family units according to the relationship information. You can use the pedstats command to verify the correct format:

$ rtg pedstats ceph-1463.ped
Pedigree file: ceph-1463.ped
Total samples:               17
Primary samples:             17
Male samples:                 9
Female samples:               8
Afflicted samples:            0
Founder samples:              4
Parent-child relationships:  26
Other relationships:          0
Families:                     3

The pedstats command can also output the pedigree structure in a format that can be displayed using the dot command from the Graphviz suite.

$ dot -Tpng <(rtg pedstats --dot "CEPH 1463" ceph-1463.ped) -o ceph-1463.png

This will create a PNG image that can be displayed in any image viewing tool.

$ firefox ceph-1463.png

This will display the pedigree structure as shown below.

_images/ceph-1463.png

For more information about the PED file format see Pedigree PED input file format.

Task 3 - Format read data

RTG mapping tools accept input read sequence data either in the RTG SDF format or other common sequence formats such as FASTA, FASTQ, or SAM/BAM. There are pros and cons as to whether to perform an initial format of the read sequence data to RTG SDF:

  • Pre-formatting requires an extra one-off workflow step (the format command), whereas native input file formats are directly accepted by many RTG commands.

  • Pre-formatting requires extra disk space for the SDF (although these can be deleted after processing if required).

  • With pre-formatting, decompression, parsing and error checking raw files is carried out only once, whereas native formats require this processing each time.

  • Pre-formatting permits random access to individual sequences or blocks of sequences, whereas with native formats, the whole file leading up to the region of interest must also be decompressed, and parsed.

  • Pre-formatting permits loading of sequence data, sequence names, and sequence quality values independently, allowing reduced RAM use during mapping

Thus, pre-formatting read sequence data can result in lower overall resource requirements (and faster throughput) than processing native file formats directly.

In this example we will be converting read sequence data from FASTQ files into the RTG SDF format. This task will be completed with the format command. The conversion will create one or more SDF directories for the reads.

Take a paired set of reads in FASTQ format and convert it into RTG data format (SDF). This example shows one lane of data, taking as input both left and right paired files from the same run.

$ mkdir sample_NA12878
$ rtg format -f fastq -q sanger -o sample_NA12878/NA12878_L001 \
    -l /data/reads/NA12878/NA12878_L001_1.fq.gz \
    -r /data/reads/NA12878/NA12878_L001_2.fq.gz \
    --sam-rg "@RG\\tID:NA12878_L001\\tSM:NA12878\\tPL:ILLUMINA"

This creates a directory named NA12878_L001 with two subdirectories, named left and right. Use the sdfstats command to verify this step.

$ rtg sdfstats sample_NA12878/NA12878_L001

It is good practice to ensure the output BAM files contain tracking information regarding the read set. This is achieved by storing the tracking information in the reads SDF by using the --sam-rg flag to provide a SAM-formatted read group header line. The header should specify at least the ID, SM and PL fields for the read group. The platform field (PL) values currently recognized by the RTG command for this attribute are ILLUMINA, LS454, IONTORRENT, COMPLETE, and COMPLETEGENOMICS. This information will automatically be used during mapping to enable automatic creation of calibration files that are used to perform base-quality recalibration during variant calling. In addition, the sample field (SM) is used to segregate samples when using multi-sample variant calling commands such as family, population, and somatic, and should correspond to the sample ids used in the corresponding pedigree file. The read group id field (ID) is a unique identifier used to group of reads that have the same source DNA and sequencing characteristics (for example, pertain to the same sequencing lane for the sample). For more details see Using SAM/BAM Read Groups in RTG map.

If the read group information is not specified when formatting the reads, it can be set during mapping instead using the --sam-rg flag of map.

Repeat this step for all available read data associated with the sample or samples to be processed. This example shows how this can be done with the format command in a loop.

$ for left_fq in /data/reads/NA12878/NA12878*_1.fq.gz; do
    right_fq=${left_fq/_1.fq.gz/_2.fq.gz}
    lane_id=$(basename ${left_fq/_1.fq.gz})
    rtg format -f fastq -q sanger -o ${lane_id} -l ${left_fq} -r ${right_fq} \
      --sam-rg "@RG\tID:${lane_id}\tSM:NA12878\tPL:ILLUMINA"
  done

RTG supports the custom read structure of Complete Genomics data. In this case, use the cg2sdf command to convert the GGI .tsv files into RTG SDF format. This example shows one run of data, taking as input a CGI read data file. It is important when mapping Complete Genomics reads that your read group information should set the platform field (PL) appropriately. For version 1 reads (35 base-pair reads), the platform should be COMPLETE. For version 2 reads (29 or 30 base-pair reads) the platform should be COMPLETEGENOMICS.

$ rtg cg2sdf -o sample_NA12878/NA12878_GS002290 /data/reads/cg/NA12878/GS002290.tsv \
    --sam-rg "@RG\tID:GS002290\tSM:NA12878\tPL:COMPLETE"

As with the formatting of Illumina data, this creates a directory named GS00290 with two subdirectories, named left and right. You can use the sdfstats command to verify this step.

$ rtg sdfstats sample_NA12878/NA12878_GS002290

Notice that during formatting we used the same sample identifier for both the Illumina and Complete Genomics data, which allows subsequent variant calling to associate both read sets with the NA12878 individual.

Repeat the formatting for other samples to be processed. When formatting data sets corresponding to other members of the pedigree, be sure to use the correct sample identifier for each individual.

Task 4 - Map reads to the reference genome

Map the sequence reads against the human reference genome to generate alignments in the BAM (Binary Sequence Alignment/Map) file format.

The RTG map command provides multiple tuning parameters to adjust sensitivity and selectivity at the read mapping stage. In general, whole genome mapping strategies aim to capture the highest number of reads possessing variant data and which map with a high degree of specificity to the human genome. Complete Genomics data has particular mapping requirements, and for this data use the separate cgmap command.

Depending on the downstream analysis required, the mapping may be adjusted to restrict alignments to unique genomic positions or to allow reporting of ambiguous mappings at equivalent regions throughout the genome. This example will show the recommended steps for human genome analysis.

By default, the mapping will report positions for all mated pairs and unmated reads that map to the reference genome to 5 or fewer positions, those with no good alignments or with more than 5 equally good alignment positions are considered unmapped and are flagged as such in the output.

Note that when mapping each individual we supply the pedigree file with --pedigree to automatically determine the appropriate sex to be used when mapping each sample. For a female sample, this will result in no mappings being made to the Y chromosome. It is also possible to manually specify the sex of each sample via --sex=male or --sex=female as appropriate, but using the pedigree file streamlines the process across all samples.

For the whole-genome Illumina data, a suitable map command is:

$ mkdir map_sample_NA12878
$ rtg map -t grch38 --pedigree ceph-1463.ped \
    -i sample_NA12878/NA12878_L001 -o map_sample_NA12878/NA12878_L001 \

For Complete Genomics reads we use the cgmap command.

$ mkdir cgmap_sample_NA12878
$ rtg cgmap -t grch38 --pedigree ceph-1463.ped --mask cg1 \
    -i sample_NA12878/NA12878_GS002290 -o map_sample_NA12878/NA12878_GS002290 \

The selection of the --mask parameter depends on the length of the Complete Genomics reads to be mapped. The cg1 and cg1-fast masks are appropriate for 35 base-pair reads and the cg2 mask is appropriate for 29 or 30 base-pair reads. See cgmap for more detail.

When the mapping command completes, multiple files will have been written to the output directory of the mapping run. By default the alignments.bam file is produced and a BAM index (alignments.bam.bai) is automatically created to permit efficient extraction of mappings within particular genomic regions. This behavior is necessary for subsequent analysis of the mappings, but can be performed manually using the index command.

During mapping RTG automatically creates a calibration file (alignments.bam.calibration) containing information about base qualities, average coverage levels, etc. This calibration information is utilized during variant calling to improve accuracy and to determine when coverage levels are abnormally high. When processing exome or other targeted data, it is important that this calibration information should only be computed for mappings within the exome capture regions, otherwise the computed typical coverage levels will be much lower than actual. This can result in RTG discarding many variant calls as being over-coverage. The correct workflow for exome processing is to specify --bed-regions to supply a BED file describing the exome regions at the same time as mapping, to ensure appropriate calibration is computed.

$ rtg map -t grch38 --pedigree ceph-1463.ped --bed-regions exome-regions.bed \
    -i sample_NA12878/NA12878_L001 -o map_sample_NA12878/NA12878_L001 \

Note that supplying a BED file does restrict the locations to which reads are mapped, it is only used to ensure calibration information is correctly calculated.

Note

The exome capture BED file must correspond to the correct reference you are mapping and calling against. You may need to run the BED file supplied by your sequencing vendor through a lift-over tool if the reference genome versions differ.

The size of the job should be tuned to the available memory of the server node. You can perform mapping in segments by using the --start-read and --end-read flags during mapping. Currently, a 48 GB RAM system as specified in the technical requirements section can process 100 million reads in about an hour. The following example would work to map a data set containing just over 100 million reads in batches of 10 million.

$ for ((j=0;j<10;j++)); do
    rtg map -t grch38 --pedigree ceph-1463.ped \
      -i sample1-100M -o map_sample1-$j \
      --start-read=$((j*10000000)) --end-read=$(((j+1)*10000000))
  done

Note that each of these runs can be executed independently of the others. This allows parallel processing in a compute cluster that can reduce wall clock time.

In a parallel compute cluster special consideration is needed with respect to where data resides during the mapping process. Reading and writing data from and to a single networked disk partition may result in undesirable I/O performance characteristics depending on the size and structure of the compute cluster. One way to minimize the adverse affects of I/O limitations is to separate input data sets and map output directories, storing them on different network disk partitions.

Using the --tempdir flag allows the map command to use a directory other than the output directory to store temporary files that are output during the mapping process. The size of the temporary files is the same as the total size of files in the map output directory after processing has finished. The following example shows how to modify the above example to place outputs on a separate partition to the inputs.

$ mkdir /partition3/map_temp_sample1
$ for ((j=0;j<10;j++)); do
    rtg map -t /partition1/grch38 --pedigree /partition1/ceph-1463.ped \
      -i /partition1/sample1-100M \
      -o /partition2/map_sample1-$j \
      --start-read=$((j*1000000)) --end-read=$(((j+1)*1000000)) \
      --tempdir /partition3/map_temp_sample1
  done

Processing other samples in the pedigree is a matter of specifying different input SDF and output directories.

Task 5 - View and evaluate mapping performance

An alignments.bam file can be viewed directly using the RTG extract command in conjunction with a shell command such as less to quickly inspect the results.

$ rtg extract map_sample_NA12878/NA12878_L001-0/alignments.bam | less -S

Since the mappings are indexed by default, it is also possible to view mappings corresponding to particular genomic regions. For example:

$ rtg extract map_sample_NA12878/NA12878_L001-0/alignments.bam chr6:1000000-5000000 \
  | less -S

The mapping output directory also contains a simple HTML summary report giving information about mapping counts, alignment score distribution, paired-end insert size distribution, etc. This can be viewed in your web browser. For example:

$ firefox map_sample_NA12878/NA12878_L001-0/index.html

For more detailed summary statistics, use the samstats command. This example will report information such as total records, number unmapped, specific details about the mate pair reads, and distributions for alignment scores, insert sizes and read hits.

$ rtg samstats -t grch38 map_sample_NA12878/NA12878_L001-0/alignments.bam \
   -r sample_NA12878/NA12878_L001 --distributions

Finally, the short summary of mapping produced on the terminal at the end of the map run, is also available as summary.txt in the mapping output directory.

Task 6 - Generate and review coverage information

For human genomic analysis, it is important to have sufficient coverage over the entire genome to detect variants accurately and minimize false negative calls. The coverage command provides detailed statistics for depth of coverage and gap size distribution. If coverage proves to be inadequate or spotty, remapping the data with different sensitivity tuning or rerunning the sample with different sequencing technology may help.

This example shows the coverage command used with all alignments, both mated and unmated, for the entire sample. The -s flag is used to introduce smoothing of the data, by default the data will not be smoothed. While you can supply the coverage command with the names of BAM files individually on the command line, this becomes unwieldy when the mapping has been carried out in many smaller jobs. In this command we will use the -I flag to supply a text file containing the names of all the mapping output BAM files, one per line. One example way to create this file is with the following command (assuming all your mapping runs have used a common root directory):

$ find map_sample_NA12878 -name "*.bam" > NA12878-bam-files.txt
$ rtg coverage -t grch38 -s 20 -o cov_sample_NA12878 -I NA12878-bam-files.txt

For exome or other targeted data, the coverage command can be instructed to focus only within the target regions by supplying the target regions BED file.

$ rtg coverage -t grch38 --bed-regions exome-regions.bed -s 20 \
    -o cov_sample_NA12878 -I NA12878-bam-files.txt

By default the coverage command will generate a BED formatted file containing regions of similar coverage. This BED file can be loaded into a genome browser to visualize the coverage, and may also be examined on the command line. For example:

$ rtg extract cov_sample_NA12878/coverage.bed.gz 'chr6:1000000-5000000' | less

The coverage command also creates a simple HTML summary report containing graphs of the depth of coverage distribution and cumulative depth of coverage. This can be viewed in your web browser. For example:

$ firefox cov_sample_NA12878/index.html

The coverage command has several options that alter the way that coverage levels are reported (for example, determining callability or giving per-region reporting). See coverage for more detail.

Task 7 - Call sequence variants (single sample)

The primary germline variant calling commands in the RTG suite are the snp command for detecting variants in a single sample, the family command for performing simultaneous joint calling of multiple family members within a single family, and the population command for performing simultaneous joint calling of multiple population members with varying degrees of relatedness. Where possible, we recommend using joint calling, as this ensures the maximum use of pedigree information as well as preventing cross-sample variant representation inconsistency that can occur when calling multiple samples individually. All of the RTG variant callers are sex-aware, producing calls of appropriate ploidy on the sex chromosomes and within PAR regions.

In this section, we demonstrate single sample calling with snp. For family calling, see Task 8 - Call sequence variants (single family), and for population calling, see Task 9 - Call population sequence variants.

The snp command detects sequence variants in a single sample, given adequate but not excessively high coverage of reads against the reference genome. As with the coverage command, we can supply a file containing a list of all the needed input files. In this case the mapping calibration files will be automatically detected at the location of the BAM files themselves and will be used in order to enable base quality recalibration during variant calling.

This example takes all available BAM files for the sample and calls SNP, MNP, and indel sequence variants.

$ rtg snp -t grch38 --pedigree ceph-1463.ped \
    -o snp_sample_NA12878 -I sample_NA12878-bam-files

Here we supply the pedigree file in order to inform the variant caller of the sample sex. While multiple samples may be listed in the pedigree file, the snp command will object if mappings for multiple samples are supplied as input.

For exome data it is also recommended to provide the BED file describing the exome regions to the variant caller in order to filter the output produced to be within the exome regions. There are two approaches here. The first is to instruct the variant caller to only perform variant calling at sites within the exome regions by supplying the --bed-regions option:

$ rtg snp -t grch38 --pedigree ceph-1463.ped -–bed-regions exome-regions.bed \
    -o snp_sample_NA12878 -I sample_NA12878-bam-files

This approach is computationally more efficient. However, if calls at off-target sites are potentially of interest, a second approach is to call variants at all sites but automatically filter variants as being off-target in the VCF FILTER field by using the --filter-bed option:

$ rtg snp -t grch38 --pedigree ceph-1463.ped --filter-bed exome-regions.bed \
    -o snp_sample_NA12878 -I sample_NA12878-bam-files

If you do not have a BED file that specifies the exome capture regions for your reference genome, you should supply the --max-coverage flag during variant calling to indicate an appropriate coverage threshold for over-coverage situations. A typical value is 5 times the expected coverage level within exome regions.

The snp command will perform the Bayesian variant calling and will use a default AVR model for scoring variant call quality. If you have a more appropriate model available, you should supply this with the --avr-model flag. RTG supplies some pre-built models which work well in a wide variety of cases, and includes tools for building custom AVR models. See AVR scoring using HAPMAP for model building for more information on building custom models.

The snp command will output variant call summary information upon completion, and this is also available in the output directory in the file summary.txt.

Variants are produced in standard VCF format. Since the variant calls are compressed and indexed by default, it is possible to view calls corresponding to particular genomic regions. For example:

$ rtg extract snp_sample_NA12878/snps.vcf.gz chr6:1000000-5000000 | less -S

Inspecting the output VCF for this run will show that no variants have been called for the Y chromosome. If you carry out similar sex-aware calling for the father of the trio, NA12891, inspection of the output VCF will show haploid variant calls for the X and Y chromosomes.

$ rtg extract snp_sample_NA12891/snps.vcf.gz chrX | head
$ rtg extract snp_sample_NA12891/snps.vcf.gz chrY | head

RTG also supports automatic handling of pseudoautosomal regions (PAR) and will produce haploid or diploid calls within the X PAR regions as appropriate for the sex of the individual.

A simple way to improve throughput is to call variants using a separate job for each reference sequence. Each of these runs could be executed on independent nodes of a cluster in order to reduce wall clock time.

$ for seq in M {1..22} X Y; do
    rtg snp -t grch38 --pedigree ceph-1463.ped -I sample_NA12878-bam-files \
      -o snp_sample_NA12878_chr"$seq" --region chr"$seq"
  done

After the separate variant calling jobs complete, the VCF files for each chromosome can be combined into a single file, if desired, by using a vcfmerge command such as:

$ rtg vcfmerge -o snp_sample_NA12878.vcf.gz snp_sample_NA12878_chr*/snps.vcf.gz

The individual snp commands will have output summary information about the variant calls made in each job. Combined summary information can be output for the merged VCF file with the vcfstats command.

$ rtg vcfstats snp_sample_NA12878.vcf.gz

Simple filtering of variants can be applied using the vcffilter command. For example, filtering calls by genotype quality or AVR score can be accomplished with a command such as:

$ rtg vcffilter --min-genotype-quality 50 -i snp_sample_NA12878.vcf.gz \
  -o filtered_sample_NA12878.vcf.gz

In this case, any variants failing the filter will be removed. Alternatively, failing variants can be kept but marked with a custom VCF FILTER field, such as:

$ rtg vcffilter --fail LOW-GQ --min-genotype-quality 50 \
  -i snp_sample_NA12878.vcf.gz -o filtered_sample_NA12878.vcf.gz

Task 8 - Call sequence variants (single family)

This section demonstrates joint calling multiple samples comprising a single Mendelian family. RTG is unique in supporting family calling beyond a single-child trio, and directly supports jointly calling a family involving multiple offspring. In this example, we call the large family of the CEPH pedigree containing 11 children.

The family command is invoked similarly to the snp command except you need to specify the way the samples relate to each other. This is done by specifying the corresponding sample ID used during mapping to the command line flags --mother, --father, --daughter and --son.

Note

The RTG family command only supports a basic family relationship of a mother, father and one or more children, either daughters or sons. For other pedigrees, use the population command.

To run the family command on the trio of NA12892 (mother), NA12891 (father) and NA12878 (daughter) you need to provide all the mapping files for the samples. The mapping calibration files will be automatically detected at the locations of the mapping files. To specify these in a file list for input you could run:

$ find /partition2/map_trio -name "alignments.bam" > map_trio-bam-files

To run the family command you then specify the sample ID for each member of the trio to the appropriate flag.

$ rtg family --mother NA12892 --father NA12891 --daughter NA12878 -t grch38 \
  -o trio_variants -I map_trio-bam-files

Examining the snps.vcf.gz file in the output directory will show individual columns for the variants of each family member. For more details about the VCF output file see Small-variant VCF output file description

Note

Per-family relationship information can also be specified using a pedigree PED file with the --pedigree flag. In this case, the pedigree file should contain a single family only.

Task 9 - Call population sequence variants

This section demonstrates joint variant calling of multiple potentially related or unrelated individuals in a population.

The population command is invoked similarly to the snp command except you must specify the pedigree file containing information about each sample and the relationships (if any) between them.

To run the population command on the population of NA12892, NA12891, NA19240 and NA12878 you need to provide all the mapping files for the samples. The mapping calibration files will be automatically detected at the locations of the mapping files. To specify these in a file list for input you could run:

$ find /partition2/map_population -name "alignments.bam" > map_population-bam-files

To run the population command you specify the PED file containing the sample ID for each member of the population in the individual ID column.

$ rtg population -t grch38 --pedigree ceph-1463.ped -o pop_variants \
    -I map_population-bam-files

Examining the snps.vcf.gz file in the output directory will show individual columns for the variants of each member of the population. For more details about the VCF output file see Small-variant VCF output file description.

Create and use population priors in variant calling

To improve the accuracy of variant calling on new members of a population, a file containing the allele counts of the population’s known variants may be supplied. This information is used as an extra set of prior probabilities when making calls.

Sources of this allele count data can be external, for instance the 1000 genomes project, or from prior variant calling on other members of the population. An example use case of the latter follows.

Data

For this use case it is assumed that the following data is available:

  • /data/runs/20humans.vcf.gz - output from a previous population command run on 20 humans from a population.

  • /data/reference/human_reference - SDF containing the human reference sequences.

  • /data/mappings/new_human.txt - text file containing a list of BAM files with the sequence alignments for the new member of the population.

Table : Overview of pipeline tasks.

Task

Command & Utilities

Purpose

1

Produce population priors file

rtg vcfannotate, rtg vcfsubset

Produce a reusable set of population priors from an existing VCF file

2

Run variant calling using population priors

rtg snp

Perform variant calling on the new member of the population using the new population priors to improve results

Task 1 - Produce population priors file

Using a full VCF file for a large population as population priors can be slow, as it contains lots of unnecessary information. The AC and AN fields are the standard VCF specification fields representing the allele count in genotypes, and total number of alleles in called genotypes. For more information on these fields, see the VCF specification at https://samtools.github.io/hts-specs/VCFv4.2.pdf. Alternatively, retaining only the GT field for each sample is sufficient, however this is less efficient both computationally and size-wise.

To calculate and annotate the AC and AN fields for a VCF file, use the RTG command vcfannotate with the parameter --fill-an-ac:

$ rtg vcfannotate --fill-an-ac -i /data/runs/20humans.vcf.gz \
  -o 20humans_an_ac.vcf.gz

Then remove all unnecessary data from the file using the RTG command vcfsubset as follows:

$ rtg vcfsubset --keep-info AC,AN --remove-samples \
  -i 20humans_an_ac.vcf.gz -o 20humans_priors.vcf.gz

This output is block compressed and tabix indexed by default, which is necessary for the population priors input. There will be an additional file output called 20humans_priors.vcf.gz.tbi which is the tabix index file.

The resulting population priors can now be stored in a suitable location to be used for any further runs as required.

$ cp 20humans_priors.vcf.gz* /data/population_priors/

Task 2 - Run variant calling using population priors

The population priors can now be used to improve variant calling on new members of the population supplying the --population-priors parameter to any of the variant caller commands.

$ rtg snp -o new_human_snps -t /data/reference/human_reference \
  -I /data/mappings/new_human.txt \
  --population-priors /data/population_priors/20humans_priors.vcf.gz

Somatic variant detection in cancer

RTG includes two variant calling commands for detecting somatic variations in tumor samples: somatic should be used when data from a matched normal sample is available; and tumoronly should be used when there is no matched normal sample.

Follow the steps below to obtain somatic variant calls.

Table : Overview of somatic pipeline tasks.

Task

Command & Utilities

Purpose

1

Format reference data

rtg format

Convert reference sequence from FASTA file to RTG Sequence Data Format (SDF)

2

Format read data

rtg format

Convert read sequence from FASTA and FASTQ files to RTG Sequence Data Format (SDF)

3

Map reads against the reference genome

rtg map

Generate read alignments for the sample(s), and report in a BAM file for downstream analysis

4

Call somatic variants

rtg somatic

Detect somatic variants in the tumor sample

Task 1 - Format reference data (Somatic)

Format the human reference data to RTG SDF using Task 1 of RTG mapping and sequence variant detection. In the following tasks it is assumed the human reference SDF is called grch38.

Task 2 - Format read data (Somatic)

Format the tumor sample read data to RTG SDF using Task 2 of RTG mapping and sequence variant detection. In this example we assume there are 20 lanes of data for the sample.

$ for ((i=1;i<20;i++)); do
$   rtg format -f fastq -q sanger -o tumor_reads_${i} \
    -l /data/reads/tumor/${i}/reads_1.fq \
    -r /data/reads/tumor/${i}/reads_2.fq \
    --sam-rg "@RG\tID:tumor_${lane_id}\tSM:sm_tumor\tPL:ILLUMINA"
  done

When matched normal sample data is available, format this in a similar manner. Note that a different sample name should be supplied for the read group information for the normal sample.

$ for ((i=1;i<20;i++)); do
$   rtg format -f fastq -q sanger -o normal_reads_${i} \
    -l /data/reads/normal/${i}/reads_1.fq \
    -r /data/reads/normal/${i}/reads_2.fq \
    --sam-rg "@RG\tID:normal_${lane_id}\tSM:sm_normal\tPL:ILLUMINA"
  done

Task 3 - Map tumor and normal sample reads against the reference genome

Map the normal and tumor sample reads against the reference genome as in Task 4 - Map reads to the reference genome. The mapping must be done with appropriate read group information for each read set, by using the --sam-rg flag at mapping time unless this has been done during the format command as described above. All mappings on the tumor should have one sample ID, and (when available) all mappings on the matched normal should have another sample ID.

$ for ((i=1;i<20;i++)); do
    rtg map -i tumor_reads_${i} -t grch38 -o tumor_map_${i}
  done
$ for ((i=1;i<20;i++)); do
    rtg map -i normal_reads_${i} -t grch38 -o normal_map_${i}
  done

Task 4 - Call somatic variants

When no matched normal data is available, the tumoronly command is invoked, similarly to the snp comment. You may optionally specify an estimated level of contamination of the tumor sample with normal tissue using the --contamination flag.

$ rtg tumoronly -t grch38 -o tumoronly_out --contamination 0.3 \
  tumor_map_*/alignments.bam

Where a matched normal sample is available, the somatic command is invoked. Command options are similar to tumoronly, except you need to specify the sample IDs corresponding to the normal and cancer samples, with the --original and --derived flags respectively.

$ rtg somatic -t grch38 -o somatic_out –contamination 0.3

–derived sm_tumor –original sm_normal normal_map_*/alignments.bam tumor_map_*/alignments.bam

Examining the snps.vcf.gz file in the output directory will show a column each for the variants of the normal and tumor samples and will contain variants where the tumor sample differs from the normal sample. The somatic command stores information in the VCF INFO fields NCS, and LOH, and FORMAT fields SSC and SS. For more details about the VCF output file see Small-variant VCF output file description.

The somatic command also supports --include-germline to include variants that are predicted to be germline, in addition to those predicted to be of somatic origin, in the output VCF.

Using site-specific somatic priors

The tumoronly and somatic commands have a default prior of 0.000001 (1e-6) for a particular site being somatic. Since the human genome comprises some 3.2 GB, this prior corresponds to an expectation of about 3200 somatic variants in a whole genome sample. Depending on the expected number of variants for a particular sample, it may be appropriate to raise or lower this prior. In general, decreasing the prior increases specificity while increasing the prior increases sensitivity.

Of course, not every site is equally likely to lead to a somatic variant. To support different priors for different sites we provide a facility to set a prior on per site basis via a BED file we call site-specific somatic priors. The --somatic-priors command line option is used to provide the file to the variant caller.

The site-specific somatic priors can cover as many or as few sites as desired. Any site not covered by a specific prior will use the default prior. The format of the file is a BED file where the fourth column of each line gives the explicit prior for the indicated region, for example,

1 14906 14910 1e-8

denotes that the prior for bases 14907, 14908, 14909, and 14910 on chromosome 1 is 1e-8 rather than the default. (Recall that BED files use 0-based indices.) If the BED file contains more than one prior covering a particular site, then the largest prior covering that site is used. When making a complex call, the prior used is the arithmetic average of priors in the region of the complex call.

A typical starting point for making somatic site-specific priors might include a database of known cancer sites (for example, COSMIC) and a database of sites known to be variant in the population (for example, dbSNP). The idea is that the COSMIC sites are more likely to be somatic and should have a higher prior, while those in dbSNP are less likely to be somatic and should have a lower prior.

The following recipe can be used to build the BED file where some sites have a lower prior of 1e-8 and others have a higher prior of 1e-5. The procedure can be easily modified to incorporate additional inputs each with its own prior.

First, collect prerequisites in the form of VCF files (here using the names cosmic.vcf.gz and dbsnp.vcf.gz, but, of course, any other VCFs can also be used).

$ COSMIC=cosmic.vcf.gz
$ DBSNP=dbsnp.vcf.gz

Convert each VCF file into a BED file with the desired priors taking care to convert from 1-based coordinates in VCF to 0-based coordinates in VCF.

$ zcat ${COSMIC} | awk -vOFS='\t' '/^[^#]/{print $1,$2-1,$2+length($4)-1,"1e-5"}'\
  | sort -Vu >p0.bed
$ zcat ${DBSNP} | awk -vOFS='\t' '/^[^#]/{print $1,$2-1,$2+length($4)-1,"1e-8"}'\
  | sort -Vu >p1.bed

[Optional] Collapse adjacent intervals together. One way of doing this is to use the bedtools merge facility. This can result in a smaller final result when the intervals are dense.

$ bedtools merge -c 4 -o distinct -i p0.bed >p0.tmp && mv p0.tmp p0.bed
$ bedtools merge -c 4 -o distinct -i p1.bed >p1.tmp && mv p1.tmp p1.bed

In general, care must be taken to ensure intersecting sites are handled in the desired manner. Since in this case we want to use COSMIC in preference to dbSNP and prior(COSMIC) > prior(dbSNP), we can simply merge the outputs because the somatic caller will choose the larger prior in the case of overlap.

$ sort --merge -V p0.bed p1.bed | rtg bgzip - >somatic-priors.bed.gz

To support somatic calling on restricted regions, construct a tabix index for the priors file.

$ rtg index -f bed somatic-priors.bed.gz

The site-specific somatic BED file is now ready to be used by the tumoronly and somatic commands:

$ rtg somatic -–somatic-priors somatic-priors.bed.gz ...

AVR scoring using HAPMAP for model building

AVR (Adaptive Variant Rescoring) is a machine learning technology for learning and predicting which calls are likely correct. It comprises of a learning algorithm that takes training examples and infers a model about what constitutes a good call and a prediction engine which applies the model to variants and estimates their correctness. It includes attributes of the call that are not considered by the internal Bayesian statistics model to make better predictions as to the correctness of a variant call.

Each of the RTG variant callers (snp, family, population) automatically runs a default AVR model, producing an AVR attribute for each sample. The model can be changed with the --avr-model parameter, and the AVR functionality can be turned off completely by specifying the special ‘none’ model.

Example command line usage. Turn AVR rescoring off:

$ rtg family --mother NA12892 --father NA12891 --daughter NA12878 -t grch38 \
  -o trio_variants -I map_trio-bam-files --avr-model none

Apply default RTG AVR model:

$ rtg family --mother NA12892 --father NA12891 --daughter NA12878 -t grch38 \
  -o trio_variants -I map_trio-bam-files

Apply a custom AVR model:

$ rtg family --mother NA12892 --father NA12891 --daughter NA12878 -t grch38 \
  -o trio_variants -I map_trio-bam-files --avr-model /path/to/my/custom.avr

The effectiveness of AVR is strongly dependent on the quality of the training data. In general, the more training data you have, the better the model will be. Ideally the training data should have the same characteristics as the calls to be predicted; that is, the same platform, the same reference, the same coverage, etc. There also needs to be a balance of positive and negative training examples. In reality, these conditions can only be met to varying degrees, but AVR will try to make the most of what it is given.

A given AVR model is tied to a set of attributes corresponding to fields in VCF records or quantities that are derivable from those fields. The attributes chosen can take into account anomalies associated with different sequencing technologies. Examples of attributes are things like quality of the call, zygosity of the call, strand bias, allele balance, and whether or not the call is complex. Not all attributes are equally predictive and it is the job of the machine learning to determine which combinations of attributes lead to the best predictions. When building a model it is necessary to provide the list of attributes to be used. In general, providing more attributes gives the AVR model a better chance at learning what constitutes a good call. There are two caveats; the attributes used during training need to be present in the calls to be predicted and some attributes like DP are vulnerable to changes in coverage. AVR is able to cope with missing values during both training and prediction.

The training data needs to comprise both positive and negative examples. Ideally we would know exactly the truth status of each training example, but in reality this must be approximated by reference to some combination of baseline information.

In the example that follows, the HAPMAP database will be used to produce and then use an AVR model on a set of output variants, a process that can be used when no appropriate AVR model is already available. The HAPMAP database will be used to determine which of the variants will be considered correct for training purposes. This will introduce two types of error; correct calls which are not in HAPMAP will be marked as negative training examples and a few incorrect calls occurring at HAPMAP sites will be marked as positive training examples.

Data

Reference SDF on which variant calling was performed, in this example assumed to be an existing SDF containing the 1000 genomes build 37 phase 2 reference

(ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz). For this example this will be called /data/reference/1000g_v37_phase2.

The HAPMAP variants file from the Broad Institute data bundle

(ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/hapmap_3.3.b37.vcf.gz). For this example this will be called /data/hapmap_3.3.b37.vcf.gz.

The example will be performed on the merged results of the RTG family command for the 1000 genomes CEU trio NA12878, NA12891 and NA12892. For this example this will be called /data/runs/NA12878trio/family.vcf.gz.

Table : Overview of basic pipeline tasks.

Task

Command & Utilities

Purpose

1

Create training data

rtg vcffilter

To generate positive and negative examples for the AVR machine learning model to train on

2

Build and check AVR model

rtg avrbuild, rtg avrstats

To create and check an AVR model

3

Use AVR model

rtg avrpredict, rtg snp, rtg family, rtg population

To apply the AVR model to the existing output or to use it directly during variant calling

4

Install AVR model

cp

Install model in standard RTG model location for later reuse

Task 1 - Create training data

The first step is to use the vcffilter command to split the variant calls into positive and negative training examples with respect to HAPMAP. It is also possible to build training sets using the vcfeval command, however this is less appropriate for dealing with sites rather than calls, and experiments indicate using vcffilter leads to better training sets.

$ rtg vcffilter --include-vcf /data/hapmap_3.3.b37.vcf.gz -o pos-NA12878.vcf.gz \
  -i /data/runs/NA12878trio/family.vcf.gz --sample NA12878 --remove-same-as-ref
$ rtg vcffilter --exclude-vcf /data/hapmap_3.3.b37.vcf.gz -o neg-NA12878.vcf.gz \
  -i /data/runs/NA12878trio/family.vcf.gz --sample NA12878 --remove-same-as-ref

Optionally check that the training data looks reasonable. There should be a reasonable amount of both positive and negative examples and all expected chromosomes should be represented.

Task 2 - Build and check AVR model

The next step is to build an AVR model. Select the attributes that will be used with some consideration to portability and the nature of the training set. Here we have excluded XRX and LAL because HAPMAP is primarily SNP locations and does not capture complex calls. By excluding XRX and LAL we prevent the model from learning that complex calls are bad. We have also excluded DP because we want a model somewhat independent of coverage level. Building the model can take a large amount of RAM and several hours. The amount of memory required is proportional to the number of training instances.

A list of derived annotations that can be used are available in the documentation for avrbuild. For VCF INFO and FORMAT annotations check the header of the VCF file for available fields. The VCF fields output by RTG variant callers are described in Small-variant VCF output file description.

$ rtg avrbuild -o NA12878model.avr --info-annotations DPR \
  --format-annotations DPR,AR,ABP,SBP,RPB,GQ \
  --derived-annotations IC,EP,QD,AN,PD,GQD,ZY \
  --sample NA12878 -p pos-NA12878.vcf.gz -n neg-NA12878.vcf.gz

Examine the statistics output to the screen to check things look reasonable. Due to how attributes are computed there can be missing values, but it is a bad sign if any attribute is missing from most samples.

Total number of examples: 5073752
Total number of positive examples: 680677
Total number of negative examples: 4393075
Total weight of positive examples: 2536873.27
Total weight of negative examples: 2536876.42
Number of examples with missing values:
DERIVED-AN 0
DERIVED-EP 0
DERIVED-GQD 0
DERIVED-IC 915424
DERIVED-PD 0
DERIVED-QD 0
DERIVED-ZY 0
FORMAT-ABP 13602
FORMAT-AR 8375
FORMAT-DPR 0
FORMAT-GQ 0
FORMAT-RPB 8375
FORMAT-SBP 37129
INFO-DPR 0
Hold-out score: 69.1821% (96853407/139997752)
Attribute importance estimate for DERIVED-AN: 0.0304% (29443/96853407)
Attribute importance estimate for DERIVED-EP: 1.2691% (1229135/96853407)
Attribute importance estimate for DERIVED-GQD: 4.7844%
(4633873/96853407)
Attribute importance estimate for DERIVED-IC: 4.7748% (4624554/96853407)
Attribute importance estimate for DERIVED-PD: 0.0000% (0/96853407)
Attribute importance estimate for DERIVED-QD: 6.8396% (6624383/96853407)
Attribute importance estimate for DERIVED-ZY: 0.8693% (841925/96853407)
Attribute importance estimate for FORMAT-ABP: 5.4232% (5252533/96853407)
Attribute importance estimate for FORMAT-AR: 2.6008% (2518955/96853407)
Attribute importance estimate for FORMAT-DPR: 3.3186% (3214163/96853407)
Attribute importance estimate for FORMAT-GQ: 3.4268% (3319009/96853407)
Attribute importance estimate for FORMAT-RPB: 0.2176% (210761/96853407)
Attribute importance estimate for FORMAT-SBP: 0.0219% (21196/96853407)
Attribute importance estimate for INFO-DPR: 4.6525% (4506074/96853407)

Optionally check the resulting model file using the avrstats command. This will produce a short summary of the model including the attributes used during the build.

$ rtg avrstats NA12878model.avr
Location : NA12878model.avr
Parameters : avrbuild -o NA12878model.avr --info-annotations DPR --derived-annotations IC,EP,QD,AN,PD,GQD,ZY --format-annotations DPR,AR,ABP,SBP,RPB,GQ --sample NA12878 -p pos-NA12878.vcf.gz -n neg-NA12878.vcf.gz
Date built : 2013-05-17-09-41-17
AVR Version : 1
AVR-ID : 7ded37d7-817f-467b-a7da-73e374719c7f
Type : ML
QUAL used : false
INFO fields : DPR
FORMAT fields : DPR,AR,ABP,SBP,RPB,GQ
Derived fields: IC,EP,QD,AN,PD,GQD,ZY

Task 3 - Use AVR model

The model is now ready to be used. It can be applied to the existing variant calling output by using the avrpredict command.

$ rtg avrpredict --avr-model NA12878model.avr \
  -i /data/runs/NA12878trio/family.vcf.gz -o predict.vcf.gz

This will create or update the AVR FORMAT field in the VCF output file with a score between 0 and 1. The higher the resulting score the more likely it is correct. To select an appropriate cutoff value for further analysis of variants some approaches might include measuring the Ti/Tv ratio or measuring sensitivity against another standard such as OMNI at varying score cutoffs.

The model can also be used directly in any new variant calling runs:

$ rtg snp --avr-model NA12878model.avr -t grch38 -o snp_sample_NA12878 --sex female \
  -I sample_NA12878-map-files
$ rtg population --avr-model NA12878model.avr -t grch38 -o pop_variants \
  -I map_population-bam-files

Task 4 - Install AVR model

The custom AVR model can be installed into a standard location so that it can be referred to by a short name (rather than the full file path name) in the avrpredict and variant caller commands. The default location for AVR models is within a subdirectory of the RTG installation directory called models, and each file in that directory with a .avr extension is a model that can be accessed by its short name. For example if the NA12878model.avr model file is placed in /path/to/rtg/installation/models/NA12878model.avr it can be accessed by any user either using the full path to the model:

$ rtg snp --avr-model /path/to/rtg/installation/models/NA12878model.avr \
  -o snp_sample_NA12878 -t grch38 --sex female -I sample_NA12878-map-files

or, by just the model file name:

$ rtg snp --avr-model NA12878model.avr -o snp_sample_NA12878 -t grch38 --sex female \
  -I sample_NA12878-map-files

The AVR model directory will already contain the models prebuilt by RTG:

  • illumina-exome.avr - model built from Illumina exome sequencing data. If you are running variant calling Illumina exome data you may want to use this model instead of the default, although the default should still be effective.

  • illumina-wgs.avr - model built from Illumina whole genome sequencing data. This model is the default model when running normal variant calling.

  • illumina-somatic.avr - model built from somatic samples using

    Illumina sequencing. It is applicable to somatic variant calling, where a variety of allelic fractions are to be expected in somatic variants. The somatic command defaults to this AVR model. If you want to score germline variants in a somatic run, it is preferable to use illumina-wgs.avr or illumina-exome.wgs instead.

  • alternate.avr - model built using XRX, ZY and GQD attributes. This should be platform independent and may be a better choice if a more specific model for your data is unavailable. In particular, this model may be more appropriate for scoring the results of variant calling in situations where unusual allele-balance is expected (for example somatic calling with contamination, or calling high amplification data where allele drop out is expected)

It is possible to score a sample with more than one AVR model, by running avrpredict with another model and using a different field name specified with --vcf-score-field.

RTG structural variant detection

RTG has developed tools to assist in discovering structural variant regions in whole genome sequencing data. The tools can be used to locate likely structural variant breakpoints and regions that have been duplicated or deleted. These tools are capable of processing whole genome mapping data containing multiple read groups in a streamlined fashion.

In this example, it is assumed that alignment has been carried out as described in Task 4 - Map reads to the reference genome. For structural variant detection it is particularly important to specify the read group information with the --sam-rg flag either during the formatting of the reads or for the map command explicitly. The structural variant tools currently requires the PL (platform) attribute to be either ILLUMINA (for Illumina reads) or COMPLETE (for Complete Genomics reads).

Table : Overview of structural variants analysis pipeline tasks.

Task

Command & Utilities

Purpose

1

Prepare read group statistics file

find

Identify read group statistics files created during mapping

2

Find structural variants with sv

rtg sv

Process prepared mapping results to identify likely structural variants

3

Find structural variants with discord

rtg discord

Process prepared mapping results to identify likely structural variant breakends

4

Find copy number variants

rtg cnv

Detect copy number variants between a pair of samples

Task 1 - Prepare Read-group statistics files

Mapping identifies discordant read matings and inserts the pair information for unique unmated reads into the SAM records. The map command also produces a file within the directory called rgstats.tsv containing read group statistics.

The sv and discord structural variant callers require the read group statistics files to be supplied, so as with multiple BAM files, it is possible to create a text file listing the locations of all the required statistics files:

$ find /partition2/map_sample_NA12878 -name "rgstats.tsv" \
  > sample_NA12878-rgstats-files

Task 2 - Find structural variants with sv

Once mapping is complete one can run the structural variants analysis tool. To run the sv tool you need to supply the mapping BAM files and the read group statistics files. As with the variant caller commands, this can be a large number of files and so input can be specified using list files.

$ rtg sv -t grch38 -o sample_NA12878-sv -I sample_NA12878-bam-files \
  -R sample_NA12878-rgstats-files
$ ls -l sample_NA12878-sv
    30 Oct 1 16:56 done
 15440 Oct 1 16:56 progress
     0 Oct 1 16:56 summary.txt
122762 Oct 1 16:56 sv_bayesian.tsv.gz
  9032 Oct 1 16:56 sv.log

The file sv_bayesian.tsv.gz contains a trace of the strengths of alternative bayesian hypothesis at points along the reference genome. The currently supported hypotheses are shown in the following table.

Table : Structural variant hypotheses.

Hypothesis

Semantics

normal

Normal mappings, no structural variants

duplicate

Above normal mappings, potential duplication region

delete

Below normal mappings, potential deletion region

delete-left

Mapping data suggest the left breakpoint of a deletion

delete-right

Mapping data suggest the right breakpoint of a deletion

duplicate-left

Mapping data suggest the left boundary of a region that has been copied elsewhere

duplicate-right

Mapping data suggest the right boundary of a region that has been copied elsewhere

breakpoint

Mapping data suggest this location has received an insertion of copied genome

novel-insertion

Mapping data suggests this location has received an insertion of material not present in the reference

For convenience, the last column of the output file gives the index of the hypothesis with the maximum strength, to make it easier to identify regions where this changes for further investigation by the researcher.

The sv command also supports calling on individual chromosomes (or regions within a chromosome) with the --region parameter, and this can be used to increase overall throughput.

$ for seq in M {1..22} X Y; do
    rtg sv -o sv_sample_NA12878-chr${seq} -t grch38 -I sample_NA12878-sam-files \
    -R sample_NA12878-rgstats-files --region chr${seq}
  done

Task 3 - Find structural variants with discord

A second tool for finding structural variant break-ends is based on detecting cluster of discordantly mapped reads, those where both ends of a read are mapped but are mapped either in an unexpected orientation or with a TLEN outside the normal range for that read group. As with the sv command, discord requires the read group statistics to be supplied.

$ rtg discord -t grch38 -o discord_sample_NA12878 -I sample_NA12878-bam-files \
  -R sample_NA12878-rgstats-files --bed

As with sv, the discord command also supports using the --region flag:

$ for seq in M {1..22} X Y; do
    rtg discord -o discord_sample_NA12878-chr${seq} -t grch38 \
    -I sample_NA12878-bam-files -R sample_NA12878-rgstats-files \
    --region chr${seq}
  done

The default output is in VCF format, following the VCF 4.2 specification for break-ends. However, as most third-party tools currently don’t support this type of VCF, it is also possible to output each break-end as a separate region in a BED file.

Task 4 - Report copy number variation statistics

With two genome samples, one can compare the relative depth of coverage by region to identify copy number variation ratios that may indicate structural variation. A common use case is where you have two samples from a cancer patient, one from normal tissue and another from a tumor.

Run the cnv command with the default bucket size of 100, this is the number of nucleotides for which to average the coverage.

$ rtg cnv -o cnv_s1_s2 -I sample1-map-files -J sample2-map-files

View the resulting output as a set of records that show cnv ratio at locations across the genome, where the locations are defined by the bucket size.

$ zless cnv_s1_/cnv.txt.gz

For deeper investigation, contact Real Time Genomics technical support for extensible reporting scripts specific to copy number variation reporting.

Ion Torrent bacterial mapping and sequence variant detection

The following example supports the steps typical to bacterial genome analysis in which an Ion Torrent sequencer has generated reads at 10 fold coverage.

Data

The baseline uses actual data downloaded from the Ion community. The reference sequence is “Escherichia coli K-12 sub-strain DH10B”, available from the NCBI RefSeq database, accession NC_010473 (http://www.ncbi.nlm.nih.gov/nuccore/NC_010473). The read data is comprised of Ion Torrent PGM run B14-387, which can be found at the Ion community (http://lifetech-it.hosted.jivesoftware.com, requires registration).

Table : Overview of basic pipeline tasks.

Task

Command & Utilities

Purpose

1

Format reference data

rtg format

Convert reference sequence from FASTA file to RTG Sequence Data Format (SDF)

2

Format read data

rtg format

Convert read sequence from FASTA and FASTQ files to RTG Sequence Data Format (SDF)

3

Map reads against the reference genome

rtg map

Generate read alignments for the normal and cancer samples, and report in a BAM file for downstream analysis

4

Call sequence variants in haploid mode

rtg snp

Detect SNPs, MNPs, and indels in haploid sample relative to the reference genome

Task 1 - Format bacterial reference data (Ion Torrent)

Mapping and variant detection requires a conversion of the reference genome from FASTA files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory containing the reference genome.

Use the format command to convert the FASTA file into an SDF directory for the reference genome.

$ rtg format -o ecoli-DH10B \
  /data/bacteria/Escherichia_coli_K_12_substr__DH10B_uid58979/NC_010473.fna

Task 2 - Format read data (Ion Torrent)

Mapping and variant detection of Ion Torrent data requires a conversion of the read sequence data from FASTQ files into the RTG SDF format. Additionally, it is recommended that read trimming based on the quality data present within the FASTQ file be performed as part of this conversion.

Use the format command to convert the read FASTQ file into an SDF directory, using the quality threshold option to trim poor quality ends of reads.

$ rtg format -f fastq -q solexa --trim-threshold=15 -o B14-387-reads \
  /data/reads/R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_Auto_B14\
-387-r121336-314_pool30-ms_8399.fastq

Task 3 - Map Ion Torrent reads to the reference genome

Map the sequence reads against the reference genome to generate alignments in BAM format.

The RTG map command provides a means for the mapping of Ion Torrent reads to a reference genome. When mapping Ion Torrent reads, a read group with the platform field (PL) set to IONTORRENT should be provided.

$ rtg map -i B14-387-reads -t ecoli-DH10B -o B14-387-map \
  --sam-rg "@RG\tID:B14-387-r121336\tSM:B14-387\tPL:IONTORRENT"

Multiple files are written to the output directory of the mapping run. For further variant calling, the alignments.bam file has the associated required index file alignments.bam.bai. The additional files alignments.bam.calibration contains metadata to provide more accurate variant calling.

Task 4 - Call sequence variants in haploid mode

Call haploid sequence variants in the mapped reads against the reference genome to generate a variants file in the VCF format.

The snp command will automatically set machine error calibration parameters according to the platform (PL attribute) specified in the SAM read group, in this example to the Ion Torrent parameters. The snp command defaults to diploid variant calling, so for this bacterial example haploid mode will be specified. The automatically included .calibration files provide additional information specific to the mapping data for improved variant calling.

$ rtg snp -t ecoli-DH10B -o B14-387-snp --ploidy=haploid B14-387-map/alignments.bam

Examining the snps.vcf.gz file in the output directory will show that variants have been called in haploid mode. For more details about the VCF output file see Small-variant VCF output file description.

RTG contaminant filtering

Use the following set of tasks to remove contaminated reads from a sequenced DNA sample.

The RTG contamination filter, called mapf, removes contaminant reads by mapping against a database of potential contaminant sequences. For example, a bacterial metagenomic sample may have some amount of human sequence contaminating it. The following process removes any human reads leaving only bacteria reads.

Table : Overview of contaminant filtering tasks.

Task

Command & Utilities

Purpose

1

Format reference data

rtg format

Convert reference sequence from FASTA file to RTG Sequence Data Format (SDF)

2

Format read data

rtg format

Convert read sequence from FASTA and FASTQ files to RTG Sequence Data Format (SDF)

3

Run contamination filter

rtg mapf

Produce the SDF file of reads which map to the contaminant and the SDF file of those that do not

4

Manage filtered reads

mv

Set up the results for use in further processing

Task 1 - Format reference data (contaminant filtering)

RTG tools require a conversion of reference sequences from FASTA files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory containing the reference genome.

First, observe a typical genome reference with multiple chromosome sequences stored in compressed FASTA format.

$ ls -l /data/human/hg19/
43026389 Mar 21 2009 chr10.fa.gz
42966674 Mar 21 2009 chr11.fa.gz
42648875 Mar 21 2009 chr12.fa.gz
31517348 Mar 21 2009 chr13.fa.gz
28970334 Mar 21 2009 chr14.fa.gz
26828094 Mar 21 2009 chr15.fa.gz
25667827 Mar 21 2009 chr16.fa.gz
25139792 Mar 21 2009 chr17.fa.gz
24574571 Mar 21 2009 chr18.fa.gz
17606811 Mar 21 2009 chr19.fa.gz
73773666 Mar 21 2009 chr1.fa.gz
19513342 Mar 21 2009 chr20.fa.gz
11549785 Mar 21 2009 chr21.fa.gz
11327826 Mar 21 2009 chr22.fa.gz
78240395 Mar 21 2009 chr2.fa.gz
64033758 Mar 21 2009 chr3.fa.gz
61700369 Mar 21 2009 chr4.fa.gz
58378199 Mar 21 2009 chr5.fa.gz
54997756 Mar 21 2009 chr6.fa.gz
50667196 Mar 21 2009 chr7.fa.gz
46889258 Mar 21 2009 chr8.fa.gz
39464200 Mar 21 2009 chr9.fa.gz
    5537 Mar 21 2009 chrM.fa.gz
49278128 Mar 21 2009 chrX.fa.gz
 8276338 Mar 21 2009 chrY.fa.gz

Now, use the format command to convert multiple input files into a single SDF directory for the reference.

$ rtg format -o hg19 /data/human/hg19/chrM.fa.gz \
  /data/human/hg19/chr[1-9].fa.gz \
  /data/human/hg19/chr1[0-9].fa.gz \
  /data/human/hg19/chr2[0-9].fa.gz \
  /data/human/hg19/chrX.fa.gz \
  /data/human/hg19/chrY.fa.gz

This takes the human reference FASTA files and creates a directory called hg19 containing the SDF, with chromosomes ordered in the standard UCSC ordering. You can use the sdfstats command to show statistics for your reference SDF, including the individual sequence lengths.

$ rtg sdfstats --lengths hg19
Type : DNA
Number of sequences: 25
Maximum length : 249250621
Minimum length : 16571
Sequence names : yes
N : 234350281
A : 844868045
C : 585017944
G : 585360436
T : 846097277
Total residues : 3095693983
Residue qualities : no
Sequence lengths:
chrM 16571
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chr8 146364022
chr9 141213431
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr19 59128983
chr20 63025520
chr21 48129895
chr22 51304566
chrX 155270560
chrY 59373566

Task 2 - Format read data (contaminant filtering)

RTG tools require a conversion of read sequence data from FASTA or FASTQ files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory for the sample reads.

Take a paired set of reads in FASTQ format and convert it into RTG data format (SDF). This example shows one run of data, taking as input both left and right mate pairs from the same run.

$ rtg format -f fastq -q sanger -o bacteria-sample \
  -l /data/reads/bacteria/sample_1.fq \
  -r /data/reads/bacteria/sample_2.fq

This creates a directory named bacteria-sample with two subdirectories, named left and right. Use the sdfstats command to verify this step.

$ rtg sdfstats bacteria-sample

Task 3 - Run contamination filter

The mapf command functions in much the same way as the map command, but instead of producing BAM files of the mappings it produces two SDF directories, one containing reads that map to the reference and the other with reads that do not map. As with the map command there are multiple tuning parameters to adjust sensitivity and selectivity of the mappings. As with the map command, you can use the --start-read and --end-read flags to perform the mapping in smaller sections if required. The default mapf settings are similar to map although the word size and step sizes have been adjusted to yield more sensitive mappings.

$ rtg mapf -t hg19 -i bacteria-sample -o filter-sample

In the filter-sample output directory there are, amongst other files, two directories named alignments.sdf and unmapped.sdf. The alignments.sdf directory is an SDF of the reads that mapped to the reference, and the unmapped.sdf directory is an SDF of the reads that did not map.

$ ls -l filter-sample/
   4096 Sep 30 16:02 alignments.sdf/
     33 Sep 30 16:02 done
2776886 Sep 30 16:02 mapf.log
  12625 Sep 30 16:02 progress
    143 Sep 30 16:02 summary.txt
   4096 Sep 30 16:02 unmapped.sdf/

Task 4 - Manage filtered reads

Depending on the use case, either rename, move or delete the filtered SDF directories as necessary. In this example the reads that did not map to the contamination reference are to be used in further processing, so rename the unmapped.sdf directory.

$ mv filter-sample/unmapped.sdf bacteria-sample-filtered

The filtered read set is now ready for subsequent processing, such as with the mapx or species tools.

RTG translated protein searching

Use the following set of tasks to search DNA reads against a protein data set.

The RTG protein search tool, mapx translates nucleotide reads into protein space and search them against a protein data set. For example, a sample taken from a human gut can be searched against a protein data set to determine which kinds of protein families are present in the sample.

For this example we will search a human gut sample read set against an NCBI non-redundant protein data set. In the following tasks it is assumed non-redundant protein data set is called nr.fasta and the human gut sample is called human-gut.fastq.

Table : Overview of translated protein searching tasks.

Task

Command & Utilities

Purpose

1

Format protein data set

rtg format

Convert protein data set from FASTA to RTG sequence data format (SDF)

2

Format DNA read set

rtg format

Convert read sequence from FASTA and FASTQ files to RTG Sequence Data Format (SDF)

3

Search against protein data set

rtg mapx

Generate search results with alignments in tabular format

Task 1 - Format protein data set

The mapx command requires a conversion of a protein data set from FASTA files into RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory containing the protein data set.

$ rtg format -p -o nr /data/NCBI-nr/nr.fasta

The above command will take the nr.fasta file and create a directory called nr containing the SDF. Note that the -p option is used to create the SDF with protein data.

Task 2 - Format DNA read set

The mapx command requires a conversion of the DNA read set data from FASTA or FASTQ files into RTG SDF format. This task will be completed with the format command. The following command assumes the sample read data set is in Solexa FASTQ format.

$ rtg format -f fastq -q solexa -o human-gut /data/human-gut-sample.fastq

Task 3 - Search against protein data set

Search the DNA reads against the protein data set and generate alignments in tabular format.

The mapx command provides multiple tuning parameters to adjust sensitivity and selectivity at the search stage. As with the map command, you can use the --start-read and --end-read flags to perform the mapping in smaller sections if required. In general, protein search strategies are based on protein similarity also known as identity.

The search example below uses a sensitivity setting that will guarantee reporting with reads that align with 4 substitutions and 1 indels.

$ rtg mapx -t nr -i human-gut -o mapx_results -a 3 -b 1

The alignments.tsv.gz file in the mapx_results output directory contains tabular output with alignments. For more information about this output format see Mapx and mapp output file description.

RTG species frequency estimation

Use the following set of tasks to estimate the frequency of bacterial species in a metagenomic sample. The RTG species frequency estimator, called species, takes a set of reads mapped against a bacterial database and from this estimates the relative frequency of each species in the database.

Table : Overview of species frequency estimation tasks.

Task

Command & Utilities

Purpose

1

Format reference data

rtg format

Convert reference sequence from FASTA file to RTG Sequence Data Format (SDF)

2

Format read data

rtg format

Convert read sequence from FASTA and FASTQ files to RTG Sequence Data Format (SDF)

3

Run contamination filter (optional)

rtg mapf

Produce the SDF file of reads which map to the contaminant and the SDF file of those that do not

4

Map metagenomic reads against bacterial database

rtg map

Generate read alignments against a given reference, and report in a BAM file for downstream analysis

5

Run species estimator

rtg species

Produce a text file which contains a list of species, one per line, with an estimate of the relative frequency in the sample

Task 1 - Format reference data (species)

RTG tools require a conversion of reference sequences from FASTA files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory containing the reference sequences.

Use the format command to convert multiple input files into a single SDF directory for the reference database.

$ rtg format -o bacteria-db /data/bacteria/db/*.fa.gz

This takes the reference FASTA files and creates a directory called bacteria-db containing the SDF. You can use the sdfstats command to show statistics for your reference SDF.

$ rtg sdfstats bacteria-db
Type               : DNA
Number of sequences: 311276
Maximum length     : 13033779
Minimum length     : 0
Sequence names     : yes
N                  : 33864547
A                  : 4167856151
C                  : 4080877385
G                  : 4072353906
T                  : 4177108579
Total residues     : 16532060568
Residue qualities  : no

Alternatively a species reference SDF for running the species command can be obtained from our website (http://www.realtimegenomics.com).

Task 2 - Format read data (species)

RTG tools require a conversion of read sequence data from FASTA or FASTQ files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory for the sample reads.

Take a paired set of reads in FASTQ format and convert it into RTG data format (SDF). This example shows one run of data, taking as input both left and right mate pairs from the same run.

$ rtg format -f fastq -q sanger -o bacteria-sample \
  -l /data/reads/bacteria/sample_1.fq \
  -r /data/reads/bacteria/sample_2.fq

This creates a directory named bacteria-sample with two subdirectories, named ‘left’ and ‘right’. Use the sdfstats command to verify this step.

$ rtg sdfstats bacteria-sample

Task 3 - Run contamination filter (optional)

Optionally filter the metagenomic read sample to remove human contamination using Tasks 1 through 4 of RTG contaminant filtering

Task 4 - Map metagenomic reads against bacterial database

Map the metagenomic reads against the reference database to generate alignments in the BAM format (Binary Sequence Alignment/Map file format). The read set in this example is paired end.

It is recommended that during mapping either the --max-top-results flag be set to a high value, such as 100, or that the --all-hits option be used. This helps ensure that all relevant species in the database are accurately represented in the output. However, note that a very large --max-top-results requires additional memory during mapping.

$ rtg map -i bacteria-sample -t bacteria-db -o map-sample -n 100

Task 5 - Run species estimator

The species estimator, species, takes as input the BAM format files from the mapping performed against the reference database.

$ rtg species -t bacteria-db -o species-result map-sample/alignments.bam

This run generates a new output directory species_result. The main result file in this directory will be called species.tsv. In the output the bacterial species are ordered from most to least abundant. The output file can be directly loaded into a spreadsheet program like Microsoft Excel.

The species.tsv file contains results for both species with associated genomic sequences and internal nodes in the taxonomy. In some scenarios it will only be necessary to examine those rows corresponding to sequences in the database, such rows have a Y in the has-sequence column. Internal taxonomy nodes (i.e. ones that have no associated sequence data) always have a breadth and depth of coverage of zero because no reads directly map to them. For further detail on the species.tsv file format see Species results file description

Also produced is an HTML5 summary file called index.html which contains an interactive pie chart detailing the results.

The best results are obtained when as many relevant records as possible are given to the species estimator. If you have insufficient memory to use all your mapping results then using the filtering options may help. You could filter the results by selecting mappings with good alignment scores or mated only reads.

RTG sample similarity

Use the following set of tasks to produce a similarity matrix from the comparison of a group of read sets. An example use case is in metagenomics where several bacteria samples taken from different sites need to be compared.

The similarity command performs a similarity analysis on multiple read sets independent of any reference genome. It does this by examining k-mer word frequencies and the intersections between sets of reads.

Table : Overview of sample similarity tasks.

Task

Command & Utilities

Purpose

1

Prepare read sets

rtg format

Convert reference sequence from FASTA file to RTG Sequence Data Format (SDF)

2

Generate read set name map

text-editor

Produce the map of names to read set SDF locations

3

Run similarity tool

rtg similarity

Process the read sets for similarity

Task 1 - Prepare read sets

RTG tools require a conversion of read sequence data from FASTA or FASTQ files into the RTG SDF format. This task will be completed with the format command. The conversion will create an SDF directory for the sample reads.

Take a paired set of reads in FASTQ format and convert it into RTG data format (SDF). This example shows one run of data, taking as input both left and right mate pairs from the same run.

$ rtg format -f fastq -q sanger -o /data/reads/read-sample1-sdf \
  -l /data/reads/fastq/read-sample1_1.fq \
  -r /data/reads/fastq/read-sample2_2.fq

This creates a directory named ‘read-sample1-sdf’ with two subdirectories, named ‘left’ and ‘right’. Use the sdfstats command to verify this step.

$ rtg sdfstats /data/reads/read-sample1-sdf

Repeat for all read samples to be compared. This example shows how this can be done with the format command in a loop.

$ for left_fq in /data/reads/fastq/*_1.fq; do
    right_fq=${left_fq/_1.fq/_2.fq}
    sample_id=$(basename ${left_fq/_1.fq})
    rtg format -f fastq -q sanger -o /data/reads/${sample_id}-sdf -l ${left_fq} \
      -r ${right_fq}
  done

Task 2 - Generate read set name map

With a text editor, or other tool, create a text file containing a list of sample name to sample read SDF file locations. If two or more read sets are from the same sample they can be combined by giving them the same sample name in the file list.

$ cat read-set-list.txt
sample1 /data/reads/read-sample1-sdf
sample2 /data/reads/read-sample2-sdf
sample3 /data/reads/read-sample3-sdf
sample4 /data/reads/read-sample4-sdf
sample5 /data/reads/read-sample5-sdf

Task 3 - Run similarity tool

Run the similarity command setting the k-mer word size (-w parameter) and the step size (-s parameter) on the read sets by specifying the file listing the read sets. Some experimentation should be performed with different word and step size parameters to find good trade-offs between memory usage and run time. Should it be necessary to reduce the memory used it is possible to limit the number of reads used from each SDF by specifying the --max-reads parameter.

$ rtg similarity -w 25 -s 25 --max-reads 1000000 -I read-set-list.txt \
  -o similarity-output

The program puts its output in the specified output directory.

$ ls similarity-output/
 4693 Aug 29 20:17 closest.tre
19393 Aug 29 20:17 closest.xml
   33 Aug 29 20:17 done
11363 Aug 29 20:17 similarity.log
48901 Aug 29 20:17 similarity.tsv
  693 Aug 29 20:17 progress

The similarity.tsv file is a tab separated file containing a matrix of counts of the number of k-mers in common between each pair of samples. The closest.tre and closest.xml files are nearest neighbor trees built from the counts from the similarity matrix. The closest.tre is in Newick format and the closest.xml file is phyloXML. The similarity.pca file contains a principal component analysis on the similarity matrix in similarity.tsv.

You may wish to view closest.tre or closest.xml in your preferred tree viewing tool or use the principal component analysis output in similarity.pca to produce a three-dimensional grouping plot showing visually the clustering of samples.