Appendix

RTG gapped alignment technical description

Real Time Genomics utilizes its own DNA sequence alignment tool and scoring system for aligned reads. Most methods for sequence comparison and alignment use a small set of operations derived from the notion of edit distance 1 to discover differences between two DNA sequences. The edit operations introduce insertions, deletions, and substitutions to transform one sequence into another. Alignments are termed global if they extend over all residues of both sequences.

Most programs for finding global alignments are based on the Needleman-Wunsch algorithm 2. Alternatively, alignments may be local, in which case reported alignments may contain subsequences of the input sequences. The Smith-Waterman variation on the Needleman-Wunsch algorithm finds such alignments 3. The proprietary RTG algorithm employs a further variation of this approach, using a dynamic programming edit-distance calculation for alignment of reads to a reference sequence. The alignment is semi-global in that it always covers the entire read but usually only covers a portion of the reference.

Alignment computations

Following the read mapping stage, the RTG aligner is presented with a read, a reference, a putative start position, and the frame. An alignment is produced with a corrected start position, which is subsequently converted by RTG into a SAM record.

If the corrected start position differs from the putative start position, then the alignment may be recomputed starting from the new start position (this is because slightly different alignments can result depending on the start position given to the aligner). Later stages in the RTG pipeline may decide to discard the alignment or to identify alignments together (for the purpose of removing duplicates). But the reference is always presented in the forward-sense and the edit-distance code itself makes the necessary adjustment for reverse complement cases. This avoids having to construct a reverse complement copy of the reference.

The matrix is initialized in a manner such that a small shift in start position incurs no penalty, but as the shift increases, an increasing penalty is applied. If after completing the alignment, such a path is still chosen, then the penalty is removed from the resulting score. This penalty is designed to prevent the algorithm from making extreme decisions like deleting the entire reference.

Alignment scoring

The basic costs used in the alignment are 0 for a match, 9 for a substitution, 19 for initiating an insertion or deletion, and 1 for continuing an insertion or deletion. All of these except for the match score can be overridden using the --mismatch-penalty parameter (for substitutions), the --gap-open-penalty parameter (for initiating an insertion or deletion) and the --gap-extend-penalty parameter (for continuing an insertion or deletion).

By default the penalty for matching an unknown nucleotide (n) in the read or reference is 5, however this can be overridden using the --unknowns-penalty flag. Note that regardless of the penalty for unknown nucleotides the CIGAR will always indicate unknown bases as mismatches. Occasionally, alignments may go outside the limits of the reference (that is, off the left or right ends of the reference). Such virtual bases are considered as unknown nucleotides.

Once the alignment is determined, the sum of these costs for the alignment path is reported as the alignment score and produced in the AS field of the corresponding SAM record. If there is a tie between an indel versus substitution operation for a particular matrix cell, then the tie is broken in favor of the substitution (except in the special case of the last column on the reference).

In the following example, the alignment score is 48, comprising a penalty of 20 for a two-nucleotide insert in the reference, 9 for a mismatch, and 19 for a one-nucleotide insert in the read. Notice there is no penalty for unknown nucleotide.

accg--gactctctgacgctgcncgtacgtgccaaaaataagt (reference)
||||  |||||| ||||||||||||||||| ||||||||||||
accgttgactctgtgacgctgcacgtacgt-ccaaaaataagt (read)

In addition to the alignment score, a CIGAR is also reported in SAM records.

References for this section

1

Levenshtein, V. I.(1966) Binary codes capable of correcting deletions, insertions and reversal. Soviet Physics Doklady, 6:707-710.

2

Needleman, S. B and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443-453

3

Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences. Journal of Molecular Biology. 147:195-197.

Using SAM/BAM Read Groups in RTG map

It is good practice to ensure the output BAM files from the map command contain tracking information regarding the sample and read set. This is accomplished by specifying a read group to assign reads to. See the SAM specification for the full details of read groups, however for RTG tools, it is important to specify at least ID, SM and PL fields in the read group. The ID field should be a unique identifier for the read group, while the SM field should contain an identifier for the sample that was sequenced. Thus, you may have the same sample identifier present in multiple read groups (for example if the sample was sequenced in multiple lanes or by different sequencing technologies). All sample names employed by pedigree or sample-oriented commands should match the values supplied in the SM field, while sequencer calibration information is grouped by the read group ID field, and certain algorithm configuration (for example aligner penalties) may have appropriate defaults selected based on the PL field.

While it is possible to post-process BAM files to add this information, it is more efficient to supply the read group information either during mapping or when formatting read data to SDF. For the RTG software, the read group can either be specified in a string on the command line or by creating a file containing the SAM-formatted read group header entry to be passed to the command.

To specify a read group on the command line directly use a string encapsulated in double quotes using \t to denote a TAB character:

$ rtg map ... --sam-rg "@RG\tID:SRR002978\tSM:NA19240\tPL:ILLUMINA" ...

To specify a read group using a file, create or use a file containing a single SAM-formatted read group header line:

$ echo -e "@RG\tID:SRR002978\tSM:NA19240\tPL:ILLUMINA" > mysamrg.txt
$ cat mysamrg.txt
@RG ID:SRR002978 SM:NA19240 PL:ILLUMINA
$ rtg map ... --sam-rg mysamrg.txt ...

Note that when supplying read group headers in a file literal TAB characters, not \t, are required to separate fields.

The platform tags supported by RTG are ILLUMINA for Illumina reads, COMPLETE for Complete Genomics version 1 reads, COMPLETEGENOMICS for Complete Genomics version 2 reads, LS454 for 454 Life Sciences reads and IONTORRENT for Ion Torrent reads.

When mapping directly from SAM/BAM input with a single read group, this will automatically be set using that read group. The read group will also be automatically set when mapping from an SDF which had the read group information stored in it during formatting.

RTG reference file format

Many RTG commands can make use of additional information about the structure of a reference genome, such as expected ploidy, sex chromosomes, location of PAR regions, etc. When appropriate, this information may be stored inside a reference genome’s SDF directory in a file called reference.txt.

The format command will automatically identify several common reference genomes during formatting and will create a reference.txt in the resulting SDF. However, for non-human reference genomes, or less common human reference genomes, a pre-built reference configuration file may not be available, and will need to be manually provided in order to make use of RTG sex-aware pipeline features.

Several example reference.txt files for different human reference versions are included as part of the RTG distribution in the scripts subdirectory, so for common reference versions it will suffice to copy the appropriate example file into the formatted reference SDF with the name reference.txt, or use one of these example files as the basis for your specific reference genome.

To see how a reference text file will be interpreted by the chromosomes in an SDF for a given sex you can use the sdfstats command with the --sex flag. For example:

$ rtg sdfstats --sex male /data/human/ref/hg19

Location            : /data/human/ref/hg19
Parameters          : format -o /data/human/ref/hg19 -I chromosomes.txt
SDF Version         : 11
Type                : DNA
Source              : UNKNOWN
Paired arm          : UNKNOWN
SDF-ID              : b6318de1-8107-4b11-bdd9-fb8b6b34c5d0
Number of sequences : 25
Maximum length      : 249250621
Minimum length      : 16571
Sequence names      : yes
N                   : 234350281
A                   : 844868045
C                   : 585017944
G                   : 585360436
T                   : 846097277
Total residues      : 3095693983
Residue qualities   : no

Sequences for sex=MALE:
chrM POLYPLOID circular 16571
chr1 DIPLOID linear 249250621
chr2 DIPLOID linear 243199373
chr3 DIPLOID linear 198022430
chr4 DIPLOID linear 191154276
chr5 DIPLOID linear 180915260
chr6 DIPLOID linear 171115067
chr7 DIPLOID linear 159138663
chr8 DIPLOID linear 146364022
chr9 DIPLOID linear 141213431
chr10 DIPLOID linear 135534747
chr11 DIPLOID linear 135006516
chr12 DIPLOID linear 133851895
chr13 DIPLOID linear 115169878
chr14 DIPLOID linear 107349540
chr15 DIPLOID linear 102531392
chr16 DIPLOID linear 90354753
chr17 DIPLOID linear 81195210
chr18 DIPLOID linear 78077248
chr19 DIPLOID linear 59128983
chr20 DIPLOID linear 63025520
chr21 DIPLOID linear 48129895
chr22 DIPLOID linear 51304566
chrX HAPLOID linear 155270560 ~=chrY
    chrX:60001-2699520 chrY:10001-2649520
    chrX:154931044-155260560 chrY:59034050-59363566
chrY HAPLOID linear 59373566 ~=chrX
    chrX:60001-2699520 chrY:10001-2649520
    chrX:154931044-155260560 chrY:59034050-59363566

The reference file is primarily intended for XY sex determination but should be able to handle ZW and X0 sex determination also.

The following describes the reference file text format in more detail. The file contains lines with TAB separated fields describing the properties of the chromosomes. Comments within the reference.txt file are preceded by the character #. The first line of the file that is not a comment or blank must be the version line.

version1

The remaining lines have the following common structure:

<sex> <line-type>     <line-setting>...

The sex field is one of male, female or either. The line-type field is one of def for default sequence settings, seq for specific chromosomal sequence settings and dup for defining pseudo-autosomal regions. The line-setting fields are a variable number of fields based on the line type given.

The default sequence settings line can only be specified with either for the sex field, can only be specified once and must be specified if there are not individual chromosome settings for all chromosomes and other contigs. It is specified with the following structure:

either        def     <ploidy>        <shape>

The ploidy field is one of haploid, diploid, triploid, tetraploid, pentaploid, hexaploid, polyploid or none. The shape field is one of circular or linear.

The specific chromosome settings lines are similar to the default chromosome settings lines. All the sex field options can be used, however for any one chromosome you can only specify a single line for either or two lines for male and female. They are specified with the following structure:

<sex> seq     <chromosome-name>       <ploidy>        <shape> [allosome]

The ploidy and shape fields are the same as for the default chromosome settings line. The chromosome-name field is the name of the chromosome to which the line applies. The allosome field is optional and is used to specify the allosome pair of a haploid chromosome.

The pseudo-autosomal region settings line can be set with any of the sex field options and any number of the lines can be defined as necessary. It has the following format:

<sex> dup     <region>        <region>

The regions must be taken from two haploid chromosomes for a given sex, have the same length and not go past the end of the chromosome. The regions are given in the format <chromosome-name>:<start>-<end> where start and end are positions counting from one and the end is non-inclusive.

An example for the HG19 human reference:

# Reference specification for hg19, see
# http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=184117983&chromInfoPage=
version 1
# Unless otherwise specified, assume diploid linear. Well-formed
# chromosomes should be explicitly listed separately so this
# applies primarily to unplaced contigs and decoy sequences
either        def     diploid linear
# List the autosomal chromosomes explicitly. These are used to help
# determine "normal" coverage levels during mapping and variant calling
either        seq     chr1    diploid linear
either        seq     chr2    diploid linear
either        seq     chr3    diploid linear
either        seq     chr4    diploid linear
either        seq     chr5    diploid linear
either        seq     chr6    diploid linear
either        seq     chr7    diploid linear
either        seq     chr8    diploid linear
either        seq     chr9    diploid linear
either        seq     chr10   diploid linear
either        seq     chr11   diploid linear
either        seq     chr12   diploid linear
either        seq     chr13   diploid linear
either        seq     chr14   diploid linear
either        seq     chr15   diploid linear
either        seq     chr16   diploid linear
either        seq     chr17   diploid linear
either        seq     chr18   diploid linear
either        seq     chr19   diploid linear
either        seq     chr20   diploid linear
either        seq     chr21   diploid linear
either        seq     chr22   diploid linear
# Define how the male and female get the X and Y chromosomes
male  seq     chrX    haploid linear  chrY
male  seq     chrY    haploid linear  chrX
female        seq     chrX    diploid linear
female        seq     chrY    none    linear
#PAR1 pseudoautosomal region
male  dup     chrX:60001-2699520      chrY:10001-2649520
#PAR2 pseudoautosomal region
male  dup     chrX:154931044-155260560        chrY:59034050-59363566
# And the mitochondria
either        seq     chrM    polyploid       circular

As of the current version of the RTG software the following are the effects of various settings in the reference.txt file when processing a sample with the matching sex.

A ploidy setting of none will prevent reads from mapping to that chromosome and any variant calling from being done in that chromosome.

A ploidy setting of diploid, haploid or polyploid does not currently affect the output of mapping.

A ploidy setting of diploid will treat the chromosome as having two distinct copies during variant calling, meaning that both homozygous and heterozygous diploid genotypes may be called for the chromosome.

A ploidy setting of haploid will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome.

A ploidy setting of polyploid will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome. For variant calling with a pedigree, maternal inheritance is assumed for polyploid sequences.

The shape of the chromosome does not currently affect the output of mapping or variant calling.

The allosome pairs do not currently affect the output of mapping or variant calling (but are used by simulated data generation commands).

The pseudo-autosomal regions will cause the second half of the region pair to be skipped during mapping. During variant calling the first half of the region pair will be called as diploid and the second half will not have calls made for it. For the example reference.txt provided earlier this means that when mapping a male the X chromosome sections of the pseudo-autosomal regions will be mapped to exclusively and for variant calling the X chromosome sections will be called as diploid while the Y chromosome sections will be skipped. There may be some edge effects up to a read length either side of a pseudo-autosomal region boundary.

RTG taxonomic reference file format

When using a metagenomic reference SDF in the species command, a taxonomy can be applied to impute associations between reference sequences. This is done using two files contained in the SDF directory. The first file (taxonomy.tsv) contains an RTG taxonomy tree and the second file (taxonomy_lookup.tsv) contains a mapping between taxon IDs and reference sequence names. Using a reference SDF containing these files allows the output of certain commands to include results at different taxonomic ranks allowing analysis at differing taxonomic levels.

Pre-constructed metagenomic reference SDFs in this format will be available from our website (http://www.realtimegenomics.com). For custom reference SDF creation, the ncbi2tax and taxfilter commands can assist the creation of a custom taxonomy.tsv file from an NCBI taxonomy dump. The taxstats command can check the validity of a metagenomic reference SDF.

RTG taxonomy file format

The RTG taxonomy file format describes the structure of the taxonomy tree. It contains multiple lines with each line being either a comment or holding data required to describe a node in the taxonomy tree.

Lines starting with a ‘#’ are comments and do not contain data. They may appear anywhere through the file. The first line of the file must be a comment containing the RTG taxonomy file version number.

Each data line in the file represents a node in the taxonomy tree and is comprised of four tab separated values. The values on each line are:

  1. The unique taxon ID of the node in the tree. This must be an integer value greater than or equal to 1.

  2. The taxon ID of the parent of this node. This must be an integer value corresponding to another node in the tree.

  3. The rank of the node in the taxonomy. This is a free format string that can contain any character other than a tab.

  4. The name of the node in the taxonomy. This is a free format string that can contain any character other than a tab.

The root of the tree is special and must have a taxon ID of 1. Since the root has no parent it can have a parent ID of either 1 (itself) or -1. The RTG taxonomy file should contain a complete and fully connected tree that has a single root and no loops.

An example of the first few lines of a taxonomy.tsv file:

#RTG taxonomy version 1.0
#taxID        parID   rankname
1-1   no rank root
129081        no rank unclassified sequences
40816912908   no rank metagenomes
410657408169  no rank ecological metagenomes
527640410657  species microbial mat metagenome
1315671       no rank cellular organisms
2131567       superkingdom    Bacteria
11172 phylum  Cyanobacteria

RTG taxonomy lookup file format

The RTG taxonomy lookup file format associates SDF sequence names with taxon IDs in the taxonomy tree. It contains one line for every sequence in the SDF, with each line containing two tab separated values.

The values on each line are:

  1. The taxonomy node ID that the sequence is associated with. This must be an integer value that corresponds to a node ID from the taxonomy tree.

  2. The name of the sequence as it appears in the SDF. (These can be discovered using the --lengths option of the sdfstats command)

A single taxon ID may be associated with multiple sequence names. This is a way to group the chromosomes and plasmids belonging to a single organism.

An example of some lines from a taxonomy_lookup.tsv file:

1219061gi|407098174|gb|AMQV00000000.1|AMQV01000000
1219061gi|407098172|gb|AMQV01000002.1|
1219061gi|407098170|gb|AMQV01000004.1|
1219061gi|407098168|gb|AMQV01000006.1|

Pedigree PED input file format

The PED file format is a white space (tab or space) delimited ASCII file. Lines starting with # are ignored. It has exactly six required columns in the following order.

Column

Definition

Family ID

Alphanumeric ID of a family group. This field is ignored by RTG commands.

Individual ID

Alphanumeric ID of an individual. This corresponds to the Sample ID specified in the read group of the individual (SM field).

Paternal ID

Alphanumeric ID of the paternal parent for the individual. This corresponds to the Sample ID specified in the read group of the paternal parent (SM field).

Maternal ID

Alphanumeric ID of the maternal parent for the individual. This corresponds to the Sample ID specified in the read group of the maternal parent (SM field).

Sex

The sex of the individual specified as using 1 for male, 2 for female and any other number as unknown.

Phenotype

The phenotype of the individual specified using -9 or 0 for unknown, 1 for unaffected and 2 for affected.

Note

The PED format is based on the PED format defined by the PLINK project: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped

The value ‘0’ can be used as a missing value for Family ID, Paternal ID and Maternal ID.

The following is an example of what a PED file may look like.

# PED format pedigree
# fam-id ind-id pat-id mat-id sex phen
FAM01 NA19238 0 0 2 0
FAM01 NA19239 0 0 1 0
FAM01 NA19240 NA19239 NA19238 2 0
0 NA12878 0 0 2 0

When specifying a pedigree for the lineage command, use either the pat-id or mat-id as appropriate to the gender of the sample cell lineage. The following is an example of what a cell lineage PED file may look like.

# PED format pedigree
# fam-id ind-id pat-id mat-id sex phen
LIN BASE 0 0 2 0
LIN GENA 0 BASE 2 0
LIN GENB 0 BASE 2 0
LIN GENA-A 0 GENA 2 0

RTG includes commands such as pedfilter and pedstats for simple viewing, filtering and conversion of pedigree files.

Genetic map directory

Certain commands such as pedsamplesim are able to make use of genetic linkage information provided in the form of per sequence TSV files in a directory.

Such files must be tab separated with a single header line. The header line must include the columns pos and cM (for centimorgans). A chr column is also recommended, but not essential. Other columns may be present, but are ignored.

For example, the start of such a file for chr1 is:

chr     pos     rate            cM
chr1    6079392 0.0453026       0
chr1    6085392 0.0696987       0.0002718156
chr1    6088671 0.719604        0.0005003576373
chr1    6095199 0.203332        0.0051979325493
chr1    6098266 0.250261        0.0058215517933

The columns in the example are:

  1. Chromosome name

  2. Position in chromosome

  3. Rate of crossovers in region ending at this position (ignored)

  4. Total number of centimorgans from start of sequence

The corresponding map directory should contain a file or files for each sequence of interest, using the naming convention [sex.]sequence.map where sex is an optional sex specification (either male or female) and sequence is the name of the sequence, for example chr1.

When loading these files, an attempt is first made to load the file corresponding to the sex of the sample being processed. If that file does not exist, then an attempt is made to load a nonspecific version of the file. If that also does not exist, then a uniform distribution will be assumed. For example, if processing a female sample for chr1, first female.chr1.map will be tested. If that does not exist, then chr1.map will be tested. If that also does not exist, then the uniform distribution will be used. In this way as many or as few sequences as desired may be covered within a particular directory.

For the human genome, several sources of appropriate linkage information are available. Please note that it will generally be necessary to add headers and possibly make other adjustments to these files before they can be used with RTG.

Specific sources include:

A zip file containing the Bherer genetic map files for human build 37, with the required header manipulations already applied is available for download from: https://realtimegenomics.com/news/pre-formatted-reference-datasets

See pedsamplesim, childsim.

RTG commands using indexed input files

Several RTG commands require coordinate indexed input files to operate and several more require them when the --region or --bed-regions parameter is used. The index files used are standard tabix or BAM index files.

The RTG commands which produce the inputs used by these commands will by default produce them with appropriate index files. To produce indexes for files from third party sources or RTG command output where the --no-index or --no-gzip parameters were set, use the RTG bgzip and index commands.

RTG output results file descriptions

RTG software produces output results in standard formats that may contain additional information for the unique requirements of a particular data analysis function.

Several of the RTG commands that output results to a directory also output a simple summary report of the results of the command in HTML format. The report file for these commands will be called index.html and will be contained in the output directory.

SAM/BAM file extensions (RTG map command output)

The Sequence Alignment/Map (SAM/BAM) format is a well-known standard for listing read alignments against reference sequences. SAM records list the mapping and alignment data for each read, ordered by chromosome (or other DNA reference sequence) location.

Note

For a thorough description of the SAM format please refer to the specification at https://samtools.github.io/hts-specs/SAMv1.pdf

The map command reports alignments in the SAM/BAM format with some minor differences.

A sample RTG SAM file is shown below, describing the relationship between a read and a reference sequence, including gaps and mismatches as determined by the RTG map aligner.

@HD VN:1.0 SO:coordinate
@PG ID:RTG VN:v2.0-EAP2.1 build 25581 (2010-03-11) CL:map -t human_REF_SDF -i human_READS_SDF -o humanMAPPING8 -w 22 -a 2 -b 2 -c 2
@SQ SN:chr1 LN:643292
@SQ SN:chr2 LN:947102
@SQ SN:chr3 LN:1060087
@SQ SN:chr4 LN:1204112
@SQ SN:chr5 LN:1343552
@SQ SN:chr6 LN:1418244
@SQ SN:chr7 LN:1501719
@SQ SN:chr8 LN:1419563
@SQ SN:chr9 LN:1541723
@SQ SN:chr10 LN:1694445
@SQ SN:chr11 LN:2035250
@SQ SN:chr12 LN:2271477
@SQ SN:chr13 LN:2895605
@SQ SN:chr14 LN:3291006
1035263       0       chr1    606     255     2X18=1X11=1X15= *       0       0       AATACTTTTCATTCTTTACTATTACTTACTTATTCTTACTTACTTACT        *       AS:i:4  NM:i:4  IH:i:1  NH:i:1
1572864       16      chr1    2041    255     48=     *       0       0       TACTTACTTTCTTCTTACTTATGTGGTAATAAGCTACTCGGTTGGGCA        *       AS:i:0  NM:i:0  IH:i:1  NH:i:1
2649455       0       chr1    3421    255     2X46=   *       0       0       AAGTACTTCTTAGTTCAATTACTATCATCATCTTACCTAATTACTACT        *       AS:i:2  NM:i:2  IH:i:1  NH:i:1

RTG identifies each query name (or QNAME) with an RTG identifier, which replaced the original identifier associated with the read data. The RTG samrename utility is used to relabel the alignment records with the original sequence names.

The CIGAR format has evolved from the original SAM specification. Formerly, a CIGAR string might appear as 283M1D1I32N8M, where M indicated a match or mismatch, I indicated an insertion, D indicated a deletion, and N indicated a skipped region.

In the sample SAM file above, the RTG CIGAR score characters are modified as represented by the string 2X18=1X11=1X15=, where X indicates a mismatch, = indicates a match, I indicates an insertion, and D indicates a deletion. Obviously, this provides more specificity around the precise location of the mismatch in the alignment.

Notice that optional fields are reported as in SAM for alignment score (AS), number of nucleotide differences (NM), number of reported alignments for a particular read (NH), number of stored alignments (IH) and depending on flag settings, the field containing the string describing mismatching positions (MD) may be included. The alignment score is calculated and reported for RTG as described in RTG gapped alignment technical description.

The following list describes RTG SAM/BAM file characteristics that may depart from or be undescribed in the SAM specification.

  • Paired-end sequencing data

  • FLAG is set to 0x02 in properly paired reads and unset for unmated or unmapped reads.

  • For all non-uniquely mapped reads FLAG 0x100 is set.

  • Unmated and unmapped reads will have the FLAG 0x08 set to reflect whether the mate has been mapped, however RNEXT and PNEXT will always be “*”.

  • For mapped reads, the SAM standard NH attribute is used, even for uniquely mapped reads (NH:i:1).

  • Single-end sequencing data

  • For all non-uniquely mapped reads FLAG 0x100 is set.

  • For mapped reads, the SAM standard NH attribute is used, even for uniquely mapped reads (NH:i:1).

  • Unmapped reads

  • RNAME and CIGAR are set as “*”.

  • POS, and MAPQ are set as 0.

For mated records, the XA attribute contains the sum of the alignment scores for each arm. It is this score that is used to determine the equality for the max top results for mated records. All the mated records for a read should have the same XA score.

In addition, for unmapped read arms, the optional attribute XC may be displayed in SAM/BAM files, using a character code that indicates the internal read mapping stage where the read arm was discarded as unmated or unmapped. If this is reported, it means that the read arm had matching hits at one point during the mapping phase.

Single-end SAM character codes include:

Character

Definition

XC:A:B

Indicates that the number of raw index hits for the read exceeded the internal threshold of 65536.

XC:A:C

Indicates that after initial ranking of hits for the read, too many hits were present (affected by --max-top-results).

XC:A:D

Indicates that after alignment scores are calculated, the \leq N remaining hits were discarded because they exceeded the mismatches threshold (affected by --max-mismatches).

XC:A:E

Indicates that there were good scoring hits, but the arm was discarded because there were too many of these hits (affected by --max-top-results).

Paired-end SAM character codes include:

Character

Definition

XC:A:B

Indicates that the number of raw index hits for the read exceeded the internal threshold of 65536.

XC:A:C

Indicates that there were index matches for the read arm, but no potential mated hits were found (affected by --min-fragment-size and --max-fragment-size), and after ranking candidate unpaired results there were too many hits (affected by --max-top-results).

XC:A:d

Indicates that potential mated hits were found for this read arm with its mate, but were discarded because they exceeded the mismatches threshold (affected by --max-mated-mismatches).

XC:A:D

Indicates that no potential mated hits were found, and after alignment scores are calculated, the (\leq N) remaining hits were discarded because they exceeded the mismatches threshold (affected by --max-unmated-mismatches).

XC:A:e

Indicates that good scoring hits were found for this read arm with its mate, but were discarded because there were too many hits at the best score (affected by --max-top-results).

XC:A:E

Indicates that no potential mated hits were found, there were good scoring unmated hits, but the arm was discarded because there were too many of these hits (affected by --max-top-results).

SAM/BAM file extensions (RTG cgmap command output)

In addition to the file extensions described for the map command in SAM/BAM file extensions (RTG map command output), the cgmap command also outputs several additional fields specific to the nature of Complete Genomics reads.

A sample RTG SAM file is shown below, describing the relationship between some reads and a reference sequence, including gaps, mismatches and overlaps as determined by the RTG cgmap aligner.

@HD VN:1.0 SO:coordinate
@PG ID:RTG PN:RTG VN:v2.3 CL:cgmap -i bac_READS_SDF -t bac_REF_SDF -o bac_MAPPING -e 7 -E 5
@SQ SN:bac LN:100262
1 179 bac 441 55 24=5N10= = 765 324 TGACGCCTCTGCTCTTGCAAGTCNTTCACATTCA 544400/31/1\*2\*858154468!073./66222 AS:i:0 MQ:i:255 XU:Z:5=1B19=1R5N10= XQ:Z:+ XA:i:0 IH:i:1 NH:i:1
1 115 bac 765 55 10=5N8=1I14= = 441 -324 ANAGAACTGGAACCATTCATGGAAGGCCAGTGA 5!5/1,+!431/..,153002076-13435001 AS:i:0 MQ:i:255 XU:Z:1=1R5=1R2=5N8=1I11=2B5= XQ:Z:74 XR:Z:TA XA:i:0 IH:i:1 NH:i:1
83 179 bac 4963 55 3=1X19=5N10= = 5257 294 GGAAGGAGTGCTGCAGGCCGACCCTCATGGAGA 42062-51/4-1,55.010456-27/2711032 AS:i:1 MQ:i:255 XU:Z:3=1X1=2B20=5N10= XQ:Z:-. XR:Z:A XA:i:1 IH:i:1 NH:i:1
83 115 bac 5257 55 10=5N25= = 4963 -294 CCTCCTAGCGGTACATCTCCAGCCCCTTCCTAGNA 55541\*,-/3+1,2,13525167".21806010!2 AS:i:0 MQ:i:255 XU:Z:10=5N23=1R1= XA:i:1 IH:i:1 NH:i:1

The XU field is an extended CIGAR format which has additional characters for encoding extra information about a Complete Genomics read. The extra characters are B for encoding a backstep on the reference (overlap in the read), T for an unknown nucleotide in the reference and R for an unknown nucleotide in the read. In the case where both the reference and the read have an unknown nucleotide at the same place the R character is used.

The XQ field is the quality in the same format as the SAM QUAL field for the nucleotides in the overlap region of the read. It is present when backsteps exist in the extended CIGAR field.

The XR field contains the nucleotides from the read which differ from the reference (read delta). It is present when there are mismatches, inserts, unknowns on the reference or soft clipping represented in the extended CIGAR.

Using these three additional fields with the QUAL field, position that the read mapped to, and the reference, it is possible to reconstruct the original Complete Genomics read.

For example:

Record:
Position 4963 42062-51/4-1,55.010456-27/2711032 XU:Z:3=1X1=2B20=5N10= XQ:Z:-. XR:Z:A

Reference 4960-5010:
CCTGGAGG  GAGTGCTGCAGGCCGACCAGCAACTCATGGAGAAGACCAAGG
   GGAAGGGGAGTGCTGCAGGCCGACC     CTCATGGAGA
   42062-.-51/4-1,55.010456-_____27/2711032

Flattened Read from SAM file:
GGAAGGAGTGCTGCAGGCCGACCCTCATGGAGA
42062-51/4-1,55.010456-27/2711032

Reconstructed Read:
GGAAGGGGAGTGCTGCAGGCCGACCCTCATGGAGA
42062-.-51/4-1,55.010456-27/2711032

This example shows how the mismatch and read delta replace the nucleotide from the reference to form the read. It also shows that the backstep in the extended CIGAR is used to record overlap in the read. Note that the number of additional quality characters in the corresponds to the number of backsteps and that the additional quality characters will always be inserted into the read qualities on the inner-most side of the overlap region.

Record:
Position 65 5!5/1,+!431/..,153002076-134735001 XU:Z:1=1R5=1T2=5N8=1I11=2B5= XR:Z:TA XQ:Z:74

Reference 760-810
AGAGGAGAGAACNGGTTTGGAACCATTC TGGAAGGCCAG  TGAGCTGTGTT
     ANAGAACTGG_____AACCATTCATGGAAGGCCAGAGTGA
     5!5/1,+143_____1/..,153002076-1347435001

Flattened Read from SAM file:
ANAGAACTGGAACCATTCATGGAAGGCCAGTGA
5!5/1,+!431/..,153002076-13435001

Reconstructed Read:
ANAGAACNGGAACCATTCATGGAAGGCCAGAGTGA
5!5/1,+1431/..,153002076-1347435001

This example shows how the R character in the extended CIGAR corresponds to an unknown in the read and how the T character corresponds to an unknown in the reference. Note that when there is an unknown in the reference but not in the read the nucleotide is included in the read delta as are inserted nucleotides. Also note that although the backstep is used in this case to reconstruct part of the outside five nucleotides the overlap quality characters still correspond to the inside nucleotides.

Small-variant VCF output file description

The snp, family, somatic, tumoronly, and population commands call single nucleotide polymorphisms (SNPs), multiple nucleotide polymorphisms (MNPs), and indels for a single individual, a family of individuals or a cancer and normal pair respectively. At each position in the reference, a base pair determination is made based on statistical analysis of the accumulated read alignments. The predictions and accompanying statistics are reported in a text-based output file named snps.vcf.

Note

RTG variant calls are stored in VCF format (version 4.2). For more information about the VCF format, refer to the specification online at: https://samtools.github.io/hts-specs/VCFv4.2.pdf

The snps.vcf output file displays each variant called with confidence. The location and type of the call, the base pairs (reference and called), and a confidence score are standard output in the snps.vcf output file. Additional support statistics in the output describe read alignment evidence that can be used to evaluate confidence in the called variants.

The commands also produce a summary.txt file which has simple counts of the variants detected and some ratios which can be used as a quick indication of SNP calling performance.

The following sample snps.vcf file is an example of the output produced by an RTG SNP call run. Each line in a snps.vcf output has tab-separated fields and represents a SNP variation calculated from the mapped reads against the reference genome.

This file represents the variations per chromosome as a result of the SAM/BAM mapped alignments against a reference genome.

##fileformat=VCFv4.2
##fileDate=20110524
##source=RTGv2.2 build 35188 (2011-05-18)
##CL=snp -o snp-hslo-18-u -t hst1 hslo-18-u/alignments.bam
##RUN-ID=b1f96b37-7f77-4d74-b472-2a36ba21397e
##reference=hst1
##contig=<ID="chr1",length=207900>
##INFO=<ID=XRX,Number=0,Type=Flag,Description="RTG variant was called using complex caller">
##INFO=<ID=CT,Number=1,Type=Integer,Description="Coverage threshold that was applied">
##FILTER=<ID=OC,Description="Coverage threshold exceeded">
##FILTER=<ID=RC,Description="RTG variant is a complex region">
##FILTER=<ID=RX,Description="RTG variant contained within hypercomplex region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RE,Number=1,Type=Float,Description="RTG Total Error">
##FORMAT=<ID=AR,Number=1,Type=Float,Description="Ambiguity Ratio">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=RS,Number=.,Type=String,Description="RTG Support Statistics">
#CHROM    POS    ID    REF      ALT         QUAL   FILTER  INFO    FORMAT    SAMPLE
chr1      43230  .     A        G           17.1   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:7:0.280:0.000:17.1:G,7,0.280
chr1      43494  .     TTTAAAT  TT:CTTAAAC  181.6  PASS    .       GT:DP:RE:AR:GQ:RS   1/2:10:0.428:0.024:15.3:CTTAAC,5,0.373
chr1      43638  .     T        C           15.3   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:6:0.210:0.000:15.3:C,6,0.210
chr1      43918  .     C        T           11.4   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:5:0.494:0.000:11.4:T,5,0.494
chr1      44038  .     C        T           18.9   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:8:0.672:0.000:18.9:T,8,0.672
chr1      44173  .     A        G           12.0   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:5:0.185:0.000:12.0:G,5,0.185
chr1      44218  .     TCCTCCA  ACCACCT     385.4  PASS    .       GT:DP:RE:AR:GQ:RS   1/1:43:3.925:0.015:80.8:ACCACCT,5,0.003,CCCCCCA,1,0.631,CCCTCCA,1,0.631,   TCCTCCA,30,2.024,TCCTTCA,4,0.635,~A,1,0.000,~TCCA,1,0.001
chr1      44329  .     A        G           12.2   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:5:0.208:0.000:12.2:G,5,0.208
chr1      44502  .     T        C           6.0    PASS    .       GT:DP:RE:AR:GQ:RS   1/1:3:0.160:0.000:6.0:C,3,0.160
chr1      44533  .     G        A           13.7   PASS    .       GT:DP:RE:AR:GQ:RS   1/1:6:0.421:0.000:13.7:A,6,0.421
chr1      202801 .     A        G           15.8   PASS    .       GT:DP:RE:AR:GQ:RS   0/1:66:30.544:0.000:15.8:A,50,24.508,C,1,0.502,G,13,4.529,T,2,1.004

Note

The VCF specification defines the semantics of the QUAL column differently for records that contain any ALT alleles from those which do not, and that the QUAL column is also defined as a score applying across the set of all samples in the VCF. Thus for multi-sample calling commands such as family, population, somatic, etc, the QUAL score is not necessarily an indication of the of quality the call of any particular sample.

RTG adds custom fields to the VCF format to better account for some of its unique variant calling features, described in the tables below. The exact set of fields used depend on the module run, with some fields only appropriate and present for family calling, somatic calling, etc.

Table : RTG VCF file FILTER fields

Value

Description

PASS

Standard VCF PASS, if variant meets all the filtering criteria.

OC

A predicted variation that has exceeded the maximum coverage filter threshold.

a<number>

A predicted variation that had greater than the given percentage of ambiguously mapped reads overlapping it. The number in the value is the percentage specified by the --max-ambiguity flag.

RCEQUIV

A predicted variation that is the same as a previous variant within a homopolymer or repeat region.

RC

The variant caller encountered a complex looking situation. A typical example would be a long insert. Some complex regions may result in simple calls.

RX

This call was made within a long complex region. Note that no attempt is made to generate complex calls in very long complex regions.

IONT

A predicted variation that failed to pass homopolymer constraints specific to IonTorrent reads.

OTHER

The variant was filtered for an unknown reason.

AVR<number>

A predicted variation that had less than the given value for the AVR score. The number in the value is the minimum AVR score specified by the --min-avr-score flag.

BED

The predicted variant falls outside the BED calling regions defined by the --filter-bed flag.

Table : RTG VCF file INFO fields

Value

Description

NCS=<value>

The Phred scaled posterior probability that the variant at this site is present in the cancer, output by the somatic command.

LOH=<value>

The value shows on a scale from -1 to 1 if the evidence of a call would suggest a loss of heterozygosity. A long run of high values is a strong indicator of a loss of heterozygosity event.

DP=<depth>

The combined read depth of multi-sample variant calls.

DPR=<ratio>

The ratio of combined read depth to the expected combined read depth.

XRX

Indicates the variant was called using the RTG complex caller. This means that a realignment of the reads relative to the reference and each other was required to make this call.

RCE

Indicates the variant is the same as one or more other variants within a homopolymer or repeat region.

NREF

Indicates the variant is called at a site where the reference is unknown, and so some other scores that require exact knowledge of the reference may not be produced.

CT=<value>

The maximum coverage threshold that was applied when the given variant has been filtered for being over the coverage threshold.

AC

The standard VCF allele count in genotypes field. For each ALT allele, in the same order as listed, the count of the allele in the genotypes.

AN

The standard VCF total number of alleles in called genotypes field.

STRL

The number of adjacent simple tandem repeats on the reference sequence.

STRU

The length of the repeating unit in a simple tandem repeat.

Table : RTG VCF file FORMAT fields

Value

Description

GT

The standard VCF format genotype field.

DP

The standard VCF format read depth field.

DPR

The ratio of read depth to the expected read depth.

VA

The allele index (using same numbering as the GT field) of the most frequent non-REF allele. This allele may not necessarily be part of the genotype that was actually called.

RE

The total error across bases of the reads at the SNP location. This is a corrective factor calculated from the r and q read mapping quality scores that adjusts the level of confidence higher or lower relative to read depth.

AR

The ratio of reads contributing to the variant that are considered to be ambiguous to uniquely mapped reads.

RQ

The Phred scaled posterior probability that the sample is not identical to the reference.

GQ

The standard VCF format genotype quality field. This is the Phred scaled posterior score of the call. It is not necessarily the same as the QUAL column score.

GQD

The genotype quality divided by the read depth of the sample.

QD

The quality field divided by the sum of the read depth for all samples.

DN

Indicates with a value of Y if the call for this sample is a putative de novo mutation, or N to indicate that the sample is not a de novo mutation. Note that even in cases when the GT of the sample and the parents would otherwise seem to indicate a de novo mutation, this field may be set to N when the variant caller has assigned a sufficiently low score to the likelihood that a de novo event has occurred.

DNP

The Phred scaled probability that the call for this sample is due to a de novo mutation.

OCOC

The count of evidence that is considered contrary to the call made for this sample, observed in the original sample.

For example, in a normal-cancer somatic call of 0/0 -> 1/0, the OCOC value is the count of the somatic (1) allele in the normal sample.

Usually a high OCOC value indicates an unreliable call.

OCOF

The fraction of evidence that is considered contrary to the call made for this sample, observed in the original sample.

For example, in a somatic call of 0/0 -> 1/0, the OCOF value is the fraction of the somatic (1) allele in the normal sample.

The OCOF and OCOC attributes are also applicable to de novo calls, where the evidence in the parents for the de novo allele is considered contrary.

Usually a high OCOF value indicates an unreliable call.

DCOC

The count of evidence that is considered contrary to the call made for this sample, observed in the derived sample.

For example, in a normal-cancer somatic call of 0/1 -> 2/0, the DCOC value is the count of the germline (1) allele in the somatic sample.

In cases of high sample purity, a high DCOC value may indicate an unreliable call.

DCOF

The fraction of evidence that is considered contrary to the call made for this sample, observed in the derived sample.

For example, in a somatic call of 0/1 -> 2/0, the DCOF value is the fraction of the germline (1) allele in the somatic sample.

The DCOF and DCOC attributes are also applicable to pedigree aware calls, where the evidence of non-inherited parental alleles in the child is considered contrary.

In cases of high sample purity, a high DCOF value may indicate an unreliable call.

ABP

The Phred scaled probability that allele imbalance is present in the call.

SBP

The Phred scaled probability that strand bias is present in the call.

RPB

The Phred scaled probability that read position bias is present in the call.

PPB

The Phred scaled probability that bias in the proportion of alignments that are properly paired is present in the call.

PUR

The ratio of placed unmapped reads to mapped reads.

RS

Statistical information about the evidence for the prediction which consists of a variable number of groups of three fields, each separated by commas. The three fields are allele, count, and the sum of probability error (computed from the Phred quality). The sum of counts should equal DP and the sum of the errors should equal RE.

DH

An alternative disagreeing hypothesis in the same format as the genotype field. This can occur when a sample intersects multiple families in a pedigree when doing population calling.

AD

The allelic depths for the reference and alternate alleles in the order listed.

ADE

The allelic depths for the reference and alternate alleles in the order listed, after adjusting for poor base quality and mapping quality.

ADF

The allelic depths for the reference and alternate alleles in the order listed, for reads on the forward strand.

ADR

The allelic depths for the reference and alternate alleles in the order listed, for reads on the reverse strand.

ADF1

The allelic depths for the reference and alternate alleles in the order listed, for arm 1 reads on the forward strand.

ADF2

The allelic depths for the reference and alternate alleles in the order listed, for arm 2 reads on the forward strand.

ADR1

The allelic depths for the reference and alternate alleles in the order listed, for arm 1 reads on the reverse strand.

ADR2

The allelic depths for the reference and alternate alleles in the order listed, for arm 2 reads on the reverse strand.

AQ

The sum of the quality of evidence (including base quality and mapping quality) for the reference and alternate alleles in the order listed.

MEANQAD

The difference in the mean AQ between the two called alleles.

SSC

The score for the somatic mutation specified by the GT field.

SS

The somatic status of the genotype for this sample. A value of 0 indicates none or wild type, a value of 1 indicates a germline variant, and a value of 2 indicates a somatic variant.

GL

The log10 scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields as defined in the VCF specifications.

VAF

The VAF field contains the estimated variant allelic fraction of each alternate allele, in the order listed.

VAF1

The VAF1 field contains the estimated variant allelic fraction of the most abundant non-reference allele. This attribute may be more suitable for AVR model building and filtering than the multi-valued VAF annotation.

IC

The inbreeding coefficient for the site.

EP

The Phred scaled probability that the site is not in Hardy-Weinberg equilibrium.

LAL

The length of the longest allele for the site.

NAA

The number of alternate alleles for the site.

PD

The ploidy of the sample.

ZY

The zygosity of the sample.

RA

Categorizes the call as hom-ref (RR), hom-alt (AA), het-ref (RA), or het-alt (AB), independent of phase or ALT allele indices.

QA

Sum of quality of the alternate observations.

CLUS

The number of variants in this sample within five bases of the current variant.

AVR

The adaptive variant rescoring value. It is a value between 0 and 1 that represents the probability that the variant for the sample is correct.

The following examples of what a variant call may look like only includes the bare minimum sample field information for clarity.

#Examples of calls:
g1  64  .  T     A        74.0  PASS    .      GT  1/1  #Homozygous SNP
g1  9   .  GAC   G        9.0   PASS    .      GT  1/1  #Homozygous deletion
g1  16  .  A     ACGT     27.0  PASS    .      GT  1/1  #Homozygous insertion
g1  54  .  T     TA       14.0  PASS    .      GT  1/0  #Heterozygous insertion
g1  17  .  ACGT  A        71.0  PASS    .      GT  0/1  #Heterozygous deletion
g1  61  .  TTA   GCG,AAT  74.0  PASS    .      GT  1/2  #Heterozygous MNP
g1  3   .  A     .        11.0  PASS    .      GT  0/0  #Equality call
g1  32  .  CGT   .        89.0  PASS    .      GT  0/0  #Equality call
g1  88  .  A     G        249.5 PASS    .      GT  1    #Haploid SNP
g1  33  .  A     T        20.0  PASS    RCE    GT  1/1  #Variant which is equivalent to other variants
g1  42  .  A     T        13.0  RCEQUIV RCE    GT  1/1  #Variant filtered due to equivalence to other variants
g1  45  .  A     C        5.0   OC     CT=100  GT  1/1  #Variant which exceeds the coverage threshold of 100
g1  76  .  G     C        10.0  a10.0   .      GT  1/1  #Variant which had ambiguous mappings overlapping it
g1  74  .  A     .        3.0   RC      .      GT  0/0  #Complex call with no prediction
g1  90  .  G     C        15.0  RX      .      GT  1/1  #Homozygous SNP called within a large complex region
g1  99  .  C     G        15.0  IONT    .      GT  1/1  #Variant that failed IonTorrent homopolymer constraints
g1  33  .  A     G,C      13.0  PASS    .      GT  0/1  0/2  0/2  1/2  #Mendelian family call
g1  90  .  G     A        20.0  PASS    LOH=1  GT:SSC:SS  0/1  0/0:24.0:2  #Somatic mutation

In the outputs from the family, population and somatic commands some additional information about the samples are provided through the PEDIGREE and SAMPLE header lines.

The family or population command output includes additional sample sex and pedigree information within the headers like the following:

##PEDIGREE=<Child=SM_SON,Mother=SM_MOTHER,Father=SM_FATHER>
##SAMPLE=<ID=SM_SON,Sex=MALE>
##SAMPLE=<ID=SM_MOTHER,Sex=FEMALE>
##SAMPLE=<ID=SM_FATHER,Sex=MALE>

The pedigree information contained within VCF header fields is also used by the mendelian, pedfilter, and pedstats commands.

The somatic command output includes information about sample relationships and genome mixtures using headers like the following:

##PEDIGREE=<Derived=SM_TUMOR,Original=SM_BASE>
##SAMPLE=<ID=SM_BASE,Genomes=SM_BASE,Mixture=1.0,Sex=MALE,Description="Original genome">
##SAMPLE=<ID=SM_TUMOR,Genomes=SM_BASE;SM_TUMOR,Mixture=0.2;0.8,Sex=MALE,Description="Original genome;Derived genome">

Regions BED output file description

The snp, family, population and somatic commands all output a BED file containing regions that were considered to be complex.

The following sample regions.bed file is an example of the output produced by an RTG variant calling run. Each line in a regions.bed output has tab-separated fields and represents a region in the reference genome that was considered to be complex.

CFTR.3.70s  7 7 complex-called
CFTR.3.70s  15  15  complex-called
CFTR.3.70s  18  18  complex-called
CFTR.3.70s  21  21  complex-called
CFTR.3.70s  26  30  complex-called
CFTR.3.70s  42  45  complex-over-coverage
CFTR.3.70s  84  93  extreme-coverage
CFTR.3.70s  133 165 hyper-complex
CFTR.3.70s  169 186 complex-called
CFTR.3.70s  189 195 complex-called
CFTR.3.70s  198 198 complex-no-variant
CFTR.3.70s  300 310 complex-no-hypotheses
CFTR.3.70s  435 439 complex-too-many-hypotheses

The columns in order are:

  1. Sequence name

  2. Region start, counting from 0

  3. Region end, counting from 0, not inclusive

  4. Name of the region type

Table : Region type names

Value

Description

complex-called

Complex regions that were called using complex calling.

hyper-complex

Long complex regions for which no call attempt was made.

extreme-coverage

No calls made in this region due to extreme coverage.

complex-over-coverage

Complex region has greater than the maximum coverage allowed.

complex-no-hypotheses

No hypotheses could be created for the complex region.

complex-no-variant

Complex region evaluation resulted in no variants.

complex-too-many-hypotheses

Complex region had too many hypotheses for evaluation.

SV command output file descriptions

The sv command is used to predict the likelihood of various structural variant categories. The outputs produced are sv_interesting.bed.gz which is a BED format file that identifies regions that could indicate a structural variant and sv_bayesian.tsv.gz a tab separated format file containing prediction strengths of event models.

The following is an example of the sv_interesting.bed.gz file output.

#chr    start   end areas   maxscore    average
chr1    10    100 1   584.5470    315.8478
chr1    49760   53270 5   630.3273    380.2483

Table : SV_INTERESTING.BED file output column descriptions

Column

Description

chr

The chromosome name.

start

The start position in the reference chromosome.

end

The end position in the reference chromosome.

areas

The number of distinct model areas contained in the region.

maxscore

The maximum score reached by a model in the given region.

average

The average score for the model areas covered by this region.

The following is an example of the sv_bayesian.tsv.gz file output.

#Version v2.3.2 build 5c2ee18 (2011-10-05), SV bayesian output v0.1
#CL  sv --step 100 --fine-step 10 --readgroup-stats map/rgstats.tsv --template /data/human/hg18.sdf -o sv map/alignments.bam
#RUN-ID  8acc8413-0daf-455d-bf9f-41d195dec4cd
#template-name    position  normal  duplicate  delete  delete-left  delete-right  duplicate-left  duplicate-right breakpoint  novel-insertion  max-index
chr1  11  -584.5470  -1582.1325  -3538.0052  -4168.1398  584.5470  -932.2432  -1219.9770  -630.1436  -664.3196  4
chr1  21  -521.2708  -1508.9226  -3617.4369  -4247.5716  521.2708  -865.7595  -1156.7007  -630.1450  -671.1980  4
chr1  31  -443.5759  -1425.2073  -3626.2318  -4256.3664  443.5759  -788.1160  -1079.0058  -630.1468  -662.5068  4
chr1  41  -372.8984  -1346.1013  -3676.6399  -4306.7745  372.8984  -715.6147  -1008.3284  -630.1490  -662.4542  4
chr1  51  -326.3469  -1288.4120  -3790.0943  -4420.2290  326.3469  -663.2324  -961.7768  -630.1516  -660.6529  4
chr1  61  -269.1376  -1223.0752  -3849.6462  -4479.7808  269.1376  -604.2883  -904.5676  -630.1545  -669.2223  4
chr1  71  -201.0995  -1146.3076  -3907.0182  -4537.1529  201.0995  -534.3735  -836.5295  -630.1578  -673.4514  4
chr1  81  -109.1116  -1046.1921  -3931.7913  -4561.9260  109.1116  -442.1511  -744.5415  -630.1612  -669.0146  4
chr1  91  -14.6429  -941.7897  -3980.0307  -4610.1653  14.6429  -345.7608  -650.0728  -630.1648  -664.5593  4
chr1  101  60.8820  -918.1163  -4095.1224  -4725.2570  -60.8820  -329.0012  -635.4300  -691.0506  -719.2296  0
chr1  111  117.6873  -911.1928  -4194.5855  -4824.7202  -117.6873  -327.5866  -635.4300  -747.8596  -775.8622  0
chr1  121  199.2098  -898.2490  -4380.5384  -5010.6731  -199.2098  -321.7577  -635.4300  -829.3856  -857.1902  0

Table : SV_BAYESIAN.TSV file output column descriptions

Column

Description

template-name

The chromosome name.

position

The position in the reference chromosome.

normal

The prediction strength for the normal model.

duplicate

The prediction strength for the duplicate model.

delete

The prediction strength for the delete model.

delete-left

The prediction strength for the delete-left model.

delete-right

The prediction strength for the delete-right model.

duplicate-left

The prediction strength for the duplicate-left model.

duplicate-right

The prediction strength for the duplicate-right model.

breakpoint

The prediction strength for the breakpoint model.

novel-insertion

The prediction strength for the novel-insertion model.

max-index

The index of the model that has the maximum prediction strength for this line. The index starts from 0 meaning normal and is in the same order as the model columns.

When the --simple-signals parameter is set an additional file called sv_simple.tsv.gz is output which is a tab separated format file containing the raw signals used by the sv command. The following is an example of the sv_simple.tsv.gz output file.

#Version v2.3.2 build 5c2ee18 (2011-10-05), SV simple output v0.1
#CL  sv --step 100 --fine-step 10 --readgroup-stats map/rgstats.tsv --template /data/human/hg18.sdf -o sv map/alignments.bam --simple-signals
#RUN-ID  8acc8413-0daf-455d-bf9f-41d195dec4cd
#template-name  position  proper-left  discordant-left  unmated-left  proper-right  discordant-right unmated-right  not-paired  unique  ambiguous  n-count
chr1  11   17.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  18.0000  0.0000  0.0000
chr1  21   14.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  15.0000  0.0000  0.0000
chr1  31   13.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  13.0000  0.0000  0.0000
chr1  41   10.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  11.0000  0.0000  0.0000
chr1  51   11.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  11.0000  0.0000  0.0000
chr1  61   12.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  12.0000  0.0000  0.0000
chr1  71   11.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  11.0000  0.0000  0.0000
chr1  81   17.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  17.0000  0.0000  0.0000
chr1  91   17.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  17.0000  0.0000  0.0000
chr1  101  9.0000   0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  10.0000  0.0000  0.0000
chr1  111  10.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  10.0000  0.0000  0.0000
chr1  121  12.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  13.0000  0.0000  0.0000

Table : SV_SIMPLE.TSV file output column descriptions

Column

Description

template-name

The chromosome name.

position

The position in the reference chromosome.

proper-left

Count of properly paired left reads mapped in this location.

discordant-left

Count of discordantly paired left reads mapped in this location.

unmated-left

Count of unmated left reads mapped in this location.

proper-right

Count of properly paired right reads mapped in this location.

discordant-right

Count of discordantly paired right reads mapped in this location.

unmated-right

Count of unmated right reads mapped in this location.

not-paired

Count of single end reads mapped in this location.

unique

Count of unique mappings in this location.

ambiguous

Count of ambiguous mappings in this location.

n-count

The number of unknown bases on the reference for this location.

Discord command output file descriptions

The discord command uses clusters of discordant reads to find possible locations for structural variant breakends. The breakends are output in a VCF file called discord_pairs.vcf.gz using the ALT and INFO fields as defined in the VCF specification.

Note

RTG structural variant calls are stored in VCF format (version 4.2). For more information about the VCF format, refer to the specification online at: https://samtools.github.io/hts-specs/VCFv4.2.pdf

The following is an example of the VCF output of the discord command:

##fileformat=VCFv4.2
##fileDate=20120305
##source=RTGv2.5 build 9f7b8a5 (2012-03-05)
##CL=discord --template hst1 -o discord --readgroup-stats map/rgstats.tsv map/alignments.bam
##RUN-ID=4157329b-edb9-419a-9129-44116e7a2195
##TEMPLATE-SDF-ID=4ecc9eb83e0ccec4
##reference=hst1
##contig=<ID="chr1",length=207900>
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=INCONSISTENT,Description="Supporting reads are inconsistent as to breakend location">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO  FORMAT  SAMPLE
chr1  50005  .  A  A[simulatedSequence1:52999[  .  PASS  IMPRECISE;SVTYPE=BND;DP=506;CIPOS=0,0  GT  1/1
chr1  52999  .  G  ]simulatedSequence1:50005]G  .  PASS  IMPRECISE;SVTYPE=BND;DP=506;CIPOS=0,0  GT  1/1

RTG adds custom fields to the VCF format to better account for some of its unique structural variant calling features, described in the tables below.

Table : RTG VCF file FILTER fields

Value

Description

PASS

Standard VCF “PASS”, if breakend meets all the filtering criteria.

INCONSISTENT

Breakend with discordant read cluster that does not agree on the possible positions.

Table : RTG VCF file INFO fields

Value

Description

DP=<depth>

Indicates the number of discordant reads contributing to the cluster at this breakend.

If the --bed parameter is set the discord command also outputs a BED format file called discord_pairs.bed.gz containing the break-end regions. Any break-ends which do not have a PASS in the filter field of the VCF output will be preceded by the comment character in the BED file output.

The following is an example of the BED output from the discord command:

#Version v2.5 build 9f7b8a5 (2012-03-05), Discordance output 1
#CL  discord --bed --template hst1 -o discord --readgroup-stats map/rgstats.tsv map/alignments.bam
#RUN-ID  4157329b-edb9-419a-9129-44116e7a2195
#chromosome  start  end  remote  count
chr1  50004  50004  remote:chr1:52998-52998  506
chr1  52998  52998  remote:chr1:50004-50004  506

The columns in the example are in BED file order:

  1. Chromosome name

  2. Start position in chromosome

  3. End position in chromosome

  4. Location of remote break-end matching this one

  5. Count of discordant reads contributing to the break-end

Coverage command output file descriptions

The coverage command works out the coverage depth for a set of read alignments for a given reference. With default settings this will produce a BED format file containing the regions with a specific read depth. These regions are calculated by taking the read depth of each position as the average of itself and the read depths of the positions to the left and right of it out to the number specified with the --smoothing flag and then grouping the resulting values which have the same average read depth.

The following is an example of the BED output from the coverage command:

#Version v2.3.2 build 5c2ee18 (2011-10-05), Coverage BED output v1.0
#CL     coverage -o coverage-hslo-18-u -t hst1 hslo-18-u/alignments.bam
#RUN-ID c0561a1d-fb3b-4062-96ca-cad2cc3c476a
#sequence start   end     label           coverage
chr1      0       4       chr1        2
chr1      4       13      chr1        3
chr1      13      22      chr1        4
chr1      22      29      chr1        5
chr1      29      38      chr1        6
chr1      38      45      chr1        7
chr1      45      53      chr1        8
chr1      53      64      chr1        9
chr1      64      80      chr1        10
chr1      80      128     chr1        11
chr1      128     137     chr1        10
chr1      137     157     chr1        11
chr1      157     164     chr1        12
chr1      164     169     chr1        13
chr1      169     174     chr1        14
chr1      174     177     chr1        15
chr1      177     181     chr1        16

The columns in the example are in BED file order:

  1. Chromosome name

  2. Start position in chromosome

  3. End position in chromosome

  4. Name or label of the feature, this is generally the name of the chromosome or name of any BED features overlapping the coverage region

  5. Depth of coverage for the range specified

When the --per-region flag is set, the coverage command will alter the criteria for outputting a BED record. Rather than defining regions having the same level of coverage, a BED record will be produced for each input BED region, containing the average coverage over that region.

When the --bedgraph flag is set, the coverage command will produce a BEDGRAPH format file with the regions calculated in the same way as for BED format output.

The following is an example of the BEDGRAPH output from the coverage command:

#Version v2.5.0 build 79d6626 (2011-10-05), Coverage BEDGRAPH output v1.0
#CL     coverage -o coverage-hslo-18-u -t hst1 hslo-18-u/alignments.bam --bedgraph
#RUN-ID 36c48ec4-52b7-48d5-b68c-71dce5dba129
track type=bedGraph name=coverage
chr1      0       4       2
chr1      4       13      3
chr1      13      22      4
chr1      22      29      5
chr1      29      38      6
chr1      38      45      7
chr1      45      53      8
chr1      53      64      9
chr1      64      80      10
chr1      80      128     11
chr1      128     137     10
chr1      137     157     11
chr1      157     164     12
chr1      164     169     13
chr1      169     174     14
chr1      174     177     15
chr1      177     181     16

The columns in the example are in BEDGRAPH file order:

  1. Chromosome name

  2. Start position in chromosome

  3. End position in chromosome

  4. Depth of coverage for the range specified

When the --per-base flag is set when running the coverage command will produce a tab separated value file with the coverage information for each individual base in the reference.

The following is an example of the TSV output from the coverage command:

#Version v2.3.2 build 5c2ee18 (2011-10-05), Coverage output v1.0
#CL  coverage -o coverage-hslo-18-u-per-base -t hst1 hslo-18-u/alignments.bam --per-base
#RUN-ID 7798b1a5-2159-48e6-976e-86b4f8e98fa6
#sequence       position        unique-count    ambiguous-count   score
chr1            0               0               0                 0.00
chr1            1               0               0                 0.00
chr1            2               1               0                 1.00
chr1            3               1               0                 1.00
chr1            4               1               0                 1.00
chr1            5               1               0                 1.00
chr1            6               1               0                 1.00
chr1            7               2               0                 2.00
chr1            8               2               0                 2.00
chr1            9               2               1                 2.50

Table : COVERAGE.TSV file output column descriptions

Column

Description

sequence

The chromosome name.

position

The position in the reference chromosome.

unique-count

The count of reads covering this position with IH equal to one.

ambiguous-count

The count of reads covering this position with IH greater than one.

score

The sum of one divided by the IH value for all the reads covering this position.

The coverage command produces a stats.tsv file with coverage statistics for each chromosome in the reference and the reference as a whole. The following is an example file:

#depth        breadth covered size    name
28.3238       0.9998  23766   23770   chr1
28.0013       0.9997  41447   41459   chr2
28.3151       0.9994  24930   24946   chr3
28.4955       0.9999  87734   87741   chr4
28.3822       0.9999  32600   32604   chr5
28.7159       0.9999  55042   55047   chr6
28.1109       0.9998  34887   34894   chr7
28.3305       0.9999  50025   50032   chr8
28.2229       0.9998  49627   49639   chr9
28.3817       0.9999  15912   15914   chr10
28.3569       0.9998  415970  416046  all sequences

Table : STATS.TSV column descriptions

Column

Description

depth

The average depth of coverage for the region where each base position is calculated as the sum of one divided by the IH of each read alignment which covers the position.

breadth

The fraction of the non-N region base positions which have a depth of one or greater.

covered

The number of non-N bases in the region which have a depth of one or greater.

size

The number of non-N bases in the region.

name

The name of the region, or “all sequences” for the entire reference.

For whole-genome coverage runs, the region names are each of the chromosomes. For exome data, or other targeted sequencing where a BED file was provided, the region names are obtained from the BED file.

The coverage command produces a levels.tsv file with some statistics about the coverage levels. The following is the start of an example file:

#coverage_level       count   %age    %cumulative
0     83      0.02    100.00
1     84      0.02    99.98
2     86      0.02    99.96
3     87      0.02    99.94
4     88      0.02    99.92
5     154     0.04    99.90

Table : LEVELS.TSV column descriptions

Column

Description

coverage_level

The coverage level.

count

The count of the number of bases at this coverage level.

%age

The percentage of the reference at this coverage level.

%cumulative

The percentage of the reference at this coverage level or higher.

Mapx and mapp output file description

The mapx and mapp commands search protein databases (with query sequences being translated nucleotide and untranslated protein query sequences, respectively). These commands report matches filtered on a combination of percent identity, e-score, bit-score and alignment score. The matches are reported in an ASCII file called alignments.tsv with each match reported on a single line as a tab-separated value record.

The following results file is an example of the output produced by mapx:

#template-name frame read-id template-start template-end template-length read-start read-end read-length template-protein read-protein alignment identical %identical positive %positive mismatches raw-score bit-score e-score
gi|212691090|ref|ZP_03299218.1| +1      0       179     211     429     1       99      100 nirqgsrtfgilcmpkasgnypallrvpgagvr       nirqgsrtfgifcmpkasgnypallrvpgaggr nirqgsrtfgi cmpkasgnypallrvpgag r       31      94      31      94      2       -162    67.0 2.3e-11
gi|255013538|ref|ZP_05285664.1| +1      0       176     208     428     1       99      100 nvrpgsrtygilcmpkkegkypallrvpgagir       nirqgsrtfgifcmpkasgnypallrvpgaggr n+r gsrt+gi cmpk  g ypallrvpgag r       25      76      27      82      6       -136    57.0 2.3e-8
gi|260172804|ref|ZP_05759216.1| +1      0       185     217     435     1       99      100 nicngsrtfgilcipkkpgkypallrvpgagvr       nirqgsrtfgifcmpkasgnypallrvpgaggr ni  gsrtfgi c+pk  g ypallrvpgag r       25      76      26      79      7       -129    54.3 1.5e-7
gi|224537502|ref|ZP_03678041.1| +1      0       162     194     414     1       99      100 tdrwgsrfygvlcvpkkegkypallrvpgagir       nirqgsrtfgifcmpkasgnypallrvpgaggr r gsr +g+ c+pk  g ypallrvpgag r       21      64      24      73      9       -111    47.4 1.8e-5

The following table provides descriptions of each column in this alignment output file format, listed from left to right.

Table : ALIGNMENTS.TSV file output column descriptions

Column

Description

template-name

ID or description of protein (reference) with match.

frame

Denotes translation frame on forward (1,2,3) or reverse strand.

read-id

Numeric ID of query sequence from SDF.

template-start

Start position of alignment on protein subject.

template-end

End position of alignment on protein subject.

template-length

Amino acid length of protein subject.

read-start

Start position of alignment on query sequence.

read-end

End position of alignment on query sequence.

read-length

Total length of query (nucleotides for mapx / amino acids for mapp).

template-protein

Amino acid sequence of aligned protein reference.

read-protein

Amino acid sequence of aligned query sequence.

alignment

Amino acid alignment of match.

identical

Count of identities in alignment between protein subject and query sequences.

%identical

Percent identity of match between protein subject and query sequences, for exact matches only (global across query sequence).

positive

Count of identical and similar amino acids in alignment between protein subject and query sequences.

%positive

Percent similarity between protein subject and query sequences, for exact and similar matches (global across query sequence).

mismatches

Count of mismatches between protein subject and query sequences.

raw-score

RTG alignment score (S); The alignment score is the negated sum of all single protein raw scores plus its penalties for gaps, which is the edit distance using one of the scoring matrices. Note that the RTG alignment score is the negated raw score of BLAST.

bit-score

Bit score is computed from the alignment score using the following formula: \text{bit-score} = ((\lambda \times -S) - \ln (K)) / \ln (2)

where \lambda and K are taken from the matrix defaults [Blast pp.302-304] and S is the RTG alignment score.

e-score

e-score is computed from the alignment score using the following formula: \text{e-score} = K \times m' \times n \times e ^ {(\lambda \times S)}

n is the total length of the database.

m' is the effective length of the query (read):

m' = \max(1, \text{querylength} + \lambda \times S/H)

When the --unmapped-reads flag is set, unmapped query sequences are reported in an ASCII file called unmapped.tsv with each query sequence reported on a single line as a tab-separated value record. Each sequence in the unmapped output has a character code indicating the reason the query was not mapped, with no code indicating that sequence had no matches.

Character codes for unmapped sequences include:

Character

Description

d

Indicates that after alignment scores are calculated, the remaining hits were discarded because they exceeded the alignment score threshold (affected by --max-alignment-score).

e

Indicates that there were good scoring hits, but the results were discarded because there were too many of these hits (affected by --max-top-results).

f

Indicates that there was a good hit which failed the percent identity threshold (affected by --min-identity).

g

Indicates that there was a good hit which failed the e-score threshold (affected by --max-e-score).

h

Indicates that there was a good hit which failed the bit score threshold (affected by --min-bit-score).

Species results file description

The species command estimates the proportion of taxa in a given set of BAM files. It does this by taking a set of BAM files which were mapped against a set of known genome sequences. The proportions are reported in a tab separated ASCII file called species.tsv with each taxon reported on a separate line. The header lines include the command line and reference sets used.

In addition to the raw output, some basic diversity metrics are produced and output to the screen and to a file named summary.txt. The metrics included are:

:clist - Shannon (see http://en.wikipedia.org/wiki/Diversity_index#Shannon_index) - Pielou (see http://en.wikipedia.org/wiki/Species_evenness) - Inverse Simpson (see http://en.wikipedia.org/wiki/Diversity_index#Inverse_Simpson_index)

For an interactive graphical view of the species command output, an HTML5 report named index.html. Opening this shows the taxonomy and data on an interactive pie chart, with wedge sizes defined by either the abundance or DNA fraction (user selectable in the report).

The following results file is an example of the output produced by species:

#abundance abundance-low abundance-high DNA-fraction DNA-fraction-low DNA-fraction-high confidence coverage-depth coverage-breadth reference-length mapped-reads has-reference taxa-count taxon-id parent-id rank taxonomy-name
0.1693 0.1677 0.1710 0.06157 0.06097 0.06217 4.2e+02 0.4919 0.3915 1496992 20456.00 Y 1 3 1 species Acholeplasma_laidlawii
0.1682 0.1671 0.1693 0.1384 0.1375 0.1393 6.6e+02 0.4892 0.3918 3389227 46059.50 Y 1 4 1 species Acidiphilium_cryptum
0.1680 0.1667 0.1692 0.09967 0.09891 0.1004 5.5e+02 0.4890 0.3887 2443540 33189.00 Y 1 6 1 species Acidothermus_cellulolyticus
0.1677 0.1668 0.1685 0.2301 0.2289 0.2313 8.7e+02 0.4877 0.3887 5650368 76543.00 Y 1 5 1 species Acidobacteria_bacterium
0.1646 0.1638 0.1655 0.2140 0.2129 0.2152 8.3e+02 0.4795 0.3886 5352772 71295.50 Y 1 7 1 species Acidovorax_avenae
0.1622 0.1614 0.1630 0.2562 0.2550 0.2574 9.2e+02 0.4721 0.3836 6503724 85288.00 Y 1 2 1 species Acaryochloris_marina

The following table provides descriptions of each column in the species output file format, listed from left to right.

Table : SPECIES.TSV file output column descriptions

Column

Description

abundance

Fraction of the individuals in the sample that belong to this taxon. The output file is sorted on this column.

abundance-low

Lower bound three standard deviations below the abundance.

abundance-high

Upper bound three standard deviations above the abundance.

DNA-fraction

Raw fraction of the DNA that maps to this taxon.

DNA-fraction-low

Lower bound three standard deviations below the DNA-fraction.

DNA-fraction-high

Upper bound three standard deviations above the DNA-fraction.

confidence

Confidence that this taxon is present in the sample. (Computed as the number of standard deviations away from the null hypotheses).

coverage-depth

The coverage depth of reads mapped to the taxon sequences, adjusted for the IH (number of output alignments) of the individual reads. (Zero if no reference sequences for the taxon).

coverage-breadth

The fraction of the taxon sequences covered by the reads. (Zero if no reference sequences for the taxon).

reference-length

The total length of the reference sequences, will be 0 if the taxon does not have associated reference sequences.

mapped-reads

The count of the reads which mapped to this taxon, adjusted for the IH (number of output alignments) of the individual reads.

has-reference

Y if the taxon has associated reference sequences, otherwise N.

taxa-count

The count of the number of taxa that are descendants of this taxon (including itself) and which are above the minimum confidence threshold.

taxon-id

The taxonomic ID for this result.

parent-id

The taxonomic ID of this results parent.

rank

The taxonomic rank associated with this result. This can be used to filter results into meaningful sets at different taxonomic ranks.

taxonomy-name

The taxonomic name for this result.

Similarity results file descriptions

The similarity command estimates how closely related a set of samples are to each other. It produces output in the form of a similarity matrix (similarity.tsv), a principal component analysis (similarity.pca) and two formats for phylogenetic trees showing relationships (closest.tre and closest.xml).

The similarity matrix file is a tab separated format file containing an N\times N matrix of the matching k-mer counts, where N is the number of samples. An example of a similarity.tsv file:

#rtg similarity --unique-words -o similarity-out -I samples.txt -w 20 -s 15
F1_G_S_1M     F1_N_AN_1M      F1_O_BM_1M      F1_O_SP_1M      F1_O_TD_1M      F1_V_PF_1M
F1_G_S_1M     8406725 4873    18425   13203   14746   10201
F1_N_AN_1M    4873    3551445 113310  117696  68118   245516
F1_O_BM_1M    18425   113310  6588017 579048  782011  152603
F1_O_SP_1M    13203   117696  579048  8583126 248654  161950
F1_O_TD_1M    14746   68118   782011  248654  9134232 79949
F1_V_PF_1M    10201   245516  152603  161950  79949   6000623

The principal component analysis file is a tab separated format file containing principal component groupings in three columns of real numbers, followed by the name of the sample. This can be turned into a 3D plot showing relationship groupings, by treating each line as a single point in three dimensional space. An example of a similarity.pca file:

0.0906        -0.1505 0.0471  F1_G_S_1M
-0.1207       -0.0610 -0.0348 F1_N_AN_1M
-0.0479       0.1889  -0.0119 F1_O_BM_1M
-0.0229       0.0893  0.0996  F1_O_SP_1M
0.0162        0.1375  -0.1384 F1_O_TD_1M
-0.1503       -0.0619 -0.0341 F1_V_PF_1M

The files containing the relationships as phylogenetic trees are in the Newick (closest.tre) and phyloXML (closest.xml) formats. For more information about the Newick tree format see http://en.wikipedia.org/wiki/Newick_format. For more information about the phyloXML format see http://www.phyloxml.org.

RTG JavaScript filtering API

The vcffilter command permits filtering VCF records via user-supplied JavaScript expressions or scripts containing JavaScript functions that operate on VCF records. The JavaScript environment has an API provided that enables convenient access to components of a VCF record in order to satisfy common use cases.

VCF record field access

This section describes the supported methods to access components of an individual VCF record. In the following descriptions, assume the input VCF contains the following excerpt (the full header has been omitted):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877 NA12878
1 11259340 . G C,T . PASS DP=795;DPR=0.581;ABC=4.5 GT:DP 1/2:65 1/0:15

CHROM, POS, ID, REF, QUAL

Within the context of a --keep-expr or record function these variables will provide access to the String representation of the VCF column of the same name.

CHROM; // "1"
POS; // "11259340"
ID; // "."
REF; // "G"
QUAL; // "."

ALT, FILTER

Will retrieve an array of the values in the column.

ALT; // ["C", "T"]
FILTER; // ["PASS"]

INFO.{INFO_FIELD}

The values in the INFO field are accessible through properties on the INFO object indexed by INFO ID. These properties will be the string representation of info values with multiple values delimited with “,”. Missing fields will be represented by “.”. Assigning to these properties will update the VCF record. This will be undefined for fields not declared in the header.

INFO.DP; // "795"
INFO.ABC; // "4,5"

{SAMPLE_NAME}.{FORMAT_FIELD}

The JavaScript String prototype has been extended to allow access to the format fields for each sample. The string representation of values in the sample column are accessible as properties on the string matching the sample name named after the FORMAT field ID.

'NA12877'.GT; // "1/2"
'NA12878'.GT; // "1/0"

Note that these properties are only defined for fields that are declared in the VCF header (any that are not declared in the header will be undefined). See below for how to add new INFO or FORMAT field declarations.

VCF record modification

Most components of VCF records can be written or updated in a fairly natural manner by direct assignment in order to make modifications. For example:

CHROM = "chr1";      // Will change the CHROM value
POS = 42;            // Will change the POS value
ID = "rs23987382";   // Will change the ID value
QUAL = "50";         // Will change the QUAL value
FILTER = "FAIL";     // Will set the FILTER value
INFO.DPR = "0.01";   // Will change the value of the DPR info field
'NA12877'.DP = "10"; // Will change the DP field of the NA12877 sample

Other components of the VCF record (such as REF, and ALT) are considered immutable and can not currently be altered.

Note

Modification of CHROM and/or POS can lead to a VCF file which is incorrectly sorted and this will not necessarily be detected or reported until the resulting VCF file is used with another module or tool. Depending on the new value assigned to CHROM it may also be necessary to modify the sequence dictionary in the VCF header to reflect the change (see VCF header modification).

Direct assignment to ID and FILTER fields accept either a string containing semicolon separated values, or a list of values. For example:

ID = 'rs23987382;COSM3805';
ID = ['rs23987382', 'COSM3805'];
FILTER = 'BAZ;BANG';
FILTER = ['BAZ', 'BANG'];

Note that although the FILTER field returns an array when read, any changes made to this array directly are not reflected back into the VCF record.

Adding a filter to existing filters is a common operation and can be accomplished by the above assignment methods, for example by adding a value to the existing list and then setting the result:

var f = FILTER;
f.push('BOING');
FILTER = f;

However, since this is a little unwieldy, a convenience function called add() can be used (and may be chained):

FILTER.add('BOING');
FILTER.add(['BOING', 'DOING');
FILTER.add('BOING').add('DOING');

VCF header modification

Functions are provided that allow the addition of new FILTER, INFO and FORMAT fields to the header and records. It is recommended that the following functions only be used within the run-once portion of --javascript.

ensureFormatHeader({FORMAT_HEADER_STRING})

Add a new FORMAT field to the VCF if it is not already present. This will add a FORMAT declaration line to the header and define the corresponding accessor methods for use in record processing.

ensureFormatHeader('##FORMAT=<ID=GL,Number=G,Type=Float,' +
  'Description="Log_10 scaled genotype likelihoods.">');

ensureInfoHeader({INFO_HEADER_STRING})

Add a new INFO field to the VCF if it is not already present. This will add an INFO declaration line to the header and define the corresponding accessor methods for use in record processing.

ensureInfoHeader('##INFO=<ID=CT,Number=1,Type=Integer,' +
  'Description="Coverage threshold that was applied">');

ensureFilterHeader({FILTER_HEADER_STRING})

Add a new FILTER field to the VCF header if it is not already present. This will add an FILTER declaration line to the header.

ensureFilterHeader('##INFO=<ID=FAIL_VAL,' +
  'Description="Failed validation">');

Testing for overlap with genomic regions

One common use case is to test whether any given VCF record overlaps or is contained within a set of genomic regions. Regions may be loaded from either an external BED file or VCF file in the run-once portion of the JavaScript by either of the following:

Regions.fromBed({FILENAME})

Load the specified BED file into a regions object. For example:

var myregions = Regions.fromBed('/path/to/regions.bed');

Regions.fromVcf({FILENAME})

Load the specified VCF file into a regions object. For example:

var myregions = Regions.fromVcf('/path/to/regions.vcf');

Having loaded a set of genomic regions, this can be used to test for region overlaps using the following methods:

{REGIONS_OBJECT}.overlaps({CHROM}, {START}, {END})

Return true if the loaded regions overlap the specified interval. This function is typically used within the record function to test the coordinates of the current VCF record, e.g.:

function record() {
    if (myregions.overlaps(CHROM, POS, POS + REF.length)) {
        // do something if the record overlaps any region
    }
}

{REGIONS_OBJECT}.encloses({CHROM}, {START}, {END})

Return true if the loaded regions entirely encloses the supplied interval. This function is typically used within the record function to test the coordinates of the current VCF record, e.g.:

function record() {
    if (myregions.encloses(CHROM, POS, POS + REF.length)) {
        // do something if the record is fully enclosed by any region
    }
}

Additional information and functions

SAMPLES

This variable contains an array of the sample names in the VCF header.

SAMPLES; // ['NA12877', 'NA12878']

error({STRING})

Write the provided string to standard error.

error('The samples are: ' + SAMPLES);

checkMinVersion(RTG_MINIMUM_VERSION)

Check the version of RTG that the script is running under, and exits with an error message if the version of RTG does not meet the minimum required version. This is useful when distributing scripts that make use of features that are not present in earlier versions of RTG.

checkMinVersion('3.9.2');

See also

For javascript filtering usage and examples see vcffilter

Parallel processing approach

The comparison of genomic variation on a large scale in real time demands parallel processing capability. On a single server node, RTG automatically splits up compute-intensive commands like map and snp into multiple threads that run simultaneously (unless explicitly directed otherwise through the -T option).

To get a further reduction in wall clock time and optimize job execution on different size server nodes, a full read mapping and alignment run can be split into smaller jobs running in parallel on multiple servers.

How it works

With read mapping, very large numbers of reads may be aligned to a single reference genome. For example, 35 fold coverage of the human genome with 2×36 base pair data requires processing of 1,500 million reads. Fortunately, multiple alignment files can be easily combined with other alignment files if the reference is held constant. Thus, the problem can easily be made parallel by breaking up the read data into multiple data sets and executing on separate compute nodes.

The steps for clustered read mappings with RTG are as follows:

Create multiple jobs of single end data, and start them on different nodes. By creating files with the individual jobs, these can be saved for repeated runs. This example shows the mapping of single end read data to a reference.

$ echo "rtg map -a 1 -b 0 -w 12 --start-read=0 --end-read=10000000 \
  -o ${RESULTS}/00 -i ${READ_DATA} -t ${REFERENCE}" > split-job-00
$ rsh host0 split-job-00
$ echo "rtg map -a 1 -b 0 -w 12 --start-read=10000000 --end-read=20000000 \
  -o ${RESULTS}/01 -i ${READ_DATA} -t ${REFERENCE}" > split-job-01
$ rsh host1 split-job-01

Repeat for as many segments of read data as required. This step can vary based on your configuration, but the basic idea is to create a command line script and then run it remotely. Store read, reference, and output results data in a central location for easier management. Shell variables that specify data location by URL are recommended to keep it all straight.

Run SNP caller to get variant detection results

$ rtg snp -t ${REFERENCE} -o ${RESULTS} ${RESULTS}/*/alignments.bam

With this pipeline, each job runs to completion on separate nodes mapping the reads against the same reference genome. Each node should have sufficient, dedicated memory. RTG will automatically use the available cores of the processor to run multi-threaded tasks.

Note

The map command does not require specific assignment of input files for left and right mate pair read data sets; this is handled automatically by the format command. The result is an SDF directory structure, with left and right SDF directories underneath a directory named by the -o parameter. Each of the left and right directories share a GUID and are identified as left or right arms respectively. The map command uses this information to verify that left and right arms are correct before processing.

Complete Genomics use case

Complete Genomics data has 70 lanes, typically. For clustered processing, plan one lane of mapping per SDF. If you have 7 machines available, map 10 on each machine resulting in 70 SDFs per machine, and 70 directories of BAM files per machine.

SDF read data is loaded into RAM during execution of the RTG cgmap command. For example, using human genomic data, ~1.28 billion reads (each with 70 bp of information) can be divided into chunks that can fit into memory by the cgmap command.

If you have multi-core server nodes available, the cgmap command will use multiple cores simultaneously. You can use the -T flag to adjust the number of cores used. Clustered processing dramatically reduces the wall clock time of the total job. At the end, the snp and cnv commands will accept multiple alignment files created by mapping runs at one time, where different sets of reads are mapped to the same reference.

Distribution Contents

The contents of the RTG distribution zip file should include:

  • The RTG executable JAR file.

  • RTG executable wrapper script.

  • Example scripts and files.

  • This operations manual.

  • A release notes file and a readme file.

Some distributions also include an appropriate java runtime environment (JRE) for your operating system.

README.txt

For reference purposes, a copy of the distribution README.txt file follows:

=== RTG.VERSION ===

RTG software from Real Time Genomics includes tools for the processing
and analysis of plant, animal and human sequence data from high
throughput sequencing systems.  Product usage and administration is
described in the accompanying RTG Operations Manual.


Quick Start Instructions
========================

RTG software is delivered as a command-line Java application accessed
via a wrapper script that allows a user to customize initial memory
allocation and other configuration options. It is recommended that
these wrapper scripts be used rather than directly accessing the Java
JAR.

For individual use, follow these quick start instructions.

No-JRE:

  The no-JRE distribution does not include a Java Runtime Environment
  and instead uses the system-installed Java.  Ensure that at the
  command line you can enter "java -version" and that this command
  reports a java version of 1.7 or higher before proceeding with the
  steps below. This may require setting your PATH environment variable
  to include the location of an appropriate version of java.

Linux/MacOS X:

  Unzip the RTG distribution to the desired location.

  If your RTG distribution requires a license file (rtg-license.txt),
  copy the license file from Real Time Genomics into the RTG
  distribution directory.

  In a terminal, cd to the installation directory and test for success
  by entering "./rtg version"

  On MacOS X, depending on your operating system version and
  configuration regarding unsigned applications, you may encounter the
  error message:

  -bash: rtg: /usr/bin/env: bad interpreter: Operation not permitted

  If this occurs, you must clear the OS X quarantine attribute with
  the command:

  xattr -d com.apple.quarantine rtg

  The first time rtg is executed you will be prompted with some
  questions to customize your installation. Follow the prompts.

  Enter "./rtg help" for a list of rtg commands.  Help for any individual
  command is available using the --help flag, e.g.: "./rtg format --help"

  By default, RTG software scripts establish a memory space of 90% of
  the available RAM - this is automatically calculated.  One may
  override this limit in the rtg.cfg settings file or on a per-run
  basis by supplying RTG_MEM as an environment variable or as the
  first program argument, e.g.: "./rtg RTG_MEM=48g map"

  [OPTIONAL] If you will be running rtg on multiple machines and would
  like to customize settings on a per-machine basis, copy
  rtg.cfg to /etc/rtg.cfg, editing per-machine settings
  appropriately (requires root privileges).  An alternative that does
  not require root privileges is to copy rtg.example.cfg to
  rtg.HOSTNAME.cfg, editing per-machine settings appropriately, where
  HOSTNAME is the short host name output by the command "hostname -s"

Windows:

  Unzip the RTG distribution to the desired location.

  If your RTG distribution requires a license file (rtg-license.txt),
  copy the license file from Real Time Genomics into the RTG
  distribution directory.

  Test for success by entering "rtg version" at the command line.  The
  first time rtg is executed you will be prompted with some
  questions to customize your installation. Follow the prompts.

  Enter "rtg help" for a list of rtg commands.  Help for any individual
  command is available using the --help flag, e.g.: "rtg format --help"

  By default, RTG software scripts establish a memory space of 90% of
  the available RAM - this is automatically calculated.  One may
  override this limit by setting the RTG_MEM variable in the rtg.bat
  script or as an environment variable.


The scripts subdirectory contains demos, helper scripts, and example
configuration files, and comprehensive documentation is contained in
the RTG Operations Manual.

Using the above quick start installation steps, an individual can
execute RTG software in a remote computing environment without the
need to establish root privileges.  Include the necessary data files
in directories within the workspace and upload the entire workspace to
the remote system (either stand-alone or cluster).

For data center deployment and instructions for editing scripts,
please consult the Administration chapter of the RTG Operations Manual.

A discussion group is now available for general questions, tips, and other
discussions. It may be viewed or joined at:
https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-users

To be informed of new software releases, subscribe to the low-traffic
rtg-announce group at:
https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-announce

Citing RTG
==========

John G. Cleary, Ross Braithwaite, Kurt Gaastra, Brian S. Hilbush,
Stuart Inglis, Sean A. Irvine, Alan Jackson, Richard Littin, Sahar
Nohzadeh-Malakshah, Mehul Rathod, David Ware, Len Trigg, and Francisco
M. De La Vega. "Joint Variant and De Novo Mutation Identification on
Pedigrees from High-Throughput Sequencing Data." Journal of
Computational Biology. June 2014, 21(6):
405-419. doi:10.1089/cmb.2014.0029.

Terms of Use
============

This proprietary software program is the property of Real Time
Genomics.  All use of this software program is subject to the
terms of an applicable end user license agreement.

Patents
=======

US: 7,640,256, 13/129,329, 13/681,046, 13/681,215, 13/848,653,
13/925,704, 14/015,295, 13/971,654, 13/971,630, 14/564,810
UK: 1222923.3, 1222921.7, 1304502.6, 1311209.9, 1314888.7, 1314908.3
New Zealand: 626777, 626783, 615491, 614897, 614560
Australia: 2005255348, Singapore: 128254
Other patents pending


Third Party Software Used
=========================

RTG software uses the open source htsjdk library
(https://github.com/samtools/htsjdk) for reading and writing SAM
files, under the terms of following license:

The MIT License

Copyright (c) 2009 The Broad Institute

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

-------------------------

RTG software uses the bzip2 library included in the open source Ant project
(http://ant.apache.org/) for decompressing bzip2 format files, under the
following license:

Copyright 1999-2010 The Apache Software Foundation

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-------------------------

RTG Software uses a modified version of
java/util/zip/GZIPInputStream.java (available in the accompanying
gzipfix.jar) from OpenJDK 7 under the terms of the following license:

This code is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License version 2 only, as
published by the Free Software Foundation.  Oracle designates this
particular file as subject to the "Classpath" exception as provided
by Oracle in the LICENSE file that accompanied this code.

This code is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
version 2 for more details (a copy is included in the LICENSE file that
accompanied this code).

You should have received a copy of the GNU General Public License version
2 along with this work; if not, write to the Free Software Foundation,
Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.

Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
or visit http://www.oracle.com if you need additional information or have
any questions.

-------------------------

RTG Software uses hierarchical data visualization software from
http://sourceforge.net/projects/krona/ under the terms of the
following license:

Copyright (c) 2011, Battelle National Biodefense Institute (BNBI);
all rights reserved. Authored by: Brian Ondov, Nicholas Bergman, and
Adam Phillippy

This Software was prepared for the Department of Homeland Security
(DHS) by the Battelle National Biodefense Institute, LLC (BNBI) as
part of contract HSHQDC-07-C-00020 to manage and operate the National
Biodefense Analysis and Countermeasures Center (NBACC), a Federally
Funded Research and Development Center.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
  notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright
  notice, this list of conditions and the following disclaimer in the
  documentation and/or other materials provided with the distribution.

* Neither the name of the Battelle National Biodefense Institute nor
  the names of its contributors may be used to endorse or promote
  products derived from this software without specific prior written
  permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Notice

Real Time Genomics does not assume any liability arising out of the application or use of any software described herein. Further, Real Time Genomics does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others.

Real Time Genomics reserves the right to make any changes in any processes, products, or parts thereof, described herein, without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied.

© 2017 Real Time Genomics All rights reserved.

Illumina, Solexa, Complete Genomics, Ion Torrent, Roche, ABI, Life Technologies, and PacBio are registered trademarks and all other brands referenced in this document are the property of their respective owners.