Overview¶

This chapter introduces the features, operational options, and installation requirements of the data analysis software from Real Time Genomics.

Introduction¶

RTG software enables the development of fast, efficient software pipelines for deep genomic analysis. RTG is built on innovative search technologies and new algorithms designed for processing high volumes of high-throughput sequencing data from different sequencing technology platforms. The RTG sequence search and alignment functions enable read mapping and protein searches with a unique combination of sensitivity and speed.

RTG-based data production pipelines support unprecedented breadth and depth of analysis on genomic data, transforming researcher visibility into DNA sequence analysis and biological investigation. A comprehensive suite of easy-to-integrate data analysis functions increases the productivity of bioinformatics specialists, freeing them to develop analytical solutions that amplify the investigative ability unique to their organization.

RTG software supports a variety of research and medical genomics applications, such as:

Medical Genomic Research – Compare sequence variants and structural variation between normal and disease genomes, or over a disease progression in the same individual to identity causal loci.
Personalized Medicine – Establish reliable, high-throughput processing pipelines that analyze individual human genomes compared to one or more reference genomes. Use RTG software for detection of sequence variants (SNP and indel calling, intersection scripting), as well as structural variation (coverage depth, and copy number variation).
Model Organisms and Basic Research – Utilize RTG mapping and variant detection commands for focused research applications such as metagenomic species identification and frequency, and metabolic pathway analysis. Map microbial communities to generate gapped alignments of both DNA and protein sequence data.
Plant Genomics – Enable investigations of new crop species and variant detection in genetically diverse strains by leveraging RTG’s highly sensitive sequence search capabilities for strain and cross-species mapping applications. Flexible sensitivity tuning controls allow investigators to accommodate very high error rates associated with unique combinations of sequencing system error, genome-specific mutation, and aggressive cross-species comparisons.

RTG software description¶

RTG software is delivered as a single executable with multiple commands executed through a command line interface (CLI). Commands are delivered in product packages, and the available commands vary per product package.

Usage:

rtg COMMAND [OPTIONS] <REQUIRED>

RTG software delivers features in four areas:

Sequence Search and Alignment – RTG software uses patented sequence search technology for the rapid production of genomic sequence data. The map command implements read mapping and gapped alignment of sequence data against a reference. The mapx command searches translated sequence data against a protein database.
Data Analysis – RTG software supports two pipelines for data analysis - variant detection and metagenomics. Purpose-built variant detection pipeline functions include several commands to identify small sequence variants, a cnv command to report copy number variation statistics for structural variation, and a coverage command to report read depth across a reference.
Reporting Options – Standard result formats and utility commands report results for validation, and ease development of custom scripts for analysis. Scripts that produce publication quality graphics for visualization of data analysis results are available through Real Time Genomics technical support.
Data Center Deployment – RTG software supports typical data center standards for enterprise deployment. RTG provides automated installation and supports industry standard operating environments and data processing systems to help maintain total cost of ownership objectives in enterprise data centers. The RTG software can be run in compute clusters of varying sizes, and commands take advantage of multi-core processors by default.

See also

For detailed information about RTG command syntax and usage of individual commands, refer to RTG Command Reference.

Sequence search and alignment¶

RTG software uses an edit-distance alignment score to determine best fit and alignment accuracy.

RTG software includes optimal sensitivity settings for error and mutation rates, plus command line controls and simulation tools that allow investigators to calibrate sensitivity settings for specific data sets. Extensive filtering and reporting options allow complete control over reported alignments, which leads to greater flexibility for downstream analysis functions.

Key functionality of RTG sequence search and alignment includes:

Read mapping by nucleotide sequence alignment to a reference genome
Protein database searching by translated nucleotide sequence searches against protein databases
Sensitivity tuning using parameter options for substitutions, indels, indel lengths, word or step sizes, and alignment scores
Filtering and reporting ambiguous reads that map to multiple locations
Benchmarking and optimization using simulation and evaluation commands

RTG mapping commands have the following characteristics:

Eliminates need for genome indexing
Aligns sequence reads of any length
Allows high mismatch levels for increased sensitivity in longer reads
Allows detection of short indels with single end (SE) or paired end (PE) data
Can optionally guarantee the mapping of reads with at least a specified number of substitutions and indels
Supports a wide range of alignment scores

See also

For detailed information about sequence search and alignment functionality, refer to Command Reference, map.

For more information about the RTG integrated software pipeline, refer to RTG product usage - baseline progressions

Data formatting¶

Prior to RTG data production, reference genome and sometimes read data sequence files are typically first converted to the RTG Sequence Data File (SDF) format. This is an efficient storage format optimized for fast retrieval during data processing.

The RTG format / cg2sdf commands converts sequencing system read and reference genome sequence data into the SDF format. The format command accepts source data in standard file formats (such as FASTA / FASTQ / SAM / BAM) and maintains the integrity and consistency of the source data during the conversion to SDF. Similarly, the cg2sdf command accepts data in the custom data format used for read data by Complete Genomics, Inc. Read data may be single-end and paired-end reads of fixed or variable length. Sequence data can be formatted as nucleotide or protein.

An SDF is a directory containing a series of files that delineate sequence and quality score information stored in a binary format, along with metadata that describes the original sequencing system data format type:

03/19/2010  12:31 PM    <DIR>          .
03/19/2010  12:31 PM    <DIR>          ..
03/19/2010  12:31 PM             5,038 log
03/19/2010  12:31 PM            24,223 mainIndex
03/19/2010  12:31 PM                75 namedata0
03/19/2010  12:31 PM                 8 nameIndex0
03/19/2010  12:31 PM                56 namepointer0
03/19/2010  12:31 PM        23,267,177 seqdata0
03/19/2010  12:31 PM                56 seqpointer0
03/19/2010  12:31 PM                 8 sequenceIndex0
             8 File(s)      23,296,641 bytes
             2 Dir(s)  400,984,870,912 bytes free

See also

For detailed information about formatting sequencing system reads to RTG SDF, refer to Data Formatting Commands

Read mapping and alignment¶

The map command implements read mapping and alignment of sequence data against a reference genome, supporting gapped alignments for both single and paired-end reads. The cgmap command performs the same function for the gapped, paired-end read data from Complete Genomics, Inc.

A summary of the mapping results is displayed at the command line following execution of the map command, as shown in the paired-end example below:

ARM MAPPINGS
    left    right     both
 6650124  6650124 13300248  64.2% mated uniquely (NH = 1)
  186812   186812   373624   1.8% mated ambiguously (NH > 1)
 1538777  1539520  3078297  14.9% unmated uniquely (NH = 1)
   70667    70125   140792   0.7% unmated ambiguously (NH > 1)
       0        0        0   0.0% unmapped due to read frequency (XC = B)
   13624    13946    27570   0.1% unmapped with no matings but too many hits (XC = C)
  109720   109765   219485   1.1% unmapped with poor matings (XC = d)
     984     1003     1987   0.0% unmapped with too many matings (XC = e)
  212158   211688   423846   2.0% unmapped with no matings and poor hits (XC = D)
       0        0        0   0.0% unmapped with no matings and too many good hits (XC = E)
 1569609  1569492  3139101  15.2% unmapped with no hits
10352475 10352475 20704950 100.0% total

The following display shows the summary output for single end mapped data from the map command

READ MAPPINGS

 875007  87.5% mapped uniquely (NH = 1)
  25174   2.5% mapped ambiguously (NH > 1)
     71   0.0% unmapped due to read frequency (XC = B)
  88729   8.9% unmapped with too many hits (XC = C)
   8940   0.9% unmapped with poor hits (XC = D)
      0   0.0% unmapped with too many good hits (XC = E)
   2079   0.2% unmapped with no hits
1000000 100.0% total

Read mapping commands also produce HTML summary reports containing more information about mapping results.

Read mapping output files¶

The map command creates alignment reports in BAM file format and a summary report file named summary.txt. There is also a file called progress that can be used to monitor overall progress during a run, and a file named map.log containing technical information that may be useful for debugging. Alignment reports may be filtered by alignment score, and/or unmapped, unmated, and ambiguous reads (those that map to multiple locations).

When mapping, the output BAM file is named alignments.bam. The reads that did not align to the reference will include XC attributes in the BAM file that describe why a read did not map.

See also

For more information about the RTG map command, refer to Command Reference, map.

For details on RTG extensions to the BAM file format, refer to SAM/BAM file extensions (RTG map command output)

Read mapping sensitivity tuning¶

The RTG map command uses default sensitivity settings that balance mapping percentage and speed requirements. These settings deliver excellent results in most cases, especially in human read sequence data from Illumina runs with error rates of 2% or less.

However, some experiments demand read mapping that accommodates higher machine error, genome mutation, or cross-species comparison. For these situations, the investigator can set various tuning parameters to increase the mapping percentage.

For reads shorter than 64 bp, RTG allows an investigator to select the number of substitutions and indels that the map command will “at least” produce. For example, using the -a parameter to specify the number of allowed substitutions (i.e., mismatches) at 1, will guarantee that the map command finds all alignments with at least 1 substitution.

For reads equal to or longer than 64 base pairs, RTG allows an investigator to modify word and step size parameters related to the index. These parameters are set by default to 18 or half the read length, whichever is smaller. Decreasing the values (using -w for word size and -s for step size) will increase the percentage of mapped reads at the expense of additional processing time, and in the case of step size, increased memory usage.

The number of mismatches threshold can be altered to increase or decrease the number of mapped reads. Using the --max-mated-mismatches parameter for example, an investigator might limit reported alignments to only those at or lower than the given threshold value.

See also

For more information about the RTG map command’s sensitivity and tuning parameters, refer to Command Reference, map

Protein search¶

The mapx command implements a search of translated nucleotide sequence data against one or more protein databases, with alignment sensitivity adjusted for gaps and mismatches. The mapx command accepts reads formatted as nucleotide data and a reference database formatted as protein data.

Similarly, the mapp command implements search of untranslated protein sequences against one or more protein databases. The input to mapp is FASTA formatted protein sequences.

With mapx and mapp, an investigator can sort and classify knowns, and identify homologs and novels.

In a two-step process, queries that have one or more exact matches of an k-mer against the database during the matching phase are then aligned to the subject sequence with a full edit-distance calculation using the BLOSUM62 scoring matrix.

The mapx and mapp commands output the statistical significance of matches based on semi-global alignments (globally across query). Reported search results may be modified by a combination of one or more thresholds on % identity, E value, bit score and alignment score. The output results file is similar in construct to that reported by BLASTX.

See also

For more information about the RTG protein mapping commands please refer to Command Reference, mapx and Command Reference, mapp

Protein search output files¶

The mapx and mapp commands write search results and a summary file in a directory specified by the -o parameter at the command line. The summary file is named summary.txt. There is also a file called progress that can be used to monitor overall progress during a run, and a log file containing technical information that may be useful for debugging.

The protein search results are written to a file named alignments.tsv.gz. Each record in this results file, representing a valid search result, is written as tab-separated fields on a single line. The output fields are very similar to those reported by BLASTX.

See also

For detailed information about the RTG mapx and mapp command results file format refer to Mapx and mapp output file description

Protein search sensitivity tuning¶

The RTG mapx command builds a set of indexes from the translated reads and scans each query for matches according to user-specified sensitivity settings. Sensitivity is set with two parameters. The word size (-w or --word) parameter specifies match length. The mismatches (-a or --mismatches) parameter specifies the number and placement of k-mers across each translated query.

The alignment score threshold can be altered to increase or decrease the number of mapped reads. Using the --max-alignment-score parameter for example, an investigator might limit reported alignments to only those at or lower than the given threshold value.

See also

For more information about the RTG mapx command’s sensitivity and tuning parameters, refer to Mapx and mapp output file description

Benchmarking and optimization utilities¶

RTG benchmarking and optimization utilities consist of simulators that generate read and reference genome sequence data, and evaluators that verify the accuracy of sequence search and data analysis functions. Investigators will use these utility commands to evaluate the use of RTG software in various read mapping and data analysis scenarios.

RTG provides several simulators:

genomesim The genomesim command generates a reference genome with one or more segments of varying length and a percentage mix of nucleotide values. Use the command to create simulated genomes for benchmarking and evaluation.
readsim / cgsim The readsim / cgsim commands generate synthetic read sequence data from an input reference genome, introducing errors at a specified rate. Use the commands to create simulated read sets for benchmarking and evaluation.
popsim, samplesim, childsim, samplereplay, denovosim These variant simulation commands are used to create mutated genomes from a known reference by adding variants. Use these commands to verify accuracy of variant detection analysis software for a particular experiment using different pipeline settings.

Simulated data that is produced in SDF format can be converted into FASTA and FASTQ format sequence files for use with other tools using the sdf2fasta and sdf2fastq commands respectively.

See also

For more information about the RTG simulation commands, refer to Simulation Commands. Advice is available to ensure best results. Please contact RTG technical support for assistance.

Variant detection functions¶

The RTG variant detection pipeline includes commands for both sequence and structural variation detection: snp, family, population, somatic, tumoronly, cnv and coverage. The types of data available for analysis from the RTG software pipeline include: Bayesian sequence variant calling (snps.vcf), structural variation analysis (cnv.ratio) and alignment coverage depth (coverage.bed).

Sequence variation (SNPs, indels and complex variants)¶

The snp command uses a Bayesian probability model to identify and locate single and multiple nucleotide polymorphisms (SNPs and MNPs), indels, and complex sequence variants. The command uses standard BAM format files as input and reports computed posterior scores, base calls, mapping quality, coverage depth, and supporting statistics for all positions and for all variants. The snp command may be instructed to run in either haploid or diploid calling mode, and can perform sex-aware calling to automatically switch between haploid and diploid calling according to sex chromosomes specified for your reference species.

The snp command calls single nucleotide polymorphisms (SNPs), multiple nucleotide polymorphisms (MNPs), and complex regions from the sorted chromosome-ordered gapped alignment (BAM) files. The snp command makes consensus SNP and MNP calls on a diploid organism at every position (homozygous, heterozygous, and equal) in the reference, and calls indels and complex variants of 1-50 bp (depending on input alignments).

At each position in the reference, a base pair determination is made based on statistical analysis of the accumulated read alignments, in accordance with any priors and quality scores. The resulting predictions and accompanying statistics are reported in industry standard VCF format.

The snps.vcf output file reports all the called variants. The location and type of the call, the base pairs (reference and called), and a confidence score are present in the snps.vcf output file. Additional ancillary statistics in the output describe read alignment evidence that can be used to further evaluate confidence in the variant. Results may be filtered (post variant calling) by posterior scores, coverage depth, or indels, and filtered report results may be integrated with the SNP calls themselves.

See also

For more information about the SNP output data, refer to Command Reference, map, Command Reference, snp for syntax, parameters, and usage of the map and snp commands.

Sequence variation with Mendelian pedigree¶

The family command uses Bayesian analysis and the constraints of Mendelian inheritance to identify single and multiple nucleotide variants in each member of a family group. It will usually yield a better result than running the snp command on each individual because the Mendelian constraints help eliminate erroneous calls.

Family calling is restricted to families comprising a mother, father, and one or more sons and daughters. Family members are identified on the command line by sample names matching those used in the input BAM files. The family caller internally assigns the SAM records to the correct family member based on SAM read group information. If available, it automatically makes use of coverage and quality calibration information computed during mapping. It automatically selects the correct haploid/diploid calling depending on the sex of each individual.

The output is a multi-sample VCF file containing a call for each family member whenever any one of the family differs from the reference genome. Each sample reports a computed posterior, base call, and ancillary statistics as per the snp command. In addition, there is an overall posterior representing the joint likelihood of the call across all the samples. As with the other variant detection commands, the VCF output includes a filter column containing markers for high-coverage, high-ambiguity, and equivalent calls. It is not guaranteed that the resulting calls will always be Mendelian across the entire family, as de novo mutations are also identified and are automatically annotated in the output VCF.

The population command extends calling to multiple samples, which may or may not be related according to a supplied pedigree. Mendelian constraints are employed where appropriate, and in cases where many unrelated samples are being called, an iterative expectation-maximization algorithm updates Bayesian priors to give improved accuracy compared to calling samples individually with the snp command.

Somatic sequence variation¶

The somatic command uses Bayesian analysis to identify putative cancer explanations in a tumor sample. As with the snp command, it can identify SNPs, MNPs, indels, and complex sequence variants. It operates on two samples, an original sample (assumed to be non-cancerous) and a derived cancerous sample. The derived sample may be a mixture of non-cancerous and cancerous sequence data. The samples are provided to the somatic command in the form of BAM format files with appropriate sample names selected via the read group mechanism.

The somatic caller produces a VCF file detailing putative cancer explanations consisting of computed posterior scores, base calls, and ancillary statistics for both input samples. The somatic caller handles both haploid and diploid sequences and is sex aware. If available, it automatically makes use of coverage and quality calibration information computed during mapping.

By default the snps.vcf output file gives each variant called where the original and derived sample differ, together with a confidence. The file is sorted by genomic position. The same statistics reported by the snp command per VCF record are listed for both samples. The filter column contains markers for situations of high-coverage, high-ambiguity, and equivalent calls. This column can be used to discard unwanted results in subsequent processing.

Coverage analysis¶

The coverage command reports read depth across a reference genome with smoothing options, and outputs the results in the industry standard BED format. This can used to view histograms of mapped coverage data and gap length distributions.

Use the coverage command as a tool to analyze mapping results and determine how much of the genome is covered with mapping alignments, and how many times the same location has been mapped.

Customizable scripts are available for enabling graphical plotting of the coverage results using gnuplot.

See also

For more information about the RTG coverage analysis, refer to Command reference, coverage

Copy number variation (CNV) analysis¶

The cnv command identifies and reports copy number statistics that can be used for the investigation of structural variation.

It is used to identify aberrational CNV region(s) or copy number variations in a mapped read. The RTG cnv command identifies and reports the copy number variation ratio between two genomes.

The results of CNV detection are output to a BED file format. Customizable scripts are available for enabling graphical plotting of the CNV results using tools such as gnuplot.

See also

For more information about the CNV output data, refer to Command Reference, cnv

Standard input and output file formats¶

RTG software produces alignment and data analysis results in standard formats to allow pipeline validation and downstream analysis.

Table : Result file formats for validation and downstream analysis

File type	Description and Usage
BAM, SAM	The RTG `map` and `cgmap` commands produces alignment results in the Binary Sequence Alignment/Map (BAM) format: `alignments.bam` or optionally the compressed ASCII (SAM format) equivalent `alignments.sam.gz`. This allows use of familiar pileup viewers for quick visual inspection of alignment results.
TXT	Many RTG commands output summary statistics as ASCII text files.
TSV	Many RTG commands output results in tab separated ASCII text files. These files can typically be loaded directly into a spreadsheet viewing program like Microsoft Excel or Open Office.
BED	Some RTG commands output results in standard BED formats for further analysis and reporting.
PED	Some RTG commands utilize standard PED format text files for supplying sample pedigree and sex information.
VCF	The `snp`, `family`, `population` and `somatic` commands output results in Variant Call Format (VCF) version 4.1.

See also

For more information about file format extensions, refer to Appendix RTG output results file descriptions

SAM/BAM files created by the RTG map command¶

The Sequence Alignment/Map (SAM/BAM) format (version 1.3) is a well-known standard for listing read alignments against reference sequences.

SAM records list the mapping and alignment data for each read, ordered by chromosome (or other DNA reference sequence) location.

A sample RTG SAM file is shown in the Appendix. It describes the relationship between a read and a reference sequence, including mismatches, insertions and deletions (indels) as determined by the RTG map aligner.

Note

RTG mapped alignments are stored in BAM format with RTG read IDs by default. This default can be overridden using the --read-names flag or changed after processing using the RTG samrename utility to label the reads with the sequence identifiers from the original source file. For more information, refer to the SAM 1.3 nomenclature and symbols online at: https://samtools.github.io/hts-specs/SAMv1.pdf

RTG has defined several extensions within the standard SAM/BAM format; be sure to review the SAM/BAM format information in SAM/BAM file extensions (RTG map command output) of the Appendix to this guide for a complete list of all extensions and differences.

By default the RTG map command produces output as compressed binary format BAM files but can be set to produce human readable SAM files instead with the --sam flag.

Variant caller output files¶

The Variant Call Format (VCF) is a widely used standard format for storing SNPs, MNPs and indels.

A sample snps.vcf file is provided in the Appendix as an example of the output produced by an RTG variant calling run. Each line in a snps.vcf output has tab-separated fields and represents a SNP variation calculated from the mapped reads against the reference genome.

Note

RTG variant calls are stored in VCF format (version 4.2). For more information about the VCF format, refer to the specification online at: https://samtools.github.io/hts-specs/VCFv4.2.pdf

RTG employs several extensions within the standard VCF format; be sure to review the VCF format information in Small-variant VCF output file description of the Appendix to this guide for a complete list of all extensions and differences.

See also

For more information about file formats, refer to the Appendix, RTG output results file descriptions

Metagenomic analysis functions¶

The RTG metagenomic analysis pipeline includes commands for sample contamination filtering, estimation of taxon abundances in a sample and finding relationships between samples.

Contamination filtering¶

The mapf command is used for filtering contaminant reads from a sample. It does this by performing alignment of the reads against a reference of known contaminants and producing an output of the reads that did not align successfully. A common use for this is to remove human DNA from a bacterial sample taken from a body site.

Taxon abundance breakdown¶

The species command is used to find the abundances of taxa within a given sample. This is accomplished by analyzing reference genome alignment data made with a metagenomic reference database of known organisms. It produces output in which each taxon is given a fraction representing its abundance in the sample with upper and lower bounds and a value indicating the confidence that the taxon is actually present in the sample. An HTML report allows interactive examination of the abundances at different taxonomic levels.

Sample relationships¶

The similarity command is used to find relationships between sample read sets. It does this by examining k-mer word frequencies and the intersections between sets of reads. This results in the output of a similarity matrix, a principal component analysis and nearest neighbor trees in the Newick and phyloXML formats.

Functional protein analysis¶

The mapx command is used to perform a translated nucleotide search of short reads against a reference protein database. This results in an output similar to that reported by BLASTX.

Pipelines¶

Included in the RTG release are some pipeline commands which perform simple end-to-end tasks using other RTG commands. These pipelines use mostly default settings for each of the commands called, and are meant as a guideline to building more complex end-to-end pipelines using our tools. The metagenomic pipeline commands are:

species composition (composition-meta-pipeline)
functional protein analysis (functional-meta-pipeline)
species composition and functional protein analysis (composition-functional-meta-pipeline).

For detailed information about individual pipeline commands see Pipeline Commands

Parallel processing¶

The comparison of genomic variation on a large scale in real time demands parallel processing capability. Parallel processing of gapped alignments and variant detection is recommended by RTG because it significantly reduces wall clock time.

RTG software includes key features that make it easier for a person to prepare a job for parallel processing. First, RTG mapping commands can be performed on a subset of a large file or set of files either by using the --start-read and --end-read parameters or for commands that do not support this, by using sdfsplit to break a large SDF into smaller pieces. Second, the data analysis commands accept multiple alignment files as input from the command. Third, many RTG commands take a --region or --bed-regions parameter to allow breaking up tasks into pieces across the reference genome.

See also

See RTG Command Reference for command-specific details, Administration & Capacity Planning for detailed information about estimating the number of multi-core servers needed (capacity planning), and Parallel processing approach for a deeper discussion of compute cluster operations.

Installation and deployment¶

RTG is a self-contained tool that sets minimal expectations on the environment in which it is placed. It comes with the application components it needs to execute completely, yet performance can be enhanced with some simple modifications to the deployment configuration. This section provides guidelines for installing and creating an optimal configuration, starting from a typical recommended system.

RTG software pipeline runs in a wide range of computing environments from dual-core processor laptops to compute clusters with racks of dual processor quad core server nodes. However, internal human genome analysis benchmarks suggest the use of six server nodes of the configuration shown in below.

Table : Recommended system requirements

Processor	Intel Core i7-2600
Memory	48 GB RAM DDR3
Disk	5 TB, 7200 RPM (prefer SAS disk)

RTG Software can be run as a Java JAR file, but platform specific wrapper scripts are supplied to provide improved pipeline ergonomics. Instructions for a quick start installation are provided here.

For further information about setting up per-machine configuration files, please see the README.txt contained in the distribution zip file (a copy is also included in this manual’s appendix).

Quick start instructions¶

These instructions are intended for an individual to install and operate the RTG software without the need to establish root / administrator privileges.

RTG software is delivered in a compressed zip file, such as: rtg-core-3.3.zip. Unzip this file to begin installation.

Linux and Windows distributions include a Java Virtual Machine (JVM) version 1.8 that has undergone quality assurance testing. RTG may be used on other operating systems for which a JVM version 1.8 or higher is available, such as MacOS X or Solaris, by using the ‘no-jre’ distribution.

RTG for Java is delivered as a Java application accessed via executable wrapper script (rtg on UNIX systems, rtg.bat on Windows) that allows a user to customize initial memory allocation and other configuration options. It is recommended that these wrapper scripts be used rather than directly executing the Java JAR.

Here are platform-specific instructions for RTG deployment.

Linux/MacOS X:

Unzip the RTG distribution to the desired location.
In a terminal, cd to the installation directory and test for success by entering ./rtg version
On MacOS X, depending on your operating system version and configuration regarding unsigned applications, you may encounter the error message:
```
-bash: rtg: /usr/bin/env: bad interpreter: Operation not permitted
```
If this occurs, you must clear the OS X quarantine attribute with the command:
```
$ xattr -d com.apple.quarantine rtg
```
The first time rtg is executed you will be prompted with some questions to customize your installation. Follow the prompts.
Enter ./rtg help for a list of rtg commands. Help for any individual command is available using the --help flag, e.g.: ./rtg format --help
By default, RTG software scripts establish a memory space of 90% of the available RAM - this is automatically calculated. One may override this limit in the rtg.cfg settings file or on a per-run basis by supplying RTG_MEM as an environment variable or as the first program argument, e.g.: ./rtg RTG_MEM=48g map
[OPTIONAL] If you will be running RTG on multiple machines and would like to customize settings on a per-machine basis, copy rtg.cfg to /etc/rtg.cfg, editing per-machine settings appropriately (requires root privileges). An alternative that does not require root privileges is to copy rtg.cfg to rtg.HOSTNAME.cfg, editing per-machine settings appropriately, where HOSTNAME is the short host name output by the command hostname -s

Windows:

Unzip the RTG distribution to the desired location.
Test for success by entering rtg version at the command line. The first time RTG is executed you will be prompted with some questions to customize your installation. Follow the prompts.
Enter rtg help for a list of rtg commands. Help for any individual command is available using the --help flag, e.g.: ./rtg format --help
By default, RTG software scripts establish a memory space of 90% of the available RAM - this is automatically calculated. One may override this limit by setting the RTG_MEM variable in the rtg.bat script or as an environment variable.

See also

For more data center deployment and instructions for editing scripts, see Administration & Capacity Planning.

Technical assistance and support¶

For assistance with any technical or conceptual issue that may arise during use of the RTG product, contact Real Time Genomics Technical Support via email at support@realtimegenomics.com

In addition, a discussion group is available at: https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-users

A low-traffic announcements-only group is available at: https://groups.google.com/a/realtimegenomics.com/forum/#!forum/rtg-announce

Overview¶

Introduction¶

RTG software description¶

Sequence search and alignment¶

Data formatting¶

Read mapping and alignment¶

Read mapping output files¶

Read mapping sensitivity tuning¶

Protein search¶

Protein search output files¶

Protein search sensitivity tuning¶

Benchmarking and optimization utilities¶

Variant detection functions¶

Sequence variation (SNPs, indels and complex variants)¶

Sequence variation with Mendelian pedigree¶

Somatic sequence variation¶

Coverage analysis¶

Copy number variation (CNV) analysis¶

Standard input and output file formats¶

SAM/BAM files created by the RTG map command¶

Variant caller output files¶

Metagenomic analysis functions¶

Contamination filtering¶

Taxon abundance breakdown¶

Sample relationships¶

Functional protein analysis¶

Pipelines¶

Parallel processing¶

Installation and deployment¶

Quick start instructions¶

Technical assistance and support¶

Table of Contents

Previous topic

Next topic

This Page