Appendix ======== .. only:: core or extra .. include:: appendix_alignment.inc.rst RTG reference file format -------------------------- Many RTG commands can make use of additional information about the structure of a reference genome, such as expected ploidy, sex chromosomes, location of PAR regions, etc. When appropriate, this information may be stored inside a reference genome's SDF directory in a file called ``reference.txt``. The ``format`` command will automatically identify several common reference genomes during formatting and will create a ``reference.txt`` in the resulting SDF. However, for non-human reference genomes, or less common human reference genomes, a pre-built reference configuration file may not be available, and will need to be manually provided in order to make use of RTG sex-aware pipeline features. Several example ``reference.txt`` files for different human reference versions are included as part of the RTG distribution in the ``scripts`` subdirectory, so for common reference versions it will suffice to copy the appropriate example file into the formatted reference SDF with the name ``reference.txt``, or use one of these example files as the basis for your specific reference genome. To see how a reference text file will be interpreted by the chromosomes in an SDF for a given sex you can use the ``sdfstats`` command with the ``--sex`` flag. For example: .. code-block:: text $ rtg sdfstats --sex male /data/human/ref/hg19 Location : /data/human/ref/hg19 Parameters : format -o /data/human/ref/hg19 -I chromosomes.txt SDF Version : 11 Type : DNA Source : UNKNOWN Paired arm : UNKNOWN SDF-ID : b6318de1-8107-4b11-bdd9-fb8b6b34c5d0 Number of sequences : 25 Maximum length : 249250621 Minimum length : 16571 Sequence names : yes N : 234350281 A : 844868045 C : 585017944 G : 585360436 T : 846097277 Total residues : 3095693983 Residue qualities : no Sequences for sex=MALE: chrM POLYPLOID circular 16571 chr1 DIPLOID linear 249250621 chr2 DIPLOID linear 243199373 chr3 DIPLOID linear 198022430 chr4 DIPLOID linear 191154276 chr5 DIPLOID linear 180915260 chr6 DIPLOID linear 171115067 chr7 DIPLOID linear 159138663 chr8 DIPLOID linear 146364022 chr9 DIPLOID linear 141213431 chr10 DIPLOID linear 135534747 chr11 DIPLOID linear 135006516 chr12 DIPLOID linear 133851895 chr13 DIPLOID linear 115169878 chr14 DIPLOID linear 107349540 chr15 DIPLOID linear 102531392 chr16 DIPLOID linear 90354753 chr17 DIPLOID linear 81195210 chr18 DIPLOID linear 78077248 chr19 DIPLOID linear 59128983 chr20 DIPLOID linear 63025520 chr21 DIPLOID linear 48129895 chr22 DIPLOID linear 51304566 chrX HAPLOID linear 155270560 ~=chrY chrX:60001-2699520 chrY:10001-2649520 chrX:154931044-155260560 chrY:59034050-59363566 chrY HAPLOID linear 59373566 ~=chrX chrX:60001-2699520 chrY:10001-2649520 chrX:154931044-155260560 chrY:59034050-59363566 The reference file is primarily intended for XY sex determination but should be able to handle ZW and X0 sex determination also. The following describes the reference file text format in more detail. The file contains lines with TAB separated fields describing the properties of the chromosomes. Comments within the ``reference.txt`` file are preceded by the character ``#``. The first line of the file that is not a comment or blank must be the version line. .. code-block:: text version1 The remaining lines have the following common structure: .. code-block:: text ... The sex field is one of ``male``, ``female`` or ``either``. The line-type field is one of ``def`` for default sequence settings, ``seq`` for specific chromosomal sequence settings and ``dup`` for defining pseudo-autosomal regions. The *line-setting* fields are a variable number of fields based on the line type given. The default sequence settings line can only be specified with ``either`` for the sex field, can only be specified once and must be specified if there are not individual chromosome settings for all chromosomes and other contigs. It is specified with the following structure: .. code-block:: text either def The *ploidy* field is one of ``haploid``, ``diploid``, ``triploid``, ``tetraploid``, ``pentaploid``, ``hexaploid``, ``polyploid`` or ``none``. The *shape* field is one of ``circular`` or ``linear``. The specific chromosome settings lines are similar to the default chromosome settings lines. All the sex field options can be used, however for any one chromosome you can only specify a single line for ``either`` or two lines for ``male`` and ``female``. They are specified with the following structure: .. code-block:: text seq [allosome] The *ploidy* and *shape* fields are the same as for the default chromosome settings line. The *chromosome-name* field is the name of the chromosome to which the line applies. The *allosome* field is optional and is used to specify the allosome pair of a haploid chromosome. The pseudo-autosomal region settings line can be set with any of the *sex* field options and any number of the lines can be defined as necessary. It has the following format: .. code-block:: text dup The regions must be taken from two haploid chromosomes for a given sex, have the same length and not go past the end of the chromosome. The regions are given in the format ``:-`` where start and end are positions counting from one and the end is non-inclusive. An example for the HG19 human reference: .. code-block:: text # Reference specification for hg19, see # http://genome.ucsc.edu/cgi-bin/hgTracks?hgsid=184117983&chromInfoPage= version 1 # Unless otherwise specified, assume diploid linear. Well-formed # chromosomes should be explicitly listed separately so this # applies primarily to unplaced contigs and decoy sequences either def diploid linear # List the autosomal chromosomes explicitly. These are used to help # determine "normal" coverage levels during mapping and variant calling either seq chr1 diploid linear either seq chr2 diploid linear either seq chr3 diploid linear either seq chr4 diploid linear either seq chr5 diploid linear either seq chr6 diploid linear either seq chr7 diploid linear either seq chr8 diploid linear either seq chr9 diploid linear either seq chr10 diploid linear either seq chr11 diploid linear either seq chr12 diploid linear either seq chr13 diploid linear either seq chr14 diploid linear either seq chr15 diploid linear either seq chr16 diploid linear either seq chr17 diploid linear either seq chr18 diploid linear either seq chr19 diploid linear either seq chr20 diploid linear either seq chr21 diploid linear either seq chr22 diploid linear # Define how the male and female get the X and Y chromosomes male seq chrX haploid linear chrY male seq chrY haploid linear chrX female seq chrX diploid linear female seq chrY none linear #PAR1 pseudoautosomal region male dup chrX:60001-2699520 chrY:10001-2649520 #PAR2 pseudoautosomal region male dup chrX:154931044-155260560 chrY:59034050-59363566 # And the mitochondria either seq chrM polyploid circular As of the current version of the RTG software the following are the effects of various settings in the ``reference.txt`` file when processing a sample with the matching sex. A ploidy setting of ``none`` will prevent reads from mapping to that chromosome and any variant calling from being done in that chromosome. A ploidy setting of ``diploid``, ``haploid`` or ``polyploid`` does not currently affect the output of mapping. A ploidy setting of ``diploid`` will treat the chromosome as having two distinct copies during variant calling, meaning that both homozygous and heterozygous diploid genotypes may be called for the chromosome. A ploidy setting of ``haploid`` will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome. A ploidy setting of ``polyploid`` will treat the chromosome as having one copy during variant calling, meaning that only haploid genotypes will be called for the chromosome. For variant calling with a pedigree, maternal inheritance is assumed for polyploid sequences. The shape of the chromosome does not currently affect the output of mapping or variant calling. The allosome pairs do not currently affect the output of mapping or variant calling (but are used by simulated data generation commands). The pseudo-autosomal regions will cause the second half of the region pair to be skipped during mapping. During variant calling the first half of the region pair will be called as diploid and the second half will not have calls made for it. For the example ``reference.txt`` provided earlier this means that when mapping a male the X chromosome sections of the pseudo-autosomal regions will be mapped to exclusively and for variant calling the X chromosome sections will be called as diploid while the Y chromosome sections will be skipped. There may be some edge effects up to a read length either side of a pseudo-autosomal region boundary. .. only:: core or extra .. include:: appendix_taxonomy.inc.rst Pedigree PED input file format ------------------------------- The PED file format is a white space (tab or space) delimited ASCII file. Lines starting with ``#`` are ignored. It has exactly six required columns in the following order. +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Column | Definition | +===================+===================================================================================================================================================================+ | *Family ID* | Alphanumeric ID of a family group. This field is ignored by RTG commands. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Individual ID* | Alphanumeric ID of an individual. This corresponds to the Sample ID specified in the read group of the individual (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Paternal ID* | Alphanumeric ID of the paternal parent for the individual. This corresponds to the Sample ID specified in the read group of the paternal parent (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Maternal ID* | Alphanumeric ID of the maternal parent for the individual. This corresponds to the Sample ID specified in the read group of the maternal parent (``SM`` field). | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Sex* | The sex of the individual specified as using 1 for male, 2 for female and any other number as unknown. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | *Phenotype* | The phenotype of the individual specified using -9 or 0 for unknown, 1 for unaffected and 2 for affected. | +-------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. note:: The PED format is based on the PED format defined by the PLINK project: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped The value '0' can be used as a missing value for Family ID, Paternal ID and Maternal ID. The following is an example of what a PED file may look like. .. code-block:: text # PED format pedigree # fam-id ind-id pat-id mat-id sex phen FAM01 NA19238 0 0 2 0 FAM01 NA19239 0 0 1 0 FAM01 NA19240 NA19239 NA19238 2 0 0 NA12878 0 0 2 0 When specifying a pedigree for the ``lineage`` command, use either the pat-id or mat-id as appropriate to the gender of the sample cell lineage. The following is an example of what a cell lineage PED file may look like. .. code-block:: text # PED format pedigree # fam-id ind-id pat-id mat-id sex phen LIN BASE 0 0 2 0 LIN GENA 0 BASE 2 0 LIN GENB 0 BASE 2 0 LIN GENA-A 0 GENA 2 0 RTG includes commands such as ``pedfilter`` and ``pedstats`` for simple viewing, filtering and conversion of pedigree files. .. include:: appendix/genetic_map_dir.inc.rst RTG commands using indexed input files -------------------------------------- Several RTG commands require coordinate indexed input files to operate and several more require them when the ``--region`` or ``--bed-regions`` parameter is used. The index files used are standard tabix or BAM index files. The RTG commands which produce the inputs used by these commands will by default produce them with appropriate index files. To produce indexes for files from third party sources or RTG command output where the ``--no-index`` or ``--no-gzip`` parameters were set, use the RTG ``bgzip`` and ``index`` commands. .. only:: core or extra .. include:: appendix_output_formats.inc.rst RTG JavaScript filtering API ---------------------------- The ``vcffilter`` command permits filtering VCF records via user-supplied JavaScript expressions or scripts containing JavaScript functions that operate on VCF records. The JavaScript environment has an API provided that enables convenient access to components of a VCF record in order to satisfy common use cases. VCF record field access ~~~~~~~~~~~~~~~~~~~~~~~ This section describes the supported methods to access components of an individual VCF record. In the following descriptions, assume the input VCF contains the following excerpt (the full header has been omitted): .. code-block:: text #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12877 NA12878 1 11259340 . G C,T . PASS DP=795;DPR=0.581;ABC=4.5 GT:DP 1/2:65 1/0:15 CHROM, POS, ID, REF, QUAL ````````````````````````` Within the context of a ``--keep-expr`` or ``record`` function these variables will provide access to the String representation of the VCF column of the same name. .. code-block:: javascript CHROM; // "1" POS; // "11259340" ID; // "." REF; // "G" QUAL; // "." ALT, FILTER ``````````` Will retrieve an array of the values in the column. .. code-block:: javascript ALT; // ["C", "T"] FILTER; // ["PASS"] INFO.{INFO\_FIELD} `````````````````` The values in the ``INFO`` field are accessible through properties on the ``INFO`` object indexed by ``INFO ID``. These properties will be the string representation of info values with multiple values delimited with "``,``". Missing fields will be represented by "``.``". Assigning to these properties will update the VCF record. This will be undefined for fields not declared in the header. .. code-block:: javascript INFO.DP; // "795" INFO.ABC; // "4,5" {SAMPLE\_NAME}.{FORMAT\_FIELD} `````````````````````````````` The JavaScript String prototype has been extended to allow access to the format fields for each sample. The string representation of values in the sample column are accessible as properties on the string matching the sample name named after the ``FORMAT`` field ``ID``. .. code-block:: javascript 'NA12877'.GT; // "1/2" 'NA12878'.GT; // "1/0" Note that these properties are only defined for fields that are declared in the VCF header (any that are not declared in the header will be undefined). See below for how to add new ``INFO`` or ``FORMAT`` field declarations. VCF record modification ~~~~~~~~~~~~~~~~~~~~~~~ Most components of VCF records can be written or updated in a fairly natural manner by direct assignment in order to make modifications. For example: .. code-block:: javascript CHROM = "chr1"; // Will change the CHROM value POS = 42; // Will change the POS value ID = "rs23987382"; // Will change the ID value QUAL = "50"; // Will change the QUAL value FILTER = "FAIL"; // Will set the FILTER value INFO.DPR = "0.01"; // Will change the value of the DPR info field 'NA12877'.DP = "10"; // Will change the DP field of the NA12877 sample Other components of the VCF record (such as ``REF``, and ``ALT``) are considered immutable and can not currently be altered. .. note:: Modification of ``CHROM`` and/or ``POS`` can lead to a VCF file which is incorrectly sorted and this will not necessarily be detected or reported until the resulting VCF file is used with another module or tool. Depending on the new value assigned to ``CHROM`` it may also be necessary to modify the sequence dictionary in the VCF header to reflect the change (see :ref:`VCF header modification`). Direct assignment to ``ID`` and ``FILTER`` fields accept either a string containing semicolon separated values, or a list of values. For example: .. code-block:: javascript ID = 'rs23987382;COSM3805'; ID = ['rs23987382', 'COSM3805']; FILTER = 'BAZ;BANG'; FILTER = ['BAZ', 'BANG']; Note that although the ``FILTER`` field returns an array when read, any changes made to this array directly are not reflected back into the VCF record. Adding a filter to existing filters is a common operation and can be accomplished by the above assignment methods, for example by adding a value to the existing list and then setting the result: .. code-block:: javascript var f = FILTER; f.push('BOING'); FILTER = f; However, since this is a little unwieldy, a convenience function called `add()` can be used (and may be chained): .. code-block:: javascript FILTER.add('BOING'); FILTER.add(['BOING', 'DOING'); FILTER.add('BOING').add('DOING'); VCF header modification ~~~~~~~~~~~~~~~~~~~~~~~ Functions are provided that allow the addition of new ``FILTER``, ``INFO`` and ``FORMAT`` fields to the header and records. It is recommended that the following functions only be used within the run-once portion of ``--javascript``. ensureFormatHeader({FORMAT\_HEADER\_STRING}) ```````````````````````````````````````````` Add a new ``FORMAT`` field to the VCF if it is not already present. This will add a ``FORMAT`` declaration line to the header and define the corresponding accessor methods for use in record processing. .. code-block:: javascript ensureFormatHeader('##FORMAT='); ensureInfoHeader({INFO\_HEADER\_STRING}) ```````````````````````````````````````` Add a new ``INFO`` field to the VCF if it is not already present. This will add an ``INFO`` declaration line to the header and define the corresponding accessor methods for use in record processing. .. code-block:: javascript ensureInfoHeader('##INFO='); ensureFilterHeader({FILTER\_HEADER\_STRING}) ```````````````````````````````````````````` Add a new ``FILTER`` field to the VCF header if it is not already present. This will add an ``FILTER`` declaration line to the header. .. code-block:: javascript ensureFilterHeader('##INFO='); Testing for overlap with genomic regions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One common use case is to test whether any given VCF record overlaps or is contained within a set of genomic regions. Regions may be loaded from either an external BED file or VCF file in the run-once portion of the JavaScript by either of the following: Regions.fromBed({FILENAME}) ``````````````````````````` Load the specified BED file into a `regions` object. For example: .. code-block:: javascript var myregions = Regions.fromBed('/path/to/regions.bed'); Regions.fromVcf({FILENAME}) ``````````````````````````` Load the specified VCF file into a `regions` object. For example: .. code-block:: javascript var myregions = Regions.fromVcf('/path/to/regions.vcf'); Having loaded a set of genomic regions, this can be used to test for region overlaps using the following methods: {REGIONS\_OBJECT}.overlaps({CHROM}, {START}, {END}) ``````````````````````````````````````````````````` Return true if the loaded regions overlap the specified interval. This function is typically used within the ``record`` function to test the coordinates of the current VCF record, e.g.: .. code-block:: javascript function record() { if (myregions.overlaps(CHROM, POS, POS + REF.length)) { // do something if the record overlaps any region } } {REGIONS\_OBJECT}.encloses({CHROM}, {START}, {END}) ``````````````````````````````````````````````````` Return true if the loaded regions entirely encloses the supplied interval. This function is typically used within the ``record`` function to test the coordinates of the current VCF record, e.g.: .. code-block:: javascript function record() { if (myregions.encloses(CHROM, POS, POS + REF.length)) { // do something if the record is fully enclosed by any region } } Additional information and functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SAMPLES ``````` This variable contains an array of the sample names in the VCF header. .. code-block:: javascript SAMPLES; // ['NA12877', 'NA12878'] print({STRING}) ``````````````` Write the provided string to standard output. .. code-block:: javascript print('The samples are: ' + SAMPLES); error({STRING}) ``````````````` Write the provided string to standard error. .. code-block:: javascript error('The samples are: ' + SAMPLES); checkMinVersion(RTG_MINIMUM_VERSION) ```````````````````````````````````` Check the version of RTG that the script is running under, and exits with an error message if the version of RTG does not meet the minimum required version. This is useful when distributing scripts that make use of features that are not present in earlier versions of RTG. .. code-block:: javascript checkMinVersion('3.9.2'); .. seealso:: For javascript filtering usage and examples see :ref:`vcffilter` .. only:: core or extra .. include:: appendix_parallel_processing.inc.rst Distribution Contents --------------------- The contents of the RTG distribution zip file should include: - The RTG executable JAR file. - RTG executable wrapper script. - Example scripts and files. - This operations manual. - A release notes file and a readme file. Some distributions also include an appropriate java runtime environment (JRE) for your operating system. README.txt ---------- For reference purposes, a copy of the distribution ``README.txt`` file follows: .. literalinclude:: include-README.txt :language: text Notice ------ Real Time Genomics does not assume any liability arising out of the application or use of any software described herein. Further, Real Time Genomics does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Real Time Genomics reserves the right to make any changes in any processes, products, or parts thereof, described herein, without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied. © 2017 Real Time Genomics All rights reserved. Illumina, Solexa, Complete Genomics, Ion Torrent, Roche, ABI, Life Technologies, and PacBio are registered trademarks and all other brands referenced in this document are the property of their respective owners.