EpiCSeg manual

Overview

EpiCSeg (Epigenome Count-based Segmentation) is a software for annotating the genome based on the state of the chromatin. It provides tools for extracting count data from BAM files, typlically corresponding to ChIP-seq experiments for histone marks (but other choices are possible) it learns a statistical model for the read counts based on a HMM, it annotates the genome, and it provides tools for displaying and analyzing the obtained models and segmentations. EpiCSeg can be used as an R package or from the command line via Rscript.

About this manual

There are three parts:

the installation part describes how to setup EpiCSeg
the subprograms part describes in detail how to use each subprogram
the usage examples part shows some basic examples to get started and discusses potential issues

For a more flexible usage try out the R functions.

Setting up EpiCSeg

EpiCSeg

Chromatin segmentation in R. For a full description and for citing us, see the following article on Genome Biology:

A Mammana, HR Chung (2015). Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome Biology 2015, 16:151

For using EpiCSeg from the command line, you can find the full manual HERE!

Installation

epicseg needs R 3.2 (or newer) and depends on Bioconductor packages, CRAN packages, and another package from github. For the installation, most of the work is done by the function devtools::install_github. Because lately this function cannot resolve Bioconductor dependencies anymore (see this issue: https://github.com/hadley/devtools/issues/700), we will need to install some Bioconductor packages manually.

The Bioconductor dependencies are IRanges, GenomicRanges, bamsignals and edgeR. At the interactive R terminal, type:

source("http://bioconductor.org/biocLite.R")
biocLite(c("IRanges", "GenomicRanges", "bamsignals", "edgeR"))

Install and load the devtools package to be able to directly install R packages hosted on github :

install.packages("devtools")
library(devtools)

To install epicseg type:

install_github("lamortenera/kfoots")
install_github("lamortenera/epicseg")

Usage from the command line

To use EpiCSeg from the command line, you need to:

create a launcher to be used with Rscript. This is done by typing epicseg:::getLauncher("epicseg.R") at the R interactive terminal, which will create the file epicseg.R in your working directory. You can move and rename this file the way you want.
To use it, type Rscript epicseg.R subprogram arguments. In UNIX you can also simply do ./epicseg.R subprogram arguments provided that you have execution permission on the file .epicseg.R.
To see what the available subprograms are, simply type: Rscript epicseg.R
To see which arguments each subprogram needs, you can type: Rscript epicseg.R subprogram

Subprograms

getcounts

Produce a counts matrix from several bam files

Required arguments

--regions REGIONS, -r REGIONS

Path to the BED file with the genomic regions of interest. These regions will be automatically partitioned into smaller, consecutive bins. Only the first three fields in the file matter. If the region lengths are not multiples of the given binsize a new bed file will be produced where each coordinate is a multiple of binsize. Use this new file together with the count matrix for later analyses.

--mark LABEL:PATH, -m LABEL:PATH (repeatable)

Mark name and path to a bam file where to extract reads from. The bam files must be indexed and the chromosome names must match with those specified in the bed file. Entries with the same mark name will be treated as replicates and collapsed into one experiment. This option must be repeated for each mark, for example: -m H3K4me3:/path1/foo1.bam -m H3K36me3:/path2/foo2.bam

--target TARGET, -t TARGET

Full path to the count matrix that will be produced as output. To save the matrix as an R archive, provide a path that ends in .Rdata or .rda, otherwise it will be saved as a text file. Using R archives, writing the file will be a bit slower, but reading it will be considerably faster.

Optional arguments

--binsize BINSIZE, -b BINSIZE (default: 200)

Size of a bin in base pairs. Each given region will be partitioned into bins of this size.

--mapq MAPQ (repeatable) (default: 0)

Minimum mapping quality for the reads (see the bam format specification for the mapq field). Only reads with the mapq field above or equal to the specified value will be considered. Specify this option once for all marks, or specify it as many times as there are marks.

--pairedend TRUE_OR_FALSE, -p TRUE_OR_FALSE (repeatable) (default: FALSE)

Set this option to TRUE or FALSE to activate or deactivate the paired-end mode. Only read pairs where both ends are mapped will be considered and assigned to the bin where the midpoint of the read pair is located. Specify this option once for all marks, or specify it as many times as there are marks. If this flag is set, the shift option will be ignored.

--shift SHIFT, -s SHIFT (repeatable) (default: 75)

Shift the reads in the 5' direction by a fixed number of base pairs. The read will be assigned to the bin where the shifted 5' end of the read is located. Specify this option once for all marks, or specify it as many times as there are marks. This option will be ignored in paired-end mode.

--nthreads NTHREADS, -n NTHREADS (default: 1)

Number of threads to use.

normalizecounts

Normalize several count matrices

Required arguments

--counts COUNTS, -c COUNTS (repeatable)

Paths to the count matrices such as those produced by getcounts. All count matrices must have the same marks and the same number of bins

Optional arguments

--suffix SUFFIX, -s SUFFIX (default: _norm)

Each normalized count matrix will be written in a filename similar to that of the unnormalized matrix. A suffix will be added before the file extension, For example, input: path/to/counts.Rda, output: path/to/counts_norm.Rda. Set this option to change this behaviour. To overwrite the orignal matrices, give the argument -.

--nthreads NTHREADS, -n NTHREADS

Number of threads to use

segment

Produce a segmentation and a report

Required arguments

--counts COUNTS, -c COUNTS (repeatable)

Path to the count matrix. To train a model on multiple datasets, the count matrices must be compatible, i.e. with the same marks and on the same genomic regions. If this option is repeated the paths must be labelled, example: --counts dataset1:path1/counts.txt --counts dataset2:path2/counts.txt

--regions REGIONS, -r REGIONS

Path to the BED file with the genomic regions associated to the count matrix. Only the first three fields of each line will be considered. Typically each region comprises several bins (for example, one region per chromosome).

--nstates NSTATES, -n NSTATES

Number of states to use for training

Optional arguments

--model MODEL, -m MODEL

Path to the file with the parameters of the HMM. The model contains the initial parameters for learning. If --notrain is set, no learning will take place and all model parameters must be specified.

--notrain

The provided model will be used 'as-is' without training.

--collapseInitP TRUE_OR_FALSE

In case a model with multiple initial probabilities is provided, should those probabilities be averaged? If you are not sure about what this means, the default is fine (the default is FALSE in train mode and TRUE in predict mode)

--nthreads NTHREADS (default: 1)

Number of threads to be used

--split4speed SPLIT4SPEED, -s SPLIT4SPEED (default: FALSE)

Add artificial splits in the input regions to speed-up the algorithm. This makes sense when the number of regions is small compared to the number of threads. The artificial breaks are used only in the training phase, not in the computation of the final state assignments.

--maxiter MAXITER (default: 200)

Maximum number of iterations in train mode

--save_rdata

Save an R data archive with the results of the segmentation. To see what they are, inside R, type library(epicseg) and ?segment

--outdir OUTDIR, -o OUTDIR (default: .)

Path to the folder where all results will be saved. If the folder doesn't exist it will be created.

--prefix PREFIX, -p PREFIX (default: )

Prefix to append to all produced files. The prefix should not be a path (use the -o option for that).

--annot LABEL:PATH, -a LABEL:PATH (repeatable)

Genomic annotation to compare with the segmentation. It must be specified with a short title (no spaces) and a path to the bed file containing the annotation (the strand matters), separated by :. This option can be repeated, for example: --annot genes:/path1/genes.bed --annot cpg:/path2/cpgislands.bed.

report

Produce a report for a given segmentation

Required arguments

--segments SEGMENTS, -s SEGMENTS (repeatable)

Path to the bed file containing the segmentation. The name field in the bed file must be a number from 1 to the maximum number of states. More than one path can be provided if the bed files describe alternative segmentations of the same genomic regions and if the paths are labelled, i.e. of the form LABEL:PATH.

--model MODEL, -m MODEL

Path to the file with the parameters of the HMM. All fields must be present.

Optional arguments

--outdir OUTDIR, -o OUTDIR (default: .)

Path to the folder where all results will be saved. If the folder doesn't exist it will be created.

--prefix PREFIX, -p PREFIX (default: )

Prefix to append to all produced files. The prefix should not be a path (use the -o option for that).

--colors COLORS, -c COLORS

Path to a text file containing one color per chromatin state. Each color must be on a separate line, and they can be specified either as RGB colors R,G,B where 0 <= R,G,B < 256, or as R colors (like red, gold, green4…).

--labels LABELS, -l LABELS

Path to a text file containing one short text label per chromatin state. Each label must be on a separate line.

--annot LABEL:PATH, -a LABEL:PATH (repeatable)

Usage examples

Command line environment

In all examples we will assume that:

A UNIX system is used. Windows can also be used, but these examples need to be slightly adapted.
A launcher called epicseg.R has been created using the epicseg:::getLauncher function.
You have execution permissions for the launcher. If not, type at the command line:
```
chmod 700 epicseg.R
```
The launcher is in a folder listed in your PATH environment variable, if this is not the case you can add it by typing at the command line:
```
export PATH=/path/to/the/launcher/:$PATH
```
You have some input data to work with. You can reproduce these examples by setting the indir environment variable to the value returned by the R code system.file("extdata", package="epicseg")
You have an output folder ${outdir} where to store files created by EpiCSeg. This could be a new empty directory to make sure that you are not overwriting anything.

Typical workflow

Typically you just need to

create a count matrix with the getcounts subprogram
produce a segmentation with the segment subprogram
sometimes you might want to get a nicer output by using the report subprogram.

Creating the count matrix

The main input of EpiCSeg are read counts for a set of histone marks. More precisely, you need to select which experiments to use (in BAM format, already indexed), in which genomic regions to count the reads, and how large the bins should be.

Here we are going to use three BAM files, a bin size of 200 base pairs (the default) and the following BED file:

cat ${indir}/contigs.bed

## chr21    40005600    42955400
## chr21    43005600    48119800

This is done with the getcounts subprogram:

epicseg.R getcounts --mark H3K4me3:${indir}/H3K4me3.bam \
                    --mark H3K36me3:${indir}/H3K36me3.bam \
                    --mark H3K9me3:${indir}/H3K9me3.bam \
                    --regions ${indir}/contigs.bed --binsize 200 \
                    --target ${outdir}/counts.txt

Which creates a count matrix in text format:

head -n 5 ${outdir}/counts.txt

## H3K4me3  H3K36me3    H3K9me3
## 0    0   0
## 0    0   0
## 0    0   1
## 1    2   4

Note that the count matrix and the genomic regions are linked and they should always be used together.

In this case the regions could be easily splitted into bins of size 200 because each region had a width multiple of 200. If this is not the case, the regions are automatically refined, and a new BED file will be created, like in this case:

epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
                    -m H3K36me3:${indir}/H3K36me3.bam \
                    -m H3K9me3:${indir}/H3K9me3.bam \
                    -r ${indir}/contigs.bed --binsize 220 \
                    -t ${outdir}/counts.txt

This creates the files ${outdir}/counts.txt (which, in this example, got overwritten by the second call to getcounts), plus the BED file ${outdir}/counts_refined_regions.bed:

cat ${outdir}/counts_refined_regions.bed

## chr21    40005680    42955220
## chr21    43005600    48119720

Note that:

the refined regions are always contained in the original ones
start and end coordinate of each region are multiples of the chosen binsize
the count matrix refers to the refined regions (if it was necessary to refine them).
if you computed the count matrix and your regions by means other than EpiCseg, that's totally fine, as long as the regions are a valid BED file, the count matrix is tab separated and each column is labelled.

Fitting the model and producing the segmentation

Training EpiCSeg's model, inferring the state for each bin and producing an HTML report is done all at once by the segment subprogram. You just need to specify your regions (the refined ones, if it was necessary to refine them), your counts and the number of states that you want.

epicseg.R segment --counts ${outdir}/counts.txt \
                  --regions ${outdir}/counts_refined_regions.bed \
                  --nstates 5 --outdir ${outdir} --prefix myFirstSegmentation_

This creates a bunch of files with filenames starting with ${outdir}/myFirstSegmentation_. Among these the most important are:

the file ${outdir}/myFirstSegmentation_lmeans.png shows the average levels of histone marks for each state.
the file ${outdir}/myFirstSegmentation_model.txt contains the parameter of the obtained model. This is useful if you want to use the learned model on other data or if you want to refine your output files with the report subprogram.
the file ${outdir}/myFirstSegmentation_segmentation.bed is the segmentation in bed format.
the file ${outdir}/myFirstSegmentation_report.html is a web page showing all plots and results.

Producing a better report

Suppose that you are not happy with the colors automatically assigned by EpiCSeg, or you want to assign labels to each state, or you want to see where the states tend to localize with respect to annotated genes. You can use the report subprogram for that.

We need to specify the model and the segmentation (the BED file) produced with segment. Here we specify colors and labels via simple text files and a gene annotation via a BED file.

printf "red\ngold\ngreen4\ngray\nblue\n" > ${outdir}/myColors.txt
printf "promoter\nactive\ntranscribed\ninactive\nrepressed\n" > ${outdir}/myLabels.txt

epicseg.R report  --model ${outdir}/myFirstSegmentation_model.txt \
                  --segments ${outdir}/myFirstSegmentation_segmentation.bed \
                  --annot genes:${indir}/genes.bed \
                  --colors ${outdir}/myColors.txt \
                  --labels ${outdir}/myLabels.txt \
                  --outdir ${outdir} --prefix nicerReport_

Now our heatmap looks better:

And we can see how states relate to genes (this was also possible by setting the --annot option directly in the segment subprogram):

Non-basic usage

Learn states from different datasets

Sometimes you have two or more sets of experiments that you want to use for chromatin segmentation, where each set comes from a different individual or cell type, so that you can look at differences between chromatin states from one individual to another.

To do this in EpiCSeg, you need a slightly different workflow:

use the getcounts subprogram for each dataset/individuals/cell-types separately
make the datasets comparable with the normalizecounts subprogram
use all count matrices with segment

Here we make two datasets by changing the association between labels and file names. Of course, this doesn't make sense and it's only to get some data for our example.

epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
                    -m H3K36me3:${indir}/H3K36me3.bam \
                    -m H3K9me3:${indir}/H3K9me3.bam \
                    -r ${indir}/contigs.bed \
                    -t ${outdir}/sample1.txt

epicseg.R getcounts -m H3K4me3:${indir}/H3K36me3.bam \
                    -m H3K36me3:${indir}/H3K9me3.bam \
                    -m H3K9me3:${indir}/H3K4me3.bam \
                    -r ${indir}/contigs.bed \
                    -t ${outdir}/sample2.txt

Here we make the two count matrices sample1.txt and sample2.txt comparable.

epicseg.R normalizecounts -c ${outdir}/sample1.txt -c ${outdir}/sample2.txt

This created the normalized count matrices sample1_norm.txt and sample2_norm.txt. Now we use them for segmentation:

epicseg.R segment -c sample1:${outdir}/sample1_norm.txt \
                  -c sample2:${outdir}/sample2_norm.txt \
                  -r ${indir}/contigs.bed \
                  -n 5 --outdir ${outdir} --prefix comparison_

Now you can, for instance, open the two BED files comparison_segmentation_sample1.bed and comparison_segmentation_sample2.bed in the genome browser and look at differences, or do more advanced analyses in R.

Aggregating replicate experiments

Sometimes you have replicates for a particular ChIP-seq experiment and you want to take some sort of average in order to improve the quality of the experiment. This is handled by the getcounts subprogram by providing the same label more than once.

Here we are treating the H3K36me3 and H3K9me3 samples as replicates, which does not make sense and it's only to get some data for our example.

epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
                    -m H3K36me3:${indir}/H3K36me3.bam \
                    -m H3K36me3:${indir}/H3K9me3.bam \
                    -r ${indir}/contigs.bed \
                    -t ${outdir}/aggregatedcounts.txt

By providing the label H3K36me3 twice we averaged two ChIP-seq experiments. The resulting count matrix looks like this:

head -n 5 ${outdir}/aggregatedcounts.txt

## H3K4me3  H3K36me3
## 0    0
## 0    0
## 0    0
## 1    3

Known issues

Outliers

Bins with abnormally high read counts compared to the genome average, especially when they appear at the beginning of a chromosome, are known to cause underflow errors in the forward-backward algorithm. Unfortunately there is no obvious solution to that. These bins should be avoided, by filtering out problematic regions or by correcting the read counts by some other means.

This can happen, for instance, when the mithocondrial genome is included in the set of analyzed regions. Read counts in the mithocondrial genome tend to be orders of magnitude higher than in the rest of the genome, so it is not appropriate to include them in the input data (moreover, mithocondria don't have chromatin…)

Did you encounter any other issue? File an issue in the github repository!!!