EpiCSeg (Epigenome Count-based Segmentation) is a software for annotating the genome based on the state of the chromatin. It provides tools for extracting count data from BAM files, typlically corresponding to ChIP-seq experiments for histone marks (but other choices are possible) it learns a statistical model for the read counts based on a HMM, it annotates the genome, and it provides tools for displaying and analyzing the obtained models and segmentations. EpiCSeg can be used as an R package or from the command line via Rscript.
There are three parts:
For a more flexible usage try out the R functions.
Chromatin segmentation in R. For a full description and for citing us, see the following article on Genome Biology:
A Mammana, HR Chung (2015). Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome. Genome Biology 2015, 16:151
For using EpiCSeg from the command line, you can find the full manual HERE!
epicseg
needs R 3.2 (or newer) and depends on Bioconductor packages, CRAN packages, and another package from github.
For the installation, most of the work is done by the function devtools::install_github
. Because lately this function cannot resolve Bioconductor dependencies anymore (see this issue: https://github.com/hadley/devtools/issues/700), we will need to install some Bioconductor packages manually.
The Bioconductor dependencies are IRanges
, GenomicRanges
, bamsignals
and edgeR
. At the interactive R terminal, type:
source("http://bioconductor.org/biocLite.R")
biocLite(c("IRanges", "GenomicRanges", "bamsignals", "edgeR"))
Install and load the devtools
package to be able to directly install R packages hosted on github :
install.packages("devtools")
library(devtools)
To install epicseg
type:
install_github("lamortenera/kfoots")
install_github("lamortenera/epicseg")
To use EpiCSeg from the command line, you need to:
epicseg:::getLauncher("epicseg.R")
at the R interactive
terminal, which will create the file epicseg.R
in your working directory.
You can move and rename this file the way you want. Rscript epicseg.R subprogram arguments
.
In UNIX you can also simply do ./epicseg.R subprogram arguments
provided that
you have execution permission on the file .epicseg.R
. Rscript epicseg.R
Rscript epicseg.R subprogram
Path to the BED file with the genomic regions of interest. These regions will be automatically partitioned into smaller, consecutive bins. Only the first three fields in the file matter. If the region lengths are not multiples of the given binsize a new bed file will be produced where each coordinate is a multiple of binsize. Use this new file together with the count matrix for later analyses.
Mark name and path to a bam file where to extract reads from.
The bam files must be indexed and the chromosome names must match with
those specified in the bed file. Entries with the same mark name will
be treated as replicates and collapsed into one experiment.
This option must be repeated for each mark, for example:
-m H3K4me3:/path1/foo1.bam -m H3K36me3:/path2/foo2.bam
Full path to the count matrix that will be produced as output.
To save the matrix as an R archive, provide a path that ends
in .Rdata
or .rda
, otherwise it will be saved as a text file.
Using R archives, writing the file will be a bit slower,
but reading it will be considerably faster.
Size of a bin in base pairs. Each given region will be partitioned into bins of this size.
Minimum mapping quality for the reads (see the bam format specification for the mapq field). Only reads with the mapq field above or equal to the specified value will be considered. Specify this option once for all marks, or specify it as many times as there are marks.
Set this option to TRUE or FALSE to activate or deactivate
the paired-end mode. Only read pairs where both ends
are mapped will be considered and assigned to the bin where the
midpoint of the read pair is located. Specify this option
once for all marks, or specify it as many times as there are marks.
If this flag is set, the shift
option will be ignored.
Shift the reads in the 5' direction by a fixed number of base pairs. The read will be assigned to the bin where the shifted 5' end of the read is located. Specify this option once for all marks, or specify it as many times as there are marks. This option will be ignored in paired-end mode.
Number of threads to use.
Paths to the count matrices such as those produced by getcounts
.
All count matrices must have the same marks and the same number of bins
Each normalized count matrix will be written in a filename
similar to that of the unnormalized matrix. A suffix will be added
before the file extension, For example,
input: path/to/counts.Rda
, output: path/to/counts_norm.Rda
.
Set this option to change this behaviour. To overwrite the orignal matrices,
give the argument -
.
Number of threads to use
Path to the count matrix. To train a model on multiple datasets,
the count matrices must be compatible, i.e. with the same marks and on the
same genomic regions. If this option is repeated the paths must be labelled,
example:
--counts dataset1:path1/counts.txt --counts dataset2:path2/counts.txt
Path to the BED file with the genomic regions associated to the count matrix. Only the first three fields of each line will be considered. Typically each region comprises several bins (for example, one region per chromosome).
Number of states to use for training
Path to the file with the parameters of the HMM.
The model contains the initial parameters for learning.
If --notrain
is set, no learning will take place and all
model parameters must be specified.
The provided model will be used 'as-is' without training.
In case a model with multiple initial probabilities is provided,
should those probabilities be averaged? If you are not sure about what
this means, the default is fine (the default is FALSE
in train mode
and TRUE
in predict mode)
Number of threads to be used
Add artificial splits in the input regions to speed-up the algorithm. This makes sense when the number of regions is small compared to the number of threads. The artificial breaks are used only in the training phase, not in the computation of the final state assignments.
Maximum number of iterations in train mode
Save an R data archive with the results of the segmentation. To
see what they are, inside R, type library(epicseg)
and ?segment
Path to the folder where all results will be saved. If the folder doesn't exist it will be created.
Prefix to append to all produced files. The prefix should not be a path (use the -o option for that).
Genomic annotation to compare with the segmentation. It must
be specified with a short title (no spaces) and a path to the bed
file containing the annotation (the strand matters), separated by :
.
This option can be repeated, for example:
--annot genes:/path1/genes.bed --annot cpg:/path2/cpgislands.bed
.
Path to the bed file containing the segmentation. The name field
in the bed file must be a number from 1 to the maximum number of states.
More than one path can be provided if the bed files describe alternative
segmentations of the same genomic regions and if the paths are labelled,
i.e. of the form LABEL:PATH
.
Path to the file with the parameters of the HMM. All fields must be present.
Path to the folder where all results will be saved. If the folder doesn't exist it will be created.
Prefix to append to all produced files. The prefix should not be a path (use the -o option for that).
Path to a text file containing one color per chromatin state.
Each color must be on a separate line, and they can be specified
either as RGB colors R,G,B
where 0 <= R,G,B < 256, or as R colors
(like red
, gold
, green4
…).
Path to a text file containing one short text label per chromatin state. Each label must be on a separate line.
Genomic annotation to compare with the segmentation. It must
be specified with a short title (no spaces) and a path to the bed
file containing the annotation (the strand matters), separated by :
.
This option can be repeated, for example:
--annot genes:/path1/genes.bed --annot cpg:/path2/cpgislands.bed
.
In all examples we will assume that:
epicseg.R
has been created using the
epicseg:::getLauncher
function.You have execution permissions for the launcher. If not, type at the command line:
chmod 700 epicseg.R
The launcher is in a folder listed in your PATH
environment variable, if
this is not the case you can add it by typing at the command line:
export PATH=/path/to/the/launcher/:$PATH
You have some input data to work with. You can reproduce these examples by
setting the indir
environment variable
to the value returned by the R code
system.file("extdata", package="epicseg")
You have an output folder ${outdir}
where to store files created by
EpiCSeg. This could be a new empty directory to make sure that you are not
overwriting anything.
Typically you just need to
getcounts
subprogramsegment
subprogramreport
subprogram.The main input of EpiCSeg are read counts for a set of histone marks. More precisely, you need to select which experiments to use (in BAM format, already indexed), in which genomic regions to count the reads, and how large the bins should be.
Here we are going to use three BAM files, a bin size of 200 base pairs (the default) and the following BED file:
cat ${indir}/contigs.bed
## chr21 40005600 42955400
## chr21 43005600 48119800
This is done with the getcounts
subprogram:
epicseg.R getcounts --mark H3K4me3:${indir}/H3K4me3.bam \
--mark H3K36me3:${indir}/H3K36me3.bam \
--mark H3K9me3:${indir}/H3K9me3.bam \
--regions ${indir}/contigs.bed --binsize 200 \
--target ${outdir}/counts.txt
Which creates a count matrix in text format:
head -n 5 ${outdir}/counts.txt
## H3K4me3 H3K36me3 H3K9me3
## 0 0 0
## 0 0 0
## 0 0 1
## 1 2 4
Note that the count matrix and the genomic regions are linked and they should always be used together.
In this case the regions could be easily splitted into bins of size 200 because each region had a width multiple of 200. If this is not the case, the regions are automatically refined, and a new BED file will be created, like in this case:
epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
-m H3K36me3:${indir}/H3K36me3.bam \
-m H3K9me3:${indir}/H3K9me3.bam \
-r ${indir}/contigs.bed --binsize 220 \
-t ${outdir}/counts.txt
This creates the files ${outdir}/counts.txt
(which, in this example, got
overwritten by the second call to getcounts
), plus the BED file
${outdir}/counts_refined_regions.bed
:
cat ${outdir}/counts_refined_regions.bed
## chr21 40005680 42955220
## chr21 43005600 48119720
Note that:
Training EpiCSeg's model, inferring the state for each bin and producing an
HTML report is done all at once by the segment
subprogram.
You just need to specify your regions (the refined ones, if it was necessary to
refine them), your counts and the number of states that you want.
epicseg.R segment --counts ${outdir}/counts.txt \
--regions ${outdir}/counts_refined_regions.bed \
--nstates 5 --outdir ${outdir} --prefix myFirstSegmentation_
This creates a bunch of files with filenames starting with
${outdir}/myFirstSegmentation_
. Among these the most important are:
the file ${outdir}/myFirstSegmentation_lmeans.png
shows
the average levels of histone marks for each state.
the file ${outdir}/myFirstSegmentation_model.txt
contains the parameter
of the obtained model. This is useful if you want to use the learned model
on other data or if you want to refine your output files with the
report
subprogram.
the file ${outdir}/myFirstSegmentation_segmentation.bed
is the segmentation
in bed format.
the file ${outdir}/myFirstSegmentation_report.html
is a web page showing
all plots and results.
Suppose that you are not happy with the colors automatically assigned by
EpiCSeg, or you want to assign labels to each state, or you want to see where
the states tend to localize with respect to annotated genes. You can use the
report
subprogram for that.
We need to specify the model and the segmentation (the BED file) produced
with segment
.
Here we specify colors and labels via simple text files and a gene annotation
via a BED file.
printf "red\ngold\ngreen4\ngray\nblue\n" > ${outdir}/myColors.txt
printf "promoter\nactive\ntranscribed\ninactive\nrepressed\n" > ${outdir}/myLabels.txt
epicseg.R report --model ${outdir}/myFirstSegmentation_model.txt \
--segments ${outdir}/myFirstSegmentation_segmentation.bed \
--annot genes:${indir}/genes.bed \
--colors ${outdir}/myColors.txt \
--labels ${outdir}/myLabels.txt \
--outdir ${outdir} --prefix nicerReport_
Now our heatmap looks better:
And we can see how states relate to genes (this was also possible
by setting the --annot
option directly in the segment
subprogram):
Sometimes you have two or more sets of experiments that you want to use for chromatin segmentation, where each set comes from a different individual or cell type, so that you can look at differences between chromatin states from one individual to another.
To do this in EpiCSeg, you need a slightly different workflow:
getcounts
subprogram for each dataset/individuals/cell-types separatelynormalizecounts
subprogramsegment
Here we make two datasets by changing the association between labels and file names. Of course, this doesn't make sense and it's only to get some data for our example.
epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
-m H3K36me3:${indir}/H3K36me3.bam \
-m H3K9me3:${indir}/H3K9me3.bam \
-r ${indir}/contigs.bed \
-t ${outdir}/sample1.txt
epicseg.R getcounts -m H3K4me3:${indir}/H3K36me3.bam \
-m H3K36me3:${indir}/H3K9me3.bam \
-m H3K9me3:${indir}/H3K4me3.bam \
-r ${indir}/contigs.bed \
-t ${outdir}/sample2.txt
Here we make the two count matrices sample1.txt
and sample2.txt
comparable.
epicseg.R normalizecounts -c ${outdir}/sample1.txt -c ${outdir}/sample2.txt
This created the normalized count matrices sample1_norm.txt
and
sample2_norm.txt
. Now we use them for segmentation:
epicseg.R segment -c sample1:${outdir}/sample1_norm.txt \
-c sample2:${outdir}/sample2_norm.txt \
-r ${indir}/contigs.bed \
-n 5 --outdir ${outdir} --prefix comparison_
Now you can, for instance, open the two BED files comparison_segmentation_sample1.bed
and comparison_segmentation_sample2.bed
in the genome browser and look at
differences, or do more advanced analyses in R.
Sometimes you have replicates for a particular ChIP-seq experiment and you
want to take some sort of average in order to improve the quality of the experiment.
This is handled by the getcounts
subprogram by providing the same label
more than once.
Here we are treating the H3K36me3 and H3K9me3 samples as replicates, which does not make sense and it's only to get some data for our example.
epicseg.R getcounts -m H3K4me3:${indir}/H3K4me3.bam \
-m H3K36me3:${indir}/H3K36me3.bam \
-m H3K36me3:${indir}/H3K9me3.bam \
-r ${indir}/contigs.bed \
-t ${outdir}/aggregatedcounts.txt
By providing the label H3K36me3 twice we averaged two ChIP-seq experiments. The resulting count matrix looks like this:
head -n 5 ${outdir}/aggregatedcounts.txt
## H3K4me3 H3K36me3
## 0 0
## 0 0
## 0 0
## 1 3
Bins with abnormally high read counts compared to the genome average, especially when they appear at the beginning of a chromosome, are known to cause underflow errors in the forward-backward algorithm. Unfortunately there is no obvious solution to that. These bins should be avoided, by filtering out problematic regions or by correcting the read counts by some other means.
This can happen, for instance, when the mithocondrial genome is included in the set of analyzed regions. Read counts in the mithocondrial genome tend to be orders of magnitude higher than in the rest of the genome, so it is not appropriate to include them in the input data (moreover, mithocondria don't have chromatin…)
Did you encounter any other issue? File an issue in the github repository!!!