Last updated: 2018-07-25
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180613)
The command set.seed(20180613)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: c8ba1be
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: code_RCC/.DS_Store
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/fit_centipede_ATACseq.Rmd
Untracked: code_RCC/.fitCentipede_TF_bams.sbatch.swp
Untracked: data/motifs/
Untracked: data/wgEncodeMapability/
Untracked: docs/figure/compare_centipede_predictions.Rmd/
Untracked: workflow_setup.R
Unstaged changes:
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c8ba1be | Kevin Luo | 2018-07-25 | add ATAC-seq preprocessing pipeline |
Download JASPAR motif database http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_non-redundant_pfms_meme.zip
Install required softwares:
bedGraphToBigWig
, bigWigAverageOverBed
cat code_RCC/fimo_jaspar_motif_rcc.sh
#!/bin/bash
#SBATCH --job-name=fimo
#SBATCH --mem=8G
#SBATCH --output=fimo_%J.out
#SBATCH --partition=broadwl
##############################################################################
# get motif matches using FIMO
#
# requires FIMO from MEME suite
# FIMO instructions: http://meme-suite.org/doc/fimo.html?man_type=web
#installation: http://meme-suite.org/doc/install.html?man_type=web
##############################################################################
## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4
###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################
flank=100
ver_genome=hg19
max_fimo_sites=1000000
## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
# directory for Jaspar motifs
dir_pwm="/project/mstephens/ATAC_DNase/motif_database/JASPAR_2018/JASPAR2018_CORE_non-redundant_pfms_meme"
pwm_filename="${dir_pwm}/${pwm_id}.meme"
# reference genome fasta file
genome_filename="/project/mstephens/ATAC_DNase/ref_genome/${ver_genome}.fa"
# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}
## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"
mkdir -p ${dir_motif_matches}
echo "Begin FIMO matching motif ..."
fimo --oc ${dir_motif_matches} --verbosity 2 --text \
--bgfile --uniform-- --thresh ${thresh_pValue} --max-stored-scores ${max_fimo_sites} \
${pwm_filename} ${genome_filename} \
> ${filename_motif_matches}
echo "Finish FIMO: ${dir_motif_matches}"
# match motifs using FIMO ( p-value = 1e-4 ) on RCC
# CTCF MA0139.1
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4
cat code_RCC/sites_jaspar_motif_rcc.sh
#!/bin/bash
#SBATCH --job-name=sites
#SBATCH --mem=8G
#SBATCH --output=sites_%J.out
#SBATCH --partition=broadwl
################################################################################
# The script process FIMO motif matches to get candidate binding sites
#
# requires process_pwm_sites.R to flank motif matches to get candidate sites
# requires bigWigAverageOverBed tool from UCSC to compute mapablity
# requires bedtools to filter out ENCODE blacklist regions
################################################################################
module load R
## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4
###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################
flank=100
ver_genome=hg19
max_fimo_sites=1000000
## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC
# directory for ENCODE mapability blacklist
dir_blacklist="/project/mstephens/ATAC_DNase/ENCODE_blacklist/hg19"
# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}
## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"
## get candidate sites after taking flanking windows around motif matches
# output directory for candidate sites
dir_sites="${dir_output}/candidate_sites/${thresh_pValue}"
mkdir -p ${dir_sites}
sites="${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed"
## mapability downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
echo "Flank motif sites and compute mapability..."
filename_mapability="${dir_blacklist}/wgEncodeDukeMapabilityUniqueness20bp.bigWig"
Rscript ${dir_code}/process_pwm_sites.R ${tf_name} ${pwm_id} ${filename_motif_matches} ${flank} ${thresh_pValue} ${dir_sites} ${filename_mapability}
## filter ENCODE blacklist regions
echo "Filter ENCODE blacklist regions..."
blacklist_filename="${dir_blacklist}/wgEncodeDacMapabilityConsensusExcludable.bed"
bedtools intersect -v -a ${sites} -b ${blacklist_filename} > ${sites}.filtered.tmp
head -1 ${sites} | cat - ${sites}.filtered.tmp > ${sites}.filtered
mv ${sites}.filtered ${sites}
rm ${sites}.filtered.tmp
echo "Complete candidate sites processing."
## use the bigWigAverageOverBed tool from UCSC to compute mapablity
## https://genome.ucsc.edu/goldenpath/help/bigWig.html
## use the bedtools intersect to filter out ENCODE blacklist regions
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4
cat code_RCC/genome_coverage_bamToBigwig.sh
#!/bin/bash
#SBATCH --job-name=coverage
#SBATCH --output=coverage_%J.out
#SBATCH --mem=10G
#SBATCH --partition=broadwl
################################################################################
# get 5' end coverage for bam files
# Compute the DNase or ATAC-seq cleavage (5' end) coverge along the genome
#
# requires samtools, bedtools, and bedGraphToBigWig (from UCSC)
################################################################################
bam_file=$1
dir_results="/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/"
echo "Compute genome coverage for ${bam_file}"
echo "Output directory: ${dir_results}"
## fetch chrom size from UCSC
# ver_genome="hg19"
# echo "fetch chrom size for ${ver_genome}"
# mkdir -p "${dir_results}/genome_size/"
# genome_chrom_sizes="${dir_results}/genome_size/${ver_genome}.chrom.sizes"
# fetchChromSizes ${ver_genome} > ${genome_chrom_sizes}
## or download from UCSC
# cd ${dir_results}/genome_size/
# wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
genome_chrom_sizes="/project/compbio/jointLCLs/reference_genomes/hg19/hg19.chrom.sizes"
## sort bam file
bam_dir_name=$(dirname ${bam_file})
bam_basename=$(basename ${bam_file} .bam)
cd ${bam_dir_name}
echo "base bamfile: ${bam_basename}"
echo "sort and index bam file"
samtools sort -o ${bam_basename}.sorted.bam ${bam_basename}.bam
samtools index ${bam_basename}.sorted.bam
samtools idxstats ${bam_basename}.sorted.bam > ${bam_basename}.idxstats.txt
bam_sorted_name="${bam_dir_name}/${bam_basename}.sorted.bam"
## count genome coverage, 5' end, save in bedGraph format, then convert to bigWig format
bw_file="${dir_results}/${bam_basename}.5p.0base.tagcount"
## forward strand
echo "Count genome coverage on + strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "+" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph
echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.fwd.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph
## reverse strand
echo "Count genome coverage on - strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "-" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph
echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.rev.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph
echo "Finished computing genome coverage => ${bw_file}.bw"
## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/genome_coverage_bamToBigwig.sh /project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/ATAC-seq_BAMfiles/"${bam_name}"
done
cat code_RCC/get_motif_count_matrices.sh
#!/bin/bash
#SBATCH --job-name=extract_counts
#SBATCH --output=extract_counts_%J.out
#SBATCH --partition=broadwl
#SBATCH --mem=8G
################################################################################
# The script takes bigwig files of DNase or ATAC-seq genome-wide count coverages
# then, it extracts tagcounts for a TF with 100bp flanks around motif
# and save in a tagcount matrix.
#
# requires bigwig tagcount files (in the earlier step)
# requires bwtool to extract counts from bigwig files
# requires rev_tagcount_bwtool.R to reverse counts on reverse strand
################################################################################
module load R
tf_name=$1
pwm_id=$2
bam_name=$3
thresh_pValue="1e-4"
flank=100
ver_genome="hg19"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC
dir_tagcounts=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/
dir_count_matrix=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_count_matrix/${pwm_name}/
dir_sites=/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/candidate_sites/${thresh_pValue}/
filename_sites=${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed
echo "PWM name: ${pwm_name}"
echo "Candidate sites: ${filename_sites}"
echo "Bam filename: ${bam_name}"
## Extract counts around motif matches into tagcount matrices
## Check the existence of tagcount bigwig files and candidate sites
bam_basename=$(basename ${bam_name} .bam)
bw_file="${dir_tagcounts}/${bam_basename}.5p.0base.tagcount"
file_tagcount_fwd=${bw_file}.fwd.bw
file_tagcount_rev=${bw_file}.rev.bw
echo "Tagcount fwd name: ${file_tagcount_fwd}"
echo "Tagcount rev name: ${file_tagcount_rev}"
if [ ! -s ${file_tagcount_fwd} ] || [ ! -s ${file_tagcount_rev} ]; then
echo "ERROR: Tagcount bigwig files for: ${bam_name} does not exist!"
elif [ ! -s ${filename_sites} ]; then
echo "ERROR: Candidate sites ${filename_sites} does not exist!"
else
echo "Counting cuts around the candidate sites of ${pwm_name} with ${flank}bp flanking windows ..."
mkdir -p ${dir_count_matrix}
file_count_matrix_fwd="${dir_count_matrix}/${pwm_name}_${bam_basename}_fwdcounts.m"
file_count_matrix_rev="${dir_count_matrix}/${pwm_name}_${bam_basename}_revcounts.m"
## count matrix around the candidate sites
echo "match matrix of ${file_count_matrix_fwd} ..."
cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_fwd} ${file_count_matrix_fwd} -fill=0 -decimals=0 -tabs
echo "match matrix of ${file_count_matrix_rev} ..."
cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_rev} ${file_count_matrix_rev} -fill=0 -decimals=0 -tabs
Rscript ${dir_code}/rev_tagcount_bwtool.R ${file_count_matrix_fwd} ${file_count_matrix_rev} ${filename_sites}
# compress the files to .gz
gzip -f ${file_count_matrix_fwd} ${file_count_matrix_rev}
echo "Finished counting for ${pwm_name} in ${bam_name}."
fi
## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")
## CTCF
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh CTCF MA0139.1 "${bam_name}"
done
## HIF1A::ARNT
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A::ARNT MA0259.1 "${bam_name}"
done
## HIF1A
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A MA1106.1 "${bam_name}"
done
## MEF2D
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch --mem=15G ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh MEF2D MA0773.1 "${bam_name}"
done
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.13.4
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_0.12.14 digest_0.6.13
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.5 whisker_0.3-2 R.oo_1.21.0
[13] R.utils_2.6.0 rmarkdown_1.9 tools_3.3.3
[16] stringr_1.2.0 yaml_2.1.16 htmltools_0.3.6
[19] knitr_1.18
This reproducible R Markdown analysis was created with workflowr 1.0.1