Last updated: 2018-07-25
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180613)
The command set.seed(20180613)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: f1b5827
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: code_RCC/.DS_Store
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/fit_centipede_ATACseq.Rmd
Untracked: code_RCC/.fitCentipede_TF_bams.sbatch.swp
Untracked: data/motifs/
Untracked: data/wgEncodeMapability/
Untracked: docs/figure/compare_centipede_predictions.Rmd/
Untracked: workflow_setup.R
Unstaged changes:
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Download JASPAR motif database http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_non-redundant_pfms_meme.zip
Install required softwares:
bedGraphToBigWig
, bigWigAverageOverBed
cat code_RCC/fimo_jaspar_motif_rcc.sh
#!/bin/bash
#SBATCH --job-name=fimo
#SBATCH --mem=8G
#SBATCH --output=fimo_%J.out
#SBATCH --partition=broadwl
##############################################################################
# get motif matches using FIMO
#
# requires FIMO from MEME suite
# FIMO instructions: http://meme-suite.org/doc/fimo.html?man_type=web
#installation: http://meme-suite.org/doc/install.html?man_type=web
##############################################################################
## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4
###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################
flank=100
ver_genome=hg19
max_fimo_sites=1000000
## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
# directory for Jaspar motifs
dir_pwm="/project/mstephens/ATAC_DNase/motif_database/JASPAR_2018/JASPAR2018_CORE_non-redundant_pfms_meme"
pwm_filename="${dir_pwm}/${pwm_id}.meme"
# reference genome fasta file
genome_filename="/project/mstephens/ATAC_DNase/ref_genome/${ver_genome}.fa"
# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}
## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"
mkdir -p ${dir_motif_matches}
echo "Begin FIMO matching motif ..."
fimo --oc ${dir_motif_matches} --verbosity 2 --text \
--bgfile --uniform-- --thresh ${thresh_pValue} --max-stored-scores ${max_fimo_sites} \
${pwm_filename} ${genome_filename} \
> ${filename_motif_matches}
echo "Finish FIMO: ${dir_motif_matches}"
# match motifs using FIMO ( p-value = 1e-4 ) on RCC
# CTCF MA0139.1
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4
cat code_RCC/sites_jaspar_motif_rcc.sh
#!/bin/bash
#SBATCH --job-name=sites
#SBATCH --mem=8G
#SBATCH --output=sites_%J.out
#SBATCH --partition=broadwl
################################################################################
# The script process FIMO motif matches to get candidate binding sites
#
# requires process_pwm_sites.R to flank motif matches to get candidate sites
# requires bigWigAverageOverBed tool from UCSC to compute mapablity
# requires bedtools to filter out ENCODE blacklist regions
################################################################################
module load R
## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4
###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################
flank=100
ver_genome=hg19
max_fimo_sites=1000000
## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC
# directory for ENCODE mapability blacklist
dir_blacklist="/project/mstephens/ATAC_DNase/ENCODE_blacklist/hg19"
# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}
## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"
## get candidate sites after taking flanking windows around motif matches
# output directory for candidate sites
dir_sites="${dir_output}/candidate_sites/${thresh_pValue}"
mkdir -p ${dir_sites}
sites="${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed"
## mapability downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
echo "Flank motif sites and compute mapability..."
filename_mapability="${dir_blacklist}/wgEncodeDukeMapabilityUniqueness20bp.bigWig"
Rscript ${dir_code}/process_pwm_sites.R ${tf_name} ${pwm_id} ${filename_motif_matches} ${flank} ${thresh_pValue} ${dir_sites} ${filename_mapability}
## filter ENCODE blacklist regions
echo "Filter ENCODE blacklist regions..."
blacklist_filename="${dir_blacklist}/wgEncodeDacMapabilityConsensusExcludable.bed"
bedtools intersect -v -a ${sites} -b ${blacklist_filename} > ${sites}.filtered.tmp
head -1 ${sites} | cat - ${sites}.filtered.tmp > ${sites}.filtered
mv ${sites}.filtered ${sites}
rm ${sites}.filtered.tmp
echo "Complete candidate sites processing."
## use the bigWigAverageOverBed tool from UCSC to compute mapablity
## https://genome.ucsc.edu/goldenpath/help/bigWig.html
## use the bedtools intersect to filter out ENCODE blacklist regions
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4
cat code_RCC/genome_coverage_bamToBigwig.sh
#!/bin/bash
#SBATCH --job-name=coverage
#SBATCH --output=coverage_%J.out
#SBATCH --mem=10G
#SBATCH --partition=broadwl
################################################################################
# get 5' end coverage for bam files
# Compute the DNase or ATAC-seq cleavage (5' end) coverge along the genome
#
# requires samtools, bedtools, and bedGraphToBigWig (from UCSC)
################################################################################
bam_file=$1
dir_results="/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/"
echo "Compute genome coverage for ${bam_file}"
echo "Output directory: ${dir_results}"
## fetch chrom size from UCSC
# ver_genome="hg19"
# echo "fetch chrom size for ${ver_genome}"
# mkdir -p "${dir_results}/genome_size/"
# genome_chrom_sizes="${dir_results}/genome_size/${ver_genome}.chrom.sizes"
# fetchChromSizes ${ver_genome} > ${genome_chrom_sizes}
## or download from UCSC
# cd ${dir_results}/genome_size/
# wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
genome_chrom_sizes="/project/compbio/jointLCLs/reference_genomes/hg19/hg19.chrom.sizes"
## sort bam file
bam_dir_name=$(dirname ${bam_file})
bam_basename=$(basename ${bam_file} .bam)
cd ${bam_dir_name}
echo "base bamfile: ${bam_basename}"
echo "sort and index bam file"
samtools sort -o ${bam_basename}.sorted.bam ${bam_basename}.bam
samtools index ${bam_basename}.sorted.bam
samtools idxstats ${bam_basename}.sorted.bam > ${bam_basename}.idxstats.txt
bam_sorted_name="${bam_dir_name}/${bam_basename}.sorted.bam"
## count genome coverage, 5' end, save in bedGraph format, then convert to bigWig format
bw_file="${dir_results}/${bam_basename}.5p.0base.tagcount"
## forward strand
echo "Count genome coverage on + strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "+" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph
echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.fwd.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph
## reverse strand
echo "Count genome coverage on - strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "-" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph
echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.rev.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph
echo "Finished computing genome coverage => ${bw_file}.bw"
## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/genome_coverage_bamToBigwig.sh /project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/ATAC-seq_BAMfiles/"${bam_name}"
done
cat code_RCC/get_motif_count_matrices.sh
#!/bin/bash
#SBATCH --job-name=extract_counts
#SBATCH --output=extract_counts_%J.out
#SBATCH --partition=broadwl
#SBATCH --mem=8G
################################################################################
# The script takes bigwig files of DNase or ATAC-seq genome-wide count coverages
# then, it extracts tagcounts for a TF with 100bp flanks around motif
# and save in a tagcount matrix.
#
# requires bigwig tagcount files (in the earlier step)
# requires bwtool to extract counts from bigwig files
# requires rev_tagcount_bwtool.R to reverse counts on reverse strand
################################################################################
module load R
tf_name=$1
pwm_id=$2
bam_name=$3
thresh_pValue="1e-4"
flank=100
ver_genome="hg19"
pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"
dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC
dir_tagcounts=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/
dir_count_matrix=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_count_matrix/${pwm_name}/
dir_sites=/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/candidate_sites/${thresh_pValue}/
filename_sites=${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed
echo "PWM name: ${pwm_name}"
echo "Candidate sites: ${filename_sites}"
echo "Bam filename: ${bam_name}"
## Extract counts around motif matches into tagcount matrices
## Check the existence of tagcount bigwig files and candidate sites
bam_basename=$(basename ${bam_name} .bam)
bw_file="${dir_tagcounts}/${bam_basename}.5p.0base.tagcount"
file_tagcount_fwd=${bw_file}.fwd.bw
file_tagcount_rev=${bw_file}.rev.bw
echo "Tagcount fwd name: ${file_tagcount_fwd}"
echo "Tagcount rev name: ${file_tagcount_rev}"
if [ ! -s ${file_tagcount_fwd} ] || [ ! -s ${file_tagcount_rev} ]; then
echo "ERROR: Tagcount bigwig files for: ${bam_name} does not exist!"
elif [ ! -s ${filename_sites} ]; then
echo "ERROR: Candidate sites ${filename_sites} does not exist!"
else
echo "Counting cuts around the candidate sites of ${pwm_name} with ${flank}bp flanking windows ..."
mkdir -p ${dir_count_matrix}
file_count_matrix_fwd="${dir_count_matrix}/${pwm_name}_${bam_basename}_fwdcounts.m"
file_count_matrix_rev="${dir_count_matrix}/${pwm_name}_${bam_basename}_revcounts.m"
## count matrix around the candidate sites
echo "match matrix of ${file_count_matrix_fwd} ..."
cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_fwd} ${file_count_matrix_fwd} -fill=0 -decimals=0 -tabs
echo "match matrix of ${file_count_matrix_rev} ..."
cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_rev} ${file_count_matrix_rev} -fill=0 -decimals=0 -tabs
Rscript ${dir_code}/rev_tagcount_bwtool.R ${file_count_matrix_fwd} ${file_count_matrix_rev} ${filename_sites}
# compress the files to .gz
gzip -f ${file_count_matrix_fwd} ${file_count_matrix_rev}
echo "Finished counting for ${pwm_name} in ${bam_name}."
fi
## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")
## CTCF
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh CTCF MA0139.1 "${bam_name}"
done
## HIF1A::ARNT
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A::ARNT MA0259.1 "${bam_name}"
done
## HIF1A
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A MA1106.1 "${bam_name}"
done
## MEF2D
for bam_name in "${bamfiles[@]}"
do
echo "${bam_name}"
sbatch --mem=15G ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh MEF2D MA0773.1 "${bam_name}"
done
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.13.4
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_0.12.14 digest_0.6.13
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.5 whisker_0.3-2 R.oo_1.21.0
[13] R.utils_2.6.0 rmarkdown_1.9 tools_3.3.3
[16] stringr_1.2.0 yaml_2.1.16 htmltools_0.3.6
[19] knitr_1.18
This reproducible R Markdown analysis was created with workflowr 1.0.1