ATAC-seq bam files to ATAC-seq data matrices

Last updated: 2018-07-25

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180613)

The command set.seed(20180613) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: c8ba1be
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
```
Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    code_RCC/.DS_Store
    Ignored:    data/.DS_Store

Untracked files:
    Untracked:  analysis/fit_centipede_ATACseq.Rmd
    Untracked:  code_RCC/.fitCentipede_TF_bams.sbatch.swp
    Untracked:  data/motifs/
    Untracked:  data/wgEncodeMapability/
    Untracked:  docs/figure/compare_centipede_predictions.Rmd/
    Untracked:  workflow_setup.R

Unstaged changes:
    Modified:   analysis/index.Rmd
```
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	c8ba1be	Kevin Luo	2018-07-25	add ATAC-seq preprocessing pipeline

Step 0: Download JASPAR motif database

Download JASPAR motif database http://jaspar.genereg.net/download/CORE/JASPAR2018_CORE_non-redundant_pfms_meme.zip

Install required softwares:

FIMO from MEME suite.
samtools
bedtools
bwtools
UCSC command-line tools: bedGraphToBigWig, bigWigAverageOverBed

Step 1: Find TF motif matches using FIMO

script

cat code_RCC/fimo_jaspar_motif_rcc.sh

#!/bin/bash
#SBATCH --job-name=fimo
#SBATCH --mem=8G
#SBATCH --output=fimo_%J.out
#SBATCH --partition=broadwl

##############################################################################
# get motif matches using FIMO
#
# requires FIMO from MEME suite
# FIMO instructions: http://meme-suite.org/doc/fimo.html?man_type=web
#installation: http://meme-suite.org/doc/install.html?man_type=web
##############################################################################


## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4

###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################

flank=100
ver_genome=hg19
max_fimo_sites=1000000

## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"

pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"

# directory for Jaspar motifs
dir_pwm="/project/mstephens/ATAC_DNase/motif_database/JASPAR_2018/JASPAR2018_CORE_non-redundant_pfms_meme"
pwm_filename="${dir_pwm}/${pwm_id}.meme"

# reference genome fasta file
genome_filename="/project/mstephens/ATAC_DNase/ref_genome/${ver_genome}.fa"

# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}

## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"

mkdir -p ${dir_motif_matches}
echo "Begin FIMO matching motif ..."

fimo --oc ${dir_motif_matches} --verbosity 2 --text \
    --bgfile --uniform-- --thresh ${thresh_pValue} --max-stored-scores ${max_fimo_sites} \
    ${pwm_filename} ${genome_filename}  \
    > ${filename_motif_matches}

echo "Finish FIMO: ${dir_motif_matches}"

run

# match motifs using FIMO ( p-value = 1e-4 ) on RCC

# CTCF MA0139.1
sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4

sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4

sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/fimo_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4

Step 2: Get TF candidate binding sites

script

cat code_RCC/sites_jaspar_motif_rcc.sh

#!/bin/bash
#SBATCH --job-name=sites
#SBATCH --mem=8G
#SBATCH --output=sites_%J.out
#SBATCH --partition=broadwl


################################################################################
# The script process FIMO motif matches to get candidate binding sites
#
# requires process_pwm_sites.R to flank motif matches to get candidate sites
# requires bigWigAverageOverBed tool from UCSC to compute mapablity
# requires bedtools to filter out ENCODE blacklist regions
################################################################################

module load R

## parameters
tf_name=$1
pwm_id=$2
thresh_pValue=$3 # 1e-4

###### example ######
# tf_name="CTCF"
# pwm_id="MA0139.1"
# thresh_pValue=1e-4
#####################

flank=100
ver_genome=hg19
max_fimo_sites=1000000

## extract pwm meme file
echo "pwm_id: $pwm_id, tf_name: $tf_name"

pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"

dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC

# directory for ENCODE mapability blacklist
dir_blacklist="/project/mstephens/ATAC_DNase/ENCODE_blacklist/hg19"

# output directory for FIMO motif matching results
dir_output="/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/"
mkdir -p ${dir_output}

## FIMO matching motifs
dir_motif_matches="${dir_output}/FIMO/${pwm_name}"
filename_motif_matches="${dir_motif_matches}/fimo.txt"

## get candidate sites after taking flanking windows around motif matches

# output directory for candidate sites
dir_sites="${dir_output}/candidate_sites/${thresh_pValue}"
mkdir -p ${dir_sites}

sites="${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed"

## mapability downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/
echo "Flank motif sites and compute mapability..."
filename_mapability="${dir_blacklist}/wgEncodeDukeMapabilityUniqueness20bp.bigWig"
Rscript ${dir_code}/process_pwm_sites.R ${tf_name} ${pwm_id} ${filename_motif_matches} ${flank} ${thresh_pValue} ${dir_sites} ${filename_mapability}

## filter ENCODE blacklist regions
echo "Filter ENCODE blacklist regions..."
blacklist_filename="${dir_blacklist}/wgEncodeDacMapabilityConsensusExcludable.bed"
bedtools intersect -v -a ${sites} -b ${blacklist_filename} > ${sites}.filtered.tmp
head -1 ${sites} | cat - ${sites}.filtered.tmp > ${sites}.filtered
mv ${sites}.filtered ${sites}
rm ${sites}.filtered.tmp

echo "Complete candidate sites processing."

run


## use the bigWigAverageOverBed tool from UCSC to compute mapablity
## https://genome.ucsc.edu/goldenpath/help/bigWig.html

## use the bedtools intersect to filter out ENCODE blacklist regions

sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh CTCF MA0139.1 1e-4

sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh HIF1A MA1106.1 1e-4

sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/sites_jaspar_motif_rcc.sh MEF2D MA0773.1 1e-4

Step 3: Count ATAC-seq genome-wide cleavage, and build tagcount bigwig file

script

cat code_RCC/genome_coverage_bamToBigwig.sh

#!/bin/bash

#SBATCH --job-name=coverage
#SBATCH --output=coverage_%J.out
#SBATCH --mem=10G
#SBATCH --partition=broadwl

################################################################################
# get 5' end coverage for bam files
# Compute the DNase or ATAC-seq cleavage (5' end) coverge along the genome
#
# requires samtools, bedtools, and bedGraphToBigWig (from UCSC)
################################################################################

bam_file=$1

dir_results="/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/"

echo "Compute genome coverage for ${bam_file}"
echo "Output directory: ${dir_results}"

## fetch chrom size from UCSC
# ver_genome="hg19"
# echo "fetch chrom size for ${ver_genome}"
# mkdir -p "${dir_results}/genome_size/"
# genome_chrom_sizes="${dir_results}/genome_size/${ver_genome}.chrom.sizes"
# fetchChromSizes ${ver_genome} > ${genome_chrom_sizes}

## or download from UCSC
# cd ${dir_results}/genome_size/
# wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes

genome_chrom_sizes="/project/compbio/jointLCLs/reference_genomes/hg19/hg19.chrom.sizes"

## sort bam file
bam_dir_name=$(dirname ${bam_file})
bam_basename=$(basename ${bam_file} .bam)

cd ${bam_dir_name}

echo "base bamfile: ${bam_basename}"
echo "sort and index bam file"

samtools sort -o ${bam_basename}.sorted.bam ${bam_basename}.bam

samtools index ${bam_basename}.sorted.bam

samtools idxstats ${bam_basename}.sorted.bam > ${bam_basename}.idxstats.txt

bam_sorted_name="${bam_dir_name}/${bam_basename}.sorted.bam"

## count genome coverage, 5' end, save in bedGraph format, then convert to bigWig format

bw_file="${dir_results}/${bam_basename}.5p.0base.tagcount"

## forward strand
echo "Count genome coverage on + strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "+" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph

echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.fwd.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph

## reverse strand
echo "Count genome coverage on - strand"
samtools view -b ${bam_sorted_name} | \
genomeCoverageBed -bg -5 -strand "-" -ibam stdin -g ${genome_chrom_sizes} > ${bw_file}.bedGraph

echo "coverting bedGraph to BigWig..."
sort -k1,1 -k2,2n ${bw_file}.bedGraph > ${bw_file}.sorted.bedGraph
bedGraphToBigWig ${bw_file}.sorted.bedGraph ${genome_chrom_sizes} ${bw_file}.rev.bw
rm ${bw_file}.bedGraph ${bw_file}.sorted.bedGraph

echo "Finished computing genome coverage => ${bw_file}.bw"

run


## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")

for bam_name in "${bamfiles[@]}"
do
   echo "${bam_name}"
   sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/genome_coverage_bamToBigwig.sh /project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/ATAC-seq_BAMfiles/"${bam_name}"
done

Step 4: match ATAC-seq tagcount matrices for each motif

script

cat code_RCC/get_motif_count_matrices.sh

#!/bin/bash

#SBATCH --job-name=extract_counts
#SBATCH --output=extract_counts_%J.out
#SBATCH --partition=broadwl
#SBATCH --mem=8G

################################################################################
# The script takes bigwig files of DNase or ATAC-seq genome-wide count coverages
# then, it extracts tagcounts for a TF with 100bp flanks around motif
# and save in a tagcount matrix.
#
# requires bigwig tagcount files (in the earlier step)
# requires bwtool to extract counts from bigwig files
# requires rev_tagcount_bwtool.R to reverse counts on reverse strand
################################################################################

module load R

tf_name=$1
pwm_id=$2
bam_name=$3

thresh_pValue="1e-4"
flank=100
ver_genome="hg19"

pwm_name="${tf_name}_${pwm_id}_${thresh_pValue}"

dir_code=$HOME/projects/ATAC-seq/ATAC-seq_workflow/code_RCC
dir_tagcounts=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_tagcounts/
dir_count_matrix=/project/mstephens/ATAC_DNase/ATAC-seq_Olivia_Gray/results/ATAC-seq_count_matrix/${pwm_name}/
dir_sites=/project/mstephens/ATAC_DNase/motif_sites_JASPAR2018/candidate_sites/${thresh_pValue}/
filename_sites=${dir_sites}/${pwm_name}_flank${flank}_fimo_sites.bed

echo "PWM name: ${pwm_name}"
echo "Candidate sites: ${filename_sites}"
echo "Bam filename: ${bam_name}"

## Extract counts around motif matches into tagcount matrices
## Check the existence of tagcount bigwig files and candidate sites
bam_basename=$(basename ${bam_name} .bam)
bw_file="${dir_tagcounts}/${bam_basename}.5p.0base.tagcount"
file_tagcount_fwd=${bw_file}.fwd.bw
file_tagcount_rev=${bw_file}.rev.bw
echo "Tagcount fwd name: ${file_tagcount_fwd}"
echo "Tagcount rev name: ${file_tagcount_rev}"

if [ ! -s ${file_tagcount_fwd} ] || [ ! -s ${file_tagcount_rev} ]; then
  echo "ERROR: Tagcount bigwig files for: ${bam_name} does not exist!"
elif [ ! -s ${filename_sites} ]; then
  echo "ERROR: Candidate sites ${filename_sites} does not exist!"
else

  echo "Counting cuts around the candidate sites of ${pwm_name} with ${flank}bp flanking windows ..."

  mkdir -p ${dir_count_matrix}

  file_count_matrix_fwd="${dir_count_matrix}/${pwm_name}_${bam_basename}_fwdcounts.m"
  file_count_matrix_rev="${dir_count_matrix}/${pwm_name}_${bam_basename}_revcounts.m"

  ## count matrix around the candidate sites
  echo "match matrix of ${file_count_matrix_fwd} ..."
  cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_fwd} ${file_count_matrix_fwd} -fill=0 -decimals=0 -tabs

  echo "match matrix of ${file_count_matrix_rev} ..."
  cut -f 1-4 ${filename_sites} | bwtool extract bed stdin ${file_tagcount_rev} ${file_count_matrix_rev} -fill=0 -decimals=0 -tabs

  Rscript ${dir_code}/rev_tagcount_bwtool.R ${file_count_matrix_fwd} ${file_count_matrix_rev} ${filename_sites}

  # compress the files to .gz
  gzip -f ${file_count_matrix_fwd} ${file_count_matrix_rev}
  echo "Finished counting for ${pwm_name} in ${bam_name}."

fi

run


## bam files
bamfiles=("H1_nomito_rdup.bam" "H2_nomito_rdup.bam" "H3_nomito_rdup.bam" "N1_nomito_rdup.bam" "N2_nomito_rdup.bam" "N3_nomito_rdup.bam")

## CTCF
for bam_name in "${bamfiles[@]}"
do
   echo "${bam_name}"
   sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh CTCF MA0139.1 "${bam_name}"
done

## HIF1A::ARNT
for bam_name in "${bamfiles[@]}"
do
   echo "${bam_name}"
   sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A::ARNT MA0259.1 "${bam_name}"
done


## HIF1A
for bam_name in "${bamfiles[@]}"
do
   echo "${bam_name}"
   sbatch ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh HIF1A MA1106.1 "${bam_name}"
done


## MEF2D
for bam_name in "${bamfiles[@]}"
do
   echo "${bam_name}"
   sbatch --mem=15G ~/projects/ATAC-seq/ATAC-seq_workflow/code_RCC/get_motif_count_matrices.sh MEF2D MA0773.1 "${bam_name}"
done

Session information

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.13.4

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.14      digest_0.6.13    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.1.5     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.9     tools_3.3.3      
[16] stringr_1.2.0     yaml_2.1.16       htmltools_0.3.6  
[19] knitr_1.18

This reproducible R Markdown analysis was created with workflowr 1.0.1

ATAC-seq bam files to ATAC-seq data matrices

Kaixuan Luo

6/13/2018

Step 0: Download JASPAR motif database

Step 1: Find TF motif matches using FIMO

run

Step 2: Get TF candidate binding sites

run

Step 3: Count ATAC-seq genome-wide cleavage, and build tagcount bigwig file

run

Step 4: match ATAC-seq tagcount matrices for each motif

run

Session information