Genomic analysis by a biologist: ChIP-Seq

analysis walk-thru

IGV Browser depicting read density for all 4 samples, and MACS2 predicted peaks

It's not my purpose to simply repeat what others have published in refereed articles, but to provide a step-by-step description of how I have conducted ChIP-Seq analysis. At the end of this post I have listed various references.

Also, I am not assuming everyone has access to a Linux server by which to conduct the analysis described below, so I want to point out the freely available web resource Galaxy.

Step 1: Download GEO dataset data.
As an example I am on using 2 replicates from an ER-ChIP experiment:

Mohammed H, Russell IA, Stark R, Rueda OM et al. Progesterone receptor modulates ERα action in breast cancer. Nature 2015 Jul 16;523(7560):313-7. PMID: 26153859

I downloaded 4 separate ChIP-seq sample SRA files using "prefetch" of SRA Toolkit:
(1) JC1532_MCF7_Input_Full_Media_3hr r1; Homo sapiens; ChIP-Seq
http://www.ncbi.nlm.nih.gov/sra?term=SRX1012634
prefetch SRR2000750

(2) JC1523_MCF7_ER_Full_Media_3hr r1; Homo sapiens; ChIP-Seq)
http://www.ncbi.nlm.nih.gov/sra?term=SRX1012625
prefetch SRR2000732
prefetch SRR2000733

(3) JC1542_MCF7_Input_Full_Media_3hr r2; Homo sapiens; ChIP-Seq
http://www.ncbi.nlm.nih.gov/sra?term=SRX1012635
prefetch SRR2000751

(4) JC1533_MCF7_ER_Full_Media_3hr r2; Homo sapiens; ChIP-Seq
http://www.ncbi.nlm.nih.gov/sra?term=SRX1012626
prefetch SRR2000734
prefetch SRR2000735

Following download of the SRA files, I then used "fastq-dump" of SRAtools to 'un-pack' them into fastq files. Note, for those samples that had multiple SRA files, I simply concatenated them together once they were unpacked fastq files (e.g., Linux command line: "cat SRR2000734.fastq SRR2000735.fastq > SRR2000734_SRR2000735.fastq")

Step 2: Align fastq files to respective genome.
Because I plan on using MACS to determine significant peaks, I aligned using bowtie and the 'default MACS' parameters specified within the Nature Protocols paper (reference below).

For each of the the 4 samples:

bowtie -m 1 -p 8 -S -q /path/to/bowtie_index/hg19 /path/to/SRR2000750.fastq /path/to/bowtie/SRR2000750.sam
bowtie -m 1 -p 8 -S -q /path/to/bowtie_index/hg19 /path/to/SRR2000732_SRR2000733.fastq /path/to/bowtie/SRR2000732_SRR2000733.sam
bowtie -m 1 -p 8 -S -q /path/to/bowtie_index/hg19 /path/to/SRR2000751.fastq /path/to/bowtie/SRR2000751.sam
bowtie -m 1 -p 8 -S -q /path/to/bowtie_index/hg19 /path/to/SRR2000734_SRR2000735.fastq /path/to/bowtie//SRR2000734_SRR2000735.sam

parameter definitions (see http://bowtie-bio.sourceforge.net/manual.shtml)
-m 1
"-m <int> Suppress all alignments for a particular read or pair if more than <int> reportable alignments exist for it... "Throw away reads that align in multiple locations."
-p 8
"-p/--threads <int> Launch <int> parallel search threads (default: 1)... "Here, using 8 processors for alignment process."
-S
Print alignments in SAM format... "Here, printing output to SAM, which is the human readable format. I will subsequently convert to its binary form, BAM."
-q
The query input files (specified either as <m1> and <m2>, or as <s>) are FASTQ files (usually having extension .fq or .fastq).... That is, the index to which you are aligning your sequence reads.

Step 3: Convert SAM files to sorted BAM
I obtained the commands from the very useful blog http://davetang.org/wiki/tiki-index.php?page=SAMTools (also see http://www.htslib.org/doc/samtools.html). I created sorted BAM files for deriving bedGraph and TDF files for viewing within the IGV browser.

samtools view -bS SRR2000750.sam | samtools sort - sorted_SRR2000750
samtools view -bS SRR2000732_SRR2000733.sam | samtools sort - sorted_SRR2000732_SRR2000733
samtools view -bS SRR2000751.sam | samtools sort - sorted_SRR2000751
samtools view -bS SRR2000734_SRR2000735.sam | samtools sort - sorted_SRR2000734_SRR2000735

Note, ".bam" will be automatically appended to your output file, which is why it is left off of the "sorted_sample" output parameter.

Step 4: ChIP-enrichment assessment.
CHANCE (CHip-seq ANalytics and Confidence Estimation) is a tool that derives 'genome-wide' indication of ChIP-enrichment by analyzing various file formats (see reference below). For this current ChIP-seq walk-thru, I uploaded my sorted BAM files, comparing ChIP-ER verses Input for "r1" and "r2".

CHANCE ANALYSIS 'r1' = sorted_SRR2000732_SRR2000733.bam vs. sorted_SRR2000750

The first replicate, "r1", was determined to be statistically-enriched

Good "Cumulative percentage enrichment" with good 'separation between the ER-IP and Input

CHANCE ANALYSIS 'r2' = sorted_SRR2000734_SRR2000735.bam vs. sorted_SRR2000751.bam

"r2" did not quite get to default FDR (although, a FDR of 0.057 and 0.053 (tfbs normal and cancer, respectively) is quite low?)

Bad 'separation between the ER-IP and Input

Step 5: Derive files viewable in the IGV browser.
Using both bedtools (specifically, "genomecov") to create a bedGraph files:

bedtools genomecov -bg -ibam sorted_SRR2000750.bam -g /path/to/hg19.chrom.sizes > sorted_SRR2000750.bedGraph
bedtools genomecov -bg -ibam sorted_SRR2000732_SRR2000733.bam -g /path/to/hg19.chrom.sizes > sorted_SRR2000732_SRR2000733.bedGraph
bedtools genomecov -bg -ibam sorted_SRR2000751.bam -g /path/to/hg19.chrom.sizes > sorted_SRR2000751.bedGraph
bedtools genomecov -bg -ibam sorted_SRR2000734_SRR2000735.bam -g /path/to/hg19.chrom.sizes > sorted_SRR2000734_SRR2000735.bedGraph

followed by igvtools to derive a TDF file for viewing in the IGV browser:

igvtools toTDF -z 7 sorted_SRR2000750.bedGraph sorted_SRR2000750.tdf /path/to/hg19.chrom.sizes
igvtools toTDF -z 7 sorted_SRR2000732_SRR2000733.bedGraph sorted_SRR2000732_SRR2000733.tdf /path/to/hg19.chrom.sizes
igvtools toTDF -z 7 sorted_SRR2000751.bedGraph sorted_SRR2000751.tdf /path/to/hg19.chrom.sizes
igvtools toTDF -z 7 sorted_SRR2000734_SRR2000735.bedGraph sorted_SRR2000734_SRR2000735.tdf /path/to/hg19.chrom.sizes

Step 6: Deriving significantly-enriched peaks using MACS2
For each ChIP-IP sample and their respective Input control, I used the updated version of MACS, MACS2, with default "regular" (non-broad peak) parameters.

1
2

macs2 callpeak -t /path/to/sorted_SRR2000732_SRR2000733.bam -c /path/to/sorted_SRR2000750.bam -f BAM -g hs -n /path/to/MACS2/SRR2000732_SRR2000733_vs_SRR2000750 -B -q 0.01
macs2 callpeak -t /path/to/sorted_SRR2000734_SRR2000735.bam -c /path/to/sorted_SRR2000751.bam -f BAM -g hs -n /path/to/MACS2/SRR2000734_SRR2000735_vs_SRR2000751 -B -q 0.01

parameter definitions
-t/--treatment FILENAME
your IP sample reads
-c/--control
your "control" reads, e.g., Input or IgG
-f/--format FORMAT
BAM in this example
-g/--gsize
hg for human genome
-n/--name
specify the output folder and output filename
-B/--bdg
store fragment pileup in a bedGraph file, currently is really unnecessary for we did this in a previous step
-q/--qvalue
q-value, FDR, cut-off

Step 7: Comparing replicate samples
There are various ways biological replicates, and even different binding site data (e.g., 2 different transcription factors) can be compared. One can use either a basic "peak overlap" approach (i.e., sample genomic intervals overlap), or a more sophisticated "statistical" approach involving distributional assessments (i.e., read alignment data).

"Peak overlap" approach:
Using the "intersect" module of bedtools, one can easily determine overlaps between 2 BED files using numerous criteria. In the example below, I am comparing the 2 BED files of MACS2 derived peaks for r1 and r2.
parameter definitions
-a
bed file 1
-b
bed file 2
-wo
"Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported."
-f 0.5: "Minimum overlap required as a fraction of A," here, a proportion of -a's genomic interval.
-r
"Require that the fraction of overlap be reciprocal for A and B," here, since -f is 0.50 and -r is used, B must overlap at least 50% of A and that A also overlaps at least 50% of B.

Note, the ">" is simply a 'pipe out' of the results (i.e., without the ">", the results would simple go to "standard out" = the monitor).

1	bedtools intersect -a MACS2_peaks_ER_vs_Input_r1.bed -b MACS2_peaks_ER_vs_Input_r2.bed -wo -f 0.50 -r >intersect_Both-50%_r1andr2.txt

Using this command, it was determined that 16,535 of the 48,550 and 22,653 peaks for r1 and r2, respectively, overlap by at least 50%. If r1 and r2 represent biological replicates, these 16,535 overlapping regions represent higher confidence peaks as they are observed in both replicates.

If one would like to associate ChIP-seq peaks with specific genomic locations (e.g., proximal gene promoters), one can use bedtools intersect. That is, derive the genomic coordinates for the proximal promoter of all the genes in the genome, and use bedtools intersect to simply compare the ChIP BED and proximal gene promoter BED files.

To derive a proximal gene promoter BED file, I have obtained gene coordinate data from 2 separate resources, Ensembl's Biomart and UCSC table browser. You can easily process the coordinate data within Excel to obtain proximal gene promoter coordinates. An IMPORTANT note, if you "right-click" import text files with gene names into Excel many of the genes names will be changed. For example, official gene symbols like "MARCH1" will be changed to "1-Mar", because Excel assumes you're importing a "date" not a "text". This issue is easily avoidable by going through the steps of "File Import" as depicted in the screengrab below.

Sophisticated statistical approach:
MAnorm allows the 'statistical' comparison of ChIP-seq peaks between replicates or treatment samples. Essentially, they us a MA plot and regression normalization procedure to adjust for differing signal-to-background noise (S/N) intensities between samples, allowing a "fair" comparison and determination of common and unique peaks. See Shao et a. 2012 Genome Biol. 13(3):R16 for details.

For example, here are the plots I receive for the current analysis:

For MAnorm analysis, one needs peak and 'read' bed files for both samples. The peak file is simply grabbed from the MACS output (peak chromosome, start, end). The read file must be derived from the original SAM file (chromosome, start, end, strand (+/-)). See MAnorm tutorial for further details.

To obtain the read files, I created a very simple Python script. Each read mapping result are located on a single line within the SAM file, specifying various attributes, including mapping location, orientation, quality, etc. Various websites explain the SAM format (e.g., Dave's Wiki, UMich Center for Statistical Genetics), as well as a SAM tag decoder) For our purposes, we want to limit our reads to either tag "0" (read mapped, sense strand) or "16" (read mapped, reverse strand). Note that my SAM output includes a "chr" in front of each 'reference sequence name' (e.g., chr12). My Python script relies on this "chr" attribute (see below). The "chr" is absent within the SAM samples displayed on the two websites I reference, and thus, my Python script would not work against these SAMs.

#! /usr/bin/env python3.3

filename = "file.sam"

for line in open(filename, 'r'):
 if "chr" in line:  
  temp = line.split()
  if temp[1] == "16":
   print(temp[2], temp[3], (int(temp[3]) + len(temp[9])) - 1, temp[1].replace("16", "-"))
  elif temp[1] == "0": 
   print(temp[2], temp[3], (int(temp[3]) + len(temp[9])) - 1, temp[1].replace("0", "+"))

Output snippet:

chr2 58384945 58384984 +

chr11 32143524 32143563 +

chr7 148706751 148706790 +

chr17 39572309 39572348 +

chr16 25131108 25131147 -

chr10 9080363 9080402 +
chr11 41005743 41005782 +

This script simply goes through the SAM file line-by-line, and lines in which "chr" is present, splits the line content into a 'list', from which I print the desired information only if the second element (temp[1]) is either 16 (+ strand) or 0 (- strand).

With my read files in hand, I simply invoke the shell script as outlined in the MAnorm R_tutorial.

1	./MAnorm.sh SRR2000732_SRR2000373_peaks.bed SRR2000734_SRR2000375_peaks.bed reads_SRR2000732_SRR2000733.bed reads_SRR2000734_SRR2000735.bed 226 181

This will produce many files, the primary of which is "MAnorm_result.xls". This file provides descriptions, rescaled M and A values, and associated -log10(p-values). Utilizing the suggested criteria outlined in the MAnorm R_tutorial (from "classfy_by_M_value.sh"):

Unique r1 peaks (non-concordant) = M-values > 1 and -log10(p-value) > 5
Unique r2 peaks (non-concordant) = M-values < -1 and -log10(p-value) > 5
Unbiased peaks (concordant peaks) = M-values between -0.5 and +0.5

I obtain ~45K unbiased/ concordant peaks from this analysis.

blue = unique r1, red = unique r2, green = concordant

This simple plot from R was made as follows:

ma <- read.table("MAnorm_MA_plot.txt", header=TRUE)
with(ma, plot(ma$A_value_rescaled, ma$M_value_rescaled, pch=20, main="MA plot of r1 and r2"))
with(subset(ma, negative_LOG10_pvalue>=5 & M_value_rescaled>=1), points(A_value_rescaled, M_value_rescaled, pch=20, col="blue"))
with(subset(ma, negative_LOG10_pvalue>=5 & M_value_rescaled<=-1), points(A_value_rescaled, M_value_rescaled, pch=20, col="red"))
with(subset(ma, M_value_rescaled>=-0.5 & M_value_rescaled<=0.5), points(A_value_rescaled, M_value_rescaled, pch=20, col="green"))

The concordant peaks represent those I would both report and verify with ChIP-PCR (i.e., based upon the apparent agreement between replicate samples).

References:
(1) Bailey et al. 2013 Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol. 9(11):e1003326 http://www.ncbi.nlm.nih.gov/pubmed/24244136

(2) Landt et al. 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22(9):1813-31. http://www.ncbi.nlm.nih.gov/pubmed/22955991

(3) Feng et. al.2012. Identifying ChIP-seq enrichment using MACS. Nat Protoc. 7(9):1728-40
http://www.ncbi.nlm.nih.gov/pubmed/22936215

(4) Diaz et al. 2012. CHANCE: comprehensive software for quality control and validation of ChIP-seq data. Genome Biol. 13(10):R98.
http://www.ncbi.nlm.nih.gov/pubmed/23068444

(5) Shao et a. 2012. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets. Genome Biol. 13(3):R16.
http://www.ncbi.nlm.nih.gov/pubmed/22424423

Genomic analysis by a biologist

ChIP-Seq

No comments:

Post a Comment