computational pipeline for chip-seqdata analysiscbsu.tc.cornell.edu › lab › doc ›...
TRANSCRIPT
![Page 1: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/1.jpg)
Computational Pipeline for
ChIP-Seq Data Analysis
Minghui Wang, Qi Sun
Bioinformatics Facility
Institute of Biotechnology
![Page 2: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/2.jpg)
Outline
ChIP-Seq experimental design
Data analysis
• Sequencing data evaluation
• Peak calling & evaluation
• GLM model for multiple replicates
Downstream analysis
• Peak annotation
• Function enrichment
![Page 3: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/3.jpg)
ChIP-seq workflow
Elaine R Mardis ER (2007) ChIP-seq: welcome to the new frontier. Nature Methods 4:613-614
![Page 4: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/4.jpg)
Controls for ChIP-seq
Most experimental protocols involve a control sample that is processed the same way as the test sample except that no immunoprecipitaion step or no specific antibody
� Input DNA does not demonstrate “flat” or
random (Poisson) distribution.
� Open chromatin regions tend to be
fragmented more easily during shearing.
� Amplification bias.
� Mapping artifacts-increased coverage of
more “mappable” regions (which also tend
to be promotor regions) and repetitive
regions due inaccuracies in number of
copies in assembled genome.
Input DNA & IgG
![Page 5: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/5.jpg)
ChIP-Seq experimental design
Replicates Single sample
Biological replicate
Technical replicate
![Page 6: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/6.jpg)
Data analysis protocol
Illumina raw sequencing data
Quality metrics of sequencing
reads
Reads mapping
Quality metrics of read counts
(strand cross-correlation &…)
Peak calling
Assessment of reproducibility for
biological replicates (IDR)
Significant and reproducible ChIP-
seq peaks & replicate specific peak
Downstream analysis
(Peak annotation, motif analysis)
![Page 7: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/7.jpg)
Quality metrics of sequencing reads
� FastQC can be used
for an overview of
the data quality
� Phred quality scores
used for trimming
low quality bases
� P = 10^(-Q/10); Q=30
base is called
incorrectly 1 in 1000
![Page 8: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/8.jpg)
Reads mapping
Most popular software: Bowtie, BWA, MAQ etc
bowtie2-build <input> <output name>
bowtie2 -x <output name> {-1 <m1> -2 <m2> | -U <r>} –p 8 -S [<hit>]
Reference genome Output base name
Pair-end or single end
CPU cores
Output sam file
� Multiple mapping hits were discarded
![Page 9: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/9.jpg)
Reference genome; FASTA format: 2 lines for each read (“>name”, sequence)
Illumina raw data; FASTQ format: 4 lines per read (“@name”, sequence, “+”, quality string)
SAM output
![Page 10: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/10.jpg)
Quality Control
�Nonredundant fraction (NRF)
ENCODE recommends target of NRF 0:8 for 10 million uniquely mapped reads
sh run_bam2bed.sh
perl CalbedNRF.pl rep5_D12K4.txt_trim_uniq_sorted_bamtobed.bed
https://github.com/mel-astar/mel-ngs/tree/master/mel-chipseq/chipseq-metrics
![Page 11: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/11.jpg)
Quality Control
Stephen G. Landt (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia Genome Res 22: 1813-1831
![Page 12: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/12.jpg)
� With ChIP-seq, the alignment
of the reads to the genome results in
two peaks (one on each strand) that
located on flanking sides of the protein
or nucleosome of interest.
� The distance between strands
specific peaks (k) represents the
average sequenced fragment.
Cross-correlation
Wilbanks EG (2010) Evaluation of Algorithm Performance in ChIP-Seq Peak Detection PLoS ONE 5:e11471
DNA fragments from a chromatin immunoprecipitation
experiment are sequenced from the 5' end.
k
![Page 13: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/13.jpg)
Cross-correlation
Stephen G. Landt (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia Genome Res 22: 1813-1831
![Page 14: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/14.jpg)
Cross-correlation
Bailey, et al (2013). Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data, PLOS Computational Biology
Strand cross-correlation is computed as the Pearson correlation between the
positive and the negative strand profiles at different strand shift distances, k
Fragment lengthRead length
https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/spp-r-from-chip-seq
![Page 15: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/15.jpg)
Cross-correlation
NSC values < 1.05 and RSC values < 0.8
Stephen G. Landt (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia Genome Res 22: 1813-1831
http://code.google.com/p/phantompeakqualtools/
![Page 16: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/16.jpg)
Peak calling software
� MACS → Yong Zhang et al
� cisGenome → Hongkai Ji et al
� spp →Peter Park et al
� Rbrads → Julie Ahringer et al
� BayesPeak → Simon Tavaré et al
� …
R environment
![Page 17: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/17.jpg)
Step1 of MACS2 �Estimating fragment length d
Slide a window of size 2 x BANDWIDTH, this value based on sonication size first
Keep top regions with MFOLD enrichment of treatment vs. control
Plot average +/- strand read densities → esTmate d
![Page 18: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/18.jpg)
Step2 of MACS2
�Identification of local noise parametershifting all reads by d/2
slide a window of size 2*d across treatment and input
estimate parameter λlocal of Poisson distribution
![Page 19: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/19.jpg)
Step3 of MACS2
Zhang Y (2008) Model-based Analysis of ChIP-Seq (MACS) Genome Biology 9:R137
http://liulab.dfci.harvard.edu/MACS/index.html
�Peaks identification
![Page 20: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/20.jpg)
Command of MACS2
macs2 -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01
Treatment file
Control file
File format
Genome size
Output name
FDR
Option:-s TSIZE, tsize=TSIZE
-m MFOLD, --mfold=MFOLD
--bw=BW
--nomodel
--shiftsize=SHIFTSIZE
![Page 21: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/21.jpg)
Output of MACS2
Input files and
parameters setting
Peaks information
![Page 22: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/22.jpg)
Output of MACS2
Enrichment score
(fold-change)
-log10pvalue
-log10qvalue
Summit position
to peak start
![Page 23: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/23.jpg)
Consistency of replicates: IDR
� IDR the irreproducible discovery rate
� Each list of peaks is ranked according to p-value or
signal score
� The IDR method adopted the bivariate rank
distributions over the replicates in order to separate
signal from noise based on consistency and
reproducibility of identifications
Rscript batch-consistency-analysis.r [peakfile1] [peakfile2] -1 [outfile.prefix] 0 F p.value
Rscript batch-consistency-plot.r [npairs] [output.prefix] [input.file.prefix1] [input.file.prefix2] [input.file.prefix3]
![Page 24: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/24.jpg)
IDR
Stephen G. Landt (2012) ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia Genome Res 22: 1813-1831
![Page 25: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/25.jpg)
Peak region merging
Rep1
Rep2
Rep3
Final peak list
![Page 26: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/26.jpg)
Multiple replicates
![Page 27: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/27.jpg)
R scripts for replicates
![Page 28: Computational Pipeline for ChIP-SeqData Analysiscbsu.tc.cornell.edu › lab › doc › CHIPseq_workshop_20150504_lecture… · Computational Pipeline for ChIP-SeqData Analysis Minghui](https://reader033.vdocuments.mx/reader033/viewer/2022060403/5f0e9e767e708231d4401c04/html5/thumbnails/28.jpg)
Comparing pairs
Yong IP Vs control; Old IP Vs control and Yong Vs Old