macs course

41
ChIP-seq analysis Luca Cozzuto Bioinformatics Core

Upload: luca-cozzuto

Post on 08-May-2015

5.487 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Macs course

ChIP-seq analysis

Luca CozzutoBioinformatics Core

Page 2: Macs course

ChIP-seq analysis

• ChIP-seq is the combination of chromatin immuno-precipitation with ultra-sequencing.

• Allows to detect genomic portions bound by proteins such as:

• Transcription factors• Histones• Polymerase II • …

Page 3: Macs course

ChIP-seq analysisTypical workflow

Page 4: Macs course

ChIP-seq analysisTypical workflow

Page 5: Macs course

ChIP-seq analysisStarting the analysis.

• Typically you will receive from 10 to 30 millions of raw reads per sample corresponding to a zipped file of 0.5-1.5 Gbytes.

FASTQ format

The quality is encoded with a ASCII character and represents the Phred quality score.

p = probability that that base call is incorrectQ = 20 means base call accuracy of 99%

@HWUSI-EAS621:69:64EKPAAXX:3:1:11477:1265 1:N:0:@(HEADER)GAAACTTGAGGACTGCCCAGCTCGACAGACACTGGA

(SEQUENCE)+

+(HEADER)GEGGDGG@GGDGGGGGGGBDGGDG8GG@3D6:3:67

(QUALITY)

Q = −10log10 p

Page 6: Macs course

ChIP-seq analysisStarting the analysis.

• It is strongly recommended to check the quality of the sequences we received before doing the analysis!

Fastqc analysis

Page 7: Macs course

ChIP-seq analysisStarting the analysis.

Mapping by using ultra-fast mappers:

• GEM • Bowtie• BWA• Stampy

It is required to index the reference genome before doing the analysis.

Page 8: Macs course

ChIP-seq analysisPeak calling – MACS

Model-based Analysis of ChIP-Seq data.

TF

Page 9: Macs course

ChIP-seq analysisPeak calling – MACS

Sequences from IP

TF

Page 10: Macs course

ChIP-seq analysisPeak calling – MACS

TF

Sequences from IP

Sequenced tags on + strand- strand

Page 11: Macs course

ChIP-seq analysisPeak calling - MACS

Page 12: Macs course

ChIP-seq analysisPeak calling – MACS

Given a sonication size (bandwith) and a fold-enrichment (mfold), MACS slides 2*bandwidth windows across the genome to find regions enriched to a random tag genome distribution >= mfold (default between 10 and 30).

Page 13: Macs course

ChIP-seq analysisPeak calling – MACS

MACS select at least 1,000 “model peaks” for calculating the distance “d” between paired peaks.

Page 14: Macs course

ChIP-seq analysisPeak calling – MACS

How to determine if peaks are greater than expected by chance?

• x = observed read number• λ= expected read number

Probability to find a peak higher than x.

Tag distribution along the genome could be modeled by a Poisson distribution.

Page 15: Macs course

ChIP-seq analysisPeak calling – MACS

Example:Tag count = 2 Number of reads = 30,000,000Read length = 36Mappable human genome = 2,700,000,000

Page 16: Macs course

ChIP-seq analysisPeak calling – MACS

Example:Tag count = 10 Number of reads = 30,000,000Read length = 36Mappable human genome = 2,700,000,000

Page 17: Macs course

ChIP-seq analysisPeak calling – MACS• shifting each tag d/2 to the 3’• sliding windows with 2*d length across the

genome to detect the enriched regions (Poisson distribution p-value <= 1e-5).

• Overlapping enriched regions are fused.• Summit of the peak is considered the putative

binding site

TF

Page 18: Macs course

ChIP-seq analysisPeak calling – MACS

In order to address local biases in the genome such as local chromatin structure, sequencing bias, genome copy number variation… MACS evaluates candidates peaks by comparing them against a “local” distribution.

λlocal =max(λ BG,λ1k,λ 5k,λ10k )

Fold enrichment =Enrichment over the λlocal

Page 19: Macs course

ChIP-seq analysisPeak calling – MACS

False Discovery Rate (FDR) is calculated as number of control peaks called / number of sample peaks. Control peaks are calculated by swapping control and sample.

FDR is calculated only when a control is provided!

Page 20: Macs course

ChIP-seq analysisPractical part

Page 21: Macs course

ChIP-seq analysisPractical part

Connect to the Etna machine by using ssh.

• MAC or Linux users can do using this command

Password: xxxxxxx

• Windows users should first download Putty and PSCP programs and then use them for accessing that machine. http://goo.gl/4BWud

$ ssh –X [email protected]@xxx.crg.es's password:

Page 22: Macs course

ChIP-seq analysis

[email protected]

Password: xxxxxx

Page 23: Macs course

ChIP-seq analysisDifferent formats can be used as input files: BED, ELAND, SAM, BAM, BOWTIE and for paired ends ELAND-MULTIPET

Bed fields: chromosome name, start, end, name, score strand

$ head ../data/Input_tags.bedchr1 233604 233639 0 2 -chr1 559767 559802 0 3 +chr1 742600 742635 0 2 +chr1 742600 742635 0 0 +chr1 744231 744266 0 0 +chr1 744307 744342 0 2 -chr1 746885 746920 0 2 +chr1 746958 746993 0 1 +chr1 748226 748261 0 2 +chr1 748357 748392 0 0 -

Page 24: Macs course

ChIP-seq analysisLaunching MACS passing the sample, the control, the genome size (hs = homo sapiens) and the name

$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -g hs -n FoxA1

Page 25: Macs course

ChIP-seq analysisCheck the output printed to the screen.

$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -g hs -n FoxA1INFO @ Thu, 29 Mar 2012 14:58:35: # ARGUMENTS LIST:# name = FoxA1# format = AUTO# ChIP-seq file = ./Treatment_tags.bed# control file = ./Input_tags.bed# effective genome size = 2.70e+09# band width = 300# model fold = 10,30# pvalue cutoff = 1.00e-05# Small dataset will be scaled towards larger dataset.# Range for calculating regional lambda is: 1000 bps and 10000 bps INFO @ Thu, 29 Mar 2012 14:58:35: #1 read tag files... INFO @ Thu, 29 Mar 2012 14:58:35: #1 read treatment tags... INFO @ Thu, 29 Mar 2012 14:58:35: Detected format is: BED

Regional lambda has two values in this version: small to consider bias around the summit and large for the surrounding area.

Page 26: Macs course

ChIP-seq analysisCheck the output printed to the screen.

INFO @ Thu, 29 Mar 2012 14:59:41: #1 tag size is determined as 35 bps INFO @ Thu, 29 Mar 2012 14:59:41: #1 tag size = 35 INFO @ Thu, 29 Mar 2012 14:59:41: #1 total tags in treatment: 3909805..INFO @ Thu, 29 Mar 2012 14:59:46: #2 Build Peak Model... INFO @ Thu, 29 Mar 2012 15:00:00: #2 number of paired peaks: 11861INFO @ Thu, 29 Mar 2012 15:00:00: #2 finished! INFO @ Thu, 29 Mar 2012 15:00:00: #2 predicted fragment length is 119 bps INFO @ Thu, 29 Mar 2012 15:00:00: #2.2 Generate R script for model : FoxA1_model.r INFO @ Thu, 29 Mar 2012 15:00:00: #3 Call peaks... INFO @ Thu, 29 Mar 2012 15:00:00: #3 shift treatment data INFO @ Thu, 29 Mar 2012 15:00:01: #3 merge +/- strand of treatment data INFO @ Thu, 29 Mar 2012 15:00:01: #3 call peak candidates INFO @ Thu, 29 Mar 2012 15:00:13: #3 shift control data INFO @ Thu, 29 Mar 2012 15:00:13: #3 merge +/- strand of control data INFO @ Thu, 29 Mar 2012 15:00:15: #3 call negative peak candidates INFO @ Thu, 29 Mar 2012 15:00:25: #3 use control data to filter peak candidates... INFO @ Thu, 29 Mar 2012 15:00:31: #3 Finally, 13591 peaks are called!INFO @ Thu, 29 Mar 2012 15:00:31: #3 find negative peaks by swapping treat and control INFO @ Thu, 29 Mar 2012 15:00:36: #3 Finally, 594 peaks are called!

Page 27: Macs course

ChIP-seq analysisOutput files

• FoxA1_model.r

• FoxA1_negative_peaks.xls

• FoxA1_peaks.bed

• FoxA1_peaks.xls

• FoxA1_summits.bed

Page 28: Macs course

ChIP-seq analysisMACS peak model

$R --vanilla < FoxA1_model.r..$evince FoxA1_model.pdf

Page 29: Macs course

ChIP-seq analysisFoxA1_peaks.xls

FoxA1_negative_peaks.xls

chr start end length summit tags

-10*LOG10(pvalue)

fold_enrichment FDR(%)

chr1 858357 858641 285 128 6 51 13.93 4.09chr1 998955 999229 275 106 9 74.39 18.28 0.26chr1 1050021 1050286 266 154 13 152 52.23 0chr1 1684288 1684577 290 176 9 89.7 32.14 0.01chr1 1775031 1775371 341 270 6 51.08 16.71 4.06chr1 1780682 1780965 284 183 6 61.17 19.9 1.45

chr start end length summit tags

-10*LOG10(pvalue)

fold_enrichment

chr1 7155010 7155530 521 311 9 61.64 44.47chr1 11265816 11266025 210 106 6 59.86 38.12chr1 18597004 18597307 304 188 8 66.25 31.77chr1 33412779 33412964 186 94 6 58.68 22.92chr1 33759125 33759514 390 234 9 62.88 19.77chr1 37102727 37102952 226 114 6 55.14 31.51

Page 30: Macs course

ChIP-seq analysisFoxA1_peaks.bedchr, start, end, peak id and score = -10*LOG10(pvalue)

FoxA1_summits.bedchr, start, end, peak id and score = height of the summit

chr1 858356 858641 MACS_peak_1 51chr1 998954 999229 MACS_peak_2 74.39chr1 1050020 1050286 MACS_peak_3 152chr1 1684287 1684577 MACS_peak_4 89.7chr1 1775030 1775371 MACS_peak_5 51.08chr1 1780681 1780965 MACS_peak_6 61.17chr1 1923146 1923449 MACS_peak_7 164.87

chr1 858483 858484 MACS_peak_1 4chr1 999059 999060 MACS_peak_2 7chr1 1050173 1050174 MACS_peak_3 12chr1 1684462 1684463 MACS_peak_4 8chr1 1775299 1775300 MACS_peak_5 4chr1 1780863 1780864 MACS_peak_6 4chr1 1923347 1923348 MACS_peak_7 14

Page 31: Macs course

ChIP-seq analysis

$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -g hs -n FoxA1 -w

-w option allows to create“wiggle” files for each chromosome analyzed.

-B option creates “bedgraph” files. -S option together with either –w or –B creates a single huge file for the whole genome.

--space=NUM can be used for change the resolution of the wiggle file

Page 32: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserhttp://genome.ucsc.edu/index.html

Page 33: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserhttp://genome.ucsc.edu/index.html

Page 34: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserhttp://genome.ucsc.edu/index.html

Page 35: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserhttp://genome.ucsc.edu/index.html

Page 36: Macs course

ChIP-seq analysisUpload files in the UCSC genome browser

Page 37: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserPeak example: chr22:20141500..20141987

Page 38: Macs course

ChIP-seq analysisAnalyze histone modifications

• Broader peaks• No clear shape (more summits)• The peak model is often impossible to create.

$macs14 -t ../data/ES.H3K27me3.bed –g mm --nomodel --nolambda -n H3K27me3

• It is recommended to skip the model with the --nomodel option.

• Since no control is available the comparison will be done against the sample background. It is recommended to skip the local background when you have no control and very broad peaks.

Page 39: Macs course

ChIP-seq analysisUpload files in the UCSC genome browserPeak example: chrX:47,922,749-47,926,228

Page 40: Macs course

ChIP-seq analysisGalaxy platformSoon a local installation at CRG!!!

https://main.g2.bx.psu.edu/

Page 41: Macs course

ChIP-seq analysisBibliography:

• http://en.wikipedia.org/wiki/File:ChIP-sequencing.svg

• http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

• http://liulab.dfci.harvard.edu/MACS/

• http://sourceforge.net/apps/mediawiki/gemlibrary/index.php?title=Th

e_GEM_library

• http://bio-bwa.sourceforge.net/

• http://www.well.ox.ac.uk/project-stampy

• http://bowtie-bio.sourceforge.net/index.shtml

• http://genome.ucsc.edu/

• http://www.r-project.org/

• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• https://main.g2.bx.psu.edu/root