starting monday m oct 29 –back to blast and orthology (readings posted) will focus on the blast...
TRANSCRIPT
Starting Monday
M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in lab we will predict orthologs using reciprocal genome-scale BLAST searches
W Oct 31 – Phylogenetic Profiles ( an example of unsupervised machine learning) and supervised machine learning approaches and applications
M Nov 5 - Phylogeny (Phylogeny Lab)
W Nov 7 – Metabolic reconstruction and modeling
***2-3 pg paper on preliminary results due***
Today: Chip-chip and Chip-seq analysis
Chromatin immunoprecipitation (ChIP)
1. Chemical or light-basedcrosslinking added toliving cells
2. Shear DNA by sonication ordigestion
3. IP by specific Ab orAb against protein tag
2
ChIP on ChIP (tiled genomic microarrays)
Sign
al I
nten
sity
Array Probes
Peak resolution a function of:- shearing size- probe resolution- ChIP enrichment
3
ChIP - Seq
Rea
d C
ount
s
4
5
1. Map reads to the reference genome
2. Convert to ‘tag’ counts: sequence coverage at each base pair in the genome
3. Find peaks of high tag count (using a fixed/sliding window with count threshold)or based on bimodal peak distribution
4. Convert bimodal peaks into summits (by shifting 3’ tag positions OR byextending the tag signal to estimated size of fragments)
5. Identify summits that represent fragment enrichment relative to control
6. Assign a confidence score (p-value, enrichment score, and/or FDR)
Types of ‘control’ data for ChIP experiments
1. ‘Input’ DNA = sheared but no IP
2. No-antibody mock IP
3. Untagged strain
Almost always somebackground in mock-IP
… hope is to haveenrichment of IP material
over background.
* Certain artifacts can givethe appearance of real peaks in
control experiments.
Pepke et al. 2009
Read counts/ tag profile is generallysmoothed before peak calling(e.g. running average) and then the‘summit’ is inferred by the dual read peaks
* using a method that incorporatesmeasured background model is probably very important
10
3 Types of peaks1. Sharp & narrow (100s bp)
(eg. site-specific TF)
2. Broader but defined (kb)(eg. RNA Polymerase)
3. Very broad (regional, 1000s kb)(eg. heterochromatin histone marks)
• methods that identify bimodal peak profiles to identify summits work less well forbiologically wider peaks/loci
Hidden Markov Models for Identifying Bound Fragments
HMM’s are trained on known data to recognize different states (eg. bound vs. unbound fragments) and the probability of moving between those states
Example: ChIP-chip data from a tiling microarray identifying regions bound toa transcription complex with a known 50bp binding sequence.
You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long.
Once trained, an HMM can be used to identify the ‘hidden’ states in an unknown dataset, based on the known characteristics of each state (‘emission probabilities ’) and
the probability of moving between states (‘transition probabilities’)
Example: “A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences” 2005. Li, Meyer, Liu
Example: ChIP-chip data from a tiling microarray identifying regions bound toa transcription complex with a known 50bp binding sequence.
You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long.
P( I ) = 0.2P( i ) = 0.8
P( I ) = 0.8P( i ) = 0.2
P( I ) = 0.8P( i ) = 0.2
P( I ) = 0.8P( i ) = 0.2
I = Intensity units > 10,000 i = Intensity units < 10,000
P= 0.5
P= 0.5
P= 1.0
P= 0
P= 0.7
P= 0.3
P= 1.0
Unbound 25mer Bound 25mer Bound 25mer Bound 25mer
Example: ChIP-chip data from a tiling microarray identifying regions bound toa transcription complex with a known 50bp binding sequence.
You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long.
P= 0.5
P= 0.5
P= 1.0
P= 0
P= 0.7
P= 0.3
P= 1.0
Unbound 25mer Bound 25mer Bound 25mer Bound 25mer
Emission Probabilities
Transition Probabilities
Given the data, an HMM will consider many different models and give back the optimal model
P( I ) = 0.2P( i ) = 0.8
P( I ) = 0.8P( i ) = 0.2
P( I ) = 0.8P( i ) = 0.2
P( I ) = 0.8P( i ) = 0.2
14
Evaluated 11 different peak-calling algorithms using 3 real datasets * & defaultparameters (mimicking “non-expert users”)
- methods with smaller peak lists often return peaks identified by other methods(more stringent)
“many programs call similar peaks, though default parameters are tuned to different levels of stringency”
15
Output: list of peak locations (start & stop) and p-values
Challenge is peaks do not show precisely where protein binds.
Different programs vary in the width of the identified peaks
Can apply the same type of motif finding to a set of IP’d regionsto identify motifs shared by regions.
Other approaches
ChIP-exoDNaseI hypersensitive sites
Micrococcal nuclease sensitive sites(nucleosome mapping)
What can you do with the data?
1.Motif finding: look for motif shared in bound regions (e.g. XX)
2.Association bound loci with neighboring genes, elements- functional enrichment of neighboring genes- other non-random association among neighboring genes,
e.g. shared expression profiles, expression dependency on factor in question
3.Locus distribution across the genome