computational genomics and proteomics
DESCRIPTION
C. E. N. T. E. R. F. O. R. I. N. T. E. G. R. A. T. I. V. E. B. I. O. I. N. F. O. R. M. A. T. I. C. S. V. U. Computational Genomics and Proteomics. Lecture 8 Motif Discovery. Outline Gene Regulation DNA Transcription factors Motifs What are they? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/1.jpg)
Computational Genomics and Proteomics
Lecture 8Lecture 8
Motif DiscoveryMotif Discovery
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
![Page 2: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/2.jpg)
OutlineGene Regulation
DNATranscription factors
MotifsWhat are they?Binding Sites
Combinatoric ApproachesExhaustive searchesConsensus
Comparative GenomicsExample
Probabilistic ApproachesStatisticsEM algorithmGibbs Sampling
![Page 3: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/3.jpg)
www.accessexcellence.org
![Page 4: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/4.jpg)
www.accessexcellence.org
![Page 5: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/5.jpg)
www.accessexcellence.org
![Page 6: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/6.jpg)
Four DNA nucleotide building blocks
G-C is more strongly hydrogen-bonded than A-T
![Page 7: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/7.jpg)
Degenerate code
Four bases: A, C, G, T
Two-fold degenerate IUB codes:
R=[AG] -- PurinesY=[CT] -- PyrimidinesK=[GT]M=[AC]S=[GC]W=[AT]
Four-fold degenerate: N=[AGCT]
![Page 8: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/8.jpg)
Transcription Factors
•Required but not a part of the RNA polymerase complex
•Many different roles in gene regulation
Binding
Interaction
Initiation
Enhancing
Repressing
•Various structural classes (eg. zinc finger domains)
•Consist of both a DNA-binding domain and an interactive domain
![Page 9: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/9.jpg)
Short sequences of DNA or RNA (or amino acids)Often consist of 5- 16 nucleotidesMay contain gapsExamples include:
Splice sitesStart/stop codonsTransmembrane domainsCentromeresPhosphorylation sitesCoiled-coil domainsTranscription factor binding sites (TFBS – regulatory motifs)
Motifs
![Page 10: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/10.jpg)
TFBSsDifficult to identifyEach transcription factor may have more than one binding siteDegenerateMost occur upstream of translation start site (TSS) but are known to also occur in:
intronsexons3’ UTRs
Usually occur in clusters, i.e. collections of sites within a region (modules)Often repeatedSites can be experimentally verified
![Page 11: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/11.jpg)
Why are TFBSs important?
Aid in identification of gene networks/pathways
Determine correct network structure
Drug discovery
Switch production of gene product on/off
Gene A Gene B
![Page 12: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/12.jpg)
Consensus sequencesMatches all of the example sequences closely but not exactlyA single site
TACGATA set of sites:
TACGATTATAATTATAATGATACTTATGATTATGTT
Consensus sequence:TATAAT orTATRNT
Trade-off: number of mismatches allowed, ambiguity in consensus sequence and the sensitivity and precision of the representation.
![Page 13: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/13.jpg)
Information Content and Entropy
![Page 14: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/14.jpg)
Sequence Logos
![Page 15: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/15.jpg)
Given a collection of motifs,
TACGATTATAATTATAATGATACTTATGATTATGTT
Create the matrix:
Frequency Matrices
TACG
![Page 16: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/16.jpg)
Position weight matrices
![Page 17: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/17.jpg)
Two problems:Given a collection of known motifs, develop a representation of the motifs such that additional occurrences can reliably be identified in new promoter regionsGiven a collection of genes, thought to be related somehow, find the location of the motif common to all and a representation for it.
Two approaches:CombinatorialProbabilistic
Finding Motifs
![Page 18: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/18.jpg)
Combinatorial Approach
![Page 19: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/19.jpg)
Exhaustive Search
![Page 20: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/20.jpg)
Exhaustive Search
Sample-driven here refers to trying all the words as they occur in the sequences, instead of trying all possible (4W) words exhaustively
![Page 21: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/21.jpg)
Greedy Motif Clustering
![Page 22: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/22.jpg)
Greedy Motif Clustering
![Page 23: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/23.jpg)
Greedy Motif Clustering
![Page 24: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/24.jpg)
Main Idea: Conserved non coding regions are importantAlign the promoters of orthologous co-expressed genes from two (or more) species e.g. human and mouseSearch for TFBS only in conserved regions
Problems:Not all regulatory regions are conservedWhich genomes to use?
Comparative Genomics
![Page 25: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/25.jpg)
Phylogenetic Footprinting
Phylogenetic Footprinting refers to the task of finding conserved motifs across different species. Common ancestry and selection on these motifs has resulted in these “footprints”.
![Page 26: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/26.jpg)
Xie et al. 2005
Genome-wide alignments for four species (human, mouse, rat, dog)
Promoter regions and 3’UTRs then extracted for 17,700 well-annotated genes
Promoter region taken to be (-2000, 2000)
This set of sequences then searched exhaustively for motifs
Phylogenetic Footprinting
An Example
Nature 434, 338-345, 2005
![Page 27: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/27.jpg)
The SearchXie et al. 2005
![Page 28: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/28.jpg)
Expected Rate
![Page 29: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/29.jpg)
Probabilistic Approach
![Page 30: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/30.jpg)
Gibbs Sampling (applied to Motif Finding)
![Page 31: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/31.jpg)
Gibbs Sampling Algorithm
![Page 32: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/32.jpg)
Gibbs Sampling – Motif Positions
![Page 33: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/33.jpg)
AlignACE - Gibbs Sampling
![Page 34: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/34.jpg)
Remainder of the lecture:Maximum likelihood and the EM algorithm
The remaining slides are for your information only and will not be part of the exam
![Page 35: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/35.jpg)
Basic Statistics
![Page 36: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/36.jpg)
Maximum Likelihood Estimates
![Page 37: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/37.jpg)
EM Algorithm
![Page 38: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/38.jpg)
Basic idea (MEME)
http://meme.nbcr.net/meme/meme-intro.html
![Page 39: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/39.jpg)
Basic idea (MEME)MEME is a tool for discovering motifs in a group of related DNA or protein sequences. A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences.
MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.
MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. http://meme.nbcr.net/meme/meme-intro.html
![Page 40: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/40.jpg)
Basic MEME Model
![Page 41: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/41.jpg)
MEME Background frequencies
![Page 42: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/42.jpg)
MEME – Hidden Variable
![Page 43: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/43.jpg)
MEME – Conditional Likelihood
![Page 44: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/44.jpg)
EM algorithm
![Page 45: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/45.jpg)
Example
![Page 46: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/46.jpg)
E-step of EM algorithm
![Page 47: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/47.jpg)
Example
![Page 48: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/48.jpg)
M-step of EM Algorithm
![Page 49: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/49.jpg)
Example
![Page 50: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/50.jpg)
Characteristics of EM
![Page 51: Computational Genomics and Proteomics](https://reader036.vdocuments.mx/reader036/viewer/2022062314/56814960550346895db6b410/html5/thumbnails/51.jpg)
Gibbs Sampling (versus EM)