7/2/20151 gene finding 7/2/20152 copyright notice many of the images in this power point...
TRANSCRIPT
04/19/23 1
Gene Finding
04/19/23 2
Copyright notice
• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.
• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!
04/19/23 3
Gene FindingWhy do it? • Find and annotate all the genes within the large volume
of DNA sequence data– Human DNA length = 3.4*109 bp– Number of genes = 30,000 - 100,000– Gene percentage ~= 1%
• Gain understanding of problems in basic biology– e.g. gene regulation-what are the mechanisms involved in
transcription, splicing, etc?
• Different emphasis in these goals has some effect on the design of computational approaches for gene finding.
04/19/23 4
Gene Finding
• Cells recognize genes from DNA sequence– find genes via their bioprocesses
• Not so easy for us..
04/19/23 5
CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...
Where is Gene?
04/19/23 6
Types of Genes
• Protein coding– most genes
• RNA genes– rRNA– tRNA– snRNA (small nuclear RNA)– snoRNA (small nucleolar RNA)
04/19/23 7
3 Major Categories of Information used in
Gene Finding Programs • Signals/features
– a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands
• Content/composition – statistical properties of coding vs. non-coding regions.
• e.g. codon-bias; length of ORFs in prokaryotes;GC content
• Similarity– compare DNA sequence to known sequences in database
– Not only known proteins but also ESTs, cDNAs
04/19/23 8
Gene Structure
04/19/23 9
Prokaryotic Genes Structure
5’ 3’
Open Reading Frame
Promoter region (maybe)
Ribosome binding site (maybe)
Termination sequence (maybe)
Start codon / Stop Codon
04/19/23 10
In Prokaryotic Genomes
• We usually start by looking for an ORF – A start codon, followed by (usually) at least 60 amino acid
codons before a stop codon occurs– Or by searching for similarity to a known ORF
• Look for basal signals– Transcription (the promoter consensus and the termination
consensus) – Translation (ribosome binding site: the Shine-Dalgarno
sequence)
• Look for differences in sequence content between coding and non-coding DNA– GC content and codon bias
04/19/23 11
Gene Finding in Bacterial Genomes
• Advantages– Simple gene structure
• Small genomes (0.5 to 10 million bp)• No introns
– Dense Genomes• High coding density (>90%)• Short intergenic regions
– Conserved signals– Abundant comparative information
• Complete Genomes available for many– Uninterrupted ORFs
• Disadvantages– Some genes overlap (nested)– Some genes are quite short (<60 bp)
04/19/23 12
Open Reading Frame (ORF)
• Any stretch of DNA that potentially encodes a protein
• The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene
04/19/23 13
Open Reading Frames
Each grouping of the nucleotides into consecutive triplets constitutes a reading frame.
A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)
A C G T A A C T G A C T A G G T G A A T
GTA ACT GAC TAG GTG AAT
CGT AAC TGA CTA GGT GAA
04/19/23 14
ORFs as gene candidates
• An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)
• Most prokaryotic genes code for proteins that are 60 or more amino acids in length
• The probability that a random sequence of nucleotides of length n has no stop codons (UAA, UAG, UGA) is (61/64)n – When n is 50, there is a probability of 92% that the random
sequence contains a stop codon– When n is 100, this probability exceeds 99%
04/19/23 15
Codon Bias
• Genetic code degenerate– Equivalent triplet codons code for the same amino acid
• Codon usage varies– organism to organism– gene to gene
• Biological basis– Avoidance of codons similar to stop– Preference for codons that correspond to abundant
tRNAs within the organism
04/19/23 16
Codon Bias Gene Differences
GAL4 ADH1Gly GGG 0.21 0Gly GGA 0.17 0Gly GGT 0.38 0.93Gly GGC 0.24 0.07
04/19/23 17
Codon BiasOrganism differences
• Arginine : CGT,CGC,CGA,CGG,AGA,AGG
• Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)
• Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)
• Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/
04/19/23 18
GC content
• GC relative to AT is a distinguishing factor of bacterial genomes
• Varies dramatically across species– Serves as a means to identify bacterial species
• For various biological reasons– Mutational bias of particular DNA polymerases– DNA repair mechanisms – horizontal gene transfer (transformation, transduction,
conjugation)
04/19/23 19
GC Content
• GC content may be different in recently acquired genes than elsewhere
• This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon
bias within different genes of a single bacterium’s genome
04/19/23 20
Ribosome Binding Sites
• RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)
• Usually found within 4-18 nucleotides of the start codon of a true gene
04/19/23 21
Shine-Dalgarno Sequence
• Shine-Dalgarno sequence is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.
• This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.
• If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.
04/19/23 22
Bacterial Promoter
-35T82T84G78A65C54A45…
(16-18 bp)…T80A95T45A60A50T96…(A,G)
-10 +1
Not so simple: remember, these are consensus sequences
04/19/23 23
Eukaryotic Gene Structure
04/19/23 24
Genes and Signals
04/19/23 25
The Complicating factors in Eukaryotes
• Interrupted genes (split genes)• introns and exons
• Large genomes• Most DNA is non-coding
• introns, regulatory regions, “junk” DNA (unknown function)
• About 3% coding
• Complex regulation of gene expression • Regulatory sequences may be far away from
start codon
04/19/23 26
Some numbers to consider:• Vertebrate genes average about 30Kb long
– varies a lot• Coding region is only about 1-2 Kb• Exon sizes and numbers vary a lot
– Average is 6 exons, each about 150 bp long• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp
– (both can be much longer)• There are huge deviations from all of these numbers
e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26 exons, introns are up to 32 Kb (one intron produces 2 transcripts unrelated to the gene!)
– There are genes without introns: called single-exon or intronless genes
04/19/23 27
Given a long eukaryotic DNA sequence:
• How would you determine if it had a gene?
• How would you determine which substrings of the sequence contained protein-coding regions?
04/19/23 28
So, what’s the problem with looking for ORFs?
“split” genes make it difficult to define ORFs
• Where are the stars and stops?
• What problems do introns introduce?
• What would you predict for the size of ORFs?
04/19/23 29
Most Programs Concentrate on Finding Exons
• Exon: the region of DNA within a gene that codes for a polypeptide chain or domain
• Intron: non-coding sequences found in the structural genes
04/19/23 30
Splice Sites used to Define Exons
• Splice donor (exon-intron boundary) and splice acceptor (intron-exon boundary)
• Common sequence motifs– C(orA)AG/GTA(orG)AGT "donor" splice
site – T(orC)nNC(orT)AG/G "acceptor" splice site
04/19/23 31
Gene finding programs look for different types of exon
• single exon genes: begin with start codon & end with stop codon
• initial exons: begin with start codon & end with donor site
• internal exons: begin with acceptor & end with donor
• terminal exons: begin with acceptor & end with stop codon
04/19/23 32
How are correct splice sites identified?
• There are many occurrences of GT or AG within introns that are not splice sites
• Statistical profiles of splice sites are used
http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm
04/19/23 33
Other Biologically Important Signals Used in Gene Finding Programs
• Transcriptional Signals– Transcription Start: characterized by cap signal
• A single purine (A/G)– TATA box (promoter) at –25 relative to start– Polyadenylation signal: AATAAA (3’ end)
• Major Caveat: not all genes have these signals
• Makes it difficult to define the beginning and end of a gene
04/19/23 34
Upstream Promoter Sites
• Transcription Factor (TF) sites– Transcription factors are sequence-specific DNA-
binding proteins– Bind to consensus DNA sequences– e.g. CAAT transcription factor and CAAT box
• Many of these– Vary in sequence, location, interaction with other sites– Further complicates the problem of delineating a
“gene”
04/19/23 35
Translation Signals
• Kozak sequence– The signal for initiation of translation in
vertebrates– Consensus is GCCACCatgG
• And of course..– Translation stop codons
04/19/23 36
GC Content in Eukaryotes
• Overall GC content does not vary between species as it does in prokaryotes
• GC content is still important in gene finding algorithms – CpG Islands
04/19/23 37
CpG Islands
• CpG stands for cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. – CG dinucleotides are often written CpG to
avoid confusion with the base pair C-G
04/19/23 38
CpG Islands
• In the eukaryotic genome, CpG occur at lower frequency than would be expected in purely random sequences (1/16).– Occurrence related to methylation– Methylation of C in CG, turning it into 5-methylcytosin.
Following spontaneous deamination, the 5-methylcytosine converts into thymine.
– Methylation of C makes CpG prone to mutation (e.g. to TpG or CpA). CpG sites thus tend to be eliminated from the genomes of eukaryotes
04/19/23 39
CpG Islands
• However, in the start regions of many genes which have a high concentration of CpG sites: CpG islands,– Found at the promoters of eukaryotic genes. – These CpG sites are unmethylated, and therefore any
spontaneous deaminations of cytosine to uracil are recognized by the repair machinery and the CpG site is restored.
– High occurrence of CpGs in many cases marks the existence of downstream genes and is frequently used in genome annotation as indicator of gene density.
04/19/23 40
Gene Finding by Computational Methods
• Dependent on good experimental data to build reliable predictive models
• Various aspects of gene structure/function provide information used in gene finding programs
04/19/23 41
Computational Gene finding approaches
1) Rule-based (e.g, start & stop codons)
2) Content-based (e.g., codon bias, promoter sites)
3) Similarity-based (e.g., orthologs)
4) Pattern-based (e.g., machine-learning: neural network, HMM)
04/19/23 42
Simple rule-based gene finding in prokaryotes, based on ORFs
• Look for putative start codon (ATG)
• Staying in same frame, scan in groups of three until a stop codon is found
• If # of codons >=50, assume it’s a gene
• If # of codons <50, go back to last start codon, increment by 1 & start again
• At end of chromosome, repeat process for reverse complement
04/19/23 43
Example ORF
04/19/23 44
Problems with rule-based approaches
• Advantages– Simple and fairly sensitive (>50%)
• Disadvantages– Prokaryotic genes are not always so simple to find– ATG is not the only possible start site (e.g. CTG,
TTG – class I alternates) – Small genes tend to be overlooked and long ones
over-predicted
• Solution? Use additional information to increase confidence in predictions
04/19/23 45
Content based approaches
• Key prokaryotic gene features– RNA polymerase promoter site (-10, -30 site
or TATA box)– Shine-Dalgarno sequence (+10, Ribosome
Binding Site) to initiate protein translation– Codon biases– High GC content – Stem-loop (rho-independent) terminators
04/19/23 46
Content based approaches
• Key eukaryotic gene features– CpG islands
• More abundant near gene start site • High GC content in 5’ ends of genes
– Codon Bias• Some codons are strongly preferred in coding regions,
others are not
– Hexamers• Dicodon frequencies informative – physical constraints prefer
certain adjacent amino acids over others
– Positional Bias• 3rd base tends to be G/C rich in coding regions
04/19/23 47
Content-based recognition
• Advantages:– Increases accuracy over rule-based
• Disadvantages:– Features are degenerate– Features are not always present
04/19/23 48
Homology-Based Approaches in Eukaryotic Genomes
• More complicated than prokaryotes due to split genes• Genome sequence -> first identify all candidate exons• Use a spliced alignment algorithm to explore all possible
exon assemblies & compare to known– e.g. Procrustes
• Limitations: – must have similar sequence in the database with
known exon structure– Sensitive to frame shift errors
04/19/23 49
Gene Finding using Comparative Genomics
• Purifying selection – Conserved regions between two genomes are useful or else they would have diverged.
• If genomes are too close in the phylogenetic tree, there may be too much noise.
• If genomes are too far, then regions can be missed.
04/19/23 50
UCSC Browser
04/19/23 51
Gene Prediction using sequence similarities
• Genomescan incorporates similarity-based method by adding a blastX component to its prediction algorithm, using the translated sequence to search protein db.
• http://genes.mit.edu/genomescan/
• “TWINSCAN is a gene prediction system that models both gene structure and evolutionary conservation. The scores of features like splice sites and coding regions are modified using the patterns of divergence between the target genome and a closely related genome.”
• http://genes.cs.wustl.edu/
04/19/23 52
Neural Networks - Grail
• Sensors are trained using a set of known genes in the organism.
• GrailExp incorporates similarity-based method by adding a blastn component to its prediction algorithm. Runs reliably on unmasked sequences.
• Sensors are :– Frame Bias Matrix - This uses the codon bias to
determine the correct frame .– Fickett - Named after Fickett who originally used
properties such as 3-periodicity and overall base composition to predict genes.
04/19/23 53
Neural Networks - Grail
– Coding 6-tuple word preference -frequency of 6-tuple words in the coding region.
– Coding 6-tuple in-frame preference - 6-tuple composition is evaluated for the 3 frames and the one with the best score is used.
– Repetitive 6-tuple word preference - 6-tuple statistics in repetitive elements. This is an identification where coding regions are not expected.
04/19/23 54
Neural Network
ACGAAGAGGAAGAGCAAGACGAAAAGCAAC
ACGAAG
EEEENN
A = [001]C = [010]G = [100]
E = [01]N = [00]
DefinitionsTraining Set
Dersired Output
Sliding Window
[010100001]Input Vector
Output Vector[01]
04/19/23 55
Neural Network Training
[010100001]
.2 .4 .1
.1 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0
.2 .4 .1
.0 .3 .5
.1 .1 .0
.5 .3 .1
[.6 .4 .6].1 .8.0 .2.3 .3
[.24 .74]
1
1 - e-x
Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector
compare
[0 1]ACGAAG
04/19/23 56
Back Propagation
[010100001]
.2 .4 .1
.1 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0
.2 .4 .1
.0 .3 .5
.1 .1 .0
.5 .3 .1
[.6 .4 .6].1 .8.0 .2.3 .3
[.24 .74]
1
1 - e-x
Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector
compare
[0 1]
.83
.33
.23
.22
.02
04/19/23 57
Calculate New Output
[010100001]
.1 .1 .1
.2 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0
.2 .2 .1
.0 .3 .5
.1 .3 .0
.5 .3 .3
[.7 .4 .7].02 .83.00 .23.22 .33
[.16 .91]
1
1 - e-x
Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector
Converged!
[0 1]
04/19/23 58
Train on Second Input Vector
[100001001]
.1 .1 .1
.2 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0
.2 .2 .1
.0 .3 .5
.1 .3 .0
.5 .3 .3
[.8 .6 .5].02 .83.00 .23.22 .33
[.12 .95]
1
1 - e-x
Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector
Compare
[0 1]ACGAAG
04/19/23 59
Back Propagation
[010100001] [.8 .6 .5] [.12 .95]
1
1 - e-x
Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector
compare
[0 1]
.84
.34.21
.01
.1 .1 .1
.2 .0 .4
.7 .1 .1
.0 .1 .1
.0 .0 .0
.2 .2 .1
.0 .3 .5
.1 .3 .0
.5 .3 .3
.02 .83
.00 .23
.22 .33.24
04/19/23 60
After Many Iterations….
Two “Generalized” Weight Matrices
.13 .08 .12
.24 .01 .45
.76 .01 .31
.06 .32 .14
.03 .11 .23
.21 .21 .51
.10 .33 .85
.12 .34 .09
.51 .31 .33
.03 .93
.01 .24
.12 .23
04/19/23 61
Neural Networks
Input Layer 1 Hidden Output Layer
ACGAGG EEEENN
Matrix1
Matrix2
New pattern Prediction
04/19/23 62
Hidden Markov Models• In general, sequences are not monolithic, but
can be made up of discrete segments
• Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state
• Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model
04/19/23 63
MMA Markov process is a process, which moves from state to state depending (only) on the previous n states.
Sunny Cloudy Rainy0.25
0.5
0.25
Sunny
Cloudy
Rainy
1.0
3.0
6.0
375.0625.0125.0
375.0125.0375.0
25.025.05.0
A
Weather today
Sunny cloudy Rainy
Weather yesterday
Sunny
Cloudy
Rainy
04/19/23 64
Example:P (Sunny , Sunny, Cloudy, Rainy | Model) =
Π(sunny)* P (Sunny | Sunny) * P (Cloudy | Sunny) *P (Rainy | Cloudy) =
0.6 * 0.5 * 0.25 * 0.375 = 0.0281
375.0625.0125.0
375.0125.0375.0
25.025.05.0
A
Weather today
Sunny cloudy Rainy Sunny
Cloudy
Rainy
Sunny Cloudy Rainy0.25
0.5
0.25
Weather yesterday
1.0
3.0
6.0
Sunny
Cloudy
Rainy
04/19/23 65
HMM
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #3#2
#1 #2 #3
#1
#2
#3
ith turn
i+1 turn
1.0
3.0
6.0 #1
#2
#3
State transition probabilities
emission probabilities
66
Elements of an HMM
• An HMM is characterized by the following:1. N, the number of states in the model2. M, the number of distinct observation symbols per state3. The state transition probability distribution A=aij, where
aij=P[qt+1=j|qt=i], 1≤i,j≤N4. The observation symbol probability distribution in state j,
B=bj(vk) , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M5. The initial state distribution =i, where i=P[q1=i],
1≤i≤N• For convenience, we usually use a compact notation
=(A,B,) to indicate the complete parameter set of an HMM– Requires specification of two model parameters (N and M)
67
Two Major Assumptions for HMM
• First-order Markov assumptionFirst-order Markov assumption– The state transition depends only on the origin and
destination
– The state transition probability is time invariant
• Output-independent assumptionOutput-independent assumption– The observation is dependent on the state that generates
it, not dependent on its neighbor observations
aij=P(qt+1=j|qt=i), 1≤i, j≤N
T
tttTt qqPqPqqqPP
2111 ,,...,,..., Q
T
ttq
T
tttTtTt obqoPqqqoooPP
t11
11 ,,,...,,...,,...,,...,, QO
04/19/23 68
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #2 #3
#1
#2
#3
ith turn#1 #3#2
i+1 turn
6.0
3.0
1.0 #1
#2
#3
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
The three Basic problems of HMMsThe three Basic problems of HMMsProblem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)compute P(O | M).
for example: P ( | M)
04/19/23 69
Example: P( | M).
#1
#3
#2
#1
#3
#2
#1
#3
#2
#1
#3
#2
Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)compute P(O | M)
),|( M) Q, | P(O T1t MqOP tt
kkqqq aaa qq 1321 ...M) | P(Q 21
)|(),|( M) | P(O Q all MQPMQOP
We define a sequence of states Q=q1q2…qT.
04/19/23 70
Example: P( | M).
Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)
compute P(O | M).We define a sequence of states Q=q1q2…qT.
),|( M) Q, | P(O T1t MqOP tt
kkqqq aaa qq 1321 ...M) | P(Q 21
)|(),|( M) | P(O Q all MQPMQOP
#1
#3
#2
#1
#3
#2
#1
#3
#2
#1
#3
#2
O(NT*T) !!!N- number of states
T- number of observations
04/19/23 71
Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M).
Solution: Forward algorithmForward algorithmMuch better…
O(N2T) !!!N- number of states
T- number of observations
#1
#3
#2
#1
#3
#2
#1
#3
#2
#1
#3
#2
#1
#3
#2
#1
#3
#2
Example: P( | M).
For N=5 an T =100
Naive solution…1072
Forward algorithm… 3000
04/19/23 72
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #2 #3
#1
#2
#3
ith turn#1 #3#2
i+1 turn
6.0
3.0
1.0 #1
#2
#3
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
The three Basic problems of HMMsThe three Basic problems of HMMsProblem 2: Given observation sequence O=O1O2…OT and model M=(Π, A, B)how do we choose a corresponding state sequence q=q1q2…qT ,which best “explains” the observation.For example:What are most probable q1q2q3q4 given the observation
#? #? #? #?
04/19/23 73
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #2 #3
#1
#2
#3
ith turn#1 #3#2
i+1 turn
1.0
3.0
6.0 #1
#2
#3
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
The three Basic problems of HMMsThe three Basic problems of HMMsProblem 3:How do we adjust the model parameters Π, A, B to maximize P(O |Π, A, B)?
04/19/23 74
Solution to the three problems:• Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)
– Problem 1:
How to efficiently compute P(O|) ?
Evaluation problem• Solution: Forward algorithm O(N2L)
– Problem 2:
How to choose an optimal state sequence Q=(q1,q2,……, qT) which best explains
the observations?
Decoding Problem• Solution: Viterbi algorithm O(N2L)
– Problem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? Learning/Training Problem
• Solution: Baum-Welch reestimation formulas
)|,(maxarg* OQQQ
P
04/19/23 75
Solution to Problem 1 - The Forward Procedure
• Base on the HMM assumptions, the calculation of and involves only qt-1, qt , and ot , so it is possible to compute the likelihood
with recursion on t
• Forward variable : – The probability of the joint event that o1,o2,…,ot are observed and
the state at time t is i, given the model λ
,1tt qqP ,tt qoP
λiqoooPiα ttt ,,...,, 21
OP
)()(
,,,...,,
11
11211
tj
N
iijt
tttt
obaiα
λjqooooPjα
04/19/23 76
)()(
)(),|(,,...,,
)(),,,...,,|(,,...,,
)(,,,...,,
)(|,,...,,
),|(|,,...,,
)|(),|(,|,...,,
)|(,|,,...,,
|,,,...,,
11
11
121
11
21121
11
121
1121
11121
111121
11121
11211
tj
N
iijt
tj
N
itttt
tj
N
ittttt
tj
N
ittt
tjtt
tttt
ttttt
tttt
tttt
obai
obλiqjqPλiqoooP
obλiqooojqPλiqoooP
obλjqiqoooP
objqoooP
jqoPjqoooP
jqPjqoPjqoooP
jqPjqooooP
jqooooPj
Solution to Problem 1 - The Forward Procedure (cont.)
Ball
BAPAP
),(
)|(),|()(
),(
),(
),,(
)(
),,()|,(
BPBAP
P
BP
BP
BAP
P
BAPBAP
Output-independent assumption
)|,()|(),|( BAPBPBAP
)(, 111 tjtt objqoP
),|()|()|,( ABPAPBAP
First-order Markov assumption
04/19/23 77
Solution to Problem 1 - The Forward Procedure (cont.)
3(2)=P(o1,o2,o3,q3=2|)
=[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3)
S2
S3
S1
o1
S2
S3
S1
S3
S2
S1
S2
S3
S1
State
o2 o3 oT
1 2 3 T-1 T Time
S2
S3
S1
oT-1
Si means bj(ot) has been computed
aij means aij has been computed
2(1)
2(2)
2(3)
a12
a22
a32b2(o3)
04/19/23 78
Solution to Problem 1 - The Forward Procedure (cont.)
• Algorithm
Complexity: O(N2T)
• Based on the lattice (trellis) structure– Computed in a time-synchronous fashion from left-to-right, where each cell
for time t is completely computed before proceeding to time t+1
• All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t
N
iT
tj
N
iijtt
ii
iαλP
Nj,T-t, obaiαjα
Ni, obπiα
1
11
1
11
111
1
O ion3.Terminat
Induction 2.
tionInitializa 1.
TN))N(T-(N-
T N)+N )(T-N(N+2
2
11: ADD
11 : MUL
λiqoooPi ttt ,...21
04/19/23 79
Solution to Problem 1 - The Forward Procedure (cont.)
• A three-state Hidden Markov Model for the Dow Jones Industrial average
b1(up)=0.7
b2(up)= 0.1
b3(up)=0.3
a11=0.6
a21=0.5
a31=0.4
(Huang et al., 2001)
b1(up)=0.7
b2(up)= 0.1
b3(up)=0.3
π1=0.5
π2=0.2
π3=0.3
α1(1)=0.5*0.7
α1(2)= 0.2*0.1
α1(3)= 0.3*0.3
α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7
04/19/23 80
Solution to Problem 2 - The Viterbi Algorithm
• The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm– Instead of summing up probabilities from different paths coming
to the same destination state, the Viterbi algorithm picks and remembers the best path
• Find a single optimal state sequence Q=(q1,q2,……, qT)
– The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm
04/19/23 81
Solution to Problem 2 - The Viterbi Algorithm (cont.)
S2
S3
S1
o1
S2
S3
S1
S2
S3
S1
S2
S1
S3
State
o2 o3 oT
1 2 3 T-1 T Time
S2
S3
S1
oT-1
04/19/23 82
Solution to Problem 2 - The Viterbi Algorithm (cont.)
1. Initialization
2. Induction
3. Termination
4. Backtracking),...,,(
1,...,2.1),(**
2*1
*
*11
T
tt*t
qqq
TTtqq
Q
Ni, i
Ni, obπi ii
10)(
1
1
11
Nj,T-t, aij
Nj,T-t, obaij
ijtNi
tjijtNi
t
1 11][maxarg)(
1 11][max
11t
11
1
iq
iλP
TNi
*T
TNi
1
1
*
maxarg
maxO
Complexity: O(N2T)
is the best state sequence
04/19/23 83
b1(up)=0.7
b2(up)= 0.1
b3(up)=0.3
a11=0.6
a21=0.5
a31=0.4
b1(up)=0.7
b2(up)= 0.1
b3(up)=0.3
π1=0.5
π2=0.2
π3=0.3
Solution to Problem 2 - The Viterbi Algorithm (cont.)
• A three-state Hidden Markov Model for the Dow Jones Industrial average
(Huang et al., 2001)
δ1(1)=0.5*0.7
δ1(2)= 0.2*0.1
δ1(3)= 0.3*0.3
δ2(1)=max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7
δ2(1)= 0.35*0.6*0.7=0.147Ψ2(1)=1
04/19/23 84
Solution to Problem 3 – The Baum-Welch Algorithm
• How to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)?– The most difficult one among the three problems, because there
is no known analytical method that maximizes the joint probability of the training data in a closed form
• The data is incomplete because of the hidden state sequence
– The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm
• The EM (Expectation Maximization) algorithm is perfectly suitable for this problem
Baum-Welch Local Maximization
• 1st step: You determine– The number of hidden states, N– The emission (observation alphabet)
• 2nd step: randomly assign values to…A - the transition probabilitiesB - the observation (emission) probabilities - the starting state probabilities
• 3rd step: Let the machine re-estimateA, B,
04/19/23 85
04/19/23 86
Solution to Problem 3 – The Backward Procedure
• Backward variable :– The probability of the partial observation sequence ot+1,ot+2,…,oT,
given state i at time t and the model 2(3)=P(o3,o4,…, oT|q2=3,)
=a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3)
λ,,...,, 21 iqoooPi tTttt
S2
S3
S1
o1
S2
S3
S1
S2
S3
S1
S2
S3
S1
o2 o3 oT
1 2 3 T-1 T Time
S2
S3
S3
oT-1
S2
S3
S1
State
3(1)b1(o3)
a31
04/19/23 87
Solution to Problem 3 – The Backward Procedure (cont.)
• Algorithm
TN))N(T-(N-T N) (T-N
Nj,T-tjobai
NiiβN
jttjijt
T
222
111
11 :ADD ; 12 : MUL Complexity
1 11 ,Induction 2.
1 ,1tion Initializa 1.
λiqoooPi tTttt ,,...,, 21
N
iii
N
iT
N
iT
N
iT
obi
λiqPλiqoPλiqoooP
λiqPλiqooooPλiqooooPλP
111
1111
132
11
13211
1321
)()(
,,,...,,
,,...,,,,,...,,,
O
N
iT iαλP
1Ocf.
04/19/23 88
Solution to Problem 3 – The Forward-Backward Algorithm
• Relation between the forward and backward variables
)(][
,...
11
21
ti
N
jjitt
ttt
obaji
iqoooPi
λ
N
jttjijt
tTttt
jobai
iqoooPi
111
21
)(
,...
λ
λiqPii ttt ,)( O
(Huang et al., 2001)
Ni tt iiλP 1 )(O
04/19/23 89
Solution to Problem 3 – The Forward-Backward Algorithm (cont.)
λiqP
iqoooP
iqPiqoooP
iqoooPiqPiqoooP
iqoooPiqoooP
ii
t
tT
ttT
tTttttt
tTtttt
tt
,
)|,,...,,(
)|(),|,...,,(
),|,...,,()|(),|,...,,(
),|,...,,()|,,...,,(
)(
21
21
2121
2121
O
N
itt
N
it iiλiqPλP
11)()(, OO
04/19/23 90
Solution to Problem 3 – The Intuitive View
• Define two new variables:
t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and
t( i, j )=P(qt = i, qt+1 = j | O, )
– Probability of being in state i at time t and state j at time t+1, given O and
N
m
N
nttnmnt
ttjijtttt
nobam
jobai
λP
λjqiqPji
1 111
111 ,,,
O
O
N
jtt jii
1
,
Ni tt
tttttt
ii
ii
λP
ii
λP
iqPi
1
)|,(
OO
O
λiqPii ttt ,)( O
Ni tt iiλP 1 )(O
04/19/23 91
Solution to Problem 3 – The Intuitive View (cont.)
• P(q3 = 3, O | )=3(3)*3(3)
o1
s2
s1
s3
s2
s1
s3
S2
s1
S1
State
o2 o3 oT
1 2 3 4 T-1 T Time
oT-1
S2
S3
S1
S2
S3
S1
S3
S2
S3
S1
S2
S3
S1
S2
S3
S1
3(3) 3(3)
04/19/23 92
Solution to Problem 3 – The Intuitive View (cont.)
• P(q3 = 3, q4 = 1, O | )=3(3)*a31*b1(o4)*4(1)
o1
s2
s1
s3
s2
s1
s3
S2
s1
S1
State
o2 o3 oT
1 2 3 4 T-1 T Time
oT-1
S2
S3
S1
S2
S3
S1
S3
S2
S3
S1
S2
S3
S1
S2
S3
S1
3(3)
4(1)
a31
b1(o4)
04/19/23 93
Solution to Problem 3 – The Intuitive View (cont.)
t( i, j )=P(qt = i, qt+1 = j | O, )
t(i)= P(qt = i | O, )
Oin state to state from ns transitioofnumber expected
,1
1
ji
jiT
tt
Oin state from ns transitioofnumber expected
1
1
i
iT
tt
04/19/23 94
Solution to Problem 3 – The Intuitive View (cont.)
• Re-estimation formulae for , A, and B are
itii 11)( at time statein times)of(number freqency expected
statein timesofnumber expected
symbol observing and statein timesofnumber expectedT
1t
T
s.t.1 t
j
j
j
vjvb
t
vo
t
kkj
kt
i
i,jξ
i
jia
T-
tt
T-
tt
ij
1
1
1
1 state from ns transitioofnumber expected
state to state from ns transitioofnumber expected
04/19/23 95
How is it connected to Gene prediction?
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
Yellow
Red
Green
Blue
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #3#2
#1 #2 #3
#1
#2
#3
ith turn
i+1 turn
1.0
3.0
6.0#1
#2
#3
04/19/23 96
How is it connected to Gene prediction?
25.0
25.025.0
25.0
1B
25.0
065.0
10.0
3B
10.0
35.010.0
35.0
2B
A
G
C
T
5.03.02.0
4.02.04.0
2.07.01.0
A
#1 #2 #3
#1
#2
#3
ith turn
i+1 turn
1.0
3.0
6.0Exon
Intron
UTR
GGT GG
AAGG
GGT
TT
CCCCAA
AA
AAAACC CCGG TAA GG
GG AACC CCT
T
Exon Intron UTR
A
G
C
T
A
G
C
T
04/19/23 97
GENESCANGENESCAN
Chris Burge 1997
E0 E1 E2
I0 I1 I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly A
Signal
promoter
Intergenic region
Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5
5’ UTR 3’ UTR
G T AG
04/19/23 98
031.041.028.0
39.0033.028.0
0100
0010
A
12.0
60.0
04.0
06.0
Sequence generating models:
P1 P2 P3 P4
CCCCAA
AA
AAAACC CCGG TIntron
GG AA AAAACCT T
GENESCAN componentsGENESCAN components
Intergenic region
E
0
E1 E
2
I0 I1 I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly Apromoter
Set of length distributions:
f1 f2 f3 f4 fintron(10)=0fintron(350)=.03
04/19/23 99
Definitions:
For fixed sequence length L we define:
ФL- set of all possible parses of length L
SL- set of all possible DNA sequences of length L
ΩL= ФL x SL
Our model M is a probability measure on this space assigns a probability density to each parse/sequence pair.
How do we use all that for gene How do we use all that for gene prediction?prediction?
04/19/23 100
Or in other words…
Given a sequence S
and a parse Фi
A C G C G A C T A G G C G C A G G T C T A … G A T
Exon0
Intron0 Exon0 Intron1 Exon1 3’UTR
We can calculate P(S, Фi):
04/19/23 101
A C G C G A C T A G G C G C A G G T C T A … G A TExon0
Intron0 Exon0 Intron1 Exon1 3’UTR
031.041.028.0
39.0033.028.0
0100
0010
A
12.0
60.0
04.0
06.0
Sequence generating models:
P1 P2 P3 P4
Set of length distributions:
f1 f2 f3 f4
CC CCAA
AA
AAAACC CCGG T
Intron
GG AA AAAACCT T
E
0
E
1
E
2
I
0
I1 I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly Apromoter
Intergenic region
πq1 fq1(d1)Pq1(s1) * …Aq1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk)P(S, Фi) =
04/19/23 102
Conditional probability of parse Фi given S sequence is:
P(S, Фi) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk)
S), jP(
S), iP(
P(S)
S)i,P( S)| iP(
j
L
PredictionPrediction::
Find the parse with maximum likelihood, i.e. max P(Фi | S)
In order to parse a given sequence S (i.e. predict genes in S) we…
04/19/23 103
Splice site sequence generator
• What about non-adjacent nucleotides dependencies?
• What about adjacent nucleotides dependencies
• What is the probability for generating signal O-5O-4…O6 ?
3 5 64210-1-2-3-4-5
A T AGATGGCCAC
A T AGGTGTCCAC
A T AGATGGACAC
A T AGATGGCCAC
310-1-2-3-4
0086033 A%
30041337 C%
450100
811418G%
100
071312 T% 3
49…
…
…
…WMM – Weight Matrix Method
WAM – Weight Array ModelConditional probability of generating nucleotide Xk at position I given nucleotide Xj at position i-1
04/19/23 104
What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies? What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies?
Procedure: MDD- Maximal Dependency Decomposition
04/19/23 105
What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies?MDD- Maximal Dependency Decomposition
Given data set D consisting of N sequences with length k
1. Align sequences
2. Find Ci, the consensus nucleotide at position i.
3. For each pair of positions (i,j) where i!=j Calculate statistic for Ci vs. nucleotide indicator Xj.
3 5 64210-1-2-3-4-5
A T AGATGGCCAC
A T AGGTGTCCAC
A T AGATGGACAC
A T AGATGGCCAC
`2)(2
E
EO
2For a specific (i,j)
O E
A … (%A in D)*Nc
C … (%C in D)*Nc
G … (%G in D)*Nc
T … (%T in D)*Nc
4.Calculate Si, the sum of each row (which is the measure between dependencies of Ci and nucleotides at remaining position sites)
Nci – number of sequences containing Ci
Do the same for all (i,j) i \ j -3 -2 -1 … 6
G -3
A -2
G -1
…
T 6
SUM
MAX(Si)
5. if (not (stop condition)) Choose Ci with max(Si) and partition D.
1. K-1 level of tree is reached
2. No significant dependencies found
3.Number of remaining sequences is to small
04/19/23 106
E
0
E1 E
2
I0 I1 I2
Einit Eterm
Single exon gene
5’ UTR 3’ UTR
Poly A
Signal
promoter
Intergenic region
E
0
E
2
I0 I1 I2
Einit
Eterm
Single exon gene
5’ UTR 3’ UTR
Poly A
Signal
promoter
Not mentionedNot mentioned•Reverse strand states
•C+G%
•Coding / non coding detection
•Branch point detection
•Expected vs. observed AG composition
•And more…
-100 -80 -60 -40 -20 0 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2Expected and observed percentage of AG near Acceptor site in coding region
ObservedExpected
04/19/23 107
Evaluating prediction programs
TP FP TN FN TP FN TN
Actual
Predicted
Sensitivity/ How many of the known genes were found?Recall
Specificity/ How many of the predicted genes were real?Precision
Correlation/ How good is it overall? F-measure
04/19/23 108
TP FP TN FN TP FN TN
Actual
Predicted
Sensitivity Sn=TP/(TP + FN)
Specificity Sp=TP/(TP + FP)
F-measure F=(sn+sp)/2
Correlation CoefficientCC=(TP*TN-FP*FN)/[(TP+FP)(TN+FN)(TP+FN)(TN+FP)]0.5
Evaluating prediction programs
04/19/23 109
Gene Prediction Accuracy at the Exon Level
Actual
Predicted
WRONGEXON
CORRECTEXON
MISSINGEXON
Sn =Sensitivitynumber of correct exons
number of actual exons
Sp =Specificitynumber of correct exons
number of predicted exons
04/19/23 110
Gene finders - a comparison
Method Sn Sp AC Sn Sp(Sn+Sp)/
2ME WE
GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24
GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1
SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13
Accuracy per nucleotide Accuracy per exon
Sn = SensitivitySp = SpecificityAc = Approximate CorrelationME = Missing ExonsWE = Wrong Exons
GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html
04/19/23 111
Gene finder comparison (cont.)
"Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001).
04/19/23 112
After putative genes are found, they’re annotated
1. Matches known protein sequence2. Strong similarity to protein sequence3. Similar to known protein4. Similar to unknown protein5. Similar to EST (i.e., putative protein)6. No EST or protein matches (i.e.,
hypothetical protein)
Annotation category
04/19/23 113
Pitfalls and Issues
Several issues make the problem of eukaryotic gene finding extremely difficult.
Very long genes: for example, the largest human gene, the dystrophin gene, is composed of 79 exons spanning nearly 2.3 Mb.
Very long introns: again, in the human dystrophin gene, some introns are >100 kb long and >99% of the gene is composed of introns.
04/19/23 114
Pitfalls and Issues
3) Very conserved introns. (Conserved non-coding sequences) This is particularly a problem when gene prediction is bolstered through similarity searches.
04/19/23 115
Pitfalls and Issues
4) Very short exons: Some exons are only 3 bp long in Arabidopsis genes. Such small exons are easily missed by all content sensors, especially if bordered by large introns. The more difficult cases are those where the length of a coding exon is a multiple of three (typically 3, 6 or 9 bp long), because missing such exons will not cause a problem in the exon assembly as they do not introduce any change in the frame.
04/19/23 116
Pitfalls and Issues
5) Overlapping genes: Though very rare in eukaryotic genomes, there are some documented cases in animals as well as in plants
6) Polycistronic gene arrangement: Also rare. One gene and one mRNA, but two or more proteins.
04/19/23 117
Pitfalls and Issues
7) Frameshifts: Some sequences stored in databases may contain errors (either sequencing errors or simply errors made when editing the sequence) resulting in the introduction of artificial frameshifts (deletion or insertion of one base). Such frameshifts greatly increase the difficulty of the computational gene finding problem by producing erroneous statistics and masking true solutions.
04/19/23 118
Pitfalls and Issues
8) Introns in UTRs: There are genes for which the genomic region corresponding to the 5`- and/or 3`-UTR in the mature mRNA is interrupted by one or more intron(s).
9) Alternative transcription start: e.g. three alternative promoters regulate the transcription of the 14 kb full-length dystrophin mRNAs and four `intragenic' promoters control that of smaller isoforms.
04/19/23 119
Pitfalls and Issues
10) Alternative splicing.
11) Alternative polyadenylation: 20% of human transcripts showing evidence of alternative polyadenylation, affecting where the 3’ end is cleaved.
04/19/23 120
Pitfalls and Issues
12)Alternative initiation of translation: finding the right AUG initiator is still a major concern for gene prediction methods. the rule stating that the firrst AUG in the mRNA is the initiator codon can be escaped through three mechanisms: context-dependent leaky scanning, re-initiation and direct internal initiation. Non-AUG triplet can sometimes act as the functional codon for translation initiation, as ACG in Arabidopsis or CUG in human sequences