7/2/20151 gene finding 7/2/20152 copyright notice many of the images in this power point...

04/19/23 1

Gene Finding

04/19/23 2

Copyright notice

• Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

• Many slides of this power point presentation Are from slides of Dr. Jonathon Pevsner and other people. The Copyright belong to the original authors. Thanks!

04/19/23 3

Gene FindingWhy do it? • Find and annotate all the genes within the large volume

of DNA sequence data– Human DNA length = 3.4*109 bp– Number of genes = 30,000 - 100,000– Gene percentage ~= 1%

• Gain understanding of problems in basic biology– e.g. gene regulation-what are the mechanisms involved in

transcription, splicing, etc?

• Different emphasis in these goals has some effect on the design of computational approaches for gene finding.

04/19/23 4

Gene Finding

• Cells recognize genes from DNA sequence– find genes via their bioprocesses

• Not so easy for us..

04/19/23 5

CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCTCTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGAAGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAGGAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGTTTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGTGGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAGAATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAACTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACTTGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATAAGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGGACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCATATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAACAAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTATTGTTATGAGACTGGATATAT...

Where is Gene?

04/19/23 6

Types of Genes

• Protein coding– most genes

• RNA genes– rRNA– tRNA– snRNA (small nuclear RNA)– snoRNA (small nucleolar RNA)

04/19/23 7

3 Major Categories of Information used in

Gene Finding Programs • Signals/features

– a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands

• Content/composition – statistical properties of coding vs. non-coding regions.

• e.g. codon-bias; length of ORFs in prokaryotes;GC content

• Similarity– compare DNA sequence to known sequences in database

– Not only known proteins but also ESTs, cDNAs

04/19/23 8

Gene Structure

04/19/23 9

Prokaryotic Genes Structure

5’ 3’

Open Reading Frame

Promoter region (maybe)

Ribosome binding site (maybe)

Termination sequence (maybe)

Start codon / Stop Codon

04/19/23 10

In Prokaryotic Genomes

• We usually start by looking for an ORF – A start codon, followed by (usually) at least 60 amino acid

codons before a stop codon occurs– Or by searching for similarity to a known ORF

• Look for basal signals– Transcription (the promoter consensus and the termination

consensus) – Translation (ribosome binding site: the Shine-Dalgarno

sequence)

• Look for differences in sequence content between coding and non-coding DNA– GC content and codon bias

04/19/23 11

Gene Finding in Bacterial Genomes

• Advantages– Simple gene structure

• Small genomes (0.5 to 10 million bp)• No introns

– Dense Genomes• High coding density (>90%)• Short intergenic regions

– Conserved signals– Abundant comparative information

• Complete Genomes available for many– Uninterrupted ORFs

• Disadvantages– Some genes overlap (nested)– Some genes are quite short (<60 bp)

04/19/23 12

Open Reading Frame (ORF)

• Any stretch of DNA that potentially encodes a protein

• The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene

04/19/23 13

Open Reading Frames

Each grouping of the nucleotides into consecutive triplets constitutes a reading frame.

A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF)

A C G T A A C T G A C T A G G T G A A T

GTA ACT GAC TAG GTG AAT

CGT AAC TGA CTA GGT GAA

04/19/23 14

ORFs as gene candidates

• An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is species-dependent)

• Most prokaryotic genes code for proteins that are 60 or more amino acids in length

• The probability that a random sequence of nucleotides of length n has no stop codons (UAA, UAG, UGA) is (61/64)n – When n is 50, there is a probability of 92% that the random

sequence contains a stop codon– When n is 100, this probability exceeds 99%

04/19/23 15

Codon Bias

• Genetic code degenerate– Equivalent triplet codons code for the same amino acid

• Codon usage varies– organism to organism– gene to gene

• Biological basis– Avoidance of codons similar to stop– Preference for codons that correspond to abundant

tRNAs within the organism

04/19/23 16

Codon Bias Gene Differences

GAL4 ADH1Gly GGG 0.21 0Gly GGA 0.17 0Gly GGT 0.38 0.93Gly GGC 0.24 0.07

04/19/23 17

Codon BiasOrganism differences

• Arginine : CGT,CGC,CGA,CGG,AGA,AGG

• Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each)

• Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each)

• Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/

04/19/23 18

GC content

• GC relative to AT is a distinguishing factor of bacterial genomes

• Varies dramatically across species– Serves as a means to identify bacterial species

• For various biological reasons– Mutational bias of particular DNA polymerases– DNA repair mechanisms – horizontal gene transfer (transformation, transduction,

conjugation)

04/19/23 19

GC Content

• GC content may be different in recently acquired genes than elsewhere

• This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon

bias within different genes of a single bacterium’s genome

04/19/23 20

Ribosome Binding Sites

• RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome)

• Usually found within 4-18 nucleotides of the start codon of a true gene

04/19/23 21

Shine-Dalgarno Sequence

• Shine-Dalgarno sequence is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'-untranslated region of prokaryotic mRNAs.

• This sequence serves as a binding site for ribosomes and is thought to influence the reading frame.

• If a subsequence aligning well with the Shine-Dalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy.

04/19/23 22

Bacterial Promoter

-35T82T84G78A65C54A45…

(16-18 bp)…T80A95T45A60A50T96…(A,G)

-10 +1

Not so simple: remember, these are consensus sequences

04/19/23 23

Eukaryotic Gene Structure

04/19/23 24

Genes and Signals

04/19/23 25

The Complicating factors in Eukaryotes

• Interrupted genes (split genes)• introns and exons

• Large genomes• Most DNA is non-coding

• introns, regulatory regions, “junk” DNA (unknown function)

• About 3% coding

• Complex regulation of gene expression • Regulatory sequences may be far away from

start codon

04/19/23 26

Some numbers to consider:• Vertebrate genes average about 30Kb long

– varies a lot• Coding region is only about 1-2 Kb• Exon sizes and numbers vary a lot

– Average is 6 exons, each about 150 bp long• An average 5’ UTR is about 750 bp• An average 3’UTR is about 450 bp

– (both can be much longer)• There are huge deviations from all of these numbers

e.g. dystrophin is 2.4 Mb long ; factor VIII gene has 26 exons, introns are up to 32 Kb (one intron produces 2 transcripts unrelated to the gene!)

– There are genes without introns: called single-exon or intronless genes

04/19/23 27

Given a long eukaryotic DNA sequence:

• How would you determine if it had a gene?

• How would you determine which substrings of the sequence contained protein-coding regions?

04/19/23 28

So, what’s the problem with looking for ORFs?

“split” genes make it difficult to define ORFs

• Where are the stars and stops?

• What problems do introns introduce?

• What would you predict for the size of ORFs?

04/19/23 29

Most Programs Concentrate on Finding Exons

• Exon: the region of DNA within a gene that codes for a polypeptide chain or domain

• Intron: non-coding sequences found in the structural genes

04/19/23 30

Splice Sites used to Define Exons

• Splice donor (exon-intron boundary) and splice acceptor (intron-exon boundary)

• Common sequence motifs– C(orA)AG/GTA(orG)AGT "donor" splice

site – T(orC)nNC(orT)AG/G "acceptor" splice site

04/19/23 31

Gene finding programs look for different types of exon

• single exon genes: begin with start codon & end with stop codon

• initial exons: begin with start codon & end with donor site

• internal exons: begin with acceptor & end with donor

• terminal exons: begin with acceptor & end with stop codon

04/19/23 32

How are correct splice sites identified?

• There are many occurrences of GT or AG within introns that are not splice sites

• Statistical profiles of splice sites are used

http://www.lclark.edu/~lycan/Bio490/pptpresentations/mutation/sld016.htm

04/19/23 33

Other Biologically Important Signals Used in Gene Finding Programs

• Transcriptional Signals– Transcription Start: characterized by cap signal

• A single purine (A/G)– TATA box (promoter) at –25 relative to start– Polyadenylation signal: AATAAA (3’ end)

• Major Caveat: not all genes have these signals

• Makes it difficult to define the beginning and end of a gene

04/19/23 34

Upstream Promoter Sites

• Transcription Factor (TF) sites– Transcription factors are sequence-specific DNA-

binding proteins– Bind to consensus DNA sequences– e.g. CAAT transcription factor and CAAT box

• Many of these– Vary in sequence, location, interaction with other sites– Further complicates the problem of delineating a

“gene”

04/19/23 35

Translation Signals

• Kozak sequence– The signal for initiation of translation in

vertebrates– Consensus is GCCACCatgG

• And of course..– Translation stop codons

04/19/23 36

GC Content in Eukaryotes

• Overall GC content does not vary between species as it does in prokaryotes

• GC content is still important in gene finding algorithms – CpG Islands

04/19/23 37

CpG Islands

• CpG stands for cytosine and guanine separated by a phosphate, which links the two nucleosides together in DNA. – CG dinucleotides are often written CpG to

avoid confusion with the base pair C-G

04/19/23 38

CpG Islands

• In the eukaryotic genome, CpG occur at lower frequency than would be expected in purely random sequences (1/16).– Occurrence related to methylation– Methylation of C in CG, turning it into 5-methylcytosin.

Following spontaneous deamination, the 5-methylcytosine converts into thymine.

– Methylation of C makes CpG prone to mutation (e.g. to TpG or CpA). CpG sites thus tend to be eliminated from the genomes of eukaryotes

04/19/23 39

CpG Islands

• However, in the start regions of many genes which have a high concentration of CpG sites: CpG islands,– Found at the promoters of eukaryotic genes. – These CpG sites are unmethylated, and therefore any

spontaneous deaminations of cytosine to uracil are recognized by the repair machinery and the CpG site is restored.

– High occurrence of CpGs in many cases marks the existence of downstream genes and is frequently used in genome annotation as indicator of gene density.

04/19/23 40

Gene Finding by Computational Methods

• Dependent on good experimental data to build reliable predictive models

• Various aspects of gene structure/function provide information used in gene finding programs

04/19/23 41

Computational Gene finding approaches

1) Rule-based (e.g, start & stop codons)

2) Content-based (e.g., codon bias, promoter sites)

3) Similarity-based (e.g., orthologs)

4) Pattern-based (e.g., machine-learning: neural network, HMM)

04/19/23 42

Simple rule-based gene finding in prokaryotes, based on ORFs

• Look for putative start codon (ATG)

• Staying in same frame, scan in groups of three until a stop codon is found

• If # of codons >=50, assume it’s a gene

• If # of codons <50, go back to last start codon, increment by 1 & start again

• At end of chromosome, repeat process for reverse complement

04/19/23 43

Example ORF

04/19/23 44

Problems with rule-based approaches

• Advantages– Simple and fairly sensitive (>50%)

• Disadvantages– Prokaryotic genes are not always so simple to find– ATG is not the only possible start site (e.g. CTG,

TTG – class I alternates) – Small genes tend to be overlooked and long ones

over-predicted

• Solution? Use additional information to increase confidence in predictions

04/19/23 45

Content based approaches

• Key prokaryotic gene features– RNA polymerase promoter site (-10, -30 site

or TATA box)– Shine-Dalgarno sequence (+10, Ribosome

Binding Site) to initiate protein translation– Codon biases– High GC content – Stem-loop (rho-independent) terminators

04/19/23 46

Content based approaches

• Key eukaryotic gene features– CpG islands

• More abundant near gene start site • High GC content in 5’ ends of genes

– Codon Bias• Some codons are strongly preferred in coding regions,

others are not

– Hexamers• Dicodon frequencies informative – physical constraints prefer

certain adjacent amino acids over others

– Positional Bias• 3rd base tends to be G/C rich in coding regions

04/19/23 47

Content-based recognition

• Advantages:– Increases accuracy over rule-based

• Disadvantages:– Features are degenerate– Features are not always present

04/19/23 48

Homology-Based Approaches in Eukaryotic Genomes

• More complicated than prokaryotes due to split genes• Genome sequence -> first identify all candidate exons• Use a spliced alignment algorithm to explore all possible

exon assemblies & compare to known– e.g. Procrustes

• Limitations: – must have similar sequence in the database with

known exon structure– Sensitive to frame shift errors

04/19/23 49

Gene Finding using Comparative Genomics

• Purifying selection – Conserved regions between two genomes are useful or else they would have diverged.

• If genomes are too close in the phylogenetic tree, there may be too much noise.

• If genomes are too far, then regions can be missed.

04/19/23 50

UCSC Browser

04/19/23 51

Gene Prediction using sequence similarities

• Genomescan incorporates similarity-based method by adding a blastX component to its prediction algorithm, using the translated sequence to search protein db.

• http://genes.mit.edu/genomescan/

• “TWINSCAN is a gene prediction system that models both gene structure and evolutionary conservation. The scores of features like splice sites and coding regions are modified using the patterns of divergence between the target genome and a closely related genome.”

• http://genes.cs.wustl.edu/

04/19/23 52

Neural Networks - Grail

• Sensors are trained using a set of known genes in the organism.

• GrailExp incorporates similarity-based method by adding a blastn component to its prediction algorithm. Runs reliably on unmasked sequences.

• Sensors are :– Frame Bias Matrix - This uses the codon bias to

determine the correct frame .– Fickett - Named after Fickett who originally used

properties such as 3-periodicity and overall base composition to predict genes.

04/19/23 53

Neural Networks - Grail

– Coding 6-tuple word preference -frequency of 6-tuple words in the coding region.

– Coding 6-tuple in-frame preference - 6-tuple composition is evaluated for the 3 frames and the one with the best score is used.

– Repetitive 6-tuple word preference - 6-tuple statistics in repetitive elements. This is an identification where coding regions are not expected.

04/19/23 54

Neural Network

ACGAAGAGGAAGAGCAAGACGAAAAGCAAC

ACGAAG

EEEENN

A = [001]C = [010]G = [100]

E = [01]N = [00]

DefinitionsTraining Set

Dersired Output

Sliding Window

[010100001]Input Vector

Output Vector[01]

04/19/23 55

Neural Network Training

[010100001]

.2 .4 .1

.1 .0 .4

.7 .1 .1

.0 .1 .1

.0 .0 .0

.2 .4 .1

.0 .3 .5

.1 .1 .0

.5 .3 .1

[.6 .4 .6].1 .8.0 .2.3 .3

[.24 .74]

1

1 - e-x

Input Weight Hidden Weight OutputVector Matrix1 Layer Matrix2 Vector

compare

[0 1]ACGAAG

04/19/23 56

Back Propagation

[010100001]

.2 .4 .1

.1 .0 .4

.7 .1 .1

.0 .1 .1

.0 .0 .0

.2 .4 .1

.0 .3 .5

.1 .1 .0

.5 .3 .1

[.6 .4 .6].1 .8.0 .2.3 .3

[.24 .74]

1

1 - e-x


compare

[0 1]

.83

.33

.23

.22

.02

04/19/23 57

Calculate New Output

[010100001]

.1 .1 .1

.2 .0 .4

.7 .1 .1

.0 .1 .1

.0 .0 .0

.2 .2 .1

.0 .3 .5

.1 .3 .0

.5 .3 .3

[.7 .4 .7].02 .83.00 .23.22 .33

[.16 .91]

1

1 - e-x


Converged!

[0 1]

04/19/23 58

Train on Second Input Vector

[100001001]

.1 .1 .1

.2 .0 .4

.7 .1 .1

.0 .1 .1

.0 .0 .0

.2 .2 .1

.0 .3 .5

.1 .3 .0

.5 .3 .3

[.8 .6 .5].02 .83.00 .23.22 .33

[.12 .95]

1

1 - e-x


Compare

[0 1]ACGAAG

04/19/23 59

Back Propagation

[010100001] [.8 .6 .5] [.12 .95]

1

1 - e-x


compare

[0 1]

.84

.34.21

.01

.1 .1 .1

.2 .0 .4

.7 .1 .1

.0 .1 .1

.0 .0 .0

.2 .2 .1

.0 .3 .5

.1 .3 .0

.5 .3 .3

.02 .83

.00 .23

.22 .33.24

04/19/23 60

After Many Iterations….

Two “Generalized” Weight Matrices

.13 .08 .12

.24 .01 .45

.76 .01 .31

.06 .32 .14

.03 .11 .23

.21 .21 .51

.10 .33 .85

.12 .34 .09

.51 .31 .33

.03 .93

.01 .24

.12 .23

04/19/23 61

Neural Networks

Input Layer 1 Hidden Output Layer

ACGAGG EEEENN

Matrix1

Matrix2

New pattern Prediction

04/19/23 62

Hidden Markov Models• In general, sequences are not monolithic, but

can be made up of discrete segments

• Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state

• Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

04/19/23 63

MMA Markov process is a process, which moves from state to state depending (only) on the previous n states.

Sunny Cloudy Rainy0.25

0.5

0.25

Sunny

Cloudy

Rainy

1.0

3.0

6.0

375.0625.0125.0

375.0125.0375.0

25.025.05.0

A

Weather today

Sunny cloudy Rainy

Weather yesterday

Sunny

Cloudy

Rainy

04/19/23 64

Example:P (Sunny , Sunny, Cloudy, Rainy | Model) =

Π(sunny)* P (Sunny | Sunny) * P (Cloudy | Sunny) *P (Rainy | Cloudy) =

0.6 * 0.5 * 0.25 * 0.375 = 0.0281

375.0625.0125.0

375.0125.0375.0

25.025.05.0

A

Weather today

Sunny cloudy Rainy Sunny

Cloudy

Rainy

Sunny Cloudy Rainy0.25

0.5

0.25

Weather yesterday

1.0

3.0

6.0

Sunny

Cloudy

Rainy

04/19/23 65

HMM

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #3#2

#1 #2 #3

#1

#2

#3

ith turn

i+1 turn

1.0

3.0

6.0 #1

#2

#3

State transition probabilities

emission probabilities

66

Elements of an HMM

• An HMM is characterized by the following:1. N, the number of states in the model2. M, the number of distinct observation symbols per state3. The state transition probability distribution A=aij, where

aij=P[qt+1=j|qt=i], 1≤i,j≤N4. The observation symbol probability distribution in state j,

B=bj(vk) , where bj(vk)=P[ot=vk|qt=j], 1≤j≤N, 1≤k≤M5. The initial state distribution =i, where i=P[q1=i],

1≤i≤N• For convenience, we usually use a compact notation

=(A,B,) to indicate the complete parameter set of an HMM– Requires specification of two model parameters (N and M)

67

Two Major Assumptions for HMM

• First-order Markov assumptionFirst-order Markov assumption– The state transition depends only on the origin and

destination

– The state transition probability is time invariant

• Output-independent assumptionOutput-independent assumption– The observation is dependent on the state that generates

it, not dependent on its neighbor observations

aij=P(qt+1=j|qt=i), 1≤i, j≤N

T

tttTt qqPqPqqqPP

2111 ,,...,,..., Q

T

ttq

T

tttTtTt obqoPqqqoooPP

t11

11 ,,,...,,...,,...,,...,, QO

04/19/23 68

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #2 #3

#1

#2

#3

ith turn#1 #3#2

i+1 turn

6.0

3.0

1.0 #1

#2

#3

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

The three Basic problems of HMMsThe three Basic problems of HMMsProblem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)compute P(O | M).

for example: P ( | M)

04/19/23 69

Example: P( | M).

#1

#3

#2

#1

#3

#2

#1

#3

#2

#1

#3

#2

Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)compute P(O | M)

),|( M) Q, | P(O T1t MqOP tt

kkqqq aaa qq 1321 ...M) | P(Q 21

)|(),|( M) | P(O Q all MQPMQOP

We define a sequence of states Q=q1q2…qT.

04/19/23 70

Example: P( | M).

Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B)

compute P(O | M).We define a sequence of states Q=q1q2…qT.

),|( M) Q, | P(O T1t MqOP tt

kkqqq aaa qq 1321 ...M) | P(Q 21

)|(),|( M) | P(O Q all MQPMQOP

#1

#3

#2

#1

#3

#2

#1

#3

#2

#1

#3

#2

O(NT*T) !!!N- number of states

T- number of observations

04/19/23 71

Problem 1: Given observation sequence O=O1O2…OT and model M=(Π, A, B) compute P(O | M).

Solution: Forward algorithmForward algorithmMuch better…

O(N2T) !!!N- number of states

T- number of observations

#1

#3

#2

#1

#3

#2

#1

#3

#2

#1

#3

#2

#1

#3

#2

#1

#3

#2

Example: P( | M).

For N=5 an T =100

Naive solution…1072

Forward algorithm… 3000

04/19/23 72

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #2 #3

#1

#2

#3

ith turn#1 #3#2

i+1 turn

6.0

3.0

1.0 #1

#2

#3

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

The three Basic problems of HMMsThe three Basic problems of HMMsProblem 2: Given observation sequence O=O1O2…OT and model M=(Π, A, B)how do we choose a corresponding state sequence q=q1q2…qT ,which best “explains” the observation.For example:What are most probable q1q2q3q4 given the observation

#? #? #? #?

04/19/23 73

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #2 #3

#1

#2

#3

ith turn#1 #3#2

i+1 turn

1.0

3.0

6.0 #1

#2

#3

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

The three Basic problems of HMMsThe three Basic problems of HMMsProblem 3:How do we adjust the model parameters Π, A, B to maximize P(O |Π, A, B)?

04/19/23 74

Solution to the three problems:• Given an observation sequence O=(o1,o2,…,oT), and an HMM =(A,B,)

– Problem 1:

How to efficiently compute P(O|) ?

Evaluation problem• Solution: Forward algorithm O(N2L)

– Problem 2:

How to choose an optimal state sequence Q=(q1,q2,……, qT) which best explains

the observations?

Decoding Problem• Solution: Viterbi algorithm O(N2L)

– Problem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? Learning/Training Problem

• Solution: Baum-Welch reestimation formulas

)|,(maxarg* OQQQ

P

04/19/23 75

Solution to Problem 1 - The Forward Procedure

• Base on the HMM assumptions, the calculation of and involves only qt-1, qt , and ot , so it is possible to compute the likelihood

with recursion on t

• Forward variable : – The probability of the joint event that o1,o2,…,ot are observed and

the state at time t is i, given the model λ

,1tt qqP ,tt qoP

λiqoooPiα ttt ,,...,, 21

OP

)()(

,,,...,,

11

11211

tj

N

iijt

tttt

obaiα

λjqooooPjα

04/19/23 76

)()(

)(),|(,,...,,

)(),,,...,,|(,,...,,

)(,,,...,,

)(|,,...,,

),|(|,,...,,

)|(),|(,|,...,,

)|(,|,,...,,

|,,,...,,

11

11

121

11

21121

11

121

1121

11121

111121

11121

11211

tj

N

iijt

tj

N

itttt

tj

N

ittttt

tj

N

ittt

tjtt

tttt

ttttt

tttt

tttt

obai

obλiqjqPλiqoooP

obλiqooojqPλiqoooP

obλjqiqoooP

objqoooP

jqoPjqoooP

jqPjqoPjqoooP

jqPjqooooP

jqooooPj

Solution to Problem 1 - The Forward Procedure (cont.)

Ball

BAPAP

),(

)|(),|()(

),(

),(

),,(

)(

),,()|,(

BPBAP

P

BP

BP

BAP

P

BAPBAP

Output-independent assumption

)|,()|(),|( BAPBPBAP

)(, 111 tjtt objqoP

),|()|()|,( ABPAPBAP

First-order Markov assumption

04/19/23 77


3(2)=P(o1,o2,o3,q3=2|)

=[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3)

S2

S3

S1

o1

S2

S3

S1

S3

S2

S1

S2

S3

S1

State

o2 o3 oT

1 2 3 T-1 T Time

S2

S3

S1

oT-1

Si means bj(ot) has been computed

aij means aij has been computed

2(1)

2(2)

2(3)

a12

a22

a32b2(o3)

04/19/23 78


• Algorithm

Complexity: O(N2T)

• Based on the lattice (trellis) structure– Computed in a time-synchronous fashion from left-to-right, where each cell

for time t is completely computed before proceeding to time t+1

• All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t

N

iT

tj

N

iijtt

ii

iαλP

Nj,T-t, obaiαjα

Ni, obπiα

1

11

1

11

111

1

O ion3.Terminat

Induction 2.

tionInitializa 1.

TN))N(T-(N-

T N)+N )(T-N(N+2

2

11: ADD

11 : MUL

λiqoooPi ttt ,...21

04/19/23 79


• A three-state Hidden Markov Model for the Dow Jones Industrial average

b1(up)=0.7

b2(up)= 0.1

b3(up)=0.3

a11=0.6

a21=0.5

a31=0.4

(Huang et al., 2001)

b1(up)=0.7

b2(up)= 0.1

b3(up)=0.3

π1=0.5

π2=0.2

π3=0.3

α1(1)=0.5*0.7

α1(2)= 0.2*0.1

α1(3)= 0.3*0.3

α2(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.7

04/19/23 80

Solution to Problem 2 - The Viterbi Algorithm

• The Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithm– Instead of summing up probabilities from different paths coming

to the same destination state, the Viterbi algorithm picks and remembers the best path

• Find a single optimal state sequence Q=(q1,q2,……, qT)

– The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

04/19/23 81

Solution to Problem 2 - The Viterbi Algorithm (cont.)

S2

S3

S1

o1

S2

S3

S1

S2

S3

S1

S2

S1

S3

State

o2 o3 oT

1 2 3 T-1 T Time

S2

S3

S1

oT-1

04/19/23 82


1. Initialization

2. Induction

3. Termination

4. Backtracking),...,,(

1,...,2.1),(**

2*1

*

*11

T

tt*t

qqq

TTtqq

Q

Ni, i

Ni, obπi ii

10)(

1

1

11

Nj,T-t, aij

Nj,T-t, obaij

ijtNi

tjijtNi

t

1 11][maxarg)(

1 11][max

11t

11

1

iq

iλP

TNi

*T

TNi

1

1

*

maxarg

maxO

Complexity: O(N2T)

is the best state sequence

04/19/23 83

b1(up)=0.7

b2(up)= 0.1

b3(up)=0.3

a11=0.6

a21=0.5

a31=0.4

b1(up)=0.7

b2(up)= 0.1

b3(up)=0.3

π1=0.5

π2=0.2

π3=0.3


• A three-state Hidden Markov Model for the Dow Jones Industrial average


δ1(1)=0.5*0.7

δ1(2)= 0.2*0.1

δ1(3)= 0.3*0.3

δ2(1)=max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.7

δ2(1)= 0.35*0.6*0.7=0.147Ψ2(1)=1

04/19/23 84

Solution to Problem 3 – The Baum-Welch Algorithm

• How to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)?– The most difficult one among the three problems, because there

is no known analytical method that maximizes the joint probability of the training data in a closed form

• The data is incomplete because of the hidden state sequence

– The problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithm

• The EM (Expectation Maximization) algorithm is perfectly suitable for this problem

Baum-Welch Local Maximization

• 1st step: You determine– The number of hidden states, N– The emission (observation alphabet)

• 2nd step: randomly assign values to…A - the transition probabilitiesB - the observation (emission) probabilities - the starting state probabilities

• 3rd step: Let the machine re-estimateA, B,

04/19/23 85

04/19/23 86

Solution to Problem 3 – The Backward Procedure

• Backward variable :– The probability of the partial observation sequence ot+1,ot+2,…,oT,

given state i at time t and the model 2(3)=P(o3,o4,…, oT|q2=3,)

=a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3)

λ,,...,, 21 iqoooPi tTttt

S2

S3

S1

o1

S2

S3

S1

S2

S3

S1

S2

S3

S1

o2 o3 oT

1 2 3 T-1 T Time

S2

S3

S3

oT-1

S2

S3

S1

State

3(1)b1(o3)

a31

04/19/23 87

Solution to Problem 3 – The Backward Procedure (cont.)

• Algorithm

TN))N(T-(N-T N) (T-N

Nj,T-tjobai

NiiβN

jttjijt

T

222

111

11 :ADD ; 12 : MUL Complexity

1 11 ,Induction 2.

1 ,1tion Initializa 1.

λiqoooPi tTttt ,,...,, 21

N

iii

N

iT

N

iT

N

iT

obi

λiqPλiqoPλiqoooP

λiqPλiqooooPλiqooooPλP

111

1111

132

11

13211

1321

)()(

,,,...,,

,,...,,,,,...,,,

O

N

iT iαλP

1Ocf.

04/19/23 88

Solution to Problem 3 – The Forward-Backward Algorithm

• Relation between the forward and backward variables

)(][

,...

11

21

ti

N

jjitt

ttt

obaji

iqoooPi

λ

N

jttjijt

tTttt

jobai

iqoooPi

111

21

)(

,...

λ

λiqPii ttt ,)( O


Ni tt iiλP 1 )(O

04/19/23 89

Solution to Problem 3 – The Forward-Backward Algorithm (cont.)

λiqP

iqoooP

iqPiqoooP

iqoooPiqPiqoooP

iqoooPiqoooP

ii

t

tT

ttT

tTttttt

tTtttt

tt

,

)|,,...,,(

)|(),|,...,,(

),|,...,,()|(),|,...,,(

),|,...,,()|,,...,,(

)(

21

21

2121

2121

O

N

itt

N

it iiλiqPλP

11)()(, OO

04/19/23 90

Solution to Problem 3 – The Intuitive View

• Define two new variables:

t(i)= P(qt = i | O, ) – Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, )

– Probability of being in state i at time t and state j at time t+1, given O and

N

m

N

nttnmnt

ttjijtttt

nobam

jobai

λP

λjqiqPji

1 111

111 ,,,

O

O

N

jtt jii

1

,

Ni tt

tttttt

ii

ii

λP

ii

λP

iqPi

1

)|,(

OO

O

λiqPii ttt ,)( O

Ni tt iiλP 1 )(O

04/19/23 91

Solution to Problem 3 – The Intuitive View (cont.)

• P(q3 = 3, O | )=3(3)*3(3)

o1

s2

s1

s3

s2

s1

s3

S2

s1

S1

State

o2 o3 oT

1 2 3 4 T-1 T Time

oT-1

S2

S3

S1

S2

S3

S1

S3

S2

S3

S1

S2

S3

S1

S2

S3

S1

3(3) 3(3)

04/19/23 92


• P(q3 = 3, q4 = 1, O | )=3(3)*a31*b1(o4)*4(1)

o1

s2

s1

s3

s2

s1

s3

S2

s1

S1

State

o2 o3 oT

1 2 3 4 T-1 T Time

oT-1

S2

S3

S1

S2

S3

S1

S3

S2

S3

S1

S2

S3

S1

S2

S3

S1

3(3)

4(1)

a31

b1(o4)

04/19/23 93


t( i, j )=P(qt = i, qt+1 = j | O, )

t(i)= P(qt = i | O, )

Oin state to state from ns transitioofnumber expected

,1

1

ji

jiT

tt

Oin state from ns transitioofnumber expected

1

1

i

iT

tt

04/19/23 94


• Re-estimation formulae for , A, and B are

itii 11)( at time statein times)of(number freqency expected

statein timesofnumber expected

symbol observing and statein timesofnumber expectedT

1t

T

s.t.1 t

j

j

j

vjvb

t

vo

t

kkj

kt

i

i,jξ

i

jia

T-

tt

T-

tt

ij

1

1

1

1 state from ns transitioofnumber expected

state to state from ns transitioofnumber expected

04/19/23 95

How is it connected to Gene prediction?

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

Yellow

Red

Green

Blue

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #3#2

#1 #2 #3

#1

#2

#3

ith turn

i+1 turn

1.0

3.0

6.0#1

#2

#3

04/19/23 96

How is it connected to Gene prediction?

25.0

25.025.0

25.0

1B

25.0

065.0

10.0

3B

10.0

35.010.0

35.0

2B

A

G

C

T

5.03.02.0

4.02.04.0

2.07.01.0

A

#1 #2 #3

#1

#2

#3

ith turn

i+1 turn

1.0

3.0

6.0Exon

Intron

UTR

GGT GG

AAGG

GGT

TT

CCCCAA

AA

AAAACC CCGG TAA GG

GG AACC CCT

T

Exon Intron UTR

A

G

C

T

A

G

C

T

04/19/23 97

GENESCANGENESCAN

Chris Burge 1997

E0 E1 E2

I0 I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly A

Signal

promoter

Intergenic region

Ex1 In1 Ex2 Ex2 In2 Ex3 In3 Ex4 In4 Ex5 Ex5

5’ UTR 3’ UTR

G T AG

04/19/23 98

031.041.028.0

39.0033.028.0

0100

0010

A

12.0

60.0

04.0

06.0

Sequence generating models:

P1 P2 P3 P4

CCCCAA

AA

AAAACC CCGG TIntron

GG AA AAAACCT T

GENESCAN componentsGENESCAN components

Intergenic region

E

0

E1 E

2

I0 I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly Apromoter

Set of length distributions:

f1 f2 f3 f4 fintron(10)=0fintron(350)=.03

04/19/23 99

Definitions:

For fixed sequence length L we define:

ФL- set of all possible parses of length L

SL- set of all possible DNA sequences of length L

ΩL= ФL x SL

Our model M is a probability measure on this space assigns a probability density to each parse/sequence pair.

How do we use all that for gene How do we use all that for gene prediction?prediction?

04/19/23 100

Or in other words…

Given a sequence S

and a parse Фi

A C G C G A C T A G G C G C A G G T C T A … G A T

Exon0

Intron0 Exon0 Intron1 Exon1 3’UTR

We can calculate P(S, Фi):

04/19/23 101

A C G C G A C T A G G C G C A G G T C T A … G A TExon0

Intron0 Exon0 Intron1 Exon1 3’UTR

031.041.028.0

39.0033.028.0

0100

0010

A

12.0

60.0

04.0

06.0

Sequence generating models:

P1 P2 P3 P4

Set of length distributions:

f1 f2 f3 f4

CC CCAA

AA

AAAACC CCGG T

Intron

GG AA AAAACCT T

E

0

E

1

E

2

I

0

I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly Apromoter

Intergenic region

πq1 fq1(d1)Pq1(s1) * …Aq1 -> q2 fq2(d2)P(s2) * … Aqk-1->qkfqk(dk)P(sk)P(S, Фi) =

04/19/23 102

Conditional probability of parse Фi given S sequence is:

P(S, Фi) = πq1 fq1(d1)Pq1(s1) * Aq1 -> q2 fq2(d2)P(s2)*…Aqk-1->qkfqk(dk)P(sk)

S), jP(

S), iP(

P(S)

S)i,P( S)| iP(

j

L

PredictionPrediction::

Find the parse with maximum likelihood, i.e. max P(Фi | S)

In order to parse a given sequence S (i.e. predict genes in S) we…

04/19/23 103

Splice site sequence generator

• What about non-adjacent nucleotides dependencies?

• What about adjacent nucleotides dependencies

• What is the probability for generating signal O-5O-4…O6 ?

3 5 64210-1-2-3-4-5

A T AGATGGCCAC

A T AGGTGTCCAC

A T AGATGGACAC

A T AGATGGCCAC

310-1-2-3-4

0086033 A%

30041337 C%

450100

811418G%

100

071312 T% 3

49…

…

…

…WMM – Weight Matrix Method

WAM – Weight Array ModelConditional probability of generating nucleotide Xk at position I given nucleotide Xj at position i-1

04/19/23 104

What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies? What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies?

Procedure: MDD- Maximal Dependency Decomposition

04/19/23 105

What about non-adjacent nucleotides dependencies?What about non-adjacent nucleotides dependencies?MDD- Maximal Dependency Decomposition

Given data set D consisting of N sequences with length k

1. Align sequences

2. Find Ci, the consensus nucleotide at position i.

3. For each pair of positions (i,j) where i!=j Calculate statistic for Ci vs. nucleotide indicator Xj.

3 5 64210-1-2-3-4-5

A T AGATGGCCAC

A T AGGTGTCCAC

A T AGATGGACAC

A T AGATGGCCAC

`2)(2

E

EO

2For a specific (i,j)

O E

A … (%A in D)*Nc

C … (%C in D)*Nc

G … (%G in D)*Nc

T … (%T in D)*Nc

4.Calculate Si, the sum of each row (which is the measure between dependencies of Ci and nucleotides at remaining position sites)

Nci – number of sequences containing Ci

Do the same for all (i,j) i \ j -3 -2 -1 … 6

G -3

A -2

G -1

…

T 6

SUM

MAX(Si)

5. if (not (stop condition)) Choose Ci with max(Si) and partition D.

1. K-1 level of tree is reached

2. No significant dependencies found

3.Number of remaining sequences is to small

04/19/23 106

E

0

E1 E

2

I0 I1 I2

Einit Eterm

Single exon gene

5’ UTR 3’ UTR

Poly A

Signal

promoter

Intergenic region

E

0

E

2

I0 I1 I2

Einit

Eterm

Single exon gene

5’ UTR 3’ UTR

Poly A

Signal

promoter

Not mentionedNot mentioned•Reverse strand states

•C+G%

•Coding / non coding detection

•Branch point detection

•Expected vs. observed AG composition

•And more…

-100 -80 -60 -40 -20 0 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2Expected and observed percentage of AG near Acceptor site in coding region

ObservedExpected

04/19/23 107

Evaluating prediction programs

TP FP TN FN TP FN TN

Actual

Predicted

Sensitivity/ How many of the known genes were found?Recall

Specificity/ How many of the predicted genes were real?Precision

Correlation/ How good is it overall? F-measure

04/19/23 108

TP FP TN FN TP FN TN

Actual

Predicted

Sensitivity Sn=TP/(TP + FN)

Specificity Sp=TP/(TP + FP)

F-measure F=(sn+sp)/2

Correlation CoefficientCC=(TP*TN-FP*FN)/[(TP+FP)(TN+FN)(TP+FN)(TN+FP)]0.5

Evaluating prediction programs

04/19/23 109

Gene Prediction Accuracy at the Exon Level

Actual

Predicted

WRONGEXON

CORRECTEXON

MISSINGEXON

Sn =Sensitivitynumber of correct exons

number of actual exons

Sp =Specificitynumber of correct exons

number of predicted exons

04/19/23 110

Gene finders - a comparison

Method Sn Sp AC Sn Sp(Sn+Sp)/

2ME WE

GENSCAN 0.93 0.93 0.91 0.78 0.81 0.8 0.09 0.05FGENEH 0.77 0.85 0.78 0.61 0.61 0.61 0.15 0.11GeneID 0.63 0.81 0.67 0.44 0.45 0.45 0.28 0.24

GeneParser2 0.66 0.79 0.66 0.35 0.39 0.37 0.29 0.17GenLang 0.72 0.75 0.69 0.5 0.49 0.5 0.21 0.21GRAILII 0.72 0.84 0.75 0.36 0.41 0.38 0.25 0.1

SORFIND 0.71 0.85 0.73 0.42 0.47 0.45 0.24 0.14Xpound 0.61 0.82 0.68 0.15 0.17 0.16 0.32 0.13

Accuracy per nucleotide Accuracy per exon

Sn = SensitivitySp = SpecificityAc = Approximate CorrelationME = Missing ExonsWE = Wrong Exons

GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html

04/19/23 111

Gene finder comparison (cont.)

"Evaluation of gene finding programs" S. Rogic, A. K. Mackworth and B. F. F. Ouellette. Genome Research, 11: 817-832 (2001).

04/19/23 112

After putative genes are found, they’re annotated

1. Matches known protein sequence2. Strong similarity to protein sequence3. Similar to known protein4. Similar to unknown protein5. Similar to EST (i.e., putative protein)6. No EST or protein matches (i.e.,

hypothetical protein)

Annotation category

04/19/23 113

Pitfalls and Issues

Several issues make the problem of eukaryotic gene finding extremely difficult.

Very long genes: for example, the largest human gene, the dystrophin gene, is composed of 79 exons spanning nearly 2.3 Mb.

Very long introns: again, in the human dystrophin gene, some introns are >100 kb long and >99% of the gene is composed of introns.

04/19/23 114

Pitfalls and Issues

3) Very conserved introns. (Conserved non-coding sequences) This is particularly a problem when gene prediction is bolstered through similarity searches.

04/19/23 115

Pitfalls and Issues

4) Very short exons: Some exons are only 3 bp long in Arabidopsis genes. Such small exons are easily missed by all content sensors, especially if bordered by large introns. The more difficult cases are those where the length of a coding exon is a multiple of three (typically 3, 6 or 9 bp long), because missing such exons will not cause a problem in the exon assembly as they do not introduce any change in the frame.

04/19/23 116

Pitfalls and Issues

5) Overlapping genes: Though very rare in eukaryotic genomes, there are some documented cases in animals as well as in plants

6) Polycistronic gene arrangement: Also rare. One gene and one mRNA, but two or more proteins.

04/19/23 117

Pitfalls and Issues

7) Frameshifts: Some sequences stored in databases may contain errors (either sequencing errors or simply errors made when editing the sequence) resulting in the introduction of artificial frameshifts (deletion or insertion of one base). Such frameshifts greatly increase the difficulty of the computational gene finding problem by producing erroneous statistics and masking true solutions.

04/19/23 118

Pitfalls and Issues

8) Introns in UTRs: There are genes for which the genomic region corresponding to the 5`- and/or 3`-UTR in the mature mRNA is interrupted by one or more intron(s).

9) Alternative transcription start: e.g. three alternative promoters regulate the transcription of the 14 kb full-length dystrophin mRNAs and four `intragenic' promoters control that of smaller isoforms.

04/19/23 119

Pitfalls and Issues

10) Alternative splicing.

11) Alternative polyadenylation: 20% of human transcripts showing evidence of alternative polyadenylation, affecting where the 3’ end is cleaved.

04/19/23 120

Pitfalls and Issues

12)Alternative initiation of translation: finding the right AUG initiator is still a major concern for gene prediction methods. the rule stating that the firrst AUG in the mRNA is the initiator codon can be escaped through three mechanisms: context-dependent leaky scanning, re-initiation and direct internal initiation. Non-AUG triplet can sometimes act as the functional codon for translation initiation, as ACG in Arabidopsis or CUG in human sequences

7/2/20151 gene finding 7/2/20152 copyright notice many of the images in this power point...

Documents