genome 559, winter 2014 · 2014-02-20 · genome 559, winter 2014 . ab initio gene prediction...

25
Ab initio gene prediction Genome 559, Winter 2014

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Ab initio gene prediction

Genome 559, Winter 2014

Page 2: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use those parameters to obtain a best interpretation of genes from any region from genome sequence alone.

1) Splice donor sequence model 2) Splice acceptor sequence model 3) Intron and exon length distribution 4) Open reading frame requirement in coding exons 5) Requirement that introns maintain reading frame 6) Transcription start and stop models (difficult to predict,

often omitted).

ab initio = "from the beginning" (i.e. without experimental evidence)

Page 3: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Sites we might want to predict

Splice donor site

Splice acceptor site

Translation start

Translation stop

(some predictors only deal with coding exons; the 5' and 3' ends are harder to predict.)

Page 4: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Open reading frames (random sequence)

• 61 of 64 codons are not stop codons (0.953 assuming equal nucleotide frequencies). • Probability of not having a stop codon in a particular reading frame along a length L of DNA is a geometric distribution that decays rapidly with L. • There are 3 reading frames on each DNA strand.

Page 5: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

(distance in codons)

long open reading frames are rare in random sequence

Geometric distribution in random sequence of distance to first stop codon (p=3/64)

( ) (1 )kP X k p p= = −

k amino acid codons followed by one

stop codon

Page 6: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Splice donor and acceptor information

donor, C. elegans (sums to ~8 bits)

acceptor, C. elegans (sums to ~9 bits)

Note – these show a log-odds measure of information content compared to background nucleotide frequencies. Similar to BLOSUM matrix log-odds.

exon

exon intron

intron

Page 7: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Position Specific Score Matrix (PSSM)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 1 2 1 0 0 2 2 0 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 2 8 0 1 0 3 0 0 0 0 0

0 0 0 0 0 8 1 1 1 2 1 1 1 1

A C G T

Slide PSSM along DNA, computing a score at every position.

splice donor

(this is a conceptual example, the real thing would be computed as log-odds values, similar to BLOSUM matrices)

Page 8: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Intron length distribution (C. elegans)

Note: intron length distributions in Drosophila melanogaster and Homo sapiens (and most other animals) are longer and broader.

Page 9: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Other information that can be used

• Splice donor and acceptor must be paired and donor must be upstream of acceptor (duh). • Introns in coding regions must maintain open reading frame of the flanking exons. • Nucleotide content analysis (e.g. introns tend to be AT rich).

Page 10: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Simple conceptual example

• Sites scored on basis of PSSM matches to known splice donor sequence model. • Arrow length reflects quality of match (worse matches not shown).

(plus strand only)

Page 11: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

(example cont.)

(one reasonable interpretation)

Add splice acceptor information

Where would you (intuitively) infer introns?

Page 12: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

(example cont.)

stop codon before highest scoring

splice donor!

reinterpreted (avoids stop codon by using lower scoring splice donor):

Page 13: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Real example (end result)

Note that this gene has no mRNA sequences (EST and ORFeome tracks empty). This is a pure ab initio prediction (made by Phil Green's genefinder).

1 2 3 4

Page 14: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Hidden Markov Model (HMM)

Markov chain - a linear series of states in which each state is dependent only on the previous state. HMM - a model that uses a Markov chain to infer the most likely states in data with unknown states ("hidden" states). A Markov chain has states and transition probabilities:

A B pAB

pBA

(since there are only two states, the probability of staying in state A is 1-pAB and the probability of staying in state B is 1-pBA )

states A and B

Page 15: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

A B

A 0.98 0.02

B 0.4 0.6

What will the series of states look like (qualitatively) for this Markov chain?

It will have long stretches of A states, interspersed with short stretches of B states.

A -> B

B -> A

Page 16: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Hidden Markov Model

We have a Markov chain with appropriate states and known transition probabilities (e.g. best fit to experimentally known genes). We have a DNA sequence with unknown states. Find the series of Markov chain states with the maximum likelihood for the DNA sequence. Solved with the Viterbi algorithm (we won't cover this, but it is another dynamic programming algorithm). See http://en.wikipedia.org/wiki/Viterbi_algorithm

Page 17: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

coding exon states (three frames)

intron states (three frames of codon they insert into)

special first (init) and last (term) coding exon states

(splice acceptor)

(splice donor)

Gene Prediction HMM States

taken from Stormo lab paper

Page 18: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

A way to connect the HMM formalism to specifics

probability of being in an intron “state” (based solely on donor sites)

Note – these probabilities are qualitative and are intended only to portray the local trends.

Page 19: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Long open reading frames favor exon state

123

probability of being in an exon “state” (based only on frame 1 ORF)

Page 20: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Ab initio gene prediction

• Requires a Markov chain with transition probabilities (typically "trained" on known genes or genes in a closely-related species).

• Produces a "parse" of a DNA sequence into each of the Markov states (intron, exon, intergenic, etc).

• Accuracy varies and always imperfect.

Page 21: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use
Page 22: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Intron positions and reading frame

• The intron can be any length and still produce the same exons • This particular splice is between two codons (0-shifting) • The splice position can move and maintain coding frame as long as both positions move coordinately. • If one splice endpoint moves it may change reading frame

ATGATCCTGGAGTCGgttggtgaacttgaaatttagGACGCTGTTATTTCCintron exon exon

ATGATCCTGGAGTCGGACGCTGTTATTTCC

M I L E S D A V I S

Page 23: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

Using evolutionary conservation as a guide to improve ab initio predictions - an example using dot plots is shown in the next two slides. The principle is that introns usually diverge much faster than coding exons. (though not always true)

Page 24: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

good exon

dubious exon 1

Gene A (ab initio

model)

Gene B (ab initio model)

DNA dot matrix comparison of two ab initio gene predictions in related genomes

other possible corrections

add intron?

remove intron?

exon 2 too long

Page 25: Genome 559, Winter 2014 · 2014-02-20 · Genome 559, Winter 2014 . Ab initio gene prediction method • Define parameters of real genes (based on experimental evidence): • Use

After correction of exons 1 and 2 in the red gene