genes recognition - École polytechnique fédérale de ...lsir recognition.pdf · 2. score of...

38
Genes Recognition Julien Favre

Upload: others

Post on 26-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Genes Recognition

Julien Favre

Page 2: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Agenda

• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details

• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity

• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 1

2

Page 3: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Gene Definition

What’s a Gene?

PART 13

DNA

Page 4: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Transcription Process I

PART 14

Page 5: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Transcription Process II

PART 15

STEP 1

STEP 2

STEP 3

Page 6: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

STEP 1 Transcription

PART 16

ANIMATION

Page 7: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

STEP 2 Processing

PART 17

Capping and Poly-A

Splicing

Page 8: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

STEP 3 Translation

PART 18

ANIMATION

Page 9: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

More details on Genes

PART 19

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

5’ 3’

Coding region Non Coding Region

TATA box

Start CodonEnd Codon

Beginning of the gene

Splice sites

It differs from genes to genes!

Page 10: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Agenda

• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details

• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity

• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 2

10

Page 11: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Situation

• Over 3’500 million of nucleotides• 35’000 -50’000 genes

2 Important Questions:

1) Where are the genes?2) What are the coding parts?

PART 211

Page 12: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Why?

• Annotate and correct the DNA databases• Link genes with the known proteins• Understand the genes functions• Understand genes expression mechanism

PART 212

We can read the DNA alphabet, but we don’t know where are the meaningful words and their meaning.

Page 13: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Complexity I

PART 213

ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA ATTCGATGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGAATGCGTGCGTCGTGTTGCGTCGCGCATTGGCATGATGACGTAGCGTATTAGACGACTAGCAGAGTGCACGACGATAGGCACGATAGACGCATTAGCAGCGCGACGCGCGA

3’500 Million bases

Page 14: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Complexity II

PART 214

Acceptor SitesDonor Sites Number of parses = Fibonacci(n+m+1)

DNA

Exons Exons

Page 15: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Agenda

• PART 1 : Gene Structure– Gene Definition– Transcription Process– Gene Details

• PART 2 : Problem Definition– Gene Recognition– Why?– Complexity

• PART 3 : Problem Approach– Approaches– Solutions description– Method improvements– Conclusion PART 3

15

Page 16: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Approaches

3 Types of Approaches :

1. Single Gene RecognitionFunctional Signals detection

Splice sitesPromoter, Poly-A, …

2. Multiple Genes Recognition3. Similarities

PART 316

Page 17: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Single Gene RecognitionPrinciple

Functional Signals DetectionMain goal is to detect the beginning and the

end of the exons or genes

PART 317

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

5’ 3’TATA box

Start CodonEnd Codon

Splice sites

Page 18: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Splicing Mechanism

PART 318

Time™ and a) decompressorsee this picture.DNA

• Consensus over the donor-acceptor site GU-AG (98%)

• Extremely reliable technique to detect exons

Page 19: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Single Gene RecognitionMethods

PART 319

1. Combinatorial methods– Single block

2. Probabilistic methods– Simple – Markov based

3. Linear Discriminant methods

Page 20: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Consensus Sequence

PART 320

Obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest

TACGATTATAATTATAATGATACTTATGATTATGTT

Consensus Sequence TATAAT

MELONMANGOHONEYSWEETCOOKY

MONEYLeads to loss of information and can produce many false positive or false negative predictions

Page 21: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Combinatory Methods

PART 321

Consensus Sequence(ex: TATA box)For a consensus sequence of size L and for a position in the considered sequence, we compute

1. P(L,k)= P(Detect the consensus seq. with k mismatches)2. where Fl = #possible positions in

the considered sequence and T is the number of patterns detected in the given sequence

3. For a given To, define a threshold value for the detection.

P(T) = CFlT pT (L,z)(1− P(L,z))Fl−T

z= 0

k

Page 22: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Probabilistic Methods I

PART 322

For a given consensus sequence a Weight Matrix is computed:Computed by measuring the frequency of every element of a particular position of the base in a training set:

Matrix entries can be considered as probabilitiesDisadvantages:

– assumes independence between adjacent bases

GU Acceptor site

Page 23: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Probabilistic Methods II

PART 323

• Under the weight matrix model, the probability of having a sequence (x1, x2, .., xk) that matches a site is:

If we introduce a measure of the form :

Then, the more LLR exceeds 0, the better chances this sequence is a functional signal

P(X /S) = pxi

i

i=1

k

LLR(X) = Log( P(X /S)P(X /N)

)

Page 24: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Methods improvements

PART 324

2 blocks approach P(L1,k1,L2,k2) and distance D1Multiple nucleotides probabilitiesNeuronal network approachReading frameMarkov Models

Page 25: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Markov Models

PART 325

Probabilistic method are 0-order Markov modelsMarkov introduces dependencies between the basesThe probabilities of observing a sequence becomes now:

P(X /S) = p0 pxi

i−1,i

i=1

k

Page 26: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Linear Discriminant methods I

PART 326

Many functional signals are very short => Exploit related characteristics1. We build a sequence characteristics vector

(x1, …,xp)2. We define and if Z>c then the sequence

correspond to a site3. We use a training set to define {ai}, c4. The training set of « site sequences » define a

vector m1 and the « non site sequence » a vector m2

Z = aixii= 0

p

a = s−1(m1 −m2) c = a (m1 + m2) /2

Page 27: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Linear Discriminant methods II

PART 327

1. Choose a set of p characteristics– Score of the weight matrix– Distance to a predicted site– Base composition in distant sequence– …

2. Test the characteristics with the Mahalonodisdistance:

3. Choose the set of q characteristics that maximizes D2

D2 = (m1 −m2)s−1(m1 −m2)

Page 28: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Linear Discriminant methods IIIExample

PART 328

Poly-A site

Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)

Stop codon

T-rich

Poly-A site

GT-Rich Last Exon

5’Score of Poly-A

CAATAAA(T/C)

Distance between poly-A and GT

Score of GT

Page 29: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Linear Discriminant methods IVExample

PART 329

Poly-A site

Chosen characteristics:1. Score of the weight matrix of Poly-A2. Score of weight matrix of the GT el.3. Distance between Poly-A and GT4. Nucleotide composition of Downstream Region(6,100)5. Nucleotide composition of Upstream Region(-100,-1)

12.6812.3611.6710.787.61Composed D2

0.442.270.013.467.61Individual D2

35241MahalonodisDistance

Page 30: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Multiple Genes approach

PART 330

2 Approaches:

1. Discriminant Analysis, Pattern based– FGENES

2. HMM, Probabilistic approach– FGENEH

Page 31: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Discriminant Analysis

PART 331

Goal: Detect first and last Exons in a big sequence

1. Find internal exons

2. Find last exons based on 3’ sites

3. Find first exons based on 5’ sites

4. Combine results

AInternal exonIntron Intron

D

ALast exonIntron 3’ site

Stop

ATGFirst exon5’ site Intron

D

Page 32: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

HMM method I

PART 332

We want to use Markov model to represent and recognize genes

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 33: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

HMM method II

PART 333

Real model:E0 E1 E2

E2E1E0

NP

Eterm

P

Einit

polyA

5’ UTR

I0 I1 I2

I0 I1 I2

Esngl

Esngl

Einit Eterm3’ UTR

5’ UTR 3’ UTR

polyA

Page 34: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

HMM method III

PART 334

1. The model must be trained to compute:• State transition probabilities• Initial distribution

2. For a given sequence, we look for the best path using Vitterbi algorithm

3. We analyze the best path to determine if it could be a gene.

Page 35: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Similarity methods

PART 335

2 Goals:1. Find out the genes functions2. Improve algorithms

2 Main Methods:1. EST based2. BLAST with others species

Page 36: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Remarks

PART 336

• Real challenge is gene recognition in long and complex sequences

• It’s very difficult to measure methods accuracy

• The databases are full of errors

Page 37: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Conclusion

PART 337

• Best results are obtained in combining methods– HMM + EST+Dynamic programming

• This problem will be solved within a few years• But huge challenges are remaining

– Gene regulation – Alternative splicing– Gene expression

Page 38: Genes Recognition - École Polytechnique Fédérale de ...lsir recognition.pdf · 2. Score of weight matrix of the GT el. 3. Distance between Poly-A and GT 4. Nucleotide composition

Questions And Remarks

PART 338

Thanks for your attention