orf calling. why? need to know protein sequence protein sequence is usually what does the work...

Post on 18-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ORF Calling

Why? Need to know protein sequence Protein sequence is usually what does the

work Functional studies

Crystallography Proteomics

Similarity studies Proteins are better for remote similarities

than DNA sequences Protein sequences change slower than DNA

sequences

ORF Calling

Intrinsic gene calling

Extrinsic gene calling

Compare your DNA sequences to known sequences. Needs other sequences that are known!

Only use information in your DNA sequences. Does not use other information.

ORF Calling

Start with DNA sequence

Translate in all 6 reading frames

Extrinsic gene calling

AGT AAA ACT TTA ATT GTT GGT TAAAGT AAA ACT TTA ATT GTT GGT TAA1

AG TAA AAC TTT AAT TGT TGG TTA A3A GTA AAA CTT TAA TTG TTG GTT AA2

TCA TTT TGA AAT TAA CAA CCA ATT | | | | | | | | | | | | | | | | | | | | | | | |

T CAT TTT GAA ATT AAC AAC CAA TT-3

TCA TTT TGA AAT TAA CAA CCA ATT-1TC ATT TTG AAA TTA ACA ACC AAT T-2

Why are there 6 reading frames?

Start with DNA sequence

Translate in all 6 reading frames

Compare your sequence to known protein sequences

Find the ends of each, and call those genes!

Extrinsic gene calling

DNAsequence

}Similarproteinsequencese.g. from BLAST

Protein encodinggene

For example

This is how (most) metagenome ORF calling is done

Eukaryotic ORF calling – especially using EST sequences

Uses of extrinsic calling

Very slow (depending on search algorithm)

Dependent on your database

Only finds known genes

Problems with extrinsic calling

Intrinsic gene calling Ab initio gene calling

What are the start codons?

What are the stop codons?

ATG

TAA TAG TGA

Alternatives to extrinsic gene calling

Approximately once every 20 amino acids at random!

A stretch of 100 amino acids is likely to have a stop codon!

How frequently do stop codons appear?

DNA

3

2

1

-1

-2

-3

How to call ORFs (the easy way)

DNA

3

2

1

-1

-2

-3

Find all the stop codons

DNA

3

2

1

-1

-2

-3

X is often 100 amino acids

Find all the ORFs > x amino acids

DNA

3

2

1

-1

-2

-3

Trim to those ORFs that have a start

DNA

3

2

1

-1

-2

-3

Short ORFs that overlap others

Remove “shadow” ORFs

DNA

3

2

1

-1

-2

-3

Trim the start sites to first ATG

DNA

3

2

1

-1

-2

-3

These are the ORFs

Intrinsic ORF calling usingMarkov Models

Based on language processing

Common for gene and protein finding, alignments, and so on

Markov Models

English: the

Spanish: el (la)

Portuguese: que

What is the most common word?

Scrabble

In scrabble, how do they score the letters?

The most abundant letters (easiest to place on the board) are given the lowest score

Scrabble

1 point: E, A, I, O, N, R, T, L, S, U

2 points: D, G

3 points: B, C, M, P

4 points: F, H, V, W, Y

5 points: K

8 points: J, X

10 points: Q, Z

Scrabble

Frequency of letters

If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score)

rla bsht es stsfa ohhofsd

Making up sentences

What follows a period (“.”)?

What follows a t?

Usually a space “ ”

Usually an “i” (-tion, -tize, ...)

Lets get clever!

When the first letter is “t” (from 3,269 words):

ti 51%

te 20%

ta 15%

th 8%

Frequency of two letters

Choose a letter based on the probability that it follows the letter before:

s h a n d t u c ht i n e y m e l e o l l d

Level 1 analysis

1 letter (a, e, o …)

2 letters (th, ti, sh …)

3 letters (the, and, …)

4 letters (that, …)

Zero order model

First order model

Second order model

Third order model

Levels of analysis

With about 10th order Markov models of English you get complete words and sentences!

Markov models

With about 10th order Markov models of English you get complete words and sentences!

Markov models

Scoring words with Markov Models

If I choose random letters how can I tell if they are real words?

Sum the scores of 10th order Markov models across the words … if it is high it is likely to be a real word!

In reality, maybe use 1st, 2nd, 3rd, 4th, 5th, 6th … order models and compare to some known words

Codons have three letters (ATG, CAC, GGG, ...)

Use a 2nd order Markov model for ORF calling

The frequency of a letter is predicted based on the frequency of the two letters before

Markov Models and ORF calling

Scrabble

Do English and Spanish use the same letters?

Scrabble (México)

Scrabble (México)

1 point: E, A, I, O, N, R, T, L, S, U

2 points: D, G

3 points: B, C, M, P

4 points: F, H, V, W, Y

5 points: K

8 points: J, X

10 points: Q, Z

Scrabble (US)

Based on the front page of the NY Times!

1 point: A, E, O, I, S, N, L, R, U, T

2 points: D, G

3 points: C, B, M, P

4 points: H, F, V, Y

5 points: CH, Q

8 points: J, LL, Ñ, RR, X

10 points: Z

Scrabble (Spanish)

Will vary with the composition of the organism!

Remember, some organisms have high G+C compared to A+T

What about scrabble scores for DNA?

Use a 2nd order Markov model for ORF calling

The frequency of a letter is predicted based on the frequency of the two letters before

Markov Models and ORF calling

Need to train the Markov model – not all organisms are the same

Can use phylogentically close organisms

Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!

Problems!

Markov Models order 1-8 (word size 2-9)

Discard (or ↓ weight) for rare words

Promote (or ↑ weight) for common words

Probability is the sum of all probabilities from 1-8

2-9

Interpolated Markov Model(The imm in GLIMMER)

As with proteins, two main methods:

Ab initio

• Intrinsic

Homology based

• extrinsic

RNA genes

Ribosomes are made of proteins and RNA

Ribosomes

30S subunit from Thermus aquaticus

Blue: proteinOrange: rRNA

E. coli16S rRNA secondary structure

Variable regionConserved region

Variable regions inthe 16S rRNA. Vn – 9 regions(n) – variable loop(s)forward/rev primers V1

(6)

V2 (8-11)

V3 (18)

V4 (P23-1, 24)

V5(28, 29)

V6(37)

V7 (43)

V8(45, 46)

V9 (49)

Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucl. Acids Res. 24:3381-3391

Ribosomes are made of proteins and RNA

Prokaryotic ribosome:

Large subunit:50S

5S and 23S rRNA genes

Small subunit:

30S

16S rRNA gene

Ribosomes

Easiest way is iterative: BLAST ALIGN TRIM

Problem: secondary structure makes identification of the ends difficult

Finding 16S genes

Not as easy as rRNA

Much shorter

Varied sequence

Only conservation is 2° structure

Finding tRNA genes

tRNAScan-SE

Sean Eddy

Use it!

How does this relate to tRNA?

tRNA-Phe by Yikrazuul - Own work.Licensed under CC BY-SA 3.0 via Wikimedia Commonshttps://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg

tRNA structure

Start of acceptor stem (7-9 bp) D-loop (4-6-bp) stem plus loop anticodon arm (6-bp) stem plus loop with

anticodon T-loop (4-5-bp) stem plus loop End of acceptor stem (7-9 bp) CCA to attach amino acid (may not be in

sequence ... added during processing)

top related