computational biology, part 4 protein coding regions robert f. murphy copyright 1996-2001. all...

51
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Robert F. Murphy Copyright Copyright 1996-2001. 1996-2001. All rights reserved. All rights reserved.

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Computational Biology, Part 4Protein Coding Regions

Computational Biology, Part 4Protein Coding Regions

Robert F. MurphyRobert F. Murphy

Copyright Copyright 1996-2001. 1996-2001.

All rights reserved.All rights reserved.

Sequence Analysis TasksSequence Analysis Tasks

Calculating the probability of finding a Calculating the probability of finding a sequence patternsequence pattern

Calculating the probability of finding a Calculating the probability of finding a region with a particular base compositionregion with a particular base composition

Representing and finding sequence Representing and finding sequence features/motifs using frequency matricesfeatures/motifs using frequency matrices

Sequence Analysis TasksSequence Analysis Tasks

Finding protein coding regionsFinding protein coding regions

GoalGoal

Given a DNA or RNA sequence, find those Given a DNA or RNA sequence, find those regions that code for protein(s)regions that code for protein(s) Direct approach: Look for stretches that can be Direct approach: Look for stretches that can be

interpreted as protein using the genetic codeinterpreted as protein using the genetic code Statistical approaches: Use other knowledge Statistical approaches: Use other knowledge

about likely coding regionsabout likely coding regions

Direct ApproachDirect Approach

Genetic codesGenetic codes

The set of tRNAs that an organism The set of tRNAs that an organism possesses defines its genetic code(s)possesses defines its genetic code(s)

The The universal genetic codeuniversal genetic code is common to all is common to all organismsorganisms

Prokaryotes, mitochondria and chloroplasts Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codesoften use slightly different genetic codes

More than one tRNA may be present for a More than one tRNA may be present for a given codon, allowing more than one given codon, allowing more than one possible translation productpossible translation product

Genetic codesGenetic codes

Differences in genetic codes occur in start Differences in genetic codes occur in start and stop codons onlyand stop codons only

Alternate initiation codonsAlternate initiation codons: codons that : codons that encode amino acids but can also be used to encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, start translation (GUG, UUG, AUA, UUA, CUG)CUG)

Suppressor tRNA codonsSuppressor tRNA codons: codons that : codons that normally stop translation but are translated normally stop translation but are translated as amino acids (UAG, UGA, UAA)as amino acids (UAG, UGA, UAA)

Genetic codesGenetic codes

Genetic codesGenetic codes

Genetic codesGenetic codes

Note additional start codons: UUA, UUG, CUGNote additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to TrpNote conversion of stop codon UGA (opal) to Trp

Modifying genetic codes in MacVectorModifying genetic codes in MacVector Under Under OptionsOptions select select Modify Genetic Modify Genetic

Codes...Codes... Enter a name for new code in boxEnter a name for new code in box Make changes by clicking on individual Make changes by clicking on individual

codons in table and selecting new valuescodons in table and selecting new values Click Click OKOK

Reading FramesReading Frames

Since nucleotide sequences are “read” three Since nucleotide sequences are “read” three bases at a time, there are three possible bases at a time, there are three possible “frames” in which a given nucleotide “frames” in which a given nucleotide sequence can be “read” (in the forward sequence can be “read” (in the forward direction)direction)

Taking the complement of the sequence and Taking the complement of the sequence and reading in the reverse direction gives three reading in the reverse direction gives three more more reading framesreading frames

Reading framesReading frames

TTC TCA TGT TTG ACA GCTTTC TCA TGT TTG ACA GCT

RF1 RF1 Phe Ser Cys Leu Thr Ala>Phe Ser Cys Leu Thr Ala>

RF2 RF2 Ser His Val *** Gln Leu>Ser His Val *** Gln Leu>

RF3 RF3 Leu Met Phe Asp Ser>Leu Met Phe Asp Ser>

AAG AGT ACA AAC TGT CGAAAG AGT ACA AAC TGT CGA

RF4 RF4 <Glu *** Thr Gln Cys Ser<Glu *** Thr Gln Cys Ser

RF5 RF5 <Glu His Lys Val Ala<Glu His Lys Val Ala

RF6 RF6 <Arg Met Asn Ser Leu<Arg Met Asn Ser Leu

Reading framesReading frames

To find which reading frame a region is in, To find which reading frame a region is in, take nucleotide number of lower bound of take nucleotide number of lower bound of region, divide by 3 and take remainder region, divide by 3 and take remainder (modulus 3)(modulus 3)

1=1=RF1RF1, 2=, 2=RF2RF2, 0=, 0=RF3RF3 This is the convention used by MacVectorThis is the convention used by MacVector Assumes first nucleotide is 1 (not 0)Assumes first nucleotide is 1 (not 0)

Reading framesReading frames

For For reversereverse reading frames, take nucleotide reading frames, take nucleotide number of number of upperupper bound of region, subtract bound of region, subtract from total number of nucleotides, divide by from total number of nucleotides, divide by 3 and take remainder (modulus 3)3 and take remainder (modulus 3)

0=0=RF4RF4, 1=, 1=RF5RF5, 2=, 2=RF6RF6 This is because the convention MacVector This is because the convention MacVector

uses is that uses is that RF4RF4 starts with the last starts with the last nucleotide and reads backwardsnucleotide and reads backwards

Open Reading Frames (ORF)Open Reading Frames (ORF)

Concept: Region of DNA or RNA sequence Concept: Region of DNA or RNA sequence that that could could be translated into a peptide be translated into a peptide sequence (sequence (openopen refers to absence of stop refers to absence of stop codons)codons)

Prerequisite: A specific genetic codePrerequisite: A specific genetic code Definition:Definition:

(start codon) (amino acid coding codon)(start codon) (amino acid coding codon)nn (stop codon) (stop codon)

Note: Not all ORFs are Note: Not all ORFs are actuallyactually used used

Open Reading FramesOpen Reading Frames

Open file Open file YSPTUBBYSPTUBB in in Sample Files Sample Files folderfolder

Under Under AnalyzeAnalyze select select Open Reading Frames Open Reading Frames Click box next to Click box next to start/stop codons...start/stop codons... ClickClick OK OK

Open Reading FramesOpen Reading Frames

Click boxes for Click boxes for List ORFS List ORFS and and ORF mapORF map

Check reading Check reading frame: frame: mod(696,3)=0 mod(696,3)=0 -> RF3-> RF3

Splicing ORFsSplicing ORFs

For eukaryotes, which have interrupted For eukaryotes, which have interrupted genes, ORFs in different reading frames genes, ORFs in different reading frames may be spliced together to generate final may be spliced together to generate final productproduct

ORFs from forward and reverse directions ORFs from forward and reverse directions cannot be combinedcannot be combined

ORFs and ExonsORFs and Exons

MacVector displays “annotations” to the MacVector displays “annotations” to the sequence in a sequence in a features tablefeatures table

Open the feature table for YSPTUBB by Open the feature table for YSPTUBB by clicking on the iconclicking on the icon

Note the six exons for the tubulin geneNote the six exons for the tubulin gene Does the large exon (exon 5) correspond to Does the large exon (exon 5) correspond to

the large ORF in reading frame 3? the large ORF in reading frame 3?

Yes, Yes, mod(639,3)=0 mod(639,3)=0 -> RF3 which -> RF3 which matches matches reading frame reading frame of large ORF of large ORF at 696at 696

Block Diagram for Search for ORFsBlock Diagram for Search for ORFs

Search Engine

Sequence to be searched

Genetic code

List of ORF positions

Both strands?

Ends start/stop?

Statistical ApproachesStatistical Approaches

Calculation WindowsCalculation Windows

Many sequence analyses require calculating Many sequence analyses require calculating some statistic over a long sequence looking some statistic over a long sequence looking for regions where the statistic is unusually for regions where the statistic is unusually high or lowhigh or low

To do this, we define a To do this, we define a window sizewindow size to be to be the width of the region over which each the width of the region over which each calculation is to be donecalculation is to be done

Example: %ATExample: %AT

Base Composition BiasBase Composition Bias

For a protein with a roughly “normal” amino acid For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons composition, the first 2 positions of all codons will be about 50% GCwill be about 50% GC

If an organism has a high GC content overall, the If an organism has a high GC content overall, the third position of all codons must be mostly GCthird position of all codons must be mostly GC

Useful for prokaryotesUseful for prokaryotes Not useful for eukaryotes due to large amount of Not useful for eukaryotes due to large amount of

noncoding DNAnoncoding DNA

Fickett’s statisticFickett’s statistic

Also called Also called TestCodeTestCode analysis analysis Looks for Looks for asymmetryasymmetry of base composition of base composition Strong statistical basis for calculationsStrong statistical basis for calculations Method:Method:

For each For each windowwindow on the sequence, calculate on the sequence, calculate the base composition of nucleotides 1, 4, 7..., the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9...then of 2, 5, 8..., and then of 3, 6, 9...

Calculate statistic from resulting three numbersCalculate statistic from resulting three numbers

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

PrinciplePrinciple Different levels of expression of different Different levels of expression of different

tRNAs for a given amino acid lead to pressure tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred on coding regions to “conform” to the preferred codon usagecodon usage

Non-coding regions, on the other hand, feel no Non-coding regions, on the other hand, feel no selective pressure and can driftselective pressure and can drift

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

Starting point: Table of observed codon Starting point: Table of observed codon frequencies in known genes from a given frequencies in known genes from a given organismorganism best to use highly expressed genesbest to use highly expressed genes

MethodMethod Calculate “coding potential” within a moving Calculate “coding potential” within a moving

windowwindow for all three for all three reading framesreading frames Look for ORFs with high scoresLook for ORFs with high scores

Codon Bias (Codon Preference)Codon Bias (Codon Preference)

Works best for prokaryotes or unicellular Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be eukaryotes, different pools of tRNA may be expressed at different stages of development expressed at different stages of development in different tissuesin different tissues may have to group genes into setsmay have to group genes into sets

Codon bias can also be used to estimate Codon bias can also be used to estimate protein expression levelprotein expression level

Portion of D. melanogaster codon frequency tablePortion of D. melanogaster codon frequency tableAmino Acid Codon Number Freq/1000 Fraction

Gly GGG 11 2.60 0.03

Gly GGA 92 21.74 0.28

Gly GGT 86 20.33 0.26

Gly GGC 142 33.56 0.43

Glu GAG 212 50.11 0.75

Glu GAA 69 16.31 0.25G ly G

Comparison of Glycine codon frequenciesComparison of Glycine codon frequencies

Codon E. coli D. melanogaster

GGG 0.02 0.03

GGA 0.00 0.28

GGT 0.59 0.26

GGC 0.38 0.43G ly G

Illustration of Fickett’s statistic and Codon Preference PlotsIllustration of Fickett’s statistic and Codon Preference Plots Goal: Reproduce Figure 6 of Chapter 4 of Goal: Reproduce Figure 6 of Chapter 4 of

Sequence Analysis PrimerSequence Analysis Primer Use Entrez via MacVector to get 5 files Use Entrez via MacVector to get 5 files

containing pieces of DNA sequence for the containing pieces of DNA sequence for the Drosophila Notch locusDrosophila Notch locus

Combine 9 exons from 5 filesCombine 9 exons from 5 files Create Codon Preference PlotCreate Codon Preference Plot

Creation of file containing Notch exons 1-9Creation of file containing Notch exons 1-9 Open New sequence file and paste selected Open New sequence file and paste selected

exon 1 into itexon 1 into it Continue for exons 2 through 9Continue for exons 2 through 9 The result is in file The result is in file

DrosophilaNotchExons1to9.gcg DrosophilaNotchExons1to9.gcg on on Lecture Notes web pageLecture Notes web page

Now generate Codon Preference Plot (for Now generate Codon Preference Plot (for file containing just exons)file containing just exons)

Codon Window Size =20 Codon Bias File:D. melanogaster codon biasGenetic Code: universal

2.00

1.50

1.00

0.50

0.001000 2000 3000 4000 5000 6000 7000 8000

2.00

1.50

1.00

0.50

0.001000 2000 3000 4000 5000 6000 7000 8000

2.00

1.50

1.00

0.50

0.001000 2000 3000 4000 5000 6000 7000 8000

Analysis of Notch locusAnalysis of Notch locus

Now look at genomic sequence (exons and Now look at genomic sequence (exons and introns)introns)

Cut and paste entire sequences from 5 files Cut and paste entire sequences from 5 files into new file (not shown)into new file (not shown)

The result is in file The result is in file DrosophilaNotchLocus.gcg DrosophilaNotchLocus.gcg on Lecture on Lecture Notes web pageNotes web page

Generate Codon Preference PlotGenerate Codon Preference Plot

Codon Window Size =20 Codon Bias File:D. melanogaster codon biasGenetic Code: universal

2.00

1.50

1.00

0.50

0.005000 10000 15000

2.00

1.50

1.00

0.50

0.005000 10000 15000

2.00

1.50

1.00

0.50

0.005000 10000 15000

Note large region scoring above 1

Summary, Part 4Summary, Part 4

Translation of nucleic acid sequences into Translation of nucleic acid sequences into hypothetical protein sequences requires a hypothetical protein sequences requires a genetic codegenetic code

Translation can occur in three forward and Translation can occur in three forward and three reverse reading framesthree reverse reading frames

Open reading frames are regions that can be Open reading frames are regions that can be translated without encountering a stop translated without encountering a stop codoncodon

Summary, Part 4Summary, Part 4

The likelihood that a particular open The likelihood that a particular open reading frames is in fact a coding region reading frames is in fact a coding region (actually made into protein) can be (actually made into protein) can be estimated using third-codon base estimated using third-codon base composition or codon preference tablescomposition or codon preference tables

This can be used to scan long sequences for This can be used to scan long sequences for possible coding regionspossible coding regions

Assigned ReadingsAssigned Readings

Baxevanis & Ouellette, Chapter 2Baxevanis & Ouellette, Chapter 2

Baxevanis & Ouellette, Chapter 5Baxevanis & Ouellette, Chapter 5