computational biology, part 4 protein coding regions robert f. murphy copyright 1996-2001. all...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Computational Biology, Part 4Protein Coding Regions
Computational Biology, Part 4Protein Coding Regions
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1996-2001. 1996-2001.
All rights reserved.All rights reserved.
Sequence Analysis TasksSequence Analysis Tasks
Calculating the probability of finding a Calculating the probability of finding a sequence patternsequence pattern
Calculating the probability of finding a Calculating the probability of finding a region with a particular base compositionregion with a particular base composition
Representing and finding sequence Representing and finding sequence features/motifs using frequency matricesfeatures/motifs using frequency matrices
Sequence Analysis TasksSequence Analysis Tasks
Finding protein coding regionsFinding protein coding regions
GoalGoal
Given a DNA or RNA sequence, find those Given a DNA or RNA sequence, find those regions that code for protein(s)regions that code for protein(s) Direct approach: Look for stretches that can be Direct approach: Look for stretches that can be
interpreted as protein using the genetic codeinterpreted as protein using the genetic code Statistical approaches: Use other knowledge Statistical approaches: Use other knowledge
about likely coding regionsabout likely coding regions
Genetic codesGenetic codes
The set of tRNAs that an organism The set of tRNAs that an organism possesses defines its genetic code(s)possesses defines its genetic code(s)
The The universal genetic codeuniversal genetic code is common to all is common to all organismsorganisms
Prokaryotes, mitochondria and chloroplasts Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codesoften use slightly different genetic codes
More than one tRNA may be present for a More than one tRNA may be present for a given codon, allowing more than one given codon, allowing more than one possible translation productpossible translation product
Genetic codesGenetic codes
Differences in genetic codes occur in start Differences in genetic codes occur in start and stop codons onlyand stop codons only
Alternate initiation codonsAlternate initiation codons: codons that : codons that encode amino acids but can also be used to encode amino acids but can also be used to start translation (GUG, UUG, AUA, UUA, start translation (GUG, UUG, AUA, UUA, CUG)CUG)
Suppressor tRNA codonsSuppressor tRNA codons: codons that : codons that normally stop translation but are translated normally stop translation but are translated as amino acids (UAG, UGA, UAA)as amino acids (UAG, UGA, UAA)
Genetic codesGenetic codes
Note additional start codons: UUA, UUG, CUGNote additional start codons: UUA, UUG, CUG Note conversion of stop codon UGA (opal) to TrpNote conversion of stop codon UGA (opal) to Trp
Modifying genetic codes in MacVectorModifying genetic codes in MacVector Under Under OptionsOptions select select Modify Genetic Modify Genetic
Codes...Codes... Enter a name for new code in boxEnter a name for new code in box Make changes by clicking on individual Make changes by clicking on individual
codons in table and selecting new valuescodons in table and selecting new values Click Click OKOK
Reading FramesReading Frames
Since nucleotide sequences are “read” three Since nucleotide sequences are “read” three bases at a time, there are three possible bases at a time, there are three possible “frames” in which a given nucleotide “frames” in which a given nucleotide sequence can be “read” (in the forward sequence can be “read” (in the forward direction)direction)
Taking the complement of the sequence and Taking the complement of the sequence and reading in the reverse direction gives three reading in the reverse direction gives three more more reading framesreading frames
Reading framesReading frames
TTC TCA TGT TTG ACA GCTTTC TCA TGT TTG ACA GCT
RF1 RF1 Phe Ser Cys Leu Thr Ala>Phe Ser Cys Leu Thr Ala>
RF2 RF2 Ser His Val *** Gln Leu>Ser His Val *** Gln Leu>
RF3 RF3 Leu Met Phe Asp Ser>Leu Met Phe Asp Ser>
AAG AGT ACA AAC TGT CGAAAG AGT ACA AAC TGT CGA
RF4 RF4 <Glu *** Thr Gln Cys Ser<Glu *** Thr Gln Cys Ser
RF5 RF5 <Glu His Lys Val Ala<Glu His Lys Val Ala
RF6 RF6 <Arg Met Asn Ser Leu<Arg Met Asn Ser Leu
Reading framesReading frames
To find which reading frame a region is in, To find which reading frame a region is in, take nucleotide number of lower bound of take nucleotide number of lower bound of region, divide by 3 and take remainder region, divide by 3 and take remainder (modulus 3)(modulus 3)
1=1=RF1RF1, 2=, 2=RF2RF2, 0=, 0=RF3RF3 This is the convention used by MacVectorThis is the convention used by MacVector Assumes first nucleotide is 1 (not 0)Assumes first nucleotide is 1 (not 0)
Reading framesReading frames
For For reversereverse reading frames, take nucleotide reading frames, take nucleotide number of number of upperupper bound of region, subtract bound of region, subtract from total number of nucleotides, divide by from total number of nucleotides, divide by 3 and take remainder (modulus 3)3 and take remainder (modulus 3)
0=0=RF4RF4, 1=, 1=RF5RF5, 2=, 2=RF6RF6 This is because the convention MacVector This is because the convention MacVector
uses is that uses is that RF4RF4 starts with the last starts with the last nucleotide and reads backwardsnucleotide and reads backwards
Open Reading Frames (ORF)Open Reading Frames (ORF)
Concept: Region of DNA or RNA sequence Concept: Region of DNA or RNA sequence that that could could be translated into a peptide be translated into a peptide sequence (sequence (openopen refers to absence of stop refers to absence of stop codons)codons)
Prerequisite: A specific genetic codePrerequisite: A specific genetic code Definition:Definition:
(start codon) (amino acid coding codon)(start codon) (amino acid coding codon)nn (stop codon) (stop codon)
Note: Not all ORFs are Note: Not all ORFs are actuallyactually used used
Open Reading FramesOpen Reading Frames
Open file Open file YSPTUBBYSPTUBB in in Sample Files Sample Files folderfolder
Under Under AnalyzeAnalyze select select Open Reading Frames Open Reading Frames Click box next to Click box next to start/stop codons...start/stop codons... ClickClick OK OK
Open Reading FramesOpen Reading Frames
Click boxes for Click boxes for List ORFS List ORFS and and ORF mapORF map
Splicing ORFsSplicing ORFs
For eukaryotes, which have interrupted For eukaryotes, which have interrupted genes, ORFs in different reading frames genes, ORFs in different reading frames may be spliced together to generate final may be spliced together to generate final productproduct
ORFs from forward and reverse directions ORFs from forward and reverse directions cannot be combinedcannot be combined
ORFs and ExonsORFs and Exons
MacVector displays “annotations” to the MacVector displays “annotations” to the sequence in a sequence in a features tablefeatures table
Open the feature table for YSPTUBB by Open the feature table for YSPTUBB by clicking on the iconclicking on the icon
Note the six exons for the tubulin geneNote the six exons for the tubulin gene Does the large exon (exon 5) correspond to Does the large exon (exon 5) correspond to
the large ORF in reading frame 3? the large ORF in reading frame 3?
Yes, Yes, mod(639,3)=0 mod(639,3)=0 -> RF3 which -> RF3 which matches matches reading frame reading frame of large ORF of large ORF at 696at 696
Block Diagram for Search for ORFsBlock Diagram for Search for ORFs
Search Engine
Sequence to be searched
Genetic code
List of ORF positions
Both strands?
Ends start/stop?
Calculation WindowsCalculation Windows
Many sequence analyses require calculating Many sequence analyses require calculating some statistic over a long sequence looking some statistic over a long sequence looking for regions where the statistic is unusually for regions where the statistic is unusually high or lowhigh or low
To do this, we define a To do this, we define a window sizewindow size to be to be the width of the region over which each the width of the region over which each calculation is to be donecalculation is to be done
Example: %ATExample: %AT
Base Composition BiasBase Composition Bias
For a protein with a roughly “normal” amino acid For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons composition, the first 2 positions of all codons will be about 50% GCwill be about 50% GC
If an organism has a high GC content overall, the If an organism has a high GC content overall, the third position of all codons must be mostly GCthird position of all codons must be mostly GC
Useful for prokaryotesUseful for prokaryotes Not useful for eukaryotes due to large amount of Not useful for eukaryotes due to large amount of
noncoding DNAnoncoding DNA
Fickett’s statisticFickett’s statistic
Also called Also called TestCodeTestCode analysis analysis Looks for Looks for asymmetryasymmetry of base composition of base composition Strong statistical basis for calculationsStrong statistical basis for calculations Method:Method:
For each For each windowwindow on the sequence, calculate on the sequence, calculate the base composition of nucleotides 1, 4, 7..., the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9...then of 2, 5, 8..., and then of 3, 6, 9...
Calculate statistic from resulting three numbersCalculate statistic from resulting three numbers
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
PrinciplePrinciple Different levels of expression of different Different levels of expression of different
tRNAs for a given amino acid lead to pressure tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred on coding regions to “conform” to the preferred codon usagecodon usage
Non-coding regions, on the other hand, feel no Non-coding regions, on the other hand, feel no selective pressure and can driftselective pressure and can drift
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
Starting point: Table of observed codon Starting point: Table of observed codon frequencies in known genes from a given frequencies in known genes from a given organismorganism best to use highly expressed genesbest to use highly expressed genes
MethodMethod Calculate “coding potential” within a moving Calculate “coding potential” within a moving
windowwindow for all three for all three reading framesreading frames Look for ORFs with high scoresLook for ORFs with high scores
Codon Bias (Codon Preference)Codon Bias (Codon Preference)
Works best for prokaryotes or unicellular Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be eukaryotes, different pools of tRNA may be expressed at different stages of development expressed at different stages of development in different tissuesin different tissues may have to group genes into setsmay have to group genes into sets
Codon bias can also be used to estimate Codon bias can also be used to estimate protein expression levelprotein expression level
Portion of D. melanogaster codon frequency tablePortion of D. melanogaster codon frequency tableAmino Acid Codon Number Freq/1000 Fraction
Gly GGG 11 2.60 0.03
Gly GGA 92 21.74 0.28
Gly GGT 86 20.33 0.26
Gly GGC 142 33.56 0.43
Glu GAG 212 50.11 0.75
Glu GAA 69 16.31 0.25G ly G
Comparison of Glycine codon frequenciesComparison of Glycine codon frequencies
Codon E. coli D. melanogaster
GGG 0.02 0.03
GGA 0.00 0.28
GGT 0.59 0.26
GGC 0.38 0.43G ly G
Illustration of Fickett’s statistic and Codon Preference PlotsIllustration of Fickett’s statistic and Codon Preference Plots Goal: Reproduce Figure 6 of Chapter 4 of Goal: Reproduce Figure 6 of Chapter 4 of
Sequence Analysis PrimerSequence Analysis Primer Use Entrez via MacVector to get 5 files Use Entrez via MacVector to get 5 files
containing pieces of DNA sequence for the containing pieces of DNA sequence for the Drosophila Notch locusDrosophila Notch locus
Combine 9 exons from 5 filesCombine 9 exons from 5 files Create Codon Preference PlotCreate Codon Preference Plot
Creation of file containing Notch exons 1-9Creation of file containing Notch exons 1-9 Open New sequence file and paste selected Open New sequence file and paste selected
exon 1 into itexon 1 into it Continue for exons 2 through 9Continue for exons 2 through 9 The result is in file The result is in file
DrosophilaNotchExons1to9.gcg DrosophilaNotchExons1to9.gcg on on Lecture Notes web pageLecture Notes web page
Now generate Codon Preference Plot (for Now generate Codon Preference Plot (for file containing just exons)file containing just exons)
Codon Window Size =20 Codon Bias File:D. melanogaster codon biasGenetic Code: universal
2.00
1.50
1.00
0.50
0.001000 2000 3000 4000 5000 6000 7000 8000
2.00
1.50
1.00
0.50
0.001000 2000 3000 4000 5000 6000 7000 8000
2.00
1.50
1.00
0.50
0.001000 2000 3000 4000 5000 6000 7000 8000
Analysis of Notch locusAnalysis of Notch locus
Now look at genomic sequence (exons and Now look at genomic sequence (exons and introns)introns)
Cut and paste entire sequences from 5 files Cut and paste entire sequences from 5 files into new file (not shown)into new file (not shown)
The result is in file The result is in file DrosophilaNotchLocus.gcg DrosophilaNotchLocus.gcg on Lecture on Lecture Notes web pageNotes web page
Generate Codon Preference PlotGenerate Codon Preference Plot
Codon Window Size =20 Codon Bias File:D. melanogaster codon biasGenetic Code: universal
2.00
1.50
1.00
0.50
0.005000 10000 15000
2.00
1.50
1.00
0.50
0.005000 10000 15000
2.00
1.50
1.00
0.50
0.005000 10000 15000
Note large region scoring above 1
Summary, Part 4Summary, Part 4
Translation of nucleic acid sequences into Translation of nucleic acid sequences into hypothetical protein sequences requires a hypothetical protein sequences requires a genetic codegenetic code
Translation can occur in three forward and Translation can occur in three forward and three reverse reading framesthree reverse reading frames
Open reading frames are regions that can be Open reading frames are regions that can be translated without encountering a stop translated without encountering a stop codoncodon
Summary, Part 4Summary, Part 4
The likelihood that a particular open The likelihood that a particular open reading frames is in fact a coding region reading frames is in fact a coding region (actually made into protein) can be (actually made into protein) can be estimated using third-codon base estimated using third-codon base composition or codon preference tablescomposition or codon preference tables
This can be used to scan long sequences for This can be used to scan long sequences for possible coding regionspossible coding regions