advanced bioinformatics (mb480/580) >sulfolobus virus 1 complete genome 15465 bp....
Post on 15-Jan-2016
215 views
TRANSCRIPT
![Page 1: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/1.jpg)
Advanced Bioinformatics (MB480/580)>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGTACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAG
Learn How to:● Assemble a genome and predict its:
- ORFs- Promoters
● Annotate genome:- Predict protein functions- Model them if possible- Re-design them if possible
● Predict functions by inference from a large amount of unrelated data● Predict ncRNAs● High-throughput methods and data interpretation● Prepare the data for presentations & publications
![Page 2: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/2.jpg)
What is Bioinformatics?
• Choices:– The analysis of biological molecules
using computers and statistical techniques•TRUE
– The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research• also TRUE, but suits Computational Biology
better
![Page 3: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/3.jpg)
More definitions
• The collection, organization and analysis of large amounts of biological data, using networks of computers and databases.
• The process of developing tools and processes to quantify and collect data to study biological systems logically.
• The science of informatics as applied to biological research.
![Page 4: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/4.jpg)
Yet more definitions
• Mark Gerstein’s definition:– Bioinformatics is conceptualizing biology in terms of
macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
– The manuscript breaking down each part of the above statement will be e-mailed.
– http://wiki.bioinformatics.org/Bioinformatics_FAQ
![Page 5: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/5.jpg)
The important stuff
• Bioinformatics brings together biological data from genome research with the theory and tools of mathematics, computer science and artificial intelligence.
• Bioinformatics includes any application of computer technology and information science to:– Gather, organize, store and handle data.– Analyze, interpret and spread data.– Predict biological structure and function.
![Page 6: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/6.jpg)
What is the information in Molecular Biology?
• Central Dogmaof Molecular Biology
DNA -> RNA -> Protein -> Phenotype
• Molecules– Sequence, Structure, Function
• Processes– Mechanism, Specificity,
Regulation
• Central Paradigmfor Bioinformatics
Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Phenotype
• Large Amounts of Information– Standardized– Statistical
•Genetic material •Information transfer (mRNA)•Protein synthesis (tRNA/mRNA)•Some catalytic activity
•Most cellular functions are performed or facilitated by proteins.
•Primary biocatalyst
•Cofactor transport/storage
•Mechanical motion/support
•Immune protection
•Control of growth/differentiation
This slide is courtesy of Mark Gerstein
![Page 7: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/7.jpg)
Language of biology is not easy to understand
• Just like in spoken language, some words look very different but have the same meaning (car and automobile are synonyms; sequences of distantly related proteins are synonyms)
• Some words look or sound very similar yet have different meaning (complement and compliment; eminent and imminent; allude and elude; decent and descent are homophones; GAG and TAG codons are homophones)
• In spoken language, we came up with the rules which is why most of the time we can trace back their origins
• How do we trace the origins of Nature’s language?
![Page 8: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/8.jpg)
Why is Bioinformatics important?
• Supports experimental work– In some cases, it provides complementary
data• More importantly, guides experimental
work– Predictions based on data– Extension of experiments in new directions
• To be believable, Bioinformatics predictions have to be verifiable– Statistical significance, or some other kind
of significance score
![Page 9: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/9.jpg)
When did Bioinformatics begin?
• 10-15 years ago?– This is a common assumption
• Bioinformatics existed even back in 70s– It was called differently– It was underused because the amount of
biological sequence data was small
![Page 10: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/10.jpg)
Bioinformatics and Genome Biology
• The revolution driving enormous development in Bioinformatics and experimental sciences came from whole genome sequencing
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512.(Picture adapted from TIGR website, http://www.tigr.org)
• Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes done1997, yeast: 13 Mb & ~6000 genes for yeast1998, worm: ~100Mb with 19 K genes1999: >30 completed genomes!2003, human: 3 Gb & 50 K genes...
Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading.
-- G A Petsko, Nature 401: 115-116 (1999)
![Page 11: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/11.jpg)
What can we infer from sequence using Bioinformatics?
Expressed?
• cellular function • physiological function • substrate binding sites • protein-protein interfaces
• activity • specificity • docking • localisation
DNA
ORF
Protein
Active proteinDomains =smallest functional /structural subunits
3D structure
Function
![Page 12: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/12.jpg)
Make sense of subtle differences
[Waterston et al. Nature 2002]
- About 90% of the mouse and human genomes are in syntenic blocks.
![Page 13: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/13.jpg)
What’s in the genome?
• If we are so much alike in terms of genome, why are we so much different?– Large variation in human population– Similar genes and similar genome organization
between human and chimp (or even human and mouse), yet large phenotypic difference
• The importance of non-coding parts of our genome became more obvious– Non-coding, regulatory RNAs– Binding sites for regulatory proteins– Other possibilities that are not obvious right now
![Page 14: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/14.jpg)
Complexity of biological information
1. Finding regulatorymotifs in DNA
2. Increasing the speedand reliability of functionalannotation from sequence
![Page 15: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/15.jpg)
The more we know, the better?
![Page 16: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/16.jpg)
So we have a genome sequence …>Sulfolobus virus 1 complete genome 15465 bp.TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAGCGGAATACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAAGGAGGGATATACTGAGAGTCCTACGCGTTAGTTCAGGTCAGACAAGAGAGAACGTAAACAAATCAATTCTGAAACAATTATTTGACCATGGTAAGGAACATGAAGATGAAGAAGAGTAATGAATGGTTATGGTTAGGGACTAAAATTATAAACGCCCATAAGACTAACGGCTTTGAAAGTGCGATTATTTTCGGGAAACAAGGTACGGGAAAGACTACTTACGCCCTTAAGGTGGCAAAAGAAGTTTACCAGAGATTAGGACATGAACCGGACAAGGCATGGGAACTGGCCCTTGACTCTTTATTCTTTGAGCTTAAAGATGCATTGAGGATAATGAAAATATTCAGGCAAAATGATAGGACAATACCAATAATAATTTTCGACGATGCTGGGATATGGCTTCAAAAATATTTATGGTATAAGGAAGAGATGATAAAGTTTTACCGTATATATAACATTATTAGGAATATAGTAAGCGGGGTGATCTTCACTACCCCTTCCCCTAACGATATAGCGTTTTATGTGAGGGAAAAGGGGTGGAAGCTGATAATGATAACGAGAAACGGAAGACAACCTGACGGTACGCCAAAGGCAGTAGCTAAAATAGCGGTGAATAAGATAACGATTATAAAAGGAAAAATAACAAATAAGATGAAATGGAGGACAGTAGACGATTATACGGTCAAGCTTCCGGATTGGGTATATAAAGAATATGTGGAAAGAAGAAAGGTTTATGAGGAAAAATTGTTGGAGGAGTTGGATGAGGTTTTAGATAGTGATAACAAAACGGAAAACCCGTCAAACCCATCACTACTAACGAAAATTGACGACGTAACAAGATAGTGATACGGGTAATGTCAGACCCCTTTTAGCCATTCCGCATACTTTTTATATTGCTCTTTCGCTATGCCGAAGAGCGATACGTAATGTTGCGTTAAAACGCGTGTCGGTTTACGCCCTTGAATAAAATCGATAATATCTAACGGTACGCTTAGCTCAGCCATCTTAGACGCTACGAATTTGCGGAAGTACTTTATCGCTATAGCGTCCTTATGACGTCGTTCAAAGTCCGCTATTGCCCACTTCGTCACCTCTACTCTCTTCAGAGGCGTTATGTGGAATACATAGAAGACGCCCTTATATCCCCTAGTCCAACTAAGCGGATAATAACAGACGTCGTTACCGCAAATGTCCCTTTCGGGTTCCTTCAGCACTTTCAGTATTTCGCTCAGCCTAACGCCCGACTCGAGAGCGATACGGTAGATGAAGTAGACGTTTTCGCTATAGTCTTTTGCTAATTGTAACGTCCTTTTTATCTCTTCCAACGTTGGAATGTAGATATCAGCGTTCGCCTTCTTCACCTTTACCGCTTTCAATATTTTATCCGCAAATTCATCATGTATGATATTGCGTGACGCTAAGAAACGTGCAAAGAGTCGGTAAGCCTTCTGTGCGTCTCTCGTCTCTTTATACGGCTTTGATATAGCATTGATGTAGTCCTTTGCAGTTTTTTCGCTTATCCCCCTTTCGTTCATGAGATAGTCGTAGAACGCCTTTATGTTGCCGTCCGTCGCGTATTGGCGCAAATTGGCAACCAACGCTATTTTACGTCGTTCAGTTCCCTCTTTTCCGCCTCCGGAGCCGGAGGTCCCGGGTTCAAATCCCGGCGGGTCCGCTTGTAGGGGAGTATCCCCTACGACCCCTAATTTCATTTTTAGATATGATTCAACGACGTCAGCTAAAGGACCCACGTAACGCTCTTTTACCTCACCGTTTTCATACTCTAGCTTGTAAACATAATACCGCCCTTTCCTCTCGCGTAAAATATAATCCCCGTATTTATAACGCGTCTTATCTTTCGTCATTTCGCCTCACAGTATTATGGTTGCCAAAACGGGCTTATAAGCATTGGCAACCCGTTAATTTTTGCCGTTAAAACACGTTGAATTGAAAGAAGACGGCAAAGAATCCACACAGGTAATACTAAAAAAGTAGTATTACTTACATTAGAAGGACTCATTTGTCCACCTTGTATTCTAGCCATGCTATCTCTGCCTTCAGCTCATCTAGCTTCCCCTTTATGTCTGTCAGGTCAAGGGGAACTCCTCTCATTAACCTGAGTTCGTTTTCGATTTTTTCAAGCTCCTTTTCCAACTCCTCTAGTTTCTCTAATTCCTTTAGTCGTTCTTCCAATTTCTTTTCCAATTTCCCCTTTGCGTCATTTATAATTATGCTTACTACCCAAACAATTCCTAAATCAGAAATAATTATTAACTCCTCTGAGTTGAATATCATTTTCCGCCCCTCGCTAAATACTCCTTAAAGCTCTGATAGAACCCCTTCAGACTAACCCGTAAGTCTGTTAGGTTCTTCCAGTATTGTAATGGGATTAAGTAATAGTAGCTTACTGCATCTCTCTCAAATTTGTCCTTCTTAATCTTTCCTTGCTTTTCTAAGTTGAGTATTTGCAGTGCTGAGATACATTTTAACTTGTCCTCAGCATCTGAATAGTGTATAAACCAAACCCTCCCCATAACCTCATTCTGCTTTGCAACTTCTACTTTAGTGCTTAATATTGCGTAAACGCTTTCGCCGTATCTTTCTTTGCTCTGTTCTTCAGTCCATGAACTTCCCGTAATATCTATCCAAATTAAAGGATAATATTCTGTCTTAGCCTTAACGTATAAAGTCAAATCGTATTTATCTTGCAGACCGCTATAGTATTGCTCATTTATTACATTAGTTAAAGTCCCCACGCCAGTTGGGCGGATATAAACATCAAAGTCTAACAAACCCTTAGCCCGCCACTTTGATAAAGAGATTAAGAGCTTTCCAAAAACTAGGTATTCTCGCCCTAAATAAGTTGAAGGGAGGATATAATCCTCAGCTTGATTACCCCAATACTTTAGCTTAAAATTAGTTTCAGCCATCTCACTCACCATATTGAAACGTGGGCTAGTATGTGAATCAGTACTGATGCTATTGCAAATAACACACTTGCAGTAGCAATTCCTATTACAATCCATTTACCATAATCCACCTTAGTTTGTTGGTCAATATACTCGTTGATGATCTTTAGTATTTCTGGCTTTAGTTCTGATAATGAAAGGAAGACAGAGGCATAAAGTACTAAGGAGGATGTGAACAGATTATCCGCCTTTTCTGAAAGTTTATAAAGCTCATATCTTGCTCTCTCATAATCTTCATAATTAATAATTTCATCAAACTTTTCTACTTGCTCTTCATATTCTTTCTTCAGAGAGTAAGGAGTTGTCTTTTCAATTACTCCTAATTTTATTAACTTCTTAACAGCTTCCTTAAATCCTTGTTTATTGCTAGCATACGCTAAAGGGTCTTTTCCTTCTTGAGAAGCTCTATAGATAACTATAGCACCATAAACAATATTTACAATATCGTATGGTAAGGAATACGCACCGATTTGGGCAATATCTTCAACTCTTCTTTGATCCATCTAGTTCACCTCTTTTTGATTTGTTTGTAGGTTTCTATCGCAGTTTTCAGCGATATCGCAAATAGCTTCCCCTTTTCCGTTAGGTATAGCCTCTTTTCGCCTCTTTCTTGACGCTCTTTCACGAAGCCCTCTTGTATTAGGAACTTTTTTGCATCATAAAAGGTGGCAGTGGACATGGGAAATTCTGCGTTTACTTTCTTGTATAGGTCATATGTTGCTATTCCTTCATTATCATATAGATAAGCCAATACTATGGCTTCGGGGTAGAAGAATGGTGTACTTTTCATATCCTCCTCACTCCTCAGCCTCTAATAGCTTAACTGCCTCCTCTATCAACTGTCCCATTGTCTTTCCAGTCTTTGCCTTAAGCCTCTGCAGAGTCTCATATGTTTCCTCACTTATTGAAATGTTAAGCCTTTTGACTATCCTATCTTTCCTCTTCTCTATCATTTAGGTCACCTTGTTTATTGTTATTTGAAATACGTATCCGTCTTCGTCACATCGAAGTATAATTTTGTATCCATTATTAGCATATTCTACGTCAAAGTTCCCACAACAATAATTCGGGTCTTCGGACTCGTTATAGACTTTGCTCCAACCATCTTTTTGTAGTGCCTCTTCTAAGTAGTCTACTCTGATGAAGCCTTCATCATATTCGTTCAGTACCCTAAAGCTTATACTATCAATGCCTAATACGTCTAATAGCTTCAACAGATCGAATATAGGAACTTGCACCATCATTTCAGCTCACCTTAATGAGCTGATATAATTCCGCTTCTATCTTTTGAACTTGGAAGTATGCCTTGCCTAGCTTTTGCTTATCCATATTGCCCGTTATTCTATCAATCTTAATCTCGTGGATTAATGATAATAGCTCTCTGACATCCTCATCAAGCATTTCAAATAATTCTTTCTCTAAGACTTCTTTACTCATTGTTTTTCACCTTAGCAAACTCATCTAACGTTGTTTGTCTCAGTTCTCTTTTCTTTATCAAATAAAATTCCGAATGTCCCTTCTTATTGTTATTACTGTACTTCATGTCAGTTCACTGCTTTGCCTTTATAAATCCTTGATCCGTTTGCTCAAAATTTGCGGGCTGGGCAT
![Page 17: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/17.jpg)
Gene finding through learningatgccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgtaa
gaggatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagatg
Gene
Non-gene
gcgatgcggctgctgagagcgtaggcccgagaggagagatgtaggaggaaggtttgatggtagttgtagatgattgtgtagttgtagctgatagtgatgatcgtag
Gene?
![Page 18: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/18.jpg)
atg
tga
ggtgag
ggtgag
ggtgag
caggtg
cagatg
cagttg
caggccggtgag
![Page 19: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/19.jpg)
Map looks better. Is this all?
![Page 20: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/20.jpg)
OK, so we’ll predict protein functions …
![Page 21: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/21.jpg)
… maybe do few experiments …
![Page 22: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/22.jpg)
… and then enjoy glory (maybe money, too).
Trevor Douglas and Mark Young (2006) Science 312, 873 - 875.
![Page 23: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/23.jpg)
What can we do with Molecular Biology information?
• Different levels of Molecular Biology information
• DNA– Coding or non-coding– Meaningful or junk DNA?
• RNA– Information transfer (mRNA, tRNA, rRNA)– Regulatory roles
• Protein– Structure and function– Modifications
![Page 24: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/24.jpg)
Molecular Biology Information in DNA and RNA
• Raw DNA Sequence– 4 bases: AGCT– Coding or Not?– How do we parse
the sequence into genes?
– Because of introns, ~1 K in a gene could mean ~2 M in genome
• Raw RNA Sequence– 4 bases: AGCU– mRNA, tRNA, rRNA– Regulatory RNAs– Secondary
structure
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
![Page 25: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/25.jpg)
Molecular Biology information in protein sequences
1. Finding regulatorymotifs in DNA
• 20 letter alphabet, more combinatorial variability than DNA (20AA-number)
– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
• Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain
• More than 2 million unique protein sequences (more than 5.6 M of total sequences in the database)
• We must be able to “transfer” the function from characterized proteins to uncharacterized ones based on some measure of similarity
![Page 26: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/26.jpg)
Molecular Biology information in macromolecular structures
• DNA/RNA/Protein– The majority of all structures
are of proteins– Proteins easier to crystallize
and were thought to be more important
![Page 27: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/27.jpg)
Organizing information: Redundancy and multiplicity help
…• Fairly different sequences may have the same
structure and function– Bad news: If they are very different, how do we find this?– Good news: Once they are found, we learn something more
about structure and function
• An organism has many similar genes and non-coding RNAs– The redundancy present for essential genes and/or RNAs
(rRNA)
• Single gene may have multiple functions– Combining domains in eukaryotes produces large proteins
• Genes are grouped into pathways; this is good
![Page 28: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/28.jpg)
… though sometimes the path is difficult
• Evolutionary distances do not help establish initial relationship– Large differences (large evolutionary distances) between
proteins are hard to identify and defend on statistical grounds without experiment
• Evolutionary distances do help once the relationship is established– If the relationship between distant proteins is
established, their conserved parts provide information about what is vital for function
– Less conserved parts of proteins are less important for function - scaffold
• Given all these difficulties, how do we find hidden similarities?
![Page 29: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/29.jpg)
Some things we can do using just sequence
• Sequence (text string) comparisons– Sequence (text string) search– Sequence alignment– Finding short sequences in biological sequences– Significance statistics
• Databases– Building, Querying
• Learning patterns– Artificial Intelligence and Machine Learning– Mining for patterns and clustering them
• Secondary structure prediction– Where are helices, strands and loops in proteins?– Finding trans-membrane helices
• Tertiary structure prediction– Fold recognition and structure prediction– Active site identification
![Page 30: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/30.jpg)
How are optimal alignments found?(Should we all pick the one we like?)
![Page 31: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/31.jpg)
Aligning text strings …Which alignment is the best?
Raw Data ???T C A T G C A T T G
2 matches, 0 gaps
T C A T G | |C A T T G
3 matches (2 end gaps)
T C A T G . | | | . C A T T G
4 matches, 1 insertion
T C A - T G | | | | . C A T T G
4 matches, 1 insertion
T C A T - G | | | | . C A T T G
![Page 32: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/32.jpg)
Dynamic Programming to the rescue1. Finding regulatory
motifs in DNA•What to do for Bigger String?SSDSEREEHVKRFRQALDDTGMKVPMATTNLFTHPVFKDGGFTANDRDVRRYALRKTIRNIDLAVELGAETYVAWGG
REGAESGGAKDVRDALDRMKEAFDLLGEYVTSQGYDIRFAIEPKPNEPRGDILLPTVGHALAFIERLERPELYGVNP
EVGHEQMAGLNFPHGIAQALWAGKLFHIDLNGQNGIKYDQDLRFGAGDLRAAFWLVDLLESAGYSGPRHFDFKPPRT
EDFDGVWAS
•Needleman-Wunsch (1970) provided first automatic method– Dynamic Programming to Find Global Alignment– Local Alignment is sometimes better than Global
•Needleman-Wunsch Test Data–ABCNYRQCLCRPMAYCYNRCKCRBP
![Page 33: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/33.jpg)
Make a dot plot (Similarity matrix)
Put 1's where characters are identical.
A B C N Y R Q C L C R P M
A 1
Y 1
C 1 1 1
Y 1
N 1
R 1 1
C 1 1 1
K
C 1 1 1
R 1 1
B 1
P 1
![Page 34: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/34.jpg)
Scoring the alignment
• The idea is to go through the matrix and find a shortest path to the bottom (it is actually done from the bottom backwards)
• Caveat 1: This path also needs to have the highest score
• Caveat 2: We have to score the gaps (insertions and deletions) since they do not exist in proteins
![Page 35: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/35.jpg)
Global alignment by dynamic programming
Sequence X: MONTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 A -30 -19 -8 3 14 25 19 13 N -36 -25 -14 -3 8 19 30 24 A -42 -31 -20 -9 2 13 24 35
Optimum alignment score: 35X: MONTANAY: MONTANA
![Page 36: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/36.jpg)
What about gaps?
Sequence X: MONTTANASequence Y: MONTANAScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A 0 -6 -12 -18 -24 -30 -36 -42 M -6 5 -1 -7 -13 -19 -25 -31 O -12 -1 10 4 -2 -8 -14 -20 N -18 -7 4 15 9 3 -3 -9 T -24 -13 -2 9 20 14 8 2 T -30 -19 -8 3 14 18 12 6 A -36 -25 -14 -3 8 19 16 17 N -42 -31 -20 -9 2 13 24 18 A -48 -37 -26 -15 -4 7 18 29
Optimum alignment score: 29X: MONTTANAY: MON-TANA
![Page 37: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/37.jpg)
Scoring “real-life” alignments
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -6 for gap
Dynamic programming matrix: M O N T A N A G R I Z Z L I E S 0 -6 -12 -18 -24 -30 -36 -42 -48 -54 -60 -66 -72 -78 -84 -90 -96 M -6 5 -1 -7 -13 -19 -25 -31 -37 -43 -49 -55 -61 -67 -73 -79 -85 O -12 -1 10 4 -2 -8 -14 -20 -26 -32 -38 -44 -50 -56 -62 -68 -74 N -18 -7 4 15 9 3 -3 -9 -15 -21 -27 -33 -39 -45 -51 -57 -63 T -24 -13 -2 9 20 14 8 2 -4 -10 -16 -22 -28 -34 -40 -46 -52 A -30 -19 -8 3 14 25 19 13 7 1 -5 -11 -17 -23 -29 -35 -41 N -36 -25 -14 -3 8 19 30 24 18 12 6 0 -6 -12 -18 -24 -30 A -42 -31 -20 -9 2 13 24 35 29 23 17 11 5 -1 -7 -13 -19 B -48 -37 -26 -15 -4 7 18 29 33 27 21 15 9 3 -3 -9 -15 O -54 -43 -32 -21 -10 1 12 23 27 31 25 19 13 7 1 -5 -11 B -60 -49 -38 -27 -16 -5 6 17 21 25 29 23 17 11 5 -1 -7 C -66 -55 -44 -33 -22 -11 0 11 15 19 23 27 21 15 9 3 -3 A -72 -61 -50 -39 -28 -17 -6 5 9 13 17 21 25 19 13 7 1 T -78 -67 -56 -45 -34 -23 -12 -1 3 7 11 15 19 23 17 11 5 S -84 -73 -62 -51 -40 -29 -18 -7 -3 1 5 9 13 17 21 15 16
Optimum alignment score: 16X: MONTANA--BOBCATSY: MONTANAGRIZZLIES
![Page 38: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/38.jpg)
The scoring depends on our choice of parameters
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match (B=G); -2 for mismatch; -6 for gap
Optimum alignment score: 23X: MONTANAB--OBCATSY: MONTANAGRIZZLIES
Sequence X: MONTANABOBCATSSequence Y: MONTANAGRIZZLIESScoring system: 5 for match; -2 for mismatch; -1 for gap
Optimum alignment score: 26X: MONTANA--------BOBCATSY: MONTANAGRIZZLIE------S
![Page 39: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/39.jpg)
How do we choose good scoring parameters?
• A simple scoring scheme considers only sequence identity
• More realistic scoring schemes consider sequence similarity, which is taken from substitution matrices
• We measure the frequency of residue substitutions and normalize it by residue frequency in the database (LOG2 an/ad)
• Zero in substitution matrix means that the substitution occurs by chance
• Score less than zero means that the substitution is unlikely to occur by chance
• There is no universally good matrix
![Page 40: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/40.jpg)
BLOSUM62 substitution matrix
A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
![Page 41: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/41.jpg)
What is the limit of substitution matrices?
![Page 42: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/42.jpg)
Why did substitution matrices fail?
• The proteins in question are very distantly related and their substitution patterns are not properly rewarded by general matrices
• Substitution matrices do not capture all families equally well because they are meant to be general
• Can we build protein family-specific substitution matrices?
• Yes, these are known as protein family profiles
![Page 43: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/43.jpg)
How do we build protein family-specific matrix?
Search protein database using BLOSUM62 matrix
Build protein-family specific matrix (profile) and search protein database again
???
![Page 44: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/44.jpg)
Detecting distant relationships using profiles (PSI-BLAST)
![Page 45: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/45.jpg)
Position-specific scoring matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V T -1 0 1 2 -2 0 1 -1 0 -2 -2 0 -1 -2 -1 0 3 -1 -1 -1 M -1 -2 -3 -3 -1 -2 -3 -3 -2 1 1 -2 7 0 -3 -3 -2 -1 -1 0 D -1 2 0 2 -3 3 1 -1 0 -3 -3 1 -2 -3 -1 0 0 -2 -2 -2 V -1 -4 -4 -5 -1 -4 -4 -5 -3 4 1 -4 0 -1 -4 -4 -1 -2 -2 5 I -1 -4 -5 -6 -1 -4 -5 -6 -4 5 1 -5 0 -1 -5 -5 -2 -2 -2 4 S 0 -1 0 -1 -1 -1 -1 -1 -1 -3 -3 -1 -2 -3 -2 4 3 -3 -2 -2 F -2 -4 -4 -5 -2 -4 -5 -5 -2 1 2 -4 1 6 -4 -4 -3 2 2 0 K -1 3 -1 -1 -4 1 0 -2 0 -3 -3 5 -2 -4 -2 -1 -1 -3 -3 -3 L -2 -4 -5 -6 -1 -4 -5 -6 -3 3 4 -4 2 1 -4 -5 -3 -1 -1 1 P -1 -3 -3 -2 -4 -3 -2 -3 -3 -4 -4 -2 -4 -4 7 -2 -2 -4 -4 -3 P -1 -1 -1 2 -4 -1 0 -1 -1 -4 -4 -1 -4 -4 6 -1 -1 -3 -3 -3 E 0 0 0 2 -4 1 4 -2 -1 -3 -3 1 -2 -4 -1 -1 -1 -3 -3 -2 L -2 -3 -4 -5 -2 -3 -4 -5 -2 2 4 -4 6 1 -4 -4 -2 -1 -1 1 N -1 0 3 0 -2 0 2 -1 0 -2 0 1 -1 -2 -1 0 0 -1 -1 -2 A 2 1 -1 -1 -1 0 0 -2 0 0 0 0 0 -1 -1 -1 0 0 0 0 K -1 1 -1 -1 -3 1 0 -2 0 -2 0 4 -1 -2 -2 -1 -1 -2 -1 -1 L -2 -4 -5 -5 -2 -4 -5 -5 -3 1 5 -4 1 1 -4 -5 -3 -1 -1 1 E -1 0 2 4 -4 0 3 -1 0 -5 -4 0 -4 -4 -1 0 -1 -3 -3 -4 S 1 1 0 0 -3 3 1 -1 0 -2 -2 1 -2 -2 -1 1 0 -2 -1 -2 V -1 -1 -2 -3 -1 -2 -3 -3 0 2 1 -2 1 2 -2 -2 -1 1 3 3 A 5 -3 -3 -2 0 -2 -2 0 -2 -2 -2 -2 -2 -3 -2 0 -1 -3 -3 -1 L -1 0 -1 -2 -1 -1 -1 -2 0 2 1 0 1 0 -2 0 0 1 2 0 K -1 2 0 0 -3 1 2 -1 0 -3 -3 3 -2 -4 -1 1 0 -3 -2 -2 E -1 1 0 0 -2 2 2 -2 3 -2 0 1 -1 -2 -1 -1 0 -1 -1 -1 K -1 1 1 0 -3 0 0 2 0 -4 -3 3 -3 -3 -1 0 -1 -2 -2 -3 K -1 0 -1 -1 -2 0 0 -3 0 0 -1 3 3 -1 -2 -1 0 -1 0 1 S -1 -1 2 0 -2 0 0 -1 0 -3 -3 0 -2 -3 -1 3 2 -2 -2 -3
![Page 46: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/46.jpg)
Combining sequence similarity with SS information
![Page 47: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/47.jpg)
Can we quickly scan for common protein families?
• YES, many databases available• Instead of comparing our query to other
sequences, we compare it to the database of profiles (also called Hidden Markov Models or HMMs)
• Profiles (and HMMs) capture the average preference for residues at all positions; They are probabilistic representations of protein families
• Try these databases:– PFAM (http://pfam.wustl.edu)– SMART (http://smart.embl-heidelberg.de)
![Page 48: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/48.jpg)
Profiles HMMs have other uses
• Profile HMMs represent a phylogenetic footprint of a given protein family– Also used for secondary structure
prediction– Predictions of trans-membrane proteins– Prediction of protein disorder
• Most of these predictors are based on machine-learning algorithms that are trained on known data and can extract subtle patterns
![Page 49: Advanced Bioinformatics (MB480/580) >Sulfolobus virus 1 complete genome 15465 bp. TTCGCCCGCTTACCGACGTACTTCGGTGAGGAACCGGTAACGGAGTTAG TACGCCCATAAGTTGAAACATTATCTCGTTTCGAAAGGAGGAAGAGGAA](https://reader033.vdocuments.mx/reader033/viewer/2022051621/56649d625503460f94a4540a/html5/thumbnails/49.jpg)
Questions?
Mensur DlakicDepartment of Microbiology
111 Lewis HallTel: 994-6576
[email protected] office: 109 Lewis Hall