bioinformatics b90901099 劉兆昕. 2 outline what is bioinformatics? applications of bioinformatics...
TRANSCRIPT
Bioinformatics
B90901099 劉兆昕
2
Outline
• What is Bioinformatics?
• Applications of Bioinformatics
• DNA Databases
• Protein Databases
• FASTA
• BLAST
3
What is Bioinformatics?
• Computational biology. • This word has not a clear definition. • It involves the analysis and interpretation of data and the
development of algorithms and statistics. • The term was coined to encompass computer applications
in biological sciences but is now used to mean rather different things, from artificial intelligence and robotics to genome analysis.
• The term was originally applied to the computational manipulation and analysis of biological sequence data (DNA and/or protein), but now tends also to be used to embrace the manipulation and analysis of 3D structural data.
4
What is Bioinformatics?
• Bioinformatics is the application of computational techniques to the management and analysis of biological information.
5
Applications of Bioinformatics• Sequencing of the Human Genome
• Annotation of the Genome
• Protein Interactions
• Find Proteins that are Better Suited for Treating with Drugs
• Development of New Drugs
• Optimization of Known Therapies on Individuals
6
Sequencing of the Human Genome
7
DNA Databases
• GenBank (USA)
• EMBL (European Molecular Biology Laboratory)
• The DNA Database of Japan
8
GenBank
9
EMBL
10
DDBJ
11
A Sample DNA Database Entry
ID TRBG361 standard; mRNA; PLN; 1859 BP.
XX
AC X56734; S46826;
XX
SV X56734.1
XX
DT 12-SEP-1991 (Rel. 29, Created)
DT 15-MAR-1999 (Rel. 59, Last updated, Version 9)
XX
DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW beta-glucosidase.
XX
OS Trifolium repens (white clover)
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids;
OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN [5]
RP 1-1859
RX MEDLINE; 91322517.
RX PUBMED; 1907511.
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT "Nucleotide and derived amino acid sequence of the cyanogenic
RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";
RL Plant Mol. Biol. 17(2):209-219(1991).
XX
RN [6]
RP 1-1859
RA Hughes M.A.;
RT ;
RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE
RL UPON TYNE, NE2 4HH, UK
XX
FH Key Location/Qualifiers
FH
FT source 1..1859
FT /db_xref="taxon:3899"
FT /mol_type="mRNA"
FT /organism="Trifolium repens"
FT /tissue_type="leaves"
FT /clone_lib="lambda gt10"
FT /clone="TRE361"
FT CDS 14..1495
FT /db_xref="GOA:P26204"
FT /db_xref="HSSP:P26205"
FT /db_xref="UniProt/Swiss-Prot:P26204"
FT /note="non-cyanogenic"
FT /EC_number="3.2.1.21"
FT /product="beta-glucosidase"
FT /protein_id="CAA40058.1"
FT
/translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT mRNA 1..1859
FT /evidence=EXPERIMENTAL
XX
SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120
tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180
aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240
tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300
caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360
ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa 420
atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct 480
ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg 540
tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt 600
gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg 660
aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac 720
aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta 780
taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg 840
gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga 900
cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg 960
gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg 1020
ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc 1080
acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa 1140
acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat 1200
gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct 1260
gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga 1320
agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg 1380
ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg 1440
taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga 1500
tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa 1560
ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt 1620
tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg 1680
aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740
agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800
tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859
//
12
The Human Genome Project
• Begun 1990• Coordinated by the U.S. Department of Energy
and the National Institutes of Health• Originally expected completion in 2005 (15 yrs),
already completed 2003 (13 yrs).
13
The Human Genome Project
14
The Human Genome Project
• Project goals were to – identify all the approximately 20,000-25,000 genes in
human DNA, – determine the sequences of the 3 billion chemical base
pairs that make up human DNA, – store this information in databases, – improve tools for data analysis, – transfer related technologies to the private sector, and – address the ethical, legal, and social issues (ELSI) that
may arise from the project
15
Protein Interactions
16
Protein Databases
• PIR (Protein Information Resource)
• Swiss-Prot
• UniProt (Universal Protein Resource)
17
PIR
18
Swiss-Prot
19
UniProt
20
A Typical SWISS-PROT Entry
ID ADH_DROME STANDARD; PRT; 255 AA.
AC P00334;
DT 21-JUL-1986 (Rel. 01, Created)
DT 21-JUL-1986 (Rel. 01, Last sequence update)
DT 15-JUL-1998 (Rel. 36, Last annotation update)
DE ALCOHOL DEHYDROGENASE (EC 1.1.1.1).
GN ADH.
OS Drosophila melanogaster (Fruit fly).
OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta;
OC Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
OC Ephydroidea; Drosophilidae; Drosophila.
RN [1]
RP SEQUENCE FROM N.A. (ADH-S AND ADH-F ALLELES).
RX MEDLINE; 83271489. [NCBI, ExPASy, Israel, Japan]
RA Kreitman M.;
RT "Nucleotide polymorphism at the alcohol dehydrogenase locus of
RT Drosophila melanogaster.";
RL Nature 304:412-417(1983).
RN [2]
RP SEQUENCE FROM N.A. (ADH-S ALLELE).
RX MEDLINE; 81247357. [NCBI, ExPASy, Israel, Japan]
RA Benyajati C., Place A.R., Powers D.A., Sofer W.;
RT "Alcohol dehydrogenase gene of Drosophila melanogaster: relationship
RT of intervening sequences to functional domains in the protein.";
RL Proc. Natl. Acad. Sci. U.S.A. 78:2717-2721(1981).
RN [3]
RP SEQUENCE FROM N.A. (ADH-S ALLELE).
RX MEDLINE; 91200630. [NCBI, ExPASy, Israel, Japan]
RA Kreitman M., Hudson R.R.;
RT "Inferring the evolutionary histories of the Adh and Adh-dup loci in
RT Drosophila melanogaster from patterns of polymorphism and
RT divergence.";
RL Genetics 127:565-582(1991).
RN [4]
RP SEQUENCE FROM N.A. (ADH-S ALLELE).
RC STRAIN=CANTON-S;
RA Brogna S., Ashburner M.;
RL Submitted (DEC-1996) to the EMBL/GenBank/DDBJ databases.
RN [5]
RP SEQUENCE OF 140-255 FROM N.A. (ADH-F ALLELE).
RX MEDLINE; 81124290. [NCBI, ExPASy, Israel, Japan]
RA Benyajati C., Wang N., Reddy A., Weinberg E., Sofer W.;
RT "Alcohol dehydrogenase in Drosophila: isolation and characterization
RT of messenger RNA and cDNA clone.";
RL Nucleic Acids Res. 8:5649-5667(1980).
RN [6]
RP SEQUENCE (ADH-F ALLELE).
RX MEDLINE; 84256516. [NCBI, ExPASy, Israel, Japan]
RA Thatcher D.R.;
RT "The complete amino acid sequence of three alcohol dehydrogenase
RT alleloenzymes (AdhN-11, AdhS and AdhUF) from the fruitfly Drosophila
RT melanogaster.";
RL Biochem. J. 187:875-886(1980).
RN [7]
RP ERRATUM.
RA Thatcher D.R.;
RL Biochem. J. 191:895-895(1980).
RN [8]
RP SEQUENCE OF 207-224 (ADH-FCH.D. ALLELE).
RX MEDLINE; 81247428. [NCBI, ExPASy, Israel, Japan]
RA Chambers G.K., Laver W.G., Campbell S., Gibson J.B.;
RT "Structural analysis of an electrophoretically cryptic alcohol
RT dehydrogenase variant from an Australian population of Drosophila
RT melanogaster.";
RL Proc. Natl. Acad. Sci. U.S.A. 78:3103-3107(1981).
RN [9]
RP SEQUENCE FROM N.A. (ADH-F ALLELE).
[CoDingSequence]
DR EMBL; M17827; AAA28341.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17828; AAA28342.1; -. [EMBL / GenBank / DDBJ]
[CoDingSequence]
DR EMBL; M19547; AAA70210.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17830; AAA28343.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17831; AAA28344.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17832; AAA28345.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17833; AAA28346.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17834; AAA28347.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17835; AAA28348.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17836; AAA28349.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; M17837; AAA70212.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; X60791; CAA43204.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; X60792; CAA43205.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; U20765; AAA88817.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; X78384; CAA55151.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR EMBL; X98338; CAA66981.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]
DR PIR; A00343; DEFFA.
DR PIR; S17083; S17083.
DR PIR; S17085; S17085.
DR FLYBASE; FBgn0000055; Adh.
DR INTERPRO; IPR002198; -.
DR INTERPRO; IPR002424; -.
DR INTERPRO; IPR002425; -.
DR INTERPRO; IPR003030; -.
DR PFAM; PF00106; adh_short; 1.
DR PFAM; PF00663; adh_short_C; 1.
DR PRINTS; PR00080; SDRFAMILY.
DR PRINTS; PR01167; INSADHFAMILY.
DR PRINTS; PR01168; ALCDHDRGNASE.
DR PROSITE; PS00061; ADH_SHORT; 1.
DR PRODOM [Domain structure / List of seq. sharing at least 1 domain]
DR BLOCKS; P00334.
DR DOMO; P00334.
DR PROTOMAP; P00334.
DR PRESAGE; P00334.
DR DIP; P00334.
DR SWISS-2DPAGE; GET REGION ON 2D PAGE.
KW Oxidoreductase; NAD; Acetylation.
FT INIT_MET 0 0
FT MOD_RES 1 1 ACETYLATION.
FT NP_BIND 11 34 NAD (BY SIMILARITY).
FT ACT_SITE 152 152
FT VARIANT 192 192 K -> T (IN ADH-F ALLELE).
FT VARIANT 214 214 P -> S (IN ADH-FCH.D ALLELE).
FT MUTAGEN 14 14 G->V: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 14 14 G->A: 31% DECREASE IN ACTIVITY.
FT MUTAGEN 129 129 G->C: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 132 132 G->I: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 135 135 C->A: NO DECREASE IN ACTIVITY.
FT MUTAGEN 152 152 Y->F,H,E,Q: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 152 152 Y->C: RETAINS ONLY 0.25% ACTIVITY.
FT MUTAGEN 156 156 K->I: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 156 156 K->R: RETAINS ONLY 2.2% ACTIVITY.
FT MUTAGEN 183 183 G->L: COMPLETE LOSS OF ACTIVITY.
FT MUTAGEN 218 218 C->A: NO DECREASE IN ACTIVITY.
SQ SEQUENCE 255 AA; 27630 MW; CA7DF929E0007296 CRC64;
SFTLTNKNVI FVAGLGGIGL DTSKELLKRD LKNLVILDRI ENPAAIAELK AINPKVTVTF
YPYDVTVPIA ETTKLLKTIF AQLKTVDVLI NGAGILDDHQ IERTIAVNYT GLVNTTTAIL
DFWDKRKGGP GGIICNIGSV TGFNAIYQVP VYSGTKAAVV NFTSSLAKLA PITGVTAYTV
NPGITRTTLV HKFNSWLDVE PQVAEKLLAH PTQPSLACAE NFVKAIELNQ NGAIWKLDLG
TLEAIQWTKH WDSGI
//
RC STRAIN=KA12, AND RI32;RX MEDLINE; 92077396. [NCBI, ExPASy, Israel, Japan]RA Laurie C.C., Bridgham J.T., Choudhary M.;RT "Associations between DNA sequence variation and variation inRT expression of the Adh gene in natural populations of DrosophilaRT melanogaster.";RL Genetics 129:489-499(1991).RN [10]RP MUTAGENESIS.RX MEDLINE; 90212596. [NCBI, ExPASy, Israel, Japan]RA Chen Z., Lu L., Shirley M., Lee W.R., Chang S.H.;RT "Site-directed mutagenesis of glycine-14 and two 'critical' cysteinylRT residues in Drosophila alcohol dehydrogenase.";RL Biochemistry 29:1112-1118(1990).RN [11]RP MUTAGENESIS OF TYR-152 AND LYS-156.RX MEDLINE; 93213802. [NCBI, ExPASy, Israel, Japan]RA Chen Z., Jiang J.C., Lin Z.-G., Lee W.R., Baker M.E., Chang S.H.;RT "Site-specific mutagenesis of Drosophila alcohol dehydrogenase:RT evidence for involvement of tyrosine-152 and lysine-156 inRT catalysis.";RL Biochemistry 32:3342-3346(1993).RN [12]RP MUTAGENESIS OF GLY-129; GLY-132; TYR-152; LYS-156; AND GLY-183.RX MEDLINE; 93202283. [NCBI, ExPASy, Israel, Japan]RA Cols N., Marfany G., Atrian S., Gonzalez-Duarte R.;RT "Effect of site-directed mutagenesis on conserved positions ofRT Drosophila alcohol dehydrogenase.";RL FEBS Lett. 319:90-94(1993).RN [13]RP MUTAGENESIS OF TYR-152.RX MEDLINE; 92371633. [NCBI, ExPASy, Israel, Japan]RA Albalat R., Gonzalez-Duarte R., Atrian S.;RT "Protein engineering of Drosophila alcohol dehydrogenase. TheRT hydroxyl group of Tyr152 is involved in the active site of theRT enzyme.";RL FEBS Lett. 308:235-239(1992).CC -!- CATALYTIC ACTIVITY: ALCOHOL + NAD(+) = ALDEHYDE OR KETONE + NADH.CC -!- ENZYME REGULATION: INHIBITED BY 2,2,2-TRIFLUOROETHANOL ANDCC PYRAZOLE.CC -!- SUBUNIT: HOMODIMER.CC -!- POLYMORPHISM: VIRTUALLY ALL NATURAL POPULATIONS OF THIS SPECIESCC ARE POLYMORPHIC FOR 2 ELECTROPHORETICALLY DISTINGUISHABLE ALLELES,CC ADH-S AND ADH-F. THE SEQUENCE OF THE ADH-S ALLELE IS SHOWN.CC -!- SIMILARITY: BELONGS TO THE SHORT-CHAIN DEHYDROGENASES/REDUCTASESCC (SDR) FAMILY.CC --------------------------------------------------------------------------CC This SWISS-PROT entry is copyright. It is produced through a collaborationCC between the Swiss Institute of Bioinformatics and the EMBL outstation -CC the European Bioinformatics Institute. There are no restrictions on itsCC use by non-profit institutions as long as its content is in no wayCC modified and this statement is not removed. Usage by and for commercialCC entities requires a license agreement (See http://www.isb-sib.ch/announce/CC or send an email to [email protected]).CC --------------------------------------------------------------------------DR EMBL; M36580; AAA28331.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; Z00030; CAA77330.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; J01066; AAB59183.1; -. [EMBL / GenBank / DDBJ]
21
Computer Science in Bioinformatics
• FASTA
• BLAST
22
FASTA
23
FASTA
Algorithm:
BASEBALLANDCRICKET
WATSNANDCRICK
24
FASTAAlgorithm: (continued)
BLOSUM50 scoring matrix table is used in FASTA
25
BLAST
26
BLAST
Algorithm: For protein A S T N C F :
A matched with itself gives a score of 2 S matched with itself gives a score of 2 T matched with itself yields a score of 3 N matched with itself yields a score of 2
ASTN can only be 9 or less, smaller than the minimum requirement for the score, 17.
27
BLAST
Algorithm: (continued)
S matched with itself gives a score of 2 T matched with itself yields a score of 3 N matched with itself yields a score of 2 C matched with itself yields a score of 12
For protein A S T N C F :
The total score of the word formed by these four amino acids is 19, and it exceeds 17.
Substitutions of amino acids have a scoring of 1. Therefore, there are some query sequences that generate many other query sequences.
28
BLAST
Algorithm: (continued)
BLOSUM62 scoring index is used in BLAST
For example, consider the sequences: A C D E E F G HA C D E F G H
Every time an exact match occurs, BLAST attempts to extend the seed by looking for matches in either direction. However, the algorithm is capable of missing sequences, because it doesn't take gaps into account.
A C D E can be located but it will not be extended to F G H, because E is not a good match. Similarly, E F G H can be identified but A C D will not be traced.
29
The Future of Bioinformatics• Data collection and organization---
New concepts for databases.
• Prediction of biological functions---Computational tools are needed to analyze the collected data in the most efficient manner.
• Data Labyrinth---Computational tools that integrate the scattered information is needed.
30
The Future of Bioinformatics