bioinformatics b90901099 劉兆昕. 2 outline what is bioinformatics? applications of bioinformatics...

Bioinformatics

B90901099 劉兆昕

2

Outline

• What is Bioinformatics?

• Applications of Bioinformatics

• DNA Databases

• Protein Databases

• FASTA

• BLAST

3

What is Bioinformatics?

• Computational biology. • This word has not a clear definition. • It involves the analysis and interpretation of data and the

development of algorithms and statistics. • The term was coined to encompass computer applications

in biological sciences but is now used to mean rather different things, from artificial intelligence and robotics to genome analysis.

• The term was originally applied to the computational manipulation and analysis of biological sequence data (DNA and/or protein), but now tends also to be used to embrace the manipulation and analysis of 3D structural data.

4

What is Bioinformatics?

• Bioinformatics is the application of computational techniques to the management and analysis of biological information.

5

Applications of Bioinformatics• Sequencing of the Human Genome

• Annotation of the Genome

• Protein Interactions

• Find Proteins that are Better Suited for Treating with Drugs

• Development of New Drugs

• Optimization of Known Therapies on Individuals

6

Sequencing of the Human Genome

7

DNA Databases

• GenBank (USA)

• EMBL (European Molecular Biology Laboratory)

• The DNA Database of Japan

8

GenBank

9

EMBL

10

DDBJ

11

A Sample DNA Database Entry

ID TRBG361 standard; mRNA; PLN; 1859 BP.

XX

AC X56734; S46826;

XX

SV X56734.1

XX

DT 12-SEP-1991 (Rel. 29, Created)

DT 15-MAR-1999 (Rel. 59, Last updated, Version 9)

XX

DE Trifolium repens mRNA for non-cyanogenic beta-glucosidase

XX

KW beta-glucosidase.

XX

OS Trifolium repens (white clover)

OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;

OC Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids;

OC eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.

XX

RN [5]

RP 1-1859

RX MEDLINE; 91322517.

RX PUBMED; 1907511.

RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;

RT "Nucleotide and derived amino acid sequence of the cyanogenic

RT beta-glucosidase (linamarase) from white clover (Trifolium repens L.).";

RL Plant Mol. Biol. 17(2):209-219(1991).

XX

RN [6]

RP 1-1859

RA Hughes M.A.;

RT ;

RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.

RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE, MEDICAL SCHOOL, NEW CASTLE

RL UPON TYNE, NE2 4HH, UK

XX

FH Key Location/Qualifiers

FH

FT source 1..1859

FT /db_xref="taxon:3899"

FT /mol_type="mRNA"

FT /organism="Trifolium repens"

FT /tissue_type="leaves"

FT /clone_lib="lambda gt10"

FT /clone="TRE361"

FT CDS 14..1495

FT /db_xref="GOA:P26204"

FT /db_xref="HSSP:P26205"

FT /db_xref="UniProt/Swiss-Prot:P26204"

FT /note="non-cyanogenic"

FT /EC_number="3.2.1.21"

FT /product="beta-glucosidase"

FT /protein_id="CAA40058.1"

FT

/translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI

FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK

FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ

FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR

FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD

FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF

FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ

FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA

FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"

FT mRNA 1..1859

FT /evidence=EXPERIMENTAL

XX

SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;

aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60

cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120

tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180

aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240

tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300

caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360

ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa 420

atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct 480

ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg 540

tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt 600

gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg 660

aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac 720

aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta 780

taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg 840

gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga 900

cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg 960

gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg 1020

ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc 1080

acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa 1140

acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat 1200

gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct 1260

gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga 1320

agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg 1380

ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg 1440

taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga 1500

tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa 1560

ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt 1620

tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg 1680

aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc 1740

agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac 1800

tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa 1859

//

12

The Human Genome Project

• Begun 1990• Coordinated by the U.S. Department of Energy

and the National Institutes of Health• Originally expected completion in 2005 (15 yrs),

already completed 2003 (13 yrs).

13


14


• Project goals were to – identify all the approximately 20,000-25,000 genes in

human DNA, – determine the sequences of the 3 billion chemical base

pairs that make up human DNA, – store this information in databases, – improve tools for data analysis, – transfer related technologies to the private sector, and – address the ethical, legal, and social issues (ELSI) that

may arise from the project

http://images.google.com.tw/imgres?imgurl=http://www.minnesotabiotech.org/biohistory_images/HumanGenomeProject.gif&imgrefurl=http://www.minnesotabiotech.org/milestones.htm&h=77&w=76&sz=4&tbnid=YPL_DHHm2yIJ:&tbnh=68&tbnw=68&start=481&prev=/images%3Fq%3Dhuman%2Bgenome%2Bproject%26start%3D480%26hl%3Dzh-TW%26lr%3D%26sa%3DN

15

Protein Interactions

16

Protein Databases

• PIR (Protein Information Resource)

• Swiss-Prot

• UniProt (Universal Protein Resource)

17

PIR

18

Swiss-Prot

19

UniProt

20

A Typical SWISS-PROT Entry

ID ADH_DROME STANDARD; PRT; 255 AA.

AC P00334;

DT 21-JUL-1986 (Rel. 01, Created)

DT 21-JUL-1986 (Rel. 01, Last sequence update)

DT 15-JUL-1998 (Rel. 36, Last annotation update)

DE ALCOHOL DEHYDROGENASE (EC 1.1.1.1).

GN ADH.

OS Drosophila melanogaster (Fruit fly).

OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta;

OC Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;

OC Ephydroidea; Drosophilidae; Drosophila.

RN [1]

RP SEQUENCE FROM N.A. (ADH-S AND ADH-F ALLELES).

RX MEDLINE; 83271489. [NCBI, ExPASy, Israel, Japan]

RA Kreitman M.;

RT "Nucleotide polymorphism at the alcohol dehydrogenase locus of

RT Drosophila melanogaster.";

RL Nature 304:412-417(1983).

RN [2]

RP SEQUENCE FROM N.A. (ADH-S ALLELE).


RA Benyajati C., Place A.R., Powers D.A., Sofer W.;

RT "Alcohol dehydrogenase gene of Drosophila melanogaster: relationship

RT of intervening sequences to functional domains in the protein.";

RL Proc. Natl. Acad. Sci. U.S.A. 78:2717-2721(1981).

RN [3]



RA Kreitman M., Hudson R.R.;

RT "Inferring the evolutionary histories of the Adh and Adh-dup loci in

RT Drosophila melanogaster from patterns of polymorphism and

RT divergence.";

RL Genetics 127:565-582(1991).

RN [4]


RC STRAIN=CANTON-S;

RA Brogna S., Ashburner M.;

RL Submitted (DEC-1996) to the EMBL/GenBank/DDBJ databases.

RN [5]

RP SEQUENCE OF 140-255 FROM N.A. (ADH-F ALLELE).


RA Benyajati C., Wang N., Reddy A., Weinberg E., Sofer W.;

RT "Alcohol dehydrogenase in Drosophila: isolation and characterization

RT of messenger RNA and cDNA clone.";

RL Nucleic Acids Res. 8:5649-5667(1980).

RN [6]

RP SEQUENCE (ADH-F ALLELE).


RA Thatcher D.R.;

RT "The complete amino acid sequence of three alcohol dehydrogenase

RT alleloenzymes (AdhN-11, AdhS and AdhUF) from the fruitfly Drosophila

RT melanogaster.";

RL Biochem. J. 187:875-886(1980).

RN [7]

RP ERRATUM.

RA Thatcher D.R.;

RL Biochem. J. 191:895-895(1980).

RN [8]

RP SEQUENCE OF 207-224 (ADH-FCH.D. ALLELE).


RA Chambers G.K., Laver W.G., Campbell S., Gibson J.B.;

RT "Structural analysis of an electrophoretically cryptic alcohol

RT dehydrogenase variant from an Australian population of Drosophila

RT melanogaster.";

RL Proc. Natl. Acad. Sci. U.S.A. 78:3103-3107(1981).

RN [9]

RP SEQUENCE FROM N.A. (ADH-F ALLELE).

[CoDingSequence]

DR EMBL; M17827; AAA28341.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]

DR EMBL; M17828; AAA28342.1; -. [EMBL / GenBank / DDBJ]

[CoDingSequence]










DR EMBL; X60791; CAA43204.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]


DR EMBL; U20765; AAA88817.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]



DR PIR; A00343; DEFFA.

DR PIR; S17083; S17083.

DR PIR; S17085; S17085.

DR FLYBASE; FBgn0000055; Adh.

DR INTERPRO; IPR002198; -.




DR PFAM; PF00106; adh_short; 1.

DR PFAM; PF00663; adh_short_C; 1.

DR PRINTS; PR00080; SDRFAMILY.

DR PRINTS; PR01167; INSADHFAMILY.

DR PRINTS; PR01168; ALCDHDRGNASE.

DR PROSITE; PS00061; ADH_SHORT; 1.

DR PRODOM [Domain structure / List of seq. sharing at least 1 domain]

DR BLOCKS; P00334.

DR DOMO; P00334.

DR PROTOMAP; P00334.

DR PRESAGE; P00334.

DR DIP; P00334.

DR SWISS-2DPAGE; GET REGION ON 2D PAGE.

KW Oxidoreductase; NAD; Acetylation.

FT INIT_MET 0 0

FT MOD_RES 1 1 ACETYLATION.

FT NP_BIND 11 34 NAD (BY SIMILARITY).

FT ACT_SITE 152 152

FT VARIANT 192 192 K -> T (IN ADH-F ALLELE).

FT VARIANT 214 214 P -> S (IN ADH-FCH.D ALLELE).

FT MUTAGEN 14 14 G->V: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 14 14 G->A: 31% DECREASE IN ACTIVITY.

FT MUTAGEN 129 129 G->C: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 132 132 G->I: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 135 135 C->A: NO DECREASE IN ACTIVITY.

FT MUTAGEN 152 152 Y->F,H,E,Q: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 152 152 Y->C: RETAINS ONLY 0.25% ACTIVITY.

FT MUTAGEN 156 156 K->I: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 156 156 K->R: RETAINS ONLY 2.2% ACTIVITY.

FT MUTAGEN 183 183 G->L: COMPLETE LOSS OF ACTIVITY.

FT MUTAGEN 218 218 C->A: NO DECREASE IN ACTIVITY.

SQ SEQUENCE 255 AA; 27630 MW; CA7DF929E0007296 CRC64;

SFTLTNKNVI FVAGLGGIGL DTSKELLKRD LKNLVILDRI ENPAAIAELK AINPKVTVTF

YPYDVTVPIA ETTKLLKTIF AQLKTVDVLI NGAGILDDHQ IERTIAVNYT GLVNTTTAIL

DFWDKRKGGP GGIICNIGSV TGFNAIYQVP VYSGTKAAVV NFTSSLAKLA PITGVTAYTV

NPGITRTTLV HKFNSWLDVE PQVAEKLLAH PTQPSLACAE NFVKAIELNQ NGAIWKLDLG

TLEAIQWTKH WDSGI

//

RC STRAIN=KA12, AND RI32;RX MEDLINE; 92077396. [NCBI, ExPASy, Israel, Japan]RA Laurie C.C., Bridgham J.T., Choudhary M.;RT "Associations between DNA sequence variation and variation inRT expression of the Adh gene in natural populations of DrosophilaRT melanogaster.";RL Genetics 129:489-499(1991).RN [10]RP MUTAGENESIS.RX MEDLINE; 90212596. [NCBI, ExPASy, Israel, Japan]RA Chen Z., Lu L., Shirley M., Lee W.R., Chang S.H.;RT "Site-directed mutagenesis of glycine-14 and two 'critical' cysteinylRT residues in Drosophila alcohol dehydrogenase.";RL Biochemistry 29:1112-1118(1990).RN [11]RP MUTAGENESIS OF TYR-152 AND LYS-156.RX MEDLINE; 93213802. [NCBI, ExPASy, Israel, Japan]RA Chen Z., Jiang J.C., Lin Z.-G., Lee W.R., Baker M.E., Chang S.H.;RT "Site-specific mutagenesis of Drosophila alcohol dehydrogenase:RT evidence for involvement of tyrosine-152 and lysine-156 inRT catalysis.";RL Biochemistry 32:3342-3346(1993).RN [12]RP MUTAGENESIS OF GLY-129; GLY-132; TYR-152; LYS-156; AND GLY-183.RX MEDLINE; 93202283. [NCBI, ExPASy, Israel, Japan]RA Cols N., Marfany G., Atrian S., Gonzalez-Duarte R.;RT "Effect of site-directed mutagenesis on conserved positions ofRT Drosophila alcohol dehydrogenase.";RL FEBS Lett. 319:90-94(1993).RN [13]RP MUTAGENESIS OF TYR-152.RX MEDLINE; 92371633. [NCBI, ExPASy, Israel, Japan]RA Albalat R., Gonzalez-Duarte R., Atrian S.;RT "Protein engineering of Drosophila alcohol dehydrogenase. TheRT hydroxyl group of Tyr152 is involved in the active site of theRT enzyme.";RL FEBS Lett. 308:235-239(1992).CC -!- CATALYTIC ACTIVITY: ALCOHOL + NAD(+) = ALDEHYDE OR KETONE + NADH.CC -!- ENZYME REGULATION: INHIBITED BY 2,2,2-TRIFLUOROETHANOL ANDCC PYRAZOLE.CC -!- SUBUNIT: HOMODIMER.CC -!- POLYMORPHISM: VIRTUALLY ALL NATURAL POPULATIONS OF THIS SPECIESCC ARE POLYMORPHIC FOR 2 ELECTROPHORETICALLY DISTINGUISHABLE ALLELES,CC ADH-S AND ADH-F. THE SEQUENCE OF THE ADH-S ALLELE IS SHOWN.CC -!- SIMILARITY: BELONGS TO THE SHORT-CHAIN DEHYDROGENASES/REDUCTASESCC (SDR) FAMILY.CC --------------------------------------------------------------------------CC This SWISS-PROT entry is copyright. It is produced through a collaborationCC between the Swiss Institute of Bioinformatics and the EMBL outstation -CC the European Bioinformatics Institute. There are no restrictions on itsCC use by non-profit institutions as long as its content is in no wayCC modified and this statement is not removed. Usage by and for commercialCC entities requires a license agreement (See http://www.isb-sib.ch/announce/CC or send an email to [email protected]).CC --------------------------------------------------------------------------DR EMBL; M36580; AAA28331.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; Z00030; CAA77330.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; J01066; AAB59183.1; -. [EMBL / GenBank / DDBJ]

21

Computer Science in Bioinformatics

• FASTA

• BLAST

22

FASTA

23

FASTA

Algorithm:

BASEBALLANDCRICKET

WATSNANDCRICK

24

FASTAAlgorithm: (continued)

BLOSUM50 scoring matrix table is used in FASTA

25

BLAST

26

BLAST

Algorithm: For protein A S T N C F :

A matched with itself gives a score of 2 S matched with itself gives a score of 2 T matched with itself yields a score of 3 N matched with itself yields a score of 2

ASTN can only be 9 or less, smaller than the minimum requirement for the score, 17.

27

BLAST

Algorithm: (continued)

S matched with itself gives a score of 2 T matched with itself yields a score of 3 N matched with itself yields a score of 2 C matched with itself yields a score of 12

For protein A S T N C F :

The total score of the word formed by these four amino acids is 19, and it exceeds 17.

Substitutions of amino acids have a scoring of 1. Therefore, there are some query sequences that generate many other query sequences.

28

BLAST

Algorithm: (continued)

BLOSUM62 scoring index is used in BLAST

For example, consider the sequences: A C D E E F G HA C D E F G H

Every time an exact match occurs, BLAST attempts to extend the seed by looking for matches in either direction. However, the algorithm is capable of missing sequences, because it doesn't take gaps into account.

A C D E can be located but it will not be extended to F G H, because E is not a good match. Similarly, E F G H can be identified but A C D will not be traced.

29

The Future of Bioinformatics• Data collection and organization---

New concepts for databases.

• Prediction of biological functions---Computational tools are needed to analyze the collected data in the most efficient manner.

• Data Labyrinth---Computational tools that integrate the scattered information is needed.

30

The Future of Bioinformatics

bioinformatics b90901099 劉兆昕. 2 outline what is bioinformatics? applications of bioinformatics...

Documents