Download - IInd Sem Class1

7/31/2019 IInd Sem Class1

1/56

Introduction to Bioinformatics

Bioinformatics is a modern discipline integrating differentbranches of science i.e. Biology, Chemistry & Information technology.

Informatics related to Biological and Medical sciences:

Bioinformatics

Structural Bioinformatics

Medical Informatics

Chemoinformatics

Pharmacy Informatics

Clinical Informatics


2/56

Bioinformatics has a strong interdisciplinary character.

It can be considered to be a confluence of Biology,

Computer Science, Information Technology,

Mathematics, Chemistry, Physics, and Medicine with

the objectives of developing tools to analyze biological,

biochemical, biophysical data and to generate new

knowledge in these areas. It is a fact that persons

trained and skilled in these multifarious ways do not

exist, and if this area is to develop in our country these

persons will have to be trained and produced.


3/56

In other wordsBioinformatics is

The combination of biology and information technology.

It is a branch of science that deals with the computerbased analysis of large biological data sets.

It incorporates the development of databases to storeand search data, and of statistical tools and algorithmsto analyze and determine relationships between

biological sets, such as macromolecular sequences,structures, expression profiles and biochemical

pathways.


4/56

DNA RNA Protein synthesis


5/56

Development of

New scientific methods,

Algorithms for managing large amounts of sequence and structuraldata

As the full genome sequences of many species, data from structural

genomics, micro-arrays, and proteomics became available, integration of

these data to a common platform require sophisticated bioinformatics

tools. {Sequence-Structure-Function }.

Organizing these data into knowledgeable databases and developingappropriate software tools for analyzing the same are going to be majorchallenges.

India as a major player in IT industry, has the potential to develop suchresources at an affordable cost.

COMPUTERS IN BIOLOGY


6/56

Targetprotein

sequence

Largescale

Docking

Homologymodeling of

target protein

Crystalstructure of

targetprotein

Virtual library ofcompounds orQSAR analysis

Confirmusing Crystallo-graphy, Kinetic

analysis

Leadidentification

& Leadoptimization

Compounddevelopment

(Drug)

Fig: Schematic outline of the application of SB (homology modeling) and X-ray

crystallography (structural molecular biology) in drug discovery process.

Structural Bioinformatics in Drug Discovery


7/56

Table : Some important structural bioinformatics databases/ resources/ tools:

S.No.Database and its importance URL

1. National Center for BiotechnologyInformation (NCBI): Provides ageneral search for nucleotidesequences, protein sequences,biomolecule 3D structures,

genomes, taxonomy or literature.

http://www.ncbi.nlm.nih.gov/Entrez/

2. Structural Genomics TargetDatabase (sgtdb): 3-D models of allsequences under investigation bystructural genomics centers.

http://spam.sdsc.edu/

3. Structure Comparison Database(CE): Pair-wise structurecomparisons based on theCombinatorial Extension (CE)Algorithm for both a representativeset and complete set of protein

structures; includes alignments.

http://cl.sdsc.edu/ce.html



8/56

4. CKAAP DB:Database ofstructures with Conserved KeyAmino Acid Positions.

http://ckaaps.sdsc.edu/perl/browser.pl

5. Protein Data Bank (PDB): Thesingle worldwide source ofprimary structural data onbiological macromoleculesdetermined experimentally.

http://www.rcsb.org/pdb

6. Extended GO Annotation of PDBChains: Use of structurecomparison to extend thecoverage of GO terms in the PDB.

http://spdc.sdsc.edu/

7. The PDBbind database is

designed to provide a collectionof experimentally measuredbinding affinity data (Kd, Ki, andIC50) exclusively for the protein-ligand complexes available inPDB.

http://www.pdbbind.org/



9/56

BioinformaticsInformation Resources And Networks


10/56

Outline

Bioinformatics Information Resources And Networks

EMBnet European Molecular Biology Network DBs and Tools

NCBI National Center For Biotechnology Information

DBs and Tools

Nucleic Acid Sequence Databases

Protein Information Resources

Metabolic Databases

Mapping Databases

Databases concerning Mutations

Literature Databases


11/56

EMBnet EuropeanMolecular Biology Network

Founded in 1988

Network that links European laboratories that use

biocomputing and bioinformatics in molecular biologyresearch

is a science-based group of collaborating nodes throughoutEurope and nodes outside Europe

provides information, services and training to the users

efforts to increase the availability and

accessibility of data resources and

computing tools

increase knowledge and proficiency in bioinformaticsthrough education and training
http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/


12/56

EMBnet - Nodes

Specialist

Nodes(9)

Associate

Nodes(11)

NationalNodes

(18)

EMBnet(41 nodes)

governmental

academic, industrialresearch centers

Biocomputing centers fromnon European countries


13/56

EMBnet - Nodes

Appointed by thegovernments

Provide on-lineservices, user supportand training

National NodesVienna Biocenter - Austria BEN - Belgium

CSC - Finland INFOBIOGEN - France

DKFZ - Germany HEN - Hungary

INCBI - Ireland INN - Israel

IEN-AdR - Italy CMBI - Netherlands

Bio - Norway IBB - Poland

PEN - Portugal GeneBee - Russia

CNB-CSIC - Spain BMC - Sweden

SIB - Switzerland SEQNET - UK


14/56

Specialist Nodes

MIPS

ICGEB

Pharmarcia

F.Hoffmann La Roche

EBI

HGMP - RC

Sanger

UCL

EMBnet - Nodes

Academic, industrialor research centers inspecific areas ofbioinformatics

Largely responsiblefor maintenance ofbiological databasesand software

Hinxton

Hall(Cambridge UK)

Important key specialist

node and home of:EMBL, SWISS-PROT andTrEMBL databases

Munich Information Center for protein sequences


15/56

EMBnet - Nodes

Centers from nonEuropean countries

Associate Nodes

IBBM - Argentina ANGIS - Australia

CBI - China CIGB - Cuba

CDFD - India SANBI South Africa

EMBnet - Brazil CBR - Canada

EMBnet - Chile EBMnet - Colombia

CIFN - MEXICO


16/56

EMBnets Mission

Assist in biotechnological and bioinformaticsrelated research

Provide training and education

Exploit network infrastructures

Investigate and develop new technologies

Bridge between commercial and academic sectors


17/56

Who are EMBnets Users?

> 40,000 registered users from all over theworld as well as a larger number ofInternet users

All scientists working in Life Sciences,from undergraduate students to top levelscientists, in academia as well as industry,

can get support from EMBnet


18/56

EMBnets SRS

Sequence Retrieval System - SRS

result of a research project with theEMBnet to interrogating all resourcesgathered together

SRS is a network browser for DBs inmolecular Biology

SRS allows any flat-file DB to beindexed to any other

queries across a range ofdifferent DB types via a singleinterface

independent of underlying datastructures or query languages

SpecialistNodes

AssociateNodes

NationalNodes

EMBnet

htt // bl h id lb d 8000/ 5/
http://srs.embl-heidelberg.de:8000/srs5/http://srs.embl-heidelberg.de:8000/srs5/


19/56

http://srs.embl-heidelberg.de:8000/srs5/

Sequence Retrieval SystemNetwork Browser forDatabanksin Molecular Biology

Data BankRele

aseNo Entries Indexing Date Group

Availa

bility

SWISSPROT 163235 10-Jun-2005 Sequence ok

SWISSNEW 81134 22-Mar-2006 Sequence ok

NRDB 2269647 29-Mar-2006 Sequence ok

SWALL 3022528 22-Mar-2006 Sequence ok

UNIPROT_SPROT 212425 22-Mar-2006 Sequence ok

UNIPROT_TREMBL 2666963 23-Mar-2006 Sequence ok

TREMBLNEW 624819 12-Dec-2005 Sequence ok

TREMBL 2576118 04-Oct-2005 Sequence ok
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pageliblist+-color+yellowWeavehttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWALLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_SPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_SPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWALLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pageliblist+-color+yellowWeave


20/56

SPTREMBL 1449374 16-Jun-2005 Sequence ok

SPTREMBLNEW 143140 17-Jun-2005 Sequence ok

REMTREMBL 92182 20-Jun-2005 Sequence ok

PIR 283416 16-Jun-2005 Sequence ok

WORMPEP 19538 16-Jun-2005 Sequence ok

DROSOPHILA 14100 16-Jun-2005 Sequence ok

EMBLNEW 4035816 21-Nov-2005 Sequence ok

EMBL 20343598 30-Dec-2005 Sequence ok

EMBLEST 31990232 06-Jan-2006 Sequence ok

EMBLWGS 11106060 24-Sep-2005 Sequence ok

GENBANK 19233264 18-Nov-2005 Sequence okGENBANKEST 31008556 23-Feb-2006 Sequence ok

REFSEQP 8006 16-Jun-2005 Sequence ok

SUBTILIST 1 16-Jun-2005 Sequence ok

Data Bank No Entries Indexing Date GroupAvaila

bility

Availa
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REMTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PIRhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+WORMPEPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DROSOPHILAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLWGShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REFSEQPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SUBTILISThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SUBTILISThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REFSEQPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLWGShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DROSOPHILAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+WORMPEPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PIRhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REMTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBL


21/56

PROSITE 1935 22-Mar-2006 SeqRelated ok

PROSITEDOC 1407 22-Mar-2006 SeqRelated ok

BLOCKS 4034 16-Jun-2005 SeqRelated ok

EPD 1375 16-Jun-2005 SeqRelated okENZYME 4173 16-Jun-2005 SeqRelated ok

PRINTS 865 16-Jun-2005 SeqRelated ok

TFSITE 4342 07-Apr-2003 TransFac ok

TFFACTOR 1799 07-Apr-2003 TransFac ok

TFCELL 816 07-Apr-2003 TransFac ok

TFCLASS 27 07-Apr-2003 TransFac ok

TFMATRIX 246 07-Apr-2003 TransFac ok

TFGENE 1035 07-Apr-2003 TransFac ok

PDB 34927 08-Feb-2006 Protein3DStruct ok

DSSP 30832 22-Nov-2005 Protein3DStruct ok

HSSP 30369 08-Feb-2006 Protein3DStruct ok

PDBFINDER 35701 28-Mar-2006 Protein3DStruct ok

NRL3D 6063 16-Jun-2005 Protein3DStruct ok

FLYGENES 7556 16-Jun-2005 Genome ok

FLYREFS 0 07-Apr-2003 Genome ok

OMIM 17004 18-Oct-2005 Mutations okREPTILIA 8364 18-Jan-2006 Others ok

Data Bank No Entries Indexing Date GroupAvaila

bility
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEDOChttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+BLOCKShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EPDhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+ENZYMEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PRINTShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFFACTORhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCELLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCLASShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFMATRIXhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFGENEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+HSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBFINDERhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRL3Dhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYGENEShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYREFShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+OMIMhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REPTILIAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REPTILIAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+OMIMhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYREFShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYGENEShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRL3Dhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBFINDERhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+HSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFGENEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFMATRIXhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCLASShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCELLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFFACTORhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PRINTShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+ENZYMEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EPDhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+BLOCKShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEDOChttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITE


22/56

NCBI National Center ForBiotechnology Information

Leading Americaninformation provider

Established in 1988as a division of theNational Library ofMedicine (NLM) Located on the

campus of the

National Institute ofHealth (NIH Rockville/Maryland)

Mission: Development of new information

technologies to aid ourunderstanding of the molecularand genetic processes that

underlie health and disease Creation of systems for storing and

analysing biological information Development of advanced methods

of computer-based informationprocessing

Facilitation of user access to DBsand software Co-ordination of efforts to gather

biotechnology informationworldwide


23/56

NCBI

Since 1992 maintenance of GenBank andcollaboration with international nucleotide DBs: EMBLand DDBJ (Japan)

Providing the Entrez that facilitates to access biologicalDBs (similar to SRS that is provided by the EMBnet)


24/56


25/56

NCBI - Responsibilities

administers research on biomedical problems at the molecularlevel using mathematical and computational methods

maintains collaborations with several NIH (National Institutes ofHealth) institutes, academia, industry, and other governmentalagencies

promotes scientific communication by sponsoring meetings,workshops, and lecture series supports training on basic and applied research in

computational biology for postdoctoral fellows through the NIHIntramural Research Program

engages members of the international scientific community ininformatics research and training through the Scientific Visitors

Program develops, distributes, supports, and coordinates access to a

variety of databases and software for the scientific and medicalcommunities

develops and promotes standards for databases, datadeposition and exchange, and biological nomenclature

N l i A id S


26/56

Nucleic Acid Sequence

Databases

Nucleic acid sequence Databases

EMBL (Europe)GenBank (USA)

DDBJ (Japan)

ENSEMBL (project between EMBL - EBI and the Sanger Institute)

dbEST (division of GenBank)

GSDB (division of GenBank)

the principal nucleic acid sequence databases are GeneBank,

EMBL and DDBJ, which each collect a portion of the total sequencedata reported world-wide, and exchange new and updated entrieson a daily basis

source: http://www3.ebi.ac.uk/Services/DBStats/
http://www.ensembl.org/http://www.ensembl.org/http://www3.ebi.ac.uk/Services/DBStats/http://www3.ebi.ac.uk/Services/DBStats/


27/56

Nucleic Acid Sequence Databases - EMBLThis morning the EMBL Database contained 127,450,085,130 nucleotides in

69,666,551 entries.Breakdown by entry type:

Entry TypeEntries Nucleotides

Standard 56,843,150 61,498,109,356Constructed (CON) 497,187 n/a

Third Party Annotation (TPA) 4,884 334,827,880Whole Genome Shotgun (WGS) 12,318,618 64,837,183,592

p

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank)constitutes Europe's primary nucleotide sequence resource. Main sourcesfor DNA and RNA sequences are direct submissions from individualresearchers, genome sequencing projects and patent applications. The

database is produced in an international collaboration with GenBank (USA)and the DNA Database of Japan (DDBJ). Each of the three groups collects aportion of the total sequence data reported worldwide, and all new andupdated database entries are exchanged between the groups on a dailybasis.
http://www3.ebi.ac.uk/Services/DBStats/http://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Contact/collaboration.htmlhttp://www.ebi.ac.uk/embl/Contact/collaboration.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www3.ebi.ac.uk/Services/DBStats/


28/56

Nucleic Acid SequenceDatabases - EMBL

Total nucleotides(current 127,450,085,130)

Number of entries(current 69,666,551)

Ref: EMBL Nucleotide Sequence Database:developments in 2005,

Nucleic Acids Research, 2006, Vol. 34, D10D15


29/56

Nucleic Acid Sequence

Databases - EMBLBy nucleotide count

Homo

sapiens

Mus

musculus

Rattus

norvegicus

Pan

troglodytes

Bostaurus

Canisfamiliaris

Monodelphisdomestica

Daniorerio

Macacamulatta

Loxodontaafricana

Other


30/56

Nucleic Acid SequenceDatabases GenBank

GenBank which is produced at NCBI, is splitinto smaller, discrete divisions.

This facilitates fast, specific searches byrestricting queries to particular database subsets

During 1992-1997, the level of EST and STS

data within GenBank grew 10-fold.

the overall sequence information contributed bysuch partial data was still less than that of higher

quality sequences in the other major divisions

Specialised Genomic


31/56

Specialised GenomicResources

In addition to the comprehensive DNA sequence DBs,there is a variety of more specialised genomicresources.

These so called boutique DBs bring focus to species-

specific genomics and to particular sequencingtechniques.

Specialised Genomic Resources

SGD Saccharomyces Genome Database

UniGene - gene-oriented clusters from GenBank

TIGR - Databases of The Institute for GenomicResearch

ACeDB A C.elegans DataBase


32/56

Specialised GenomicDatabases

SGD (SaccharomycesGenome Database) SGDTM is a scientific databaseof the molecular biology and genetics of the yeast Saccharomyces cerevisiae.http://genome-www.stanford.edu/Saccharomyces

AceDB (A C. elegansDataBase)http://www.acedb.org(c.elegans)

FlyBase (A Database of DrosophilaGenes & Genomes)(http://flybase.bio.indiana.edu(fruit fly)

MGD(Mouse Genome Database)http://www.informatics.jax.org(Mouse)
http://genome-www.stanford.edu/Saccharomyceshttp://www.acedb.org/http://flybase.bio.indiana.edu/http://www.informatics.jax.org/http://www.informatics.jax.org/http://flybase.bio.indiana.edu/http://www.acedb.org/http://genome-www.stanford.edu/Saccharomyceshttp://genome-www.stanford.edu/Saccharomyceshttp://genome-www.stanford.edu/Saccharomyces


33/56


The primary structure of a protein is its amino acid sequence

The second structure of a protein corresponds to regions of localregularity (e.g., -helices and -strands).

The tertiary structure of a protein arises from the packing of itssecondary structure elements, which may form discretedomains within a fold.

Levels of protein sequence and structural organisation:

primary

tertiary

secondary


34/56

ACDEFGHIKLMNPQRSTVWY

primary structure

Principles of Protein Structure


35/56


Levels of protein sequence and structural organisation:

primary

secondary

tertiary domain module

motif

sequence

@.*,#a,b,c

[AS]-[IL]2-X[DE]-R-[FYW]2-H

AVILDRYFH

structuredatabase

secondarydatabase

primary

database


36/56

Primary Protein Databases

Protein sequence DatabasesSWISS-PROT - Protein knowledgebase

TrEMBL - Computer-annotated supplement to Swiss-Prot

PIRProtein Information Resource

MIPSMunich Information Centre for Protein Sequences

NRL-3D - produced by PIR

The primary structure of a protein is its amino acid sequence these are stored in primary databases as linear alphabetsthat denote the constituent residues
http://www.expasy.org/sprot/http://www.expasy.org/sprot/http://pir.georgetown.edu/home.shtmlhttp://mips.gsf.de/http://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://mips.gsf.de/http://pir.georgetown.edu/home.shtmlhttp://www.expasy.org/sprot/http://www.expasy.org/sprot/http://www.expasy.org/sprot/http://www.expasy.org/sprot/


37/56

Protein Sequence Databases

Swiss-Prot contains 197,228sequence entries, comprising71,501,181 amino acidsabstracted from 135,257references

Total number of speciesrepresented in Swiss-Prot:9,520

The average sequence lengthin Swiss-Prot is 362 amino

acids. Swiss-Prot is the most highly

annotated protein sequenceDB

No. Frequ. Species

1 13049 Homo sapiens (Human)

2 10132 Mus musculus (Mouse)

3 5189 Saccharomyces cerevisiae(Baker's yeast)

4 4847 Escherichia coli

5 4669 Rattus norvegicus (Rat)

6 3665Arabidopsis thaliana (Mouse-ear cress)

8 2863 Schizosaccharomycespombe (Fission yeast)

7 2814 Bacillus subtilis

9 2750 Caenorhabditis elegans

10 2286Drosophila melanogaster(Fruit fly)

Table of the most represented species

C S


38/56

Composite Protein SequenceDatabases

Composite databases amalgamate a variety ofdifferent primary databases

They render sequence searching much more

efficient, because they obviate the need tointerrogate multiple resources

Different composite databases use differentprimary sources and different redundancy

criteria in their amalgamation procedures

C i P i S


39/56

Composite Protein SequenceDatabases

NRDBNatural Resource DB

OWL MIPSX SP+TrEMBLSwissProt TrEMBL

PDB SWISS-PROT PIR1-4 SWISS-PROT

SWISS-PROT PIR MIPSOwn TrEMBL

PIR GenBank MIPSTrn

GenPept NRL-3D MIPSH

SWISS-PROTupdate PIRMOD

GenPeptupdate NRL-3D

SWISS-PROT

EMTrans

GBTrans

Kabat

PseqIP
http://www.nrdb.co.uk/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/owl-help.htmlhttp://mips.gsf.de/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/trembl-help.htmlhttp://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/trembl-help.htmlhttp://mips.gsf.de/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/owl-help.htmlhttp://www.nrdb.co.uk/


40/56

Secondary databases

Secondary databases contain pattern data, i.e., diagnosticsignatures for protein families. These signatures encode themost highly conserved features of multiply aligned sequences,

which are often crucial to the structure or function of the protein The second structure of a protein corresponds to regions of

local regularity (e.g., -helices and -strands).

Which, in sequence alignments, are often apparent as well-conserved motifs

patterns are regular expressions, fingerprints, blocks, profiles,etc.


41/56

Secondary databases

SecondaryDB

Primarysource

Storedinformation

PROSITE SWISS-PROT Regular expressions

(patterns)Profiles SWISS-PROT Weighted matrices

(profiles)

PRINTS OWL Aligned motifs(fingerprints)

BLOCKS PROSITE/PRINTS Aligned motifs(blocks)

IDENTIFY BLOCKS/PRINTS Fuzzy regularexpressions(patterns)
http://www.expasy.org/prosite/http://www.expasy.org/prosite/


42/56

Secondary databases TRANSFAC

http://transfac.gbf.de EPD

http://www.epd.isb-sib.ch InterPro

http://www.ebi.ac.uk/interpro/ PROSITE

http://www.expasy.ch/prosite BLOCKS

http://blocks.fhcrc.org PRINTS

ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM

http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom

http://www.toulouse.inra.fr/prodom.html InterPro

http://www.ebi.ac.uk/interpro GeneCards

http://bioinformatics.weizmann.ac.il/cards ENSEMBL

http://www.ensembl.org EcoCyc

http://ecocyc.panbio.com/ecocyc/ecocyc.html
http://transfac.gbf.de/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.ebi.ac.uk/interpro/http://www.expasy.ch/prositehttp://blocks.fhcrc.org/http://blocks.fhcrc.org/http://blocks.fhcrc.org/ftp://ftp.seqnet.dl.ac.uk/pub/database/printshttp://www.sanger.ac.uk/Software/Pfam/index.shtmlhttp://www.toulouse.inra.fr/prodom.htmlhttp://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interprohttp://bioinformatics.weizmann.ac.il/cardshttp://www.ensembl.org/http://ecocyc.panbio.com/ecocyc/ecocyc.htmlhttp://ecocyc.panbio.com/ecocyc/ecocyc.htmlhttp://www.ensembl.org/http://bioinformatics.weizmann.ac.il/cardshttp://www.ebi.ac.uk/interprohttp://www.toulouse.inra.fr/prodom.htmlhttp://www.sanger.ac.uk/Software/Pfam/index.shtmlftp://ftp.seqnet.dl.ac.uk/pub/database/printshttp://blocks.fhcrc.org/http://www.expasy.ch/prositehttp://www.ebi.ac.uk/interpro/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://transfac.gbf.de/


43/56

Secondary databases

There is some overlap in content between the secondarydatabases

PDBsum alone has 35,291 entries

Pattern DB growth is slow because the addition ofdetailed family annotation is very time consuming.

PROSITE and PRINTS are the only comprehensively,manually annotated secondary DBs

To address the annotation bottleneck, the secondarydatabase curators are together created a unifieddatabase of protein families known as InterPro


44/56

Structure Classification DBs

Contain 3D structures available fromcrystallographic and spectroscopic studies

Structure Classification Databases

PDBsum Protein Data Bank

CATH Class, Architecture, Topology, Homology

SCOP Structural Classification of Proteins


45/56

Structure Classification DBs

PDBhttp://www.rcsb.org

SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop

CATHhttp://www.biochem.ucl.ac.uk/bsm/cath

DSSPhttp://www.sander.ebi.ac.uk/dssp

FSSPhttp://www.ebi.ac.uk/dali/fssp

HSSPhttp://www.sander.ebi.ac.uk/hssp
http://www.rcsb.org/http://scop.mrc-lmb.cam.ac.uk/scophttp://www.biochem.ucl.ac.uk/bsm/cathhttp://www.sander.ebi.ac.uk/dssphttp://www.ebi.ac.uk/dali/fssphttp://www.sander.ebi.ac.uk/hssphttp://www.sander.ebi.ac.uk/hssphttp://www.ebi.ac.uk/dali/fssphttp://www.sander.ebi.ac.uk/dssphttp://www.biochem.ucl.ac.uk/bsm/cathhttp://scop.mrc-lmb.cam.ac.uk/scophttp://scop.mrc-lmb.cam.ac.uk/scophttp://scop.mrc-lmb.cam.ac.uk/scophttp://www.rcsb.org/


46/56

Metabolic Databases

KEGG(Kyoto Encyclopedia of Genes and Genomes)http://www.genome.ad.jp/kegg

ENZYME (Enzyme nomenclature database)http://www.expasy.ch/enzyme

BRENDA (Enzyme Information System)http://www.brenda.uni-koeln.de

EMP(Enzymes and Metabolic Pathways database)http://www.empproject.com

A number of metabolic databases are available electronically some with features for querying and visualizing metabolicpathways and regulatory networks.
http://www.genome.ad.jp/kegghttp://www.expasy.ch/enzymehttp://www.brenda.uni-koeln.de/http://www.empproject.com/http://www.empproject.com/http://www.brenda.uni-koeln.de/http://www.brenda.uni-koeln.de/http://www.brenda.uni-koeln.de/http://www.expasy.ch/enzymehttp://www.genome.ad.jp/kegg


47/56

Mapping Databases

OMIM (Online Mendelian Inheritance in Man)http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

GDB (The GDB Human Genome Database)http://www.gdb.org

RHDBhttp://corba.ebi.ac.uk/RHdb

D t b i
http://www.gdb.org/http://corba.ebi.ac.uk/RHdbhttp://corba.ebi.ac.uk/RHdbhttp://www.gdb.org/


48/56

Databases concerningMutations

dbSNPhttp://www.ncbi.nlm.nih.gov/SNP

HGBASE

http://hgbase.cgr.ki.se The SNP Consortium (TSC)

http://snp.cshl.org

HAEMAhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htm
http://www.ncbi.nlm.nih.gov/SNPhttp://hgbase.cgr.ki.se/http://snp.cshl.org/http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://snp.cshl.org/http://hgbase.cgr.ki.se/http://www.ncbi.nlm.nih.gov/SNP


49/56

LiteratureDatabases

PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query

Bioinformatics Onlinehttp://www.bioinformatics.oupjournals.org

Naturehttp://www.nature.com

Sciencehttp://www.sciencemag.org
http://www.ncbi.nlm.nih.gov/entrez/queryhttp://www.bioinformatics.oupjournals.org/http://www.nature.com/http://www.sciencemag.org/http://www.sciencemag.org/http://www.nature.com/http://www.bioinformatics.oupjournals.org/http://www.ncbi.nlm.nih.gov/entrez/query


50/56

Database tools for displaying andannotating genomic sequence data

Viewerformat

URL

Artemis www.sanger.ac.uk/Software/Artemis

ACeDB www.acedb.org/Tutorial/brief-tutorial/shtml

Apollo www.ensembl.org/apollo

EnsEMBL www.ensembl.org

NCBI mapviewer

www.ncbi.nlm.nih.gov

GoldenPath genome.ucsc.edu


51/56


52/56

Common formats

There are several conventions forrepresenting nucleic acid and proteinsequences, of which the following arewidely used

NBRF/PIR

FASTA

GDE

These formats have limited facilities forcomments, which must include a uniqueidentifier code and sequence accession

number

Formats for multiple sequence


53/56

Formats for multiple sequencealignment

There are separate formats for

multiple sequence alignmentrepresentation, of which thefollowing are popular

MSF

PHYLIP

ALN


54/56

Files of structural data

Structural data are maintained as flat filesusing the PDB format

Such files contain orthogonal atomic co-

ordinates together with annotations,comments and experimental details

http://www.pdb.org


55/56

Submission of sequences

Sequences may be submitted to any of thethree primary databases using the toolsprovided by the database curators

Such tools include WebIn and BankIt,which can be used over the Internet, andSequin, a stand-alone application

http://www.ebi.ac.uk/embl/Submission/webin.html

http://www.ncbi.nlm.nih.gov/BankIt/
http://www.ebi.ac.uk/embl/Submission/webin.htmlhttp://www.ncbi.nlm.nih.gov/BankIt/http://www.ncbi.nlm.nih.gov/BankIt/http://www.ebi.ac.uk/embl/Submission/webin.html


56/56

Database interrogation

All the databases discussed above can besearched by sequence similarity

However, detailed text-based searches of theannotations are also possible using tools suchas Entrez

The simplest way to cross-reference betweenthe primary nucleotide sequence databases andSWISS-PROT is to search by accessionnumber, as this provides an unambiguousidentifier of genes and their products

Download - IInd Sem Class1

Top Related