Download - IInd Sem Class1
-
7/31/2019 IInd Sem Class1
1/56
Introduction to Bioinformatics
Bioinformatics is a modern discipline integrating differentbranches of science i.e. Biology, Chemistry & Information technology.
Informatics related to Biological and Medical sciences:
Bioinformatics
Structural Bioinformatics
Medical Informatics
Chemoinformatics
Pharmacy Informatics
Clinical Informatics
-
7/31/2019 IInd Sem Class1
2/56
Bioinformatics has a strong interdisciplinary character.
It can be considered to be a confluence of Biology,
Computer Science, Information Technology,
Mathematics, Chemistry, Physics, and Medicine with
the objectives of developing tools to analyze biological,
biochemical, biophysical data and to generate new
knowledge in these areas. It is a fact that persons
trained and skilled in these multifarious ways do not
exist, and if this area is to develop in our country these
persons will have to be trained and produced.
-
7/31/2019 IInd Sem Class1
3/56
In other wordsBioinformatics is
The combination of biology and information technology.
It is a branch of science that deals with the computerbased analysis of large biological data sets.
It incorporates the development of databases to storeand search data, and of statistical tools and algorithmsto analyze and determine relationships between
biological sets, such as macromolecular sequences,structures, expression profiles and biochemical
pathways.
-
7/31/2019 IInd Sem Class1
4/56
DNA RNA Protein synthesis
-
7/31/2019 IInd Sem Class1
5/56
Development of
New scientific methods,
Algorithms for managing large amounts of sequence and structuraldata
As the full genome sequences of many species, data from structural
genomics, micro-arrays, and proteomics became available, integration of
these data to a common platform require sophisticated bioinformatics
tools. {Sequence-Structure-Function }.
Organizing these data into knowledgeable databases and developingappropriate software tools for analyzing the same are going to be majorchallenges.
India as a major player in IT industry, has the potential to develop suchresources at an affordable cost.
COMPUTERS IN BIOLOGY
-
7/31/2019 IInd Sem Class1
6/56
Targetprotein
sequence
Largescale
Docking
Homologymodeling of
target protein
Crystalstructure of
targetprotein
Virtual library ofcompounds orQSAR analysis
Confirmusing Crystallo-graphy, Kinetic
analysis
Leadidentification
& Leadoptimization
Compounddevelopment
(Drug)
Fig: Schematic outline of the application of SB (homology modeling) and X-ray
crystallography (structural molecular biology) in drug discovery process.
Structural Bioinformatics in Drug Discovery
-
7/31/2019 IInd Sem Class1
7/56
Table : Some important structural bioinformatics databases/ resources/ tools:
S.No.Database and its importance URL
1. National Center for BiotechnologyInformation (NCBI): Provides ageneral search for nucleotidesequences, protein sequences,biomolecule 3D structures,
genomes, taxonomy or literature.
http://www.ncbi.nlm.nih.gov/Entrez/
2. Structural Genomics TargetDatabase (sgtdb): 3-D models of allsequences under investigation bystructural genomics centers.
http://spam.sdsc.edu/
3. Structure Comparison Database(CE): Pair-wise structurecomparisons based on theCombinatorial Extension (CE)Algorithm for both a representativeset and complete set of protein
structures; includes alignments.
http://cl.sdsc.edu/ce.html
COMPUTERS IN BIOLOGY
-
7/31/2019 IInd Sem Class1
8/56
4. CKAAP DB:Database ofstructures with Conserved KeyAmino Acid Positions.
http://ckaaps.sdsc.edu/perl/browser.pl
5. Protein Data Bank (PDB): Thesingle worldwide source ofprimary structural data onbiological macromoleculesdetermined experimentally.
http://www.rcsb.org/pdb
6. Extended GO Annotation of PDBChains: Use of structurecomparison to extend thecoverage of GO terms in the PDB.
http://spdc.sdsc.edu/
7. The PDBbind database is
designed to provide a collectionof experimentally measuredbinding affinity data (Kd, Ki, andIC50) exclusively for the protein-ligand complexes available inPDB.
http://www.pdbbind.org/
COMPUTERS IN BIOLOGY
-
7/31/2019 IInd Sem Class1
9/56
BioinformaticsInformation Resources And Networks
-
7/31/2019 IInd Sem Class1
10/56
Outline
Bioinformatics Information Resources And Networks
EMBnet European Molecular Biology Network DBs and Tools
NCBI National Center For Biotechnology Information
DBs and Tools
Nucleic Acid Sequence Databases
Protein Information Resources
Metabolic Databases
Mapping Databases
Databases concerning Mutations
Literature Databases
-
7/31/2019 IInd Sem Class1
11/56
EMBnet EuropeanMolecular Biology Network
Founded in 1988
Network that links European laboratories that use
biocomputing and bioinformatics in molecular biologyresearch
is a science-based group of collaborating nodes throughoutEurope and nodes outside Europe
provides information, services and training to the users
efforts to increase the availability and
accessibility of data resources and
computing tools
increase knowledge and proficiency in bioinformaticsthrough education and training
http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/http://www.embnet.org/ -
7/31/2019 IInd Sem Class1
12/56
EMBnet - Nodes
Specialist
Nodes(9)
Associate
Nodes(11)
NationalNodes
(18)
EMBnet(41 nodes)
governmental
academic, industrialresearch centers
Biocomputing centers fromnon European countries
-
7/31/2019 IInd Sem Class1
13/56
EMBnet - Nodes
Appointed by thegovernments
Provide on-lineservices, user supportand training
National NodesVienna Biocenter - Austria BEN - Belgium
CSC - Finland INFOBIOGEN - France
DKFZ - Germany HEN - Hungary
INCBI - Ireland INN - Israel
IEN-AdR - Italy CMBI - Netherlands
Bio - Norway IBB - Poland
PEN - Portugal GeneBee - Russia
CNB-CSIC - Spain BMC - Sweden
SIB - Switzerland SEQNET - UK
-
7/31/2019 IInd Sem Class1
14/56
Specialist Nodes
MIPS
ICGEB
Pharmarcia
F.Hoffmann La Roche
EBI
HGMP - RC
Sanger
UCL
EMBnet - Nodes
Academic, industrialor research centers inspecific areas ofbioinformatics
Largely responsiblefor maintenance ofbiological databasesand software
Hinxton
Hall(Cambridge UK)
Important key specialist
node and home of:EMBL, SWISS-PROT andTrEMBL databases
Munich Information Center for protein sequences
-
7/31/2019 IInd Sem Class1
15/56
EMBnet - Nodes
Centers from nonEuropean countries
Associate Nodes
IBBM - Argentina ANGIS - Australia
CBI - China CIGB - Cuba
CDFD - India SANBI South Africa
EMBnet - Brazil CBR - Canada
EMBnet - Chile EBMnet - Colombia
CIFN - MEXICO
-
7/31/2019 IInd Sem Class1
16/56
EMBnets Mission
Assist in biotechnological and bioinformaticsrelated research
Provide training and education
Exploit network infrastructures
Investigate and develop new technologies
Bridge between commercial and academic sectors
-
7/31/2019 IInd Sem Class1
17/56
Who are EMBnets Users?
> 40,000 registered users from all over theworld as well as a larger number ofInternet users
All scientists working in Life Sciences,from undergraduate students to top levelscientists, in academia as well as industry,
can get support from EMBnet
-
7/31/2019 IInd Sem Class1
18/56
EMBnets SRS
Sequence Retrieval System - SRS
result of a research project with theEMBnet to interrogating all resourcesgathered together
SRS is a network browser for DBs inmolecular Biology
SRS allows any flat-file DB to beindexed to any other
queries across a range ofdifferent DB types via a singleinterface
independent of underlying datastructures or query languages
SpecialistNodes
AssociateNodes
NationalNodes
EMBnet
htt // bl h id lb d 8000/ 5/
http://srs.embl-heidelberg.de:8000/srs5/http://srs.embl-heidelberg.de:8000/srs5/ -
7/31/2019 IInd Sem Class1
19/56
http://srs.embl-heidelberg.de:8000/srs5/
Sequence Retrieval SystemNetwork Browser forDatabanksin Molecular Biology
Data BankRele
aseNo Entries Indexing Date Group
Availa
bility
SWISSPROT 163235 10-Jun-2005 Sequence ok
SWISSNEW 81134 22-Mar-2006 Sequence ok
NRDB 2269647 29-Mar-2006 Sequence ok
SWALL 3022528 22-Mar-2006 Sequence ok
UNIPROT_SPROT 212425 22-Mar-2006 Sequence ok
UNIPROT_TREMBL 2666963 23-Mar-2006 Sequence ok
TREMBLNEW 624819 12-Dec-2005 Sequence ok
TREMBL 2576118 04-Oct-2005 Sequence ok
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pageliblist+-color+yellowWeavehttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWALLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_SPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_TREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+UNIPROT_SPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWALLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SWISSPROThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pageliblist+-color+yellowWeave -
7/31/2019 IInd Sem Class1
20/56
SPTREMBL 1449374 16-Jun-2005 Sequence ok
SPTREMBLNEW 143140 17-Jun-2005 Sequence ok
REMTREMBL 92182 20-Jun-2005 Sequence ok
PIR 283416 16-Jun-2005 Sequence ok
WORMPEP 19538 16-Jun-2005 Sequence ok
DROSOPHILA 14100 16-Jun-2005 Sequence ok
EMBLNEW 4035816 21-Nov-2005 Sequence ok
EMBL 20343598 30-Dec-2005 Sequence ok
EMBLEST 31990232 06-Jan-2006 Sequence ok
EMBLWGS 11106060 24-Sep-2005 Sequence ok
GENBANK 19233264 18-Nov-2005 Sequence okGENBANKEST 31008556 23-Feb-2006 Sequence ok
REFSEQP 8006 16-Jun-2005 Sequence ok
SUBTILIST 1 16-Jun-2005 Sequence ok
Data Bank No Entries Indexing Date GroupAvaila
bility
Availa
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REMTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PIRhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+WORMPEPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DROSOPHILAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLWGShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REFSEQPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SUBTILISThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SUBTILISThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REFSEQPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+GENBANKhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLWGShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLESThttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DROSOPHILAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+WORMPEPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PIRhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REMTREMBLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBLNEWhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+SPTREMBL -
7/31/2019 IInd Sem Class1
21/56
PROSITE 1935 22-Mar-2006 SeqRelated ok
PROSITEDOC 1407 22-Mar-2006 SeqRelated ok
BLOCKS 4034 16-Jun-2005 SeqRelated ok
EPD 1375 16-Jun-2005 SeqRelated okENZYME 4173 16-Jun-2005 SeqRelated ok
PRINTS 865 16-Jun-2005 SeqRelated ok
TFSITE 4342 07-Apr-2003 TransFac ok
TFFACTOR 1799 07-Apr-2003 TransFac ok
TFCELL 816 07-Apr-2003 TransFac ok
TFCLASS 27 07-Apr-2003 TransFac ok
TFMATRIX 246 07-Apr-2003 TransFac ok
TFGENE 1035 07-Apr-2003 TransFac ok
PDB 34927 08-Feb-2006 Protein3DStruct ok
DSSP 30832 22-Nov-2005 Protein3DStruct ok
HSSP 30369 08-Feb-2006 Protein3DStruct ok
PDBFINDER 35701 28-Mar-2006 Protein3DStruct ok
NRL3D 6063 16-Jun-2005 Protein3DStruct ok
FLYGENES 7556 16-Jun-2005 Genome ok
FLYREFS 0 07-Apr-2003 Genome ok
OMIM 17004 18-Oct-2005 Mutations okREPTILIA 8364 18-Jan-2006 Others ok
Data Bank No Entries Indexing Date GroupAvaila
bility
http://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEDOChttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+BLOCKShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EPDhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+ENZYMEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PRINTShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFFACTORhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCELLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCLASShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFMATRIXhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFGENEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+HSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBFINDERhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRL3Dhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYGENEShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYREFShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+OMIMhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REPTILIAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+REPTILIAhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+OMIMhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYREFShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+FLYGENEShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+NRL3Dhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBFINDERhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+HSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+DSSPhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PDBhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFGENEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFMATRIXhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCLASShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFCELLhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFFACTORhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+TFSITEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PRINTShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+ENZYMEhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EPDhttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+BLOCKShttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITEDOChttp://srs.embl-heidelberg.de:8000/srs5bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+PROSITE -
7/31/2019 IInd Sem Class1
22/56
NCBI National Center ForBiotechnology Information
Leading Americaninformation provider
Established in 1988as a division of theNational Library ofMedicine (NLM) Located on the
campus of the
National Institute ofHealth (NIH Rockville/Maryland)
Mission: Development of new information
technologies to aid ourunderstanding of the molecularand genetic processes that
underlie health and disease Creation of systems for storing and
analysing biological information Development of advanced methods
of computer-based informationprocessing
Facilitation of user access to DBsand software Co-ordination of efforts to gather
biotechnology informationworldwide
-
7/31/2019 IInd Sem Class1
23/56
NCBI
Since 1992 maintenance of GenBank andcollaboration with international nucleotide DBs: EMBLand DDBJ (Japan)
Providing the Entrez that facilitates to access biologicalDBs (similar to SRS that is provided by the EMBnet)
-
7/31/2019 IInd Sem Class1
24/56
-
7/31/2019 IInd Sem Class1
25/56
NCBI - Responsibilities
administers research on biomedical problems at the molecularlevel using mathematical and computational methods
maintains collaborations with several NIH (National Institutes ofHealth) institutes, academia, industry, and other governmentalagencies
promotes scientific communication by sponsoring meetings,workshops, and lecture series supports training on basic and applied research in
computational biology for postdoctoral fellows through the NIHIntramural Research Program
engages members of the international scientific community ininformatics research and training through the Scientific Visitors
Program develops, distributes, supports, and coordinates access to a
variety of databases and software for the scientific and medicalcommunities
develops and promotes standards for databases, datadeposition and exchange, and biological nomenclature
N l i A id S
-
7/31/2019 IInd Sem Class1
26/56
Nucleic Acid Sequence
Databases
Nucleic acid sequence Databases
EMBL (Europe)GenBank (USA)
DDBJ (Japan)
ENSEMBL (project between EMBL - EBI and the Sanger Institute)
dbEST (division of GenBank)
GSDB (division of GenBank)
the principal nucleic acid sequence databases are GeneBank,
EMBL and DDBJ, which each collect a portion of the total sequencedata reported world-wide, and exchange new and updated entrieson a daily basis
source: http://www3.ebi.ac.uk/Services/DBStats/
http://www.ensembl.org/http://www.ensembl.org/http://www3.ebi.ac.uk/Services/DBStats/http://www3.ebi.ac.uk/Services/DBStats/ -
7/31/2019 IInd Sem Class1
27/56
Nucleic Acid Sequence Databases - EMBLThis morning the EMBL Database contained 127,450,085,130 nucleotides in
69,666,551 entries.Breakdown by entry type:
Entry TypeEntries Nucleotides
Standard 56,843,150 61,498,109,356Constructed (CON) 497,187 n/a
Third Party Annotation (TPA) 4,884 334,827,880Whole Genome Shotgun (WGS) 12,318,618 64,837,183,592
p
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank)constitutes Europe's primary nucleotide sequence resource. Main sourcesfor DNA and RNA sequences are direct submissions from individualresearchers, genome sequencing projects and patent applications. The
database is produced in an international collaboration with GenBank (USA)and the DNA Database of Japan (DDBJ). Each of the three groups collects aportion of the total sequence data reported worldwide, and all new andupdated database entries are exchanged between the groups on a dailybasis.
http://www3.ebi.ac.uk/Services/DBStats/http://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Contact/collaboration.htmlhttp://www.ebi.ac.uk/embl/Contact/collaboration.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www.ebi.ac.uk/embl/Submission/index.htmlhttp://www3.ebi.ac.uk/Services/DBStats/ -
7/31/2019 IInd Sem Class1
28/56
Nucleic Acid SequenceDatabases - EMBL
Total nucleotides(current 127,450,085,130)
Number of entries(current 69,666,551)
Ref: EMBL Nucleotide Sequence Database:developments in 2005,
Nucleic Acids Research, 2006, Vol. 34, D10D15
-
7/31/2019 IInd Sem Class1
29/56
Nucleic Acid Sequence
Databases - EMBLBy nucleotide count
Homo
sapiens
Mus
musculus
Rattus
norvegicus
Pan
troglodytes
Bostaurus
Canisfamiliaris
Monodelphisdomestica
Daniorerio
Macacamulatta
Loxodontaafricana
Other
-
7/31/2019 IInd Sem Class1
30/56
Nucleic Acid SequenceDatabases GenBank
GenBank which is produced at NCBI, is splitinto smaller, discrete divisions.
This facilitates fast, specific searches byrestricting queries to particular database subsets
During 1992-1997, the level of EST and STS
data within GenBank grew 10-fold.
the overall sequence information contributed bysuch partial data was still less than that of higher
quality sequences in the other major divisions
Specialised Genomic
-
7/31/2019 IInd Sem Class1
31/56
Specialised GenomicResources
In addition to the comprehensive DNA sequence DBs,there is a variety of more specialised genomicresources.
These so called boutique DBs bring focus to species-
specific genomics and to particular sequencingtechniques.
Specialised Genomic Resources
SGD Saccharomyces Genome Database
UniGene - gene-oriented clusters from GenBank
TIGR - Databases of The Institute for GenomicResearch
ACeDB A C.elegans DataBase
-
7/31/2019 IInd Sem Class1
32/56
Specialised GenomicDatabases
SGD (SaccharomycesGenome Database) SGDTM is a scientific databaseof the molecular biology and genetics of the yeast Saccharomyces cerevisiae.http://genome-www.stanford.edu/Saccharomyces
AceDB (A C. elegansDataBase)http://www.acedb.org(c.elegans)
FlyBase (A Database of DrosophilaGenes & Genomes)(http://flybase.bio.indiana.edu(fruit fly)
MGD(Mouse Genome Database)http://www.informatics.jax.org(Mouse)
http://genome-www.stanford.edu/Saccharomyceshttp://www.acedb.org/http://flybase.bio.indiana.edu/http://www.informatics.jax.org/http://www.informatics.jax.org/http://flybase.bio.indiana.edu/http://www.acedb.org/http://genome-www.stanford.edu/Saccharomyceshttp://genome-www.stanford.edu/Saccharomyceshttp://genome-www.stanford.edu/Saccharomyces -
7/31/2019 IInd Sem Class1
33/56
Protein Information Resources
The primary structure of a protein is its amino acid sequence
The second structure of a protein corresponds to regions of localregularity (e.g., -helices and -strands).
The tertiary structure of a protein arises from the packing of itssecondary structure elements, which may form discretedomains within a fold.
Levels of protein sequence and structural organisation:
primary
tertiary
secondary
-
7/31/2019 IInd Sem Class1
34/56
ACDEFGHIKLMNPQRSTVWY
primary structure
Principles of Protein Structure
-
7/31/2019 IInd Sem Class1
35/56
Protein Information Resources
Levels of protein sequence and structural organisation:
primary
secondary
tertiary domain module
motif
sequence
@.*,#a,b,c
[AS]-[IL]2-X[DE]-R-[FYW]2-H
AVILDRYFH
structuredatabase
secondarydatabase
primary
database
-
7/31/2019 IInd Sem Class1
36/56
Primary Protein Databases
Protein sequence DatabasesSWISS-PROT - Protein knowledgebase
TrEMBL - Computer-annotated supplement to Swiss-Prot
PIRProtein Information Resource
MIPSMunich Information Centre for Protein Sequences
NRL-3D - produced by PIR
The primary structure of a protein is its amino acid sequence these are stored in primary databases as linear alphabetsthat denote the constituent residues
http://www.expasy.org/sprot/http://www.expasy.org/sprot/http://pir.georgetown.edu/home.shtmlhttp://mips.gsf.de/http://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://www-nbrf.georgetown.edu/pirwww/search/textnrl3d.htmlhttp://mips.gsf.de/http://pir.georgetown.edu/home.shtmlhttp://www.expasy.org/sprot/http://www.expasy.org/sprot/http://www.expasy.org/sprot/http://www.expasy.org/sprot/ -
7/31/2019 IInd Sem Class1
37/56
Protein Sequence Databases
Swiss-Prot contains 197,228sequence entries, comprising71,501,181 amino acidsabstracted from 135,257references
Total number of speciesrepresented in Swiss-Prot:9,520
The average sequence lengthin Swiss-Prot is 362 amino
acids. Swiss-Prot is the most highly
annotated protein sequenceDB
No. Frequ. Species
1 13049 Homo sapiens (Human)
2 10132 Mus musculus (Mouse)
3 5189 Saccharomyces cerevisiae(Baker's yeast)
4 4847 Escherichia coli
5 4669 Rattus norvegicus (Rat)
6 3665Arabidopsis thaliana (Mouse-ear cress)
8 2863 Schizosaccharomycespombe (Fission yeast)
7 2814 Bacillus subtilis
9 2750 Caenorhabditis elegans
10 2286Drosophila melanogaster(Fruit fly)
Table of the most represented species
C S
-
7/31/2019 IInd Sem Class1
38/56
Composite Protein SequenceDatabases
Composite databases amalgamate a variety ofdifferent primary databases
They render sequence searching much more
efficient, because they obviate the need tointerrogate multiple resources
Different composite databases use differentprimary sources and different redundancy
criteria in their amalgamation procedures
C i P i S
-
7/31/2019 IInd Sem Class1
39/56
Composite Protein SequenceDatabases
NRDBNatural Resource DB
OWL MIPSX SP+TrEMBLSwissProt TrEMBL
PDB SWISS-PROT PIR1-4 SWISS-PROT
SWISS-PROT PIR MIPSOwn TrEMBL
PIR GenBank MIPSTrn
GenPept NRL-3D MIPSH
SWISS-PROTupdate PIRMOD
GenPeptupdate NRL-3D
SWISS-PROT
EMTrans
GBTrans
Kabat
PseqIP
http://www.nrdb.co.uk/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/owl-help.htmlhttp://mips.gsf.de/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/trembl-help.htmlhttp://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/trembl-help.htmlhttp://mips.gsf.de/http://www.hgmp.mrc.ac.uk/Bioinformatics/Databases/owl-help.htmlhttp://www.nrdb.co.uk/ -
7/31/2019 IInd Sem Class1
40/56
Secondary databases
Secondary databases contain pattern data, i.e., diagnosticsignatures for protein families. These signatures encode themost highly conserved features of multiply aligned sequences,
which are often crucial to the structure or function of the protein The second structure of a protein corresponds to regions of
local regularity (e.g., -helices and -strands).
Which, in sequence alignments, are often apparent as well-conserved motifs
patterns are regular expressions, fingerprints, blocks, profiles,etc.
-
7/31/2019 IInd Sem Class1
41/56
Secondary databases
SecondaryDB
Primarysource
Storedinformation
PROSITE SWISS-PROT Regular expressions
(patterns)Profiles SWISS-PROT Weighted matrices
(profiles)
PRINTS OWL Aligned motifs(fingerprints)
BLOCKS PROSITE/PRINTS Aligned motifs(blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regularexpressions(patterns)
http://www.expasy.org/prosite/http://www.expasy.org/prosite/ -
7/31/2019 IInd Sem Class1
42/56
Secondary databases TRANSFAC
http://transfac.gbf.de EPD
http://www.epd.isb-sib.ch InterPro
http://www.ebi.ac.uk/interpro/ PROSITE
http://www.expasy.ch/prosite BLOCKS
http://blocks.fhcrc.org PRINTS
ftp://ftp.seqnet.dl.ac.uk/pub/database/prints PFAM
http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom
http://www.toulouse.inra.fr/prodom.html InterPro
http://www.ebi.ac.uk/interpro GeneCards
http://bioinformatics.weizmann.ac.il/cards ENSEMBL
http://www.ensembl.org EcoCyc
http://ecocyc.panbio.com/ecocyc/ecocyc.html
http://transfac.gbf.de/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.ebi.ac.uk/interpro/http://www.expasy.ch/prositehttp://blocks.fhcrc.org/http://blocks.fhcrc.org/http://blocks.fhcrc.org/ftp://ftp.seqnet.dl.ac.uk/pub/database/printshttp://www.sanger.ac.uk/Software/Pfam/index.shtmlhttp://www.toulouse.inra.fr/prodom.htmlhttp://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interprohttp://www.ebi.ac.uk/interprohttp://bioinformatics.weizmann.ac.il/cardshttp://www.ensembl.org/http://ecocyc.panbio.com/ecocyc/ecocyc.htmlhttp://ecocyc.panbio.com/ecocyc/ecocyc.htmlhttp://www.ensembl.org/http://bioinformatics.weizmann.ac.il/cardshttp://www.ebi.ac.uk/interprohttp://www.toulouse.inra.fr/prodom.htmlhttp://www.sanger.ac.uk/Software/Pfam/index.shtmlftp://ftp.seqnet.dl.ac.uk/pub/database/printshttp://blocks.fhcrc.org/http://www.expasy.ch/prositehttp://www.ebi.ac.uk/interpro/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://www.epd.isb-sib.ch/http://transfac.gbf.de/ -
7/31/2019 IInd Sem Class1
43/56
Secondary databases
There is some overlap in content between the secondarydatabases
PDBsum alone has 35,291 entries
Pattern DB growth is slow because the addition ofdetailed family annotation is very time consuming.
PROSITE and PRINTS are the only comprehensively,manually annotated secondary DBs
To address the annotation bottleneck, the secondarydatabase curators are together created a unifieddatabase of protein families known as InterPro
-
7/31/2019 IInd Sem Class1
44/56
Structure Classification DBs
Contain 3D structures available fromcrystallographic and spectroscopic studies
Structure Classification Databases
PDBsum Protein Data Bank
CATH Class, Architecture, Topology, Homology
SCOP Structural Classification of Proteins
-
7/31/2019 IInd Sem Class1
45/56
Structure Classification DBs
PDBhttp://www.rcsb.org
SCOPhttp://scop.mrc-lmb.cam.ac.uk/scop
CATHhttp://www.biochem.ucl.ac.uk/bsm/cath
DSSPhttp://www.sander.ebi.ac.uk/dssp
FSSPhttp://www.ebi.ac.uk/dali/fssp
HSSPhttp://www.sander.ebi.ac.uk/hssp
http://www.rcsb.org/http://scop.mrc-lmb.cam.ac.uk/scophttp://www.biochem.ucl.ac.uk/bsm/cathhttp://www.sander.ebi.ac.uk/dssphttp://www.ebi.ac.uk/dali/fssphttp://www.sander.ebi.ac.uk/hssphttp://www.sander.ebi.ac.uk/hssphttp://www.ebi.ac.uk/dali/fssphttp://www.sander.ebi.ac.uk/dssphttp://www.biochem.ucl.ac.uk/bsm/cathhttp://scop.mrc-lmb.cam.ac.uk/scophttp://scop.mrc-lmb.cam.ac.uk/scophttp://scop.mrc-lmb.cam.ac.uk/scophttp://www.rcsb.org/ -
7/31/2019 IInd Sem Class1
46/56
Metabolic Databases
KEGG(Kyoto Encyclopedia of Genes and Genomes)http://www.genome.ad.jp/kegg
ENZYME (Enzyme nomenclature database)http://www.expasy.ch/enzyme
BRENDA (Enzyme Information System)http://www.brenda.uni-koeln.de
EMP(Enzymes and Metabolic Pathways database)http://www.empproject.com
A number of metabolic databases are available electronically some with features for querying and visualizing metabolicpathways and regulatory networks.
http://www.genome.ad.jp/kegghttp://www.expasy.ch/enzymehttp://www.brenda.uni-koeln.de/http://www.empproject.com/http://www.empproject.com/http://www.brenda.uni-koeln.de/http://www.brenda.uni-koeln.de/http://www.brenda.uni-koeln.de/http://www.expasy.ch/enzymehttp://www.genome.ad.jp/kegg -
7/31/2019 IInd Sem Class1
47/56
Mapping Databases
OMIM (Online Mendelian Inheritance in Man)http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
GDB (The GDB Human Genome Database)http://www.gdb.org
RHDBhttp://corba.ebi.ac.uk/RHdb
D t b i
http://www.gdb.org/http://corba.ebi.ac.uk/RHdbhttp://corba.ebi.ac.uk/RHdbhttp://www.gdb.org/ -
7/31/2019 IInd Sem Class1
48/56
Databases concerningMutations
dbSNPhttp://www.ncbi.nlm.nih.gov/SNP
HGBASE
http://hgbase.cgr.ki.se The SNP Consortium (TSC)
http://snp.cshl.org
HAEMAhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htm
http://www.ncbi.nlm.nih.gov/SNPhttp://hgbase.cgr.ki.se/http://snp.cshl.org/http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/quiz.dir/intrquiz.htmhttp://snp.cshl.org/http://hgbase.cgr.ki.se/http://www.ncbi.nlm.nih.gov/SNP -
7/31/2019 IInd Sem Class1
49/56
LiteratureDatabases
PubMedhttp://www.ncbi.nlm.nih.gov/entrez/query
Bioinformatics Onlinehttp://www.bioinformatics.oupjournals.org
Naturehttp://www.nature.com
Sciencehttp://www.sciencemag.org
http://www.ncbi.nlm.nih.gov/entrez/queryhttp://www.bioinformatics.oupjournals.org/http://www.nature.com/http://www.sciencemag.org/http://www.sciencemag.org/http://www.nature.com/http://www.bioinformatics.oupjournals.org/http://www.ncbi.nlm.nih.gov/entrez/query -
7/31/2019 IInd Sem Class1
50/56
Database tools for displaying andannotating genomic sequence data
Viewerformat
URL
Artemis www.sanger.ac.uk/Software/Artemis
ACeDB www.acedb.org/Tutorial/brief-tutorial/shtml
Apollo www.ensembl.org/apollo
EnsEMBL www.ensembl.org
NCBI mapviewer
www.ncbi.nlm.nih.gov
GoldenPath genome.ucsc.edu
-
7/31/2019 IInd Sem Class1
51/56
-
7/31/2019 IInd Sem Class1
52/56
Common formats
There are several conventions forrepresenting nucleic acid and proteinsequences, of which the following arewidely used
NBRF/PIR
FASTA
GDE
These formats have limited facilities forcomments, which must include a uniqueidentifier code and sequence accession
number
Formats for multiple sequence
-
7/31/2019 IInd Sem Class1
53/56
Formats for multiple sequencealignment
There are separate formats for
multiple sequence alignmentrepresentation, of which thefollowing are popular
MSF
PHYLIP
ALN
-
7/31/2019 IInd Sem Class1
54/56
Files of structural data
Structural data are maintained as flat filesusing the PDB format
Such files contain orthogonal atomic co-
ordinates together with annotations,comments and experimental details
http://www.pdb.org
-
7/31/2019 IInd Sem Class1
55/56
Submission of sequences
Sequences may be submitted to any of thethree primary databases using the toolsprovided by the database curators
Such tools include WebIn and BankIt,which can be used over the Internet, andSequin, a stand-alone application
http://www.ebi.ac.uk/embl/Submission/webin.html
http://www.ncbi.nlm.nih.gov/BankIt/
http://www.ebi.ac.uk/embl/Submission/webin.htmlhttp://www.ncbi.nlm.nih.gov/BankIt/http://www.ncbi.nlm.nih.gov/BankIt/http://www.ebi.ac.uk/embl/Submission/webin.html -
7/31/2019 IInd Sem Class1
56/56
Database interrogation
All the databases discussed above can besearched by sequence similarity
However, detailed text-based searches of theannotations are also possible using tools suchas Entrez
The simplest way to cross-reference betweenthe primary nucleotide sequence databases andSWISS-PROT is to search by accessionnumber, as this provides an unambiguousidentifier of genes and their products