databases in bioinformatics - welcome to srm university ... · pdf filedatabases in...
TRANSCRIPT
UNIT-VDatabases in Bioinformatics
R.KAVITHA,M.PHARMLECTURER,
DEPARTMENT OF PHARMACUTICSSRM COLEGE OF PHARMACY
SRMUNIVERITY
• Why?• The different types of databases• Database language: identifiers• Nucleotide sequence databases• Protein sequence databases• 3D structure databases• Ontologies
Databases in Bioinformatics
• Make biological data available to scientists– Consolidation of data (gather data from different sources)– Provide access to large dataset that cannot be published
explicitly (genome, …)
• Make biological data available in computer-readable format– Make data accessible for automated analysis
Bioinformatics: “a collective term for data compilation, organisation, analysis and dissemination”
Biological databases: Why?
The different types of Databases in Bioinformatics
1) Data:
Type of data:• nucleotide sequences• protein sequences• 3D structures• gene expression data• metabolic pathways• ….
Data entry and quality control:• data deposited directly• curators add and update data• treatment of erroneous data: removed,
or marked• error checking• consistency, updates• ….
Primary, or derived data:• Primary databases: direct experimental results• Secondary databases: result of analysis on primary databases• Consolidation of many databases• …
The different types of Databases in Bioinformatics2) Database:
Organisation:• flat files• Relational databases• Object-oriented databases• ….
Curators:• Large, public institution (EMBL, NCBI)• Quasi-academic institute (Swiss institute of Bioinformatics, TIGR,…)• Academic group or scientist• Commercial company
Availability:• Publicly available, no restriction• Available, but with copyright• Accessible, but not downloadable• Academic, but not freely available• Commercial
• Identifier: string of letters and digits that generally is “understandable”– Example: TPIS_CHICK (Triose Phosphate Isomerase from
chicken (gallus gallus) ) in SwissProt– The identifier can change (based on the curator)
• Accession code: a string of letters and digits that uniquely identifies an entry in its database.– The accession number for TPIS_CHICK in Swissprot is
P00940– Accession number should not changed!!
Identifiers and Accession numbers
• 3 main databases– EMBL: www.ebi.ac.uk/embl– GenBank: www.ncbi.nlm.nih.gov/GenBank– DDBJ: www.ddbj.nig.ac.jp
The 3 databases are synchronized on a daily basis, and the accession numbers are consistent.
There are no legal restriction in the usage of these databases. However, there are some patented sequences in the database
Nucleotide Sequence Databases
Protein Sequence Databases
One of the first biological sequencedatabases was probably the book "Atlas of Protein Sequences and Structures"by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein sequences determined at the time, and new editions of the book were published till 1978. It became the foundationof the PIR database.
http://pir.georgetown.edu/
Protein Information Resource
Protein Sequence Databases
http://www.expasy.ch/sprot/
The SWISS-PROT database has some legal restrictions: the entries are copyrighted, but freely accessible by academic researchers. Commercial companies must buy a license fee from SIB.
Amino AcidComposition
Size of SwissProt
SwissProt: Statistics
• PDB: http://www.rcsb.org• SCOP: http://scop.berkeley.edu• CATH: http://biochem.ucl.ac.uk/bsm/CATH• ASTRAL: http://astral.berkeley.edu• HOMSTRAD: http://www-cryst.bioc.cam.ac.uk/data/align/• Interfaces to PDB:
– PDB at a glancehttp://cmm.info.nih.gov/modeling/pdb_at_a_glance.html
– Molecules to go http://molbio.info.nih.gov/cgi-bin/pdb/– EBI interface: http://www.ebi.ac.uk/msd/– PDBSum: http://www.ebi.ac.uk/thornton-srv/databases/pdbsum
Biomolecule Structure Database
• GO paper: Creating the Gene Ontology Resource: Design and Implementation Genome Research (2001) 11:1425-1433
• The GO Website - http://www.geneontology.org• Application of GO –
The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro Genome Res. 2003 Apr;13(4):662-72.
The Gene Ontology (GO)
GO Goals
From Genome Res 2001 Aug;11(8):1425-33
• Three levels of annotation:
– Molecular function - what a gene product does at the biochemical level
– Biological process - a broad biological perspective – not currently a pathway (no dynamics or dependencies)
– Cellular component - location within cellular structures (eg Golgi apparatus) and macromolecular complexes (ribosome)
Gene Ontology (GO)
Structure of GO
Example from molecular function:
Transmembrane receptor tyrosine protein kinaseChild
ParentTransmembrane
receptorProtein tyrosine
kinase
Is_a Is_a
Searching for papers…
Searching for papers…
http://scholar.google.com