databases, archives, search tools. bioinformatics:
DESCRIPTION
Databases, archives, search tools. Bioinformatics: ”convergence of two historical trends in biological research - storage of molecular sequences in computer databases - application of computational algoritms to the analysis of DNA and protein sequences.” - PowerPoint PPT PresentationTRANSCRIPT
Databases, archives, search tools.
Bioinformatics:
”convergence of two historical trends in biological research
- storage of molecular sequences in computer databases
- application of computational algoritms to the analysis of DNA and protein sequences.”
(Brown 2003 Biotechniques).
After the database lecture the student should
* Understand the differences between primary and secondary databases. * Understand the differences between sequence similarity search and
structured data search. * Understand the background for maintaining different versions of
databases with nearly the same content. * Understand the difference between curated and raw databases. * Understand the difference between databases (Genbank non-redundant
protein, SwissProt), servers (NCBI, Expasy) and search programs (Blast, Fasta).
Why? Most information developed in bioinformatics is stored in databases. Often the same information exists in different formats in different databases, and different servers present the same data in different more or less user-friendly ways. The choice of database depends on the problem and personal taste.
The choice of server may even depend on the time of day and the loads (number of users) at the time.
Databases (DB).
1. Primary databases = archives = repositories.
2. Secondary databases = specialized databases
3. Parallel informationmainlyAmerican (U. S. A.) versus European (EU) databases.
All databases are listed in the first issue of Nucleic Acid Researh (NAR) each year
Main bioinformatic institutes hosting databases and servers
verdenskort:
NCBI EBI DDBJ
USA, Bethesta Hinxton, JapanMaryland England
International Nucleotide Sequence Database Collaboration
Primary (repository) (archives) DB.
Data derived from direct experimental characterization of DNA or protein. Authors submit their own material which is curated by the database.
International public databases All known nucleotide and protein sequences.
GenBank (funded 1982) hosted at NCBI since 1992EMBL (funded ?) hosted at EBI since 1994DDJB (funded 1986) (DNA Data Bank Japan)
Local databases at institutions doing sequencingTIGR (funded 1992), Sanger (funded 1992)Other local databases of sequencing projects linked to the 3-5 large primary databasesCommercial DB not assible to the public.
Secondary DB (specialized DB, derived information resources)
Information curated by the DB. No direct submission from scientists. Further analysis by the database.
Swiss-prot (hosted at SIB since 1998) (funded by Dr. Amos Bairoch in 1985)Annotation, minimum redundancy, integration with other DB, documentation.
PDB (Protein Data Bank) (funded in 1971) DB of experimental determined three-dimentional structural information.
PIR (Protein Information Resource) (funded in 1965 by Margaret O. Dayhoff)Receives directly sequenced proteins.
Many, many more (see NAR)
Domain and motif specialized DB.
Domain: compact units of proteins behaving indepentlyMotifs: conserved regions of proteins which might be part of domain
BLOCKS (USA) (funded by Henikoff’s) Multiple alignments of conserved regions
PRINTS (UK) (parallel to BLOCKS based on OWL DB)Hierarchical gene family fingerprints
PROSITE (associated Swiss-Prot)Biologically-significant protein patterns and profiles
ProDOM (automatic created blocks, France)Pfam (manually defiend domains)
Multiple sequence alignments and hidden Markov models of common protein domains
CDD (Conserved Domain Database)Alignment models for conserved protein domains
Domain, motif specialized DB.
Domain: compact units of proteins behaving indepentlyMotifs: conserved regions of proteins which might be part of domain
Search tools for the domains DBDART (Domain Architecture Retrieval Tool)SMART (Simple Modular Architacture Research Tool)
Interpro: Linking information in PRINTS, PROSITE, ProDOM and Pfam
Database Category: Proteome Resources
AAindex Physicochemical and biological properties of amino acids GELBANK 2D gel electrophoresis patterns from completed genomes
REBASE Restriction enzymes and associated methylases
SWISS-2DPAGE Annotated two-dimensional polyacrylamide gel electrophoresis database
Database Category: Varied Biomedical Content
DBcat Catalog of databases
DrugDB Pharmacologically-active compounds; generic and trade names
GlycoSuiteDB N- and O-linked glycan structures and biological source information
NCBI Taxonomy Browser Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence
probeBase rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts, and associated information
PubMed MEDLINE and Pre-MEDLINE citations
RefSeq Reference sequence standards for genomes, genes, transcripts, and proteins
Tree of Life Information on phylogeny and biodiversity
VirOligo Virus-specific oligonucleotides for PCR and hybridization
International bioinformatic resources(integrated databases, programs and servers)
NCBI (National Center for Biotechnology Information)Division of NLM on NIH campus.
web-site www.ncbi.nlm.nih.gov.
Repository: GenBank
Data retrieval: Entrez, PubMed, LocusLinkEntrez is an integrated database retrival system
accessible all type of data
Data analysis: BLAST, Electronic PCR, ORFfinder, and more
International bioinformatic resources(integrated databases, programs and servers)
EBI (European Bioinformatics Institute)
EMBL. Repository. Europe’s primary collection of nucleotide sequences
UniProt Knowledgebase - a complete annotated protein sequence database
Macromolecular Structure Database - European Project for the management and distribution of data on macromolecular structures
ArrayExpress - for gene expression data
Ensembl - Providing up to date completed metazoic genomes and the best possible automatic annotation.
Tools:Clustalw and many more
Example of repository:
GenBank
Submission: 35 % by Bankit individual submissions. Rest bulk submissions from sequencing centres.
Gi (genetic identifier)-number: changes with new updates.
Accession number: constant but extended by version no.
DNA sequences: two letters, six digits (old one letter 5 dig.).
Protein sequences: three letters, five digits (old one letter 5 dig.).
Non-redundant (nr) ?
Example of protein search in DB: Leucotoxin
Frey and Kuhnert 2002
GenBank DNA-sequence format
LOCUS PASA1LKT 7801 bp DNA linear BCT 26-APR-1993 DEFINITION Pasteurella haemolytica A1 leukotoxin gene, encoding LktA, LktB, LktC and
LktD proteins, complete cds. ACCESSION M20730 VERSION M20730.1 GI:150492 KEYWORDS LktA protein; LktB protein; LktC protein; LktD protein. SOURCE Mannheimia haemolytica ORGANISM Mannheimia haemolytica Bacteria; Proteobacteria; Gammaproteobacteria;
Pasteurellales; Pasteurellaceae; Mannheimia. REFERENCE 1 (bases 1 to 7801) AUTHORS Lo,R.Y., Strathdee,C.A. and Shewen,P.E. TITLE
Nucleotide sequence of the leukotoxin genes of Pasteurella haemolytica A1 JOURNAL Infect. Immun. 55 (9), 1987-1996 (1987) MEDLINE 87306837 PUBMED 3040588 COMMENT Original source text: P.haemolytica (serotype 1, biotype A) DNA. Submitted
in computer readable form by C.Strathdee21-SEP-1988. FEATURES Location/Qualifiers
source 1..7801 /organism="Mannheimia haemolytica" /mol_type="genomic DNA" /db_xref="taxon:75985"
GenBank DNA-sequence format
CDS 470..973 /note="LktC protein" /codon_start=1 /transl_table=11 /protein_id="AAA25528.1" /db_xref="GI:150493" /translation="MNQSYFNLMNSSLHK…..
CDS 989..3850 /note="LktA protein" /codon_start=1 /transl_table=11 /protein_id="AAA25529.1" /db_xref="GI:150494" /translation="MGTRLTTLSNGLKNTLTATKS…..
ORIGIN 3 bp upstream of EcoRV site. 1 gatatcttgt gcctgcgcag taaccacaca cccgaataaa
agggtcaaaa gtgttttttt 61 cataaaaagt ccctgtgttt tcattataag gattaccact
ttaacgcagt tactttctta
Genome level comparison
COG (Clusters of Orthologous Groups)
A RNA processing and modificationB Chromatin structure and dynamicsC Energy production and conversionD Cell cycle control and mitosisE Amino acid metabolism and transportF Nucleotide metabolism and transport…
S Function unknown
Example of search in specialized database.
Selection of DB.
More versions of the same acc. no.
Different types of identifiers.
Links (or lack of these) to other specialized databases
Example of protein search in DB: Leucotoxin
Frey and Kuhnert 2002
Example of protein search in DB: Leucotoxin
NCBI http://www.ncbi.nlm.nih.gov/
Protein keywords, Mannheimia haemolytica leukotoxin
over 100 hits
Swiss-prot http://expasy.org/sprot/
Wrong name (Pasteurella) only one sequence (P16535)
Swiss-prot no. LKA1_PASHA
NCBI with P16535
Example with 16S rRNA based identification of bacteria. Relevant for food -, veterinary and environmental microbiology.
16S rRNA sequence comparison preferred for classification/identification:
16S rRNA genes are universially distributedThere is only one type of ribosomes. No selection and no recombination (in theory)
16S rRNA gene sequence derived phylogeny reflects the natural relationship of bacteria
Current framework for bacterial taxonomy
Huge databases.
Example of sequence submission to a primary database.Isolate P876, 16S rRNA gene sequence. Length: 1449 bp
TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAACTGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTTGGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCAGCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAGGGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTAATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTAGAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGACGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGCCCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGTTTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCATGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGAGACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGGCAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCAAATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG
AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA Submission to GenBank with BankIt
Errors detected during automatic translation of DNA to protein. When the sequence is curated at the database.
BLAST 16S rRNA
TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAACTGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTTGGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCAGCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAGGGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTAATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTAGAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGACGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGCCCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGTTTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCATGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGAGACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGGCAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCAAATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG
AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA
Four DB advises.
Start with: NCBI and/or Swiss-prot
Remember differences between:1. Repository, archive2. Specialized
Parallel resources often exists in Europe and USA.
Find help in the scientific litterature.
Be aware of errors in the DB.
Cite the databases correctly (see first issue of NAR each year)