databases, archives, search tools. bioinformatics:

Databases, archives, search tools.

Bioinformatics:

”convergence of two historical trends in biological research

- storage of molecular sequences in computer databases

- application of computational algoritms to the analysis of DNA and protein sequences.”

(Brown 2003 Biotechniques).

After the database lecture the student should

* Understand the differences between primary and secondary databases. * Understand the differences between sequence similarity search and

structured data search. * Understand the background for maintaining different versions of

databases with nearly the same content. * Understand the difference between curated and raw databases. * Understand the difference between databases (Genbank non-redundant

protein, SwissProt), servers (NCBI, Expasy) and search programs (Blast, Fasta).

Why? Most information developed in bioinformatics is stored in databases. Often the same information exists in different formats in different databases, and different servers present the same data in different more or less user-friendly ways. The choice of database depends on the problem and personal taste.

The choice of server may even depend on the time of day and the loads (number of users) at the time.

Databases (DB).

1. Primary databases = archives = repositories.

2. Secondary databases = specialized databases

3. Parallel informationmainlyAmerican (U. S. A.) versus European (EU) databases.

All databases are listed in the first issue of Nucleic Acid Researh (NAR) each year

Main bioinformatic institutes hosting databases and servers

verdenskort:

NCBI EBI DDBJ

USA, Bethesta Hinxton, JapanMaryland England

International Nucleotide Sequence Database Collaboration

Primary (repository) (archives) DB.

Data derived from direct experimental characterization of DNA or protein. Authors submit their own material which is curated by the database.

International public databases All known nucleotide and protein sequences.

GenBank (funded 1982) hosted at NCBI since 1992EMBL (funded ?) hosted at EBI since 1994DDJB (funded 1986) (DNA Data Bank Japan)

Local databases at institutions doing sequencingTIGR (funded 1992), Sanger (funded 1992)Other local databases of sequencing projects linked to the 3-5 large primary databasesCommercial DB not assible to the public.

Secondary DB (specialized DB, derived information resources)

Information curated by the DB. No direct submission from scientists. Further analysis by the database.

Swiss-prot (hosted at SIB since 1998) (funded by Dr. Amos Bairoch in 1985)Annotation, minimum redundancy, integration with other DB, documentation.

PDB (Protein Data Bank) (funded in 1971) DB of experimental determined three-dimentional structural information.

PIR (Protein Information Resource) (funded in 1965 by Margaret O. Dayhoff)Receives directly sequenced proteins.

Many, many more (see NAR)

Domain and motif specialized DB.

Domain: compact units of proteins behaving indepentlyMotifs: conserved regions of proteins which might be part of domain

BLOCKS (USA) (funded by Henikoff’s) Multiple alignments of conserved regions

PRINTS (UK) (parallel to BLOCKS based on OWL DB)Hierarchical gene family fingerprints

PROSITE (associated Swiss-Prot)Biologically-significant protein patterns and profiles

ProDOM (automatic created blocks, France)Pfam (manually defiend domains)

Multiple sequence alignments and hidden Markov models of common protein domains

CDD (Conserved Domain Database)Alignment models for conserved protein domains

Domain, motif specialized DB.

Domain: compact units of proteins behaving indepentlyMotifs: conserved regions of proteins which might be part of domain

Search tools for the domains DBDART (Domain Architecture Retrieval Tool)SMART (Simple Modular Architacture Research Tool)

Interpro: Linking information in PRINTS, PROSITE, ProDOM and Pfam

Database Category: Proteome Resources

AAindex Physicochemical and biological properties of amino acids GELBANK 2D gel electrophoresis patterns from completed genomes

REBASE Restriction enzymes and associated methylases

SWISS-2DPAGE Annotated two-dimensional polyacrylamide gel electrophoresis database

Database Category: Varied Biomedical Content

DBcat Catalog of databases

DrugDB Pharmacologically-active compounds; generic and trade names

GlycoSuiteDB N- and O-linked glycan structures and biological source information

NCBI Taxonomy Browser Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence

probeBase rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts, and associated information

PubMed MEDLINE and Pre-MEDLINE citations

RefSeq Reference sequence standards for genomes, genes, transcripts, and proteins

Tree of Life Information on phylogeny and biodiversity

VirOligo Virus-specific oligonucleotides for PCR and hybridization

International bioinformatic resources(integrated databases, programs and servers)

NCBI (National Center for Biotechnology Information)Division of NLM on NIH campus.

web-site www.ncbi.nlm.nih.gov.

Repository: GenBank

Data retrieval: Entrez, PubMed, LocusLinkEntrez is an integrated database retrival system

accessible all type of data

Data analysis: BLAST, Electronic PCR, ORFfinder, and more

International bioinformatic resources(integrated databases, programs and servers)

EBI (European Bioinformatics Institute)

EMBL. Repository. Europe’s primary collection of nucleotide sequences

UniProt Knowledgebase - a complete annotated protein sequence database

Macromolecular Structure Database - European Project for the management and distribution of data on macromolecular structures

ArrayExpress - for gene expression data

Ensembl - Providing up to date completed metazoic genomes and the best possible automatic annotation.

Tools:Clustalw and many more

http://www.ebi.ac.uk/uniprot/index.html

http://www.ebi.ac.uk/msd/index.html

http://www.ebi.ac.uk/arrayexpress/

http://www.ensembl.org/

Example of repository:

GenBank

Submission: 35 % by Bankit individual submissions. Rest bulk submissions from sequencing centres.

Gi (genetic identifier)-number: changes with new updates.

Accession number: constant but extended by version no.

DNA sequences: two letters, six digits (old one letter 5 dig.).

Protein sequences: three letters, five digits (old one letter 5 dig.).

Non-redundant (nr) ?

Example of protein search in DB: Leucotoxin

Frey and Kuhnert 2002

GenBank DNA-sequence format

LOCUS PASA1LKT 7801 bp DNA linear BCT 26-APR-1993 DEFINITION Pasteurella haemolytica A1 leukotoxin gene, encoding LktA, LktB, LktC and

LktD proteins, complete cds. ACCESSION M20730 VERSION M20730.1 GI:150492 KEYWORDS LktA protein; LktB protein; LktC protein; LktD protein. SOURCE Mannheimia haemolytica ORGANISM Mannheimia haemolytica Bacteria; Proteobacteria; Gammaproteobacteria;

Pasteurellales; Pasteurellaceae; Mannheimia. REFERENCE 1 (bases 1 to 7801) AUTHORS Lo,R.Y., Strathdee,C.A. and Shewen,P.E. TITLE

Nucleotide sequence of the leukotoxin genes of Pasteurella haemolytica A1 JOURNAL Infect. Immun. 55 (9), 1987-1996 (1987) MEDLINE 87306837 PUBMED 3040588 COMMENT Original source text: P.haemolytica (serotype 1, biotype A) DNA. Submitted

in computer readable form by C.Strathdee21-SEP-1988. FEATURES Location/Qualifiers

source 1..7801 /organism="Mannheimia haemolytica" /mol_type="genomic DNA" /db_xref="taxon:75985"

GenBank DNA-sequence format

CDS 470..973 /note="LktC protein" /codon_start=1 /transl_table=11 /protein_id="AAA25528.1" /db_xref="GI:150493" /translation="MNQSYFNLMNSSLHK…..

CDS 989..3850 /note="LktA protein" /codon_start=1 /transl_table=11 /protein_id="AAA25529.1" /db_xref="GI:150494" /translation="MGTRLTTLSNGLKNTLTATKS…..

ORIGIN 3 bp upstream of EcoRV site. 1 gatatcttgt gcctgcgcag taaccacaca cccgaataaa

agggtcaaaa gtgttttttt 61 cataaaaagt ccctgtgttt tcattataag gattaccact

ttaacgcagt tactttctta

Genome level comparison

COG (Clusters of Orthologous Groups)

A RNA processing and modificationB Chromatin structure and dynamicsC Energy production and conversionD Cell cycle control and mitosisE Amino acid metabolism and transportF Nucleotide metabolism and transport…

S Function unknown

Example of search in specialized database.

Selection of DB.

More versions of the same acc. no.

Different types of identifiers.

Links (or lack of these) to other specialized databases


Frey and Kuhnert 2002


NCBI http://www.ncbi.nlm.nih.gov/

Protein keywords, Mannheimia haemolytica leukotoxin

over 100 hits

Swiss-prot http://expasy.org/sprot/

Wrong name (Pasteurella) only one sequence (P16535)

Swiss-prot no. LKA1_PASHA

NCBI with P16535

http://expasy.org/sprot/

Example with 16S rRNA based identification of bacteria. Relevant for food -, veterinary and environmental microbiology.

16S rRNA sequence comparison preferred for classification/identification:

16S rRNA genes are universially distributedThere is only one type of ribosomes. No selection and no recombination (in theory)

16S rRNA gene sequence derived phylogeny reflects the natural relationship of bacteria

Current framework for bacterial taxonomy

Huge databases.

Example of sequence submission to a primary database.Isolate P876, 16S rRNA gene sequence. Length: 1449 bp

TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAACTGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTTGGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCAGCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAGGGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTAATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTAGAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGACGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGCCCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGTTTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCATGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGAGACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGGCAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCAAATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG

AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA Submission to GenBank with BankIt

Errors detected during automatic translation of DNA to protein. When the sequence is curated at the database.

BLAST 16S rRNA

TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAACTGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTTGGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCAGCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAGGGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTAATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTAGAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGACGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGCCCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGTTTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCATGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGAGACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGGCAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCAAATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG

AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA

Four DB advises.

Start with: NCBI and/or Swiss-prot

Remember differences between:1. Repository, archive2. Specialized

Parallel resources often exists in Europe and USA.

Find help in the scientific litterature.

Be aware of errors in the DB.

Cite the databases correctly (see first issue of NAR each year)

databases, archives, search tools. bioinformatics:

Documents