michael y. galperin national center for biotechnology information national library of medicine...

67
Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases (“knowledge bases”) used in genome analysis

Upload: rodney-fisher

Post on 16-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Michael Y. Galperin National Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthBethesda, Maryland, USA

Databases (“knowledge bases”)

used in genome analysis

Databases (“knowledge bases”)

used in genome analysis

Growth in genome sequencing

Working Draft Sequence

gaps

J. Smith - a very common name

Structure - a very common term

Glutamine amidotransferase - less common term but not

a very good descriptor

A different professor Janet Smith

Another Janet Smith in the news

Glutamine for sale

• Databases– PubMed and other NCBI databases– Biochemical databases– Protein domain databases– Structural databases– Genome comparison databases

• Tools– CDD / COGs– VAST / FSSP

Tools of trade for the “armchair scientist”

• Archival or Primary Data – Text: PubMed– DNA Sequence: GenBank– Protein Sequence: Entrez Proteins, TREMBL– Protein Structures: PDB

• Curated or Processed Data– DNA sequences : RefSeq, LocusLink, OMIM– Protein Sequences: SWISS-PROT, PIR– Protein Structures : SCOP, CATH, MMDB– Genomes: Entrez Genomes, COGs

Types of databasesTypes of databases

Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov

The National Center for Biotechnology Information (NCBI)

• Created as a part of the National Library of Medicine, National Institutes of Health in 1988– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

• Tools: BLAST(1990), Entrez (1992)• GenBank (1992)• Free MEDLINE (PubMed, 1997)• Other databases: dbEST, dbGSS, dbSTS,

MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink,

RefSeq

What is GenBank?• Archival nucleotide sequence database• Sample slogans:

“Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served”

• Data are shared nightly among three collaborating databases:

• GenBank at NCBI - Bethesda, Maryland, USA• DNA Database of Japan (DDBJ) at NIG -

Mishima, Japan• European Molecular Biology LaboratoryEuropean Molecular Biology Laboratory

DatabaseDatabase (EMBL) at EBI - Hinxton, UK

Some guiding principles of working with GenBank

• GenBank is a nucleotide-centric view of the information space

• GenBank is a repository of all publically available sequences

• In GenBank, records are grouped for various reasons

• Data in GenBank is only as good as what you put in

NCBI databases and their linksNCBI databases and their links

Word WeightWord Weight

VASTVAST

BLASTBLASTBLASTBLAST

PhylogenyPhylogenyGenomesGenomes

TaxonomyTaxonomy

Nucleotide Nucleotide SequencesSequences

Protein Protein SequencesSequences

Article Article AbstractsAbstracts

MedlineMedline

3-D Structure

3 D 3 D StructureStructure

MMDBMMDB

Entrez: An integrated search and retrieval systemEntrez: An integrated search and retrieval system

PubMed book links

[rest of protein sequence deleted for brevity]

[rest of nucleotide sequence deleted for brevity]

GenBank RecordAccession NumberAccession Number

gi Numbergi Number

Protein SequenceProtein Sequence

Nucleotide SequenceNucleotide Sequence

Locus NameLocus Name

Medline IDMedline ID

GenPept IDGenPept ID

Archival databases are unreliable

• Misinterpreted experimental results• Annotations base on low similarity

gi|1968785 - cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens)

gi|6522905 - very hypothetical protein (S. pombe)

• Biologically senseless annotationsDeinococcus: head morphogenesis protein

Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein HH. pylori - brute force proteinS. cerevisiae - inside intron 7

• Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

Advanced Neighbors: BLink

BLink

Protein sequence motif is a descriptor of a protein family

• Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]

[C is the active site residue]

• Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]

[C is the active site residue]

purF gene neighbors

Searching MMDB

Principles of structural alignment

• Dali: http://www.ebi.ac.uk/dali/Looks for minimal RMSD between C atoms. Calculate C - C distance matrices, then identifies the longest alignable segments

• VAST (Vector Alignment Search Tool)http://www.ncbi.nlm.nih.gov/Structure/looks for pairs of secondary structure elements (-helices, -strands) that have similar orientation and connectivity

Dali alignment of Tyr phosphatase

VAST Structure Neighbors

Structure Summary

Cn3D viewer

VAST neighbors

BLAST neighbors

Cn3D : Displaying Structures

Chloroquine

Structure Neighbors

Use of structural alignments

Chloroquine

NADH

A catalog A catalog of human of human genes and genes and genetic genetic disordersdisorders

Online Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in Man

OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)

Associated LocusLink recordAssociated LocusLink record

External resourcesExternal resources

Additional info in OMIMAdditional info in OMIM

ContentContentss

Each record Each record provides a provides a state of the state of the art summary art summary of current of current knowledgeknowledge

Extensive Extensive references to references to literatureliterature

OMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by Titles

alzheimer AND presenilin 1

Entrez Genome: Gene LocationEntrez Genome: Gene Location

View of View of chromosochromosome 14me 14

Gene Gene NameName

Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.

Entrez Entrez Genomes Map Genomes Map ViewerViewer

Chromosome Chromosome 7 7

GenBank Map GenBank Map Contig Map Contig Map STS MapSTS Map

Integrated View of Chromosome 7Integrated View of Chromosome 7

Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.

Entrez Genome: Gene LocationEntrez Genome: Gene Location

View of View of chromosochromosome 14me 14

Gene Gene NameName

Entrez Genome: Gene LocationEntrez Genome: Gene Location

Entrez Entrez Genomes Genomes Map ViewerMap Viewer

Chromosome Chromosome 14 Cytogenetic 14 Cytogenetic mapmap

Location of Location of PSEN1 and PSEN1 and surrounding surrounding genesgenes

LocusLinkLocusLink

LocusLinkLocusLink

Text Text queryingquerying

Multiple Multiple OrganismsOrganisms

Alphabetical Alphabetical listingslistings

Stable Locus Stable Locus IDID

Approved Approved symbolsymbol

DescriptioDescriptionn

Genome Genome PositionPosition External External

LinksLinks

Curated Curated Resource Resource

Central hub of Central hub of information information for human, for human, mouse, rat, mouse, rat, zebrafish, and zebrafish, and fruit fly locifruit fly loci

alzheimer

OMIM

RefSeq

GenBank

UniGene

dbSNP

LocusLinkLocusLink

LocusLink: LocusID 5663 PSEN1LocusLink: LocusID 5663 PSEN1

Directed by Dr. David J. Lipman

National Center for Biotechnology Information

http://www.ncbi.nlm.nih.gov