using entrez the life sciences search engine. searching ncbi databases efficiently knowing how to...
TRANSCRIPT
Using Entrez
The Life Sciences Search Engine
Searching NCBI Databases Efficiently
• Knowing how to retrieve the exact information you need in an efficient way is the fundamental and most important skill in Bioinformatics.
• Every NCBI database is designed and created for some specific purposes.
• A common mistake Bioinformatics novices make is searching for information in an inappropriate database.
• Entrez links among and within databases, making it easier to search for information.
What is Entrez?
• Entrez is an NCBI retrieval system designed for searching several linked databases.
• Entrez is a search tool for integrated access to the biological literature and sequence data.
• Entrez is extremely powerful, enabling the user to quickly move between the different specialized databases.
Entrez
• Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search.
• When you conduct a search via Entrez, your query generates this screen, telling you the number of hits to your query.
The Entrez System
The Big Picture
LocusLink
Nucleotide
Protein
OMIM
PubMed
SNP
MGC
UCSC
GDB
e!
HGMD
UniGene
Homologene
MapViewer
Structure
3D Domains
CDD
Books
PopSet
Genome
Taxonomy
ProbeSet
UniSTS
Entrez
Entrez and LocusLink
• Entrez doesn’t link to all the databases that contain sequences, however!
• LocusLink has its own groups of links to specialty databases, since it doesn’t cover all the genomes yet.
Genomes
Taxonomy
Entrez:Database Integration
PubMed abstracts
Nucleotide sequences
Protein sequences
3-D Structure
3 -D Structure
Word weight
VAST
BLASTBLAST
Phylogeny
Entrez
Journals
UniGenePubMed Nucleotide
Protein
SNP
Genome
BooksProbeSet
OMIM
CDD
Taxonomy
3D Domains
UniSTS
PopSet
Structure
The (ever) Expanding Entrez System
Entrez DatabasesPubMed Biomedical literatureBooks Online textbooksNucleotide GenBank, EMBL, DDBJ, RefSeq, PDBProtein [GenBank, EMBL, DDBJ], RefSeq,
SWISS-PROT, PIR, PRF, PDBGenome Complete genomesTaxonomy Organisms in NCBI sequence databasesStructure MMDB: experimental 3D structuresDomains CDD: conserved protein domains3D Domains Compact 3D protein domains in MMDBOMIM Online Mendelian Inheritance in ManSNP Single nucleotide polymorphismsUniSTS Sequence Tagged Site markersProbeSet Gene expression and microarray datasetsPopSet Population study datasetsUniGene Gene-based expressed sequence clusters
Nucleotide Database
• The Nucleotide database contains sequence data from GenBank, EMBL, and DDBJ, the members of the tripartite, international collaboration of sequence databases.
• EMBL is the European Molecular Biology Laboratory at Hinxton Hall, UK;
• DDBJ is the DNA Database of Japan in Mishima, Japan.
• Sequence data are also incorporated from the Genome Sequence Data Base (GSDB), Santa Fe, NM.
• Patent sequences are incorporated through arrangements with the U.S. Patent and Trademark Office (USPTO) and via the collaborating international databases from other international patent offices.
Entrez Nucleotides
Primary • GenBank / EMBL / DDBJ 35,116,960
Derivative• RefSeq 259,219• Third Party Annotation 3,182
• PDB 4,703 Total 35,384,248
Database Searching with Entrez
Using limits and field restriction to find plant g6pdhLinking and neighboring with g6pdh
Entrez Nucleotides
glucose 6 phosphate dehydrogenase
The G6PD enzyme catalyzes the oxidation of glucose-6-phosphate to 6-phosphogluconate, while reducing nicotinamide adenine dinucleotide phosphate (NADP+ to NADPH). In terms of electron transfer, glucose-6-phosphate loses two electrons to become 6-phosphogluconate and NADP+ gains two electrons to become NADPH. This is the first step in the pentose phosphate pathway. This pathway, or shunt, as it is sometimes called, produces the 5- carbon sugar, ribose, which is an essential component of both DNA and RNA.
Limits Are Helpful
• Limits allow restriction of a search to a defined subset of the database.
• Limits can be set to restrict a search to a particular database field (e.g., the Author field).
• Limits can be set to search everything but a particular type of data (e.g., “exclude patent records”).
• Alternatively, limits can be set to search only a particular type of data (e.g., Genomic RNA/DNA) or to search only data from a particular source database (e.g., EMBL). Date limits and sequence length limits are also possible.
• The contents of each Entrez database differ, and therefore the Limits available for each database differ.
glucose 6 phosphate dehydrogenase
Entrez Nucleotides: Limits & Preview/Index
Try using the Limits and Preview function to hone your searchTo find the Plant G6PD genes.
glucose 6 phosphate dehydrogenase
Entrez Nucleotides: LimitsAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume
Field Restriction
Exclude bulk sequences
glucose 6 phosphate dehydrogenase
Entrez Nucleotides: Limits
Title == Definition
Exclude Bulk Sequences
mRNA molecule type
Nuclear gene
Document Summaries: Limits
green plants
Adding Terms: Preview/IndexAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle WordUidVolume
green plants
Plant cytosolic g6pdh mRNAs
Database Neighbors and Interlinking
• What makes Entrez more powerful than many services is that most of its records are linked to other records, both within a given database (such as Nucleotide) and between databases.
• Links within a database are called “neighbors” (e.g., Nucleotide neighbors).
Links Between Databases
• Protein and Nucleotide neighbors are determined by performing similarity searches using the BLAST algorithm to compare the entry amino acid or DNA sequence to all other amino acid or DNA sequences in the database. We will discuss more about BLAST later.
• Nucleotide sequence records in the Nucleotide database are linked to the PubMed citation of the article in which the sequences were published.
• Protein sequence records are linked to the nucleotide sequence from which the protein was translated.
Plant cytosolic g6pdh mRNAsSummaryBriefGenBankASN.1FASTAGI listLinkOutPubMed LinksProtein LinksNucleotide NeighborsPopSet LinksStructure LinksGenome LinksTaxonomy LinksOMIM Links
Formats
Links and neighbors (related records)
LinkOut
• LinkOut is a feature of Entrez that is designed to provide users with links from PubMed and other Entrez databases to a wide variety of relevant web-accessible online resources:– Full-text publications– Other biological databases– Consumer health information– Research tools
• The goal is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases.
Protein Database• The protein
database includes proteins from translate regions of DNA in GenBank as well as sequence from PIR
• The entry includes:– The name of the
protein– How the protein
sequence was derived
– An accession and a PID number
– The number of amino acids
Protein EntryThe Entry also
includes:• Structural
information for the protein (if known)– Helices and -
Sheets – Domains– Etc
• The sequence of amino acids comprising the protein
Setting Protein Database search limits• Choose Protein from
the drop-down menu– Can do a Boolean
search– Or can set LIMITS
• Fields (eg Author, Journal, etc.)
• Gene Location (genomic, mitochondrial etc)
• Segmented Sequence
• Only from (Database to check)
• Modification date
Linking Between Databases
• Sometimes you will pull up a record and you have no idea what organism the gene you are looking at is from.
• For Example, the following record- what is Medicago sativa ?
Entrez GenBank / GenPept
Taxonomy to the Rescue
• Entrez lets you click a live link from the record and determine what organism Medicago sativa is.
• It is alfalfa.• You can also tell what it is related to
taxonomically, because sometimes the common name isn’t very useful either!
Taxonomy Link
Advanced Neighbors: BLink
What is BLink
• BLink - BLAST Link • Someone has done a BLAST search
already, and you can just retrieve it!• BLink displays the graphical output of pre-
computed blastp results against the protein non-redundant (nr) database.
This graphical output includes:
• Alignment of up to 200 BLAST hits on the query sequence
• Best Hits to each organism • List of known protein domains in the query
sequence • Filter hits by selecting the BLAST cutoff score • Distribution of hits by taxonomic grouping • Display of similar sequences with known 3D
structure • Filter hits by database and/or by taxonomic
grouping • Display a taxonomic tree of all organisms with
similar sequences
PopSet Links
• The PopSet database contains aligned sequences submitted as a set resulting from a population, phylogenetic, or mutation study.
• These alignments describe such events as evolution and population variation.
• The PopSet database contains both nucleotide and protein sequence data.
Protein Neighbors->PopSet Links
Protein Neighbors->Genome Links
PopSet search results
• The results or a PopSet search
• The PopSet database includes alignments of genes from multiple organisms OR different gene families OR mutational analyses
PopSet Entry• The PopSet
entry includes:– The title of
the paper/study
– The length of the sequence(s) aligned
– The number of aligned sequences
PopSet Entry without alignment
• The PopSet Entry without an alignment– Title of the
study– The number
of sequences included
– Links to the sequences
Entrez Structures
Protein Structures can also be in databases
http://bmbiris.bmb.uga.edu/wampler/tutorial/prot0.html is a useful review
Tutorial.
Entrez links to structure databases
• The Structure database or Molecular Modeling Database (MMDB) contains experimental data from crystallographic and NMR structure determinations.
• The data for MMDB are obtained from the Protein Data Bank (PDB).
• The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy.
• Use Cn3D, the NCBI 3D structure viewer, for easy interactive visualization of molecular structures from Entrez.
Structure Search results
• The structure of proteins are also in a database
• Search as before
• Your search results are similar
Structure Entry• The structure
Entry has links to the other databases
• And it will allow you download a file to open with a structure viewer program
• Proteins with similar structures and functions have been identified in the databases
BLink: Advanced Protein Neighbors
BLink: Related Structures
Viewing Structure in Cn3D• You can
download Cn3D (a structural viewer program) from NCBI
• This will allow you to view the structures from the structure database
Cn3D Text Window
• The Text window of Cn3D will align two or more proteins so you can compare the structure of multiple proteins
BLink: Human Homologue
Human RefSeqs: Genome Reagents
MMDB: MMolecular MModeling Data Base
• Derived from experimentally determined PDB records
• Value added to PDB records including:– Addition of explicit chemical graph
information– Validation– Inclusion of Taxonomy, Citation, and other information– Conversion to ASN.1 data description
language• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
Structure Summary
Cn3D viewer
Conserved Domains3D Domain Neighbors
Structure Neighbors
Cn3D 4.1
Cn3D 4.1: Structural Alignment
Casein kinase S. pombe
Src Kinase H. sapiens
Conserved ATP binding site
Cn3D: Simple Homology Modeling
human
swordtail
Using Cn3D to model domains
Other services and databases from the NCBI
• LocusLink to all possible information from NCBI and beyond for a few well characterized model organisms.
• LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms.
• RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)
Locus Links • Results of a Locus links search, includes:– Locus ID– Species – Locus symbol– Locus name– Locus location– Links
• Protein Database
• OMIM
• Reference Sequence
• Related GenBank Sequences
• Homologene Data
• UniGene
• Variation Data
LocusLink: Selected Higher Genomes
OMIM
RefSeq
GenBank dbSNP
UniGene
Full report
PubMedHomoloGene
Map Viewer
Protein
Protein Database
• The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to:– Protein Information Resource (PIR)– SWISS-PROT– Protein Research Foundation (PRF)– Protein Data Bank (PDB) (sequences from solved
structures)
NCBI Protein Databases
• GenPept GenBank, EMBL, DDBJ CDS translations
• RefSeq mRNA based (NP_) and genome based (XP_)
• Swiss-Prot curated high quality protein reviews
• PIR protein information resource Georgetown University
• PRF protein resource foundation
• PDB Protein Databank sequences from structures
Entrez Protein
• GenPept (GB,EMBL, DDBJ) 3,442,298 • RefSeq 856,191
• Third Party Annotation 3,834• Swiss Prot 144,508• PIR 282,821• PRF 12,079 Total 3,442,298
BLAST nr 1,642,191
Protein Link
BLAST Link
Conserved Domains
Related Proteins: Redundancy
Red
un
dan
t Seq
uen
ces
Sequence from MutL structure
Related Proteins: Links
BLink: non-redundant relatives
Arabidopsis homolog
Conserved Domain
MLH1 Domain Structure: CDD
ATPase Domain Mismatch Repair Domain
MLH1: ATPase Domain
1BGQ: ATPase Domain in Cn3D
Yeast HSP90ATP Binding site helix
Variations Human MLH1
BLink
Finding structural models
Mapping Variation Onto Structure
Bacterial DNA mismatch repair proteins
Loads sequence alignment and structure in Cn3D
Mapping Variation Onto Structure
Conserved Asn
AsnIle
Ile – Val
NCBI Genome Databases
• The Genome database provides views for a variety of genomes, complete chromosomes, sequence maps with contigs, and integrated genetic and physical maps.
Microbial Genomes
ZWF
Genome search results
• Genome Search Results
• The Genome database includes full (and some partial) genomes from viruses to complex organisms
Genome Entry
• Genome entries include– Maps of the
genome– Links to the
sequence– The organism
for the genome
Genes Database: All Genomes
Coming soon!
Genes Database: All Genomes
Genes Database: All Genomes
But wait! There’s more!
• There is even more at NCBI that I have covered here.
• This site map is also a guide to NCBI resources. Each link leads to a brief description of the resource on this page, then to the resource itself. http://www.ncbi.nlm.nih.gov/Sitemap/
There are many bioinformatics servers outside NCBI.
• Try ExPASy’s sequence retrieval system at http://www.expasy.ch/
• (ExPASy = Expert Protein Analysis System)
• Or try ENSEMBL at www.ensembl.org for a premier human genome web browser.