lecture 2.21 retrieving information: using entrez

62
Lecture 2.2 1 Retrieving Information: Using Entrez

Upload: lisa-cummings

Post on 13-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 1

Retrieving Information: Using Entrez

Page 2: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 2

Retrieving information: how it works:

• Servers have the records you want• You need to understand the data they have, and

how it is organized• There are often many ways to get to an answer.• Route to get there is not always obvious, but you

need to think of alternatives and traps.• Use some query language – each system has its

own.• Retrieve data in a specified format.• Save it in a way that will be useful to you.

Page 3: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 3

What you may be looking for:

• Did a BLAST search – and you need more info about some of the proteins they found similarities to.

• Heard on about a disease gene that was recently discovered, and you want to know more about it.

• Want to build a dataset for local blast searches.

• A colleague wants you to do an alignment of all sequences from a given protein family.

Page 4: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 4

What you are looking for:

• PubMed paper from author X• Sequence from gene X in organism Y• All information about organelle W in

model organism Y• All information about disease X in

human• Orthologs of that disease genes in other

model organisms

Page 5: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 5

Central Dogma: NCBI version

RNA

protein

DNA

Write a paper about it

Page 6: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 6

Entrez: Pathway to Discovery

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

Term frequency statistics

Literature citations in sequence databases

Literature citations in sequence databases

MEDLINE abstracts

Nucleotide sequences

Protein sequences

1993

Page 7: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 7

Related Articles

Type in your last name and find a paper form one of your

teammates

Page 8: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 8

Hard link DNA to proteinL12345

Page 9: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 9

From Fig 1 ofEntrez search and retrieval systemJim OstellChapter 14, the NCBI Handbook.

2003

Page 10: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 10

Page 11: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 11

Page 12: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 12

Page 13: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 13

Ctrl-F

Page 14: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 14

Page 15: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 15

Getting started in Entrez

Page 16: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 16

“ouellette bf” [au] AND yeast

Page 17: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 17

Page 18: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 18

Page 19: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 19

Page 20: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 20

MeSH: Medical Subject Heading

Page 21: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 21

A query

• Word <free text> : too many hits– More words (the Boolean ‘AND’ is the

default)– Limit query to specified field– Limit query in time– Do Boolean on queries

• #1 AND #2• #3 NOT #5• #7 OR #8

Page 22: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 22

hieter p [au]

Page 23: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 23

Limit in Time: 1993-01-01 1993-12-31

Page 24: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 24

Page 25: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 25

No abstract

With abstract

Full Text on-line

Full Text in PubMed Central

Page 26: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 26

boguski m [au] 99

boguski ms [au] 80

Page 27: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 27

#24 NOT #23 19

Page 28: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 28

Page 29: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 29

Other types of links in Entrez

• Next slides to explore other kind of things linked into Entrez records.

Page 30: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 30

“hieter p” [au] cdc16p

Page 31: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 31

Page 32: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 32

Page 33: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 33

Page 34: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 34

Page 35: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 35

Page 36: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 36

Page 37: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 37

Page 38: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 38

Page 39: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 39

“Books”

Page 40: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 40

(2)

Page 41: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 41

Page 42: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 42

Page 43: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 43

Page 44: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 44

Page 45: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 45

Page 46: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 46

Link to Genome View of Chromosome I

Page 47: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 47

Page 48: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 48

Page 49: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 49

RefSeq

• RefSeq represents the NCBI curated “reference sequences” for all ‘worked’ genome.

• Historically, these used to be referred to as “GenBank-Gold”.

• RefSeq are either genomic, mRNA or protein sequences.

• Not all sequences are in RefSeq• All RefSeq sequences are assembled/taken

from things in GenBank.

Page 50: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 50

Some of the features of the RefSeq:

•  non-redundancy  • explicitly linked nucleotide and protein

sequences  • updates to reflect current knowledge of

sequence data and biology  • data validation and format consistency  • distinct accession series  • ongoing curation by NCBI staff and

collaborators, with review status indicated on each record

Page 51: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 51

Accession number space• GenBank:

– 1+5 (L12345, U00001)– 2+6 (AF000001, AC000003)– 4+2+6 (WGS)

• All have accession.version

• Protein:– 1+5 (SwissProt/UniProt)– 3+5 (GenPept)

• All have accession.version

• RefSeq:– N*_12345

Page 52: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 52

RefSeq Accession Number Space

NC_123456 Genomic Complete genomic molecules including genomes, chromosomes, organelles, plasmids.

NG_123456 Genomic Incomplete genomic region; supplied to support the NCBI Genome Annotation pipeline.

NM_123456 mRNA

NR_123456 RNA Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others

NP_123456 Protein

NP_12345678 Protein Planned expansion of accession series

Page 53: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 53

Automated Assemblies

NT_123456 Genomic Intermediate genomic assemblies of BAC sequence data

NW_123456 Genomic Intermediate genomic assemblies of Whole Genome Shotgun sequence data

Page 54: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 54

Model RefSeq records

XM_123456 mRNA model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig.

XR_123456 RNA model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig.

XP_123456 Protein model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig.

Page 55: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 55

WGS special case

NZ_ABCD12345678

Genomic A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project.

ZP_12345678 Protein Proteins annotated on NZ_ accessions (often via computational methods).

Page 56: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 56

Download all the data

Entrez and RefSeq

Page 57: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 57

Page 58: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 58

Page 59: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 59

Page 60: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 60

Locus Link

Page 61: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 61

Things to watch out for:

Page 62: Lecture 2.21 Retrieving Information: Using Entrez

Lecture 2.2 62