genomics and personalized health care databases bailee ludwig quality management

Post on 12-Jan-2016

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Genomics and Personalized Health Care

Databases

Bailee Ludwig

Quality Management

Molecular Biology Databases

• Excellent means of storing a vast amount of Information in a central, sharable location

• Biological databases are designed especially for the proper storing, searching & retrieving biological data– Keyword Searches– Cross-Referencing– 3D capabilities

Database Categories

• Categories– Nucleotide Sequence Databases

• Gene Databases• Genome Databases

– Protein Sequence Databases– Structure Databases– Metabolic and Signaling Pathways– Human Genes and Diseases– Microarray Data and other Expression Databases– …

• Each contains specific information• Each is interrelated

Nucleotide & Protein Sequence

Databases

National Center for Biotechnology Information (NCBI)

• Created as a part of National Library of Medicine in 1988– Establish public databases– Perform research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

• Databases– Sequence, such as GeneBank, RefSeq, dbSNP– Literature, such as PubMed, OMIM

• Tools– Entrez. Blast, Cn3D, etc.

NCBI Homepage

NCBI Site Map

All Databases at NCBI:

Let’s Check out NCBI

• http://www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar

Multiple ways to find Genes…

Let’s Look at BRCA1

GenBank

http://www.ncbi.nlm.nih.gov/Genbank/

GenBank

• Nucleotide only sequence database • GenBank Data

– Direct submissions individual records (BankIt, Sequin)

– Batch submissions via email (EST, GSS, STS)– ftp accounts established for sequencing centers

• Data shared nightly amongst three collaborating databases:– GenBank– DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory Database

(EMBL)

0

20

40

60

80

100

120Growth of GenBank (1982-2009)

Base PairsSequences

Year

Seq

uen

ces

(mil

lio

n)

Bas

e P

airs

(b

illi

on

)

GeneBank Release 175.0

• ftp://ftp.ncbi.nih.gov/genbank/• Full release every two months• Incremental and cumulative updates daily

• Release 175.0 (12/15/2009)

• 112,910,950 Sequences • 110,118,557,163 Bases

NCBI Reference Sequences

GenBank Record (Header)

LOCUS NM_001963 4913 bp mRNA linear PRI 20-SEP-2009

DEFINITION Homo sapiens epidermal growth factor (beta-urogastrone) (EGF), mRNA.

ACCESSION NM_001963

VERSION NM_001963.3 GI:166362727

KEYWORDS .

SOURCE Homo sapiens (human)

ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo.

REFERENCE 1 (bases 1 to 4913)

AUTHORS Hosgood,H.D. III, Menashe,I., He,X., Chanock,S. and Lan,Q.

TITLE PTEN identified as important risk factor of chronic obstructive pulmonary disease

JOURNAL Respir Med (2009) In press

PUBMED 19625176

REMAKR GeneRIF: Observational study of gene-disease association.

Summary

GenBank Record (Sequence)

ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc 61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt 181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc 241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga 301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag 361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg 421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc 481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg 541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt 601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg 661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt 721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga 781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag 841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt 901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa 961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg

FASTA: Sequence Format

Sequence Viewer Graphics

RefSeq

RefSeq

• Database of reference sequences– http://www.ncbi.nlm.nih.gov/RefSeq/

• Curated– Many experimentally validated– Some partially validated via ESTs– Some computationally predicted

• Non-redundant; one record for each gene, or each splice variant, from each organism represented

Accession Numbers

• DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data

• RefSeq provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence.

• RefSeq identifiers include the following formats:– Complete chromosome NC_######– Genomic contig NT_######– mRNA (DNA format) NM_######– Protein NP_######

Accession Numbers: More Examples

AC_123456 Genomic Alternate complete genomic

AP_123456 Protein Protein products; alternate

NG_123456 Genomic Incomplete genomic regions

NR_123456 RNA Non-coding transcripts

NW_123456 GenomicGenomic assemblies

NZ_ABCD12345678 Genomic Whole genome shotgun data

XM_123456 mRNA Transcript products

XP_123456 Protein Protein products

XR_123456 RNA Transcript products

YP_123456 Protein Protein products

ZP_12345678 Protein Protein products

EST

EST

• Expressed Sequence Tags database (dbEST) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organisms

• http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucest&cmd=search&term=

EST

• mRNA: Genomic regions actively transcribed in cell• cDNA (complementary DNA)

– Copy of mRNA using mRNA as a template– Sequence is complementary to mRNA

• EST: Expressed Sequence Tag (a short sub-sequence of a transcribed cDNA sequence)– Partial cDNA sequence– Can be 5’ or 3’– Typical size: 200 - 500 bp– Represents mRNA actively transcribed in cell– Use to identify

• Genes; Alternative splicing; etc.

Access to dbEST Data

• EST sequences are included in the EST division of GenBank, available from NCBI by anonymous ftp and through Entrez

• The nucleotide sequences may be searched using the BLAST server – The TBLASTN program which takes an amino acid query

sequence and compares it with six-frame translations of dbEST DNA sequences is particularly useful.

• EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the /repository/dbEST directory at ftp.ncbi.nih.gov

UniGene

UniGene

• www.ncbi.nlm.nih.gov/UniGene• Each UniGene entry is a set of transcript sequences that

appear to come from the same transcription locus (gene or expressed pseudogene)

• In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included.

• UniGene may be of use as a resource for gene discovery.

• UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis.

Numbers of UniGene Entries

• Bos taurus (cow) 42,843 • Canis lupus familiaris (dog) 27,853 • Equus caballus (horse) 8,348 • Homo sapiens (human) 123,396 • Mus musculus (mouse) 78,289 • Ovis aries (sheep) 18,814 • Rattus norvegicus (Norway rat) 63,434 • Sus scrofa (pig) 51,576 • Danio rerio (zebrafish) 51,481

UniGene

• UniGene is a useful tool to look up information about expressed genes

• UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression

Protein Structure

• Now…

Let’s Give these databases a closer look with a Lab

top related