ppt - biostatistics - johns hopkins bloomberg school of public health

56
Carlo Colantuoni & Rafael Irizarry April 19, 2006 [email protected] Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Upload: medresearch

Post on 19-Jun-2015

1.794 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Carlo Colantuoni&

Rafael Irizarry

April 19, 2006

[email protected]

Gene Annotation in Genomics Experiments With a Focus on Tools in BioConductor

Page 2: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Biological Setup

Every cell in the human body contains the entire human genome: 3.3 Gb or ~30K genes.

The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes.

Tasks necessary for gene expression analysis:

Define what a gene is.

Identify genes in a sea of genomic DNA where <3% of DNA is contained in genes.

Design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Cross-reference these probes.

Page 3: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Cellular Biology, Gene Expression, and Microarray Analysis

DNA

RNA

Protein

Page 4: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

DNAProbe

~30K genes

Sequence is a Necessity

Page 5: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

From Genomic DNA to mRNA Transcripts

EXONS INTRONS

RNA editing & SNPs

Alternative splicingAlternative start & stop sites in same RNA molecule

~30K

>30K

Transcript coverage Homology to other transcripts

Hybridization dynamics 3’ bias

Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.

Page 6: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Sequence Quality!

Redundancy!

Completeness?

Unsurpassed as source of expressed sequence

Chaos?!?

Page 7: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

From Genomic DNA to mRNA Transcripts

~30K

>30K

>>30K

Page 8: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Transcript-BasedGene-Centered Information

Page 9: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors – copying and pasting 30K rows in excel …

Using RefSeq’s can help.

Design of Gene Expression Probes

Content: UniGene, Incyte, Celera Expressed vs. Genomic

Source: cDNA libraries, clone collections, oligos

Cross-referencing of array probes (across platforms):

Sequence <> GenBank <> UniGene <> HomoloGene

Page 10: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

From Genomic DNA to mRNA Transcripts

Page 11: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 12: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 13: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

From Genomic DNA to mRNA Transcripts

Page 14: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 15: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

http://www.ncbi.nlm.nih.gov/Entrez/

Page 16: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Functional Annotation of Lists of Genes

KEGGPFAM

SWISS-PROTGO

DRAGONDAVID

BioConductor

Page 17: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Analysis of Functional Gene Groups

Page 18: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 19: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 20: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 21: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 22: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 23: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 24: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

•One of the largest challenges in analyzing genomic data is associating the experimental data with the available metadata, e.g. sequence, gene annotation, chromosomal maps, literature.

•The annotate and AnnBuilder packages provides some tools for carrying this out.

•Using AnnBuilder. It is possible to build associations with specific gene lists, eg. hgu95a package for Affymetrix HGU95A GeneChips.

•The annotate package maps to GenBank accession number, LocusLink LocusID, gene symbol, gene name, UniGene cluster, chromosome, cytoband, physical distance (bp), orientation, Gene Ontology Consortium (GO), PubMed PMID.

Page 25: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Analysis of Functional Gene Groups

Page 26: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Functional Gene/Protein Networks

DIPBINDMINTHPRD

PubGenePredicted Protein Interactions

Page 27: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Analysis of Gene Networks

Page 28: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 29: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

9606 is the Taxonomy ID for Homo Sapiens

Page 30: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 31: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 32: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 33: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Predicted Human Protein Interactions

Page 34: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Predicted Human Protein Interactions

Used high-throughput protein interaction experiments from fly, worm, and yeast to predict human protein interactions.

Human protein interaction is predicted if both proteins in an interaction pair from other organism have high sequence homology to human proteins.

>70K Hs interactions predicted>6K Hs genes

Page 35: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Analysis of Gene Networks

Page 36: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Carlo ColantuoniClinical Brain Disorders Branch, NIMH, NIH

Dept. Biostatistics, [email protected]

[email protected]

Thanks to …

Rafael Irizarry

Scott Zeger

Jonathan Pevsner

Page 37: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health
Page 38: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov/Entrez/http://www.ncbi.nih.gov/Genbank/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotidehttp://www.ncbi.nlm.nih.gov/dbEST/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Proteinhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genehttp://www.ncbi.nlm.nih.gov/LocusLink/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologenehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIMhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMedhttp://www.ncbi.nlm.nih.gov/PubMed/http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cddhttp://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snphttp://www.ncbi.nlm.nih.gov/SNP/http://eutils.ncbi.nlm.nih.gov/entrez/query/static/advancedentrez.html http://www.ncbi.nlm.nih.gov/geo/http://www.ncbi.nlm.nih.gov/RefSeq/

FTP:ftp://ftp.ncbi.nlm.nih.gov/ftp://ftp.ncbi.nlm.nih.gov/repository/UniGeneftp://ftp.ncbi.nih.gov/pub/HomoloGene/

NCBI Web Links

Page 39: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

NUCLEOTIDE:

http://genome.ucsc.edu/

http://www.embl-heidelberg.de/

http://www.ensembl.org/

http://www.ebi.ac.uk/

http://www.gdb.org/

http://bioinfo.weizmann.ac.il/cards/index.htmlhttp://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl

PATHWAYS and NETWORKS:

http://www.genome.ad.jp/kegg/

ftp://ftp.genome.ad.jp/pub/kegg/ (http://www.genome.ad.jp/anonftp/)

http://dip.doe-mbi.ucla.edu

http://dip.doe-mbi.ucla.edu/dip/Download.cgi

http://www.blueprint.org/bind/

http://www.blueprint.org/bind/bind_downloads.html

http://160.80.34.4/mint/index.php

http://160.80.34.4/mint/release/main.php

http://www.hprd.org/

http://www.hprd.org/FAQ?selectedtab=DOWNLOAD+REQUESTS

http://www.pubgene.org/ (also .com)

PROTEIN:

http://us.expasy.org/

ftp://us.expasy.org/

http://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam/ftp.shtml

http://smart.embl-heidelberg.de/

http://www.ebi.ac.uk/interpro/

http://us.expasy.org/prosite/

ftp://us.expasy.org/databases/prosite/

More Web Links

http://www.bioconductor.org/http://apps1.niaid.nih.gov/david/http://www.geneontology.org/http://discover.nci.nih.gov/gominer/index.jsphttp://pubmatrix.grc.nia.nih.gov/http://pevsnerlab.kennedykrieger.org/dragon.htm

Page 40: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

SAVAGE:

Detection of More Subtle Functionally Related Groups

of Gene Expression Changes

Page 41: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

EXP#1

Swiss-Prot

30KPFAM

KEGG

~3K

10K

~40K annotations

DRAGON SAVAGE

Differential Expression of FunctionalGene Groups within One Experiment

Page 42: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

EXP#4EXP#3EXP#2EXP#1BioDB

Differential Expression of a Single FunctionalGene Group Across Multiple Experiments

DRAGON

SAVAGE

Page 43: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Similar Differential Expression Patterns Across Multiple Experiments

p value

0.0

<0.1

ALL

CN

CN

CN

CN

The distribution of gene expression values for each gene group in each sample is plotted as a single point in low dimensional space. This is achieved using Principal Components Analysis along with Non-Metric Multi-Dimensional Scaling.

1

1

EX

P#1

EX

P#1

2

2

EX

P#2

EX

P#2

5

4

3

5

4

3

X

CN

X

Page 44: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

PING:

Detection of Differential Expression in Functional

Networks of Proteins

Page 45: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Interaction Networks in Gene Expression Data

Page 46: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Large Protein Interaction Network

Network Regulated in Sample #1

Page 47: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Network Regulated in Sample #1

Network Regulated in Sample #2

Large Protein Interaction Network

Page 48: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

Page 49: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Networkof Interest

Network Regulated in Sample #1

Network Regulated in Sample #2

Network Regulated in Sample #3

Large Protein Interaction Network

PING

Page 50: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

1

10

100

1000

10000

100000N

T's

in G

en

Ban

k (m

illio

ns)

1984 1994 2004

Page 51: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Genomic DNA Content

1. Interspersed repeats (~1/2 Hs. genome)2. (Processed) pseudogenes3. Simple sequence repeats4. Segmental duplications (~5% Hs. genome)5. Blocks of tandem repeats (can be very large)6. Genes: Promoters - Exons – Introns <3%

defining what a gene is - protein coding unit of genomic DNA with an mRNA intermediateidentifying genes within genomic DNA

protein-coding genes (mRNA)functional RNA genes - tRNA, rRNA, snoRNA, snRNA, miRNA

prokaryotes eukaryotes

Page 52: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

Page 53: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

AAAAA

Gene: Protein coding unit of genomic DNA with an mRNA intermediate.

START STOPprotein coding

5’ UTR 3’ UTR

mRNA

GenomicDNA 3.3 Gb

Protein

~30K genes

Sequence is a Necessity

Page 54: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

How is a gene definedin “wet” biology and in silico?

Seq. from mRNA sample

Seq. on array

Array probe design:

Source – cDNA libraries, oligos, clone collections

Content – UniGene, Celera, Incyte

Transcript coverage

Homology to other transcripts

Hybridization dynamics – hyper-multiplex hyb rxn

Empirical validation

3’ bias

Alt. splicing - known and not

Alt. start / stop site in same RNA molecule

Less important: RNA editing, SNPs

Cross-referencing of array probes:GenBank <> UniGene <> HomoloGene

Possible mis-referencing:Genomic GenBank Acc.#’sReferenced ID has more NT’s than probeOld DB buildsDB or table errors

Page 55: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

Finding genes in eukaryotic DNA

ORF identification – Three Letter Genetic Code (codons) 4*4*4. It is possible to translate any stretch of genomic DNA into protein, but that doesn’t mean we have identified a protein coding gene!

There are several kinds of exons:-- non-coding-- initial coding exons-- internal exons-- terminal exons-- some single-exon genes are intronless

Page 56: PPT - Biostatistics - Johns Hopkins Bloomberg School of Public Health

What We Are Going To Cover

Cells, Genes, Transcripts –> Genomics Experiments

Sequence Knowledge Behind Genomics Experiments

Annotation of Genes in Genomics Experiments