pathway analysis using bioconductor the global test revisited
TRANSCRIPT
Pathway analysis using BioConductor
The global test revisited
R User group 6 dec 2005
Overview
• Introduction
• Annotation
• Pathway analysis
• Demonstration
R User group 6 dec 2005
Introduction
• Pathway
• Set of related genes
• Functional
• Structural
• Described as lists of gene identifiers
• Micro array
• 1000s of tests
• Description
• Location on chip/slide
• Sequence ID
• On chip replication
R User group 6 dec 2005
Feature description
• Proprietary ID (Affymetrix/Agilent)
• GenBank, RefSeq, EnsembleID
• Symbol, LocusLink /Entrez Gene,Unigene
• SwissProt
• Chromosomal location
• EC number, GO, KEGG
R User group 6 dec 2005
R User group 6 dec 2005
R User group 6 dec 2005
Annotation sources
• Batch Gene Finder: http://cgap.nci.nih.gov/Genes
• BioMart: http://www.ebi.ac.uk/BioMart/martview
• Resourcerer: http://www.tigr.org/tigr-scripts/magic/r1.pl
• Bioconductor metadata http://www.bioconductor.org
• NettAffx http://www.affymetrix.com/analysis/index.affx
R User group 6 dec 2005
Create Annotation for Array
• Select / create unique identifier for probes on array
• i.e. Use positional information b01r03c14
• Use this identifier as rownames of data and annotation
• Use annotation sources to connect sequence ids to gene ids
UniGene LocusLink SymbolA28102_atAB000114_at Hs.94070 4958 OMDAB000115_at Hs.389724 10964 IFI44LAB000220_at Hs.269109 10512 SEMA3CAB000381_s_atAB000409_at Hs.371594 8569 MKNK1
Smp1 Smp2 Smp3 Smp4A28102_at 140.7 164.73 137.8 53.15AB000114_at 115.88 617.08 97.3 393.94AB000115_at 259.19 393.79 193.66 32.09AB000220_at 130.83 258.38 213.3 31.83AB000381_s_at 7505.4 18152.14 4990.21 25.35AB000409_at 166.29 418.2 125.12 78.88
R User group 6 dec 2005
Connecting sequence ids to gene ids
# 2 matrices, myAnnot and GBFAnnot (from Gene Batch Finder)
# Create temporary annotation with correct dimensions
tmpAnnot<- matrix("", nrow=nrow(veerannot), ncol=ncol(GBFAnnot),
dimnames=list(rownames(veerannot), colnames(GBFAnnot)))
ind<-match(myAnnot[,1],rownames(GBFAnnot))
tmpAnnot[!is.na(ind),]<-GBFAnnot[ind[!is.na(ind)],]
myAnnot<-cbind(myAnnot,tmpAnnot)
Genbankr1c1 NM_002658r1c2 NM_000476r1c3 NM_002581r1c4 NM_000466r1c5 NM_002343
Symbol Description Genbank Unigene LocusLinkNM_000476 AK1 Adenylate kinase 1 NM_000476 Hs.175473 203NM_002343 LTF Lactotransferrin NM_002343 Hs.529517 4057NM_000466 PEX1 Peroxisome biogenesis factor 1 NM_000466 Hs.164682 5189NM_002658 PLAU Plasminogen activator, urokinase NM_002658 NM_001001791 Hs.77274 5328NM_002581
ProbeId Genbank Symbol Name Genbank Unigene LocusLinkr1c1 NM_002658 PLAU Plasminogen activator, urokinase NM_002658 NM_001001791 Hs.77274 5328r1c2 NM_000476 AK1 Adenylate kinase 1 NM_000476 Hs.175473 203r1c3 NM_002581r1c4 NM_000466 PEX1 Peroxisome biogenesis factor 1 NM_000466 Hs.164682 5189r1c5 NM_002343 LTF Lactotransferrin NM_002343 Hs.529517 4057
R User group 6 dec 2005
Selecting probes by pathway
• Using BioConductor metadata package
• Using BioConductor GO and Mapping
>library(hgu95av2)>get("GO:0005868",envir=hgu95av2GO2PROBE) NAS <NA> TAS <NA> TAS ISS "37300_at" "40318_at" "40319_at" "40949_at" "40950_at" "946_at"
> library(GO)> ll<-get("GO:0005868",envir=GOLOCUSID)> rownames(myAnnot)[myAnnot[,”LocusLink”] %in% ll][1] "Contig51966_RC" "NM_004411" "Contig47291_RC" "NM_006141" [5] "NM_014183" "AB002323" "NM_006519"
R User group 6 dec 2005
Pathway analysis
• List based methods
• Order based methods
• Statistical combination of results
R User group 6 dec 2005
List based Pathway analysis
• Compare the proportion if differentially expressed genes in a
pathway to the proportion on the array
• R: phyper(), GOHyperG() in GOstats package (BioConductor)
• phyper(PWde,ARde,ARall-ARde,PWall,lower.tail=FALSE)
Diff Expressed AllIn Pathway 15 100On Array 1000 10000
>phyper(15,1000,9000,100,lower.tail=FALSE)
[1] 0.03910265
R User group 6 dec 2005
Order based analysis
• Genes are ordered by difference, from up- to non- to
downregulated. Interesting pathways form clusters along this
order
• In R: Gene Set Enrichment Analysis (GSEA) package http://www.broad.mit.edu/gsea/software/software_index.html
R User group 6 dec 2005
Statistical combination of results
• All genes in a pathway contribute their statistical influence
• In R: globaltest package (BioConductor)
01
02
03
04
0
influ
en
ce
20
42
90
_s
_a
t
22
15
88
_x
_a
t
22
15
89
_s
_a
t
22
15
90
_s
_a
t
20
08
22
_x
_a
t
21
00
50
_a
t
21
30
11
_s
_a
t
higher expression in NA sampleshigher expression in 0 samples
R User group 6 dec 2005
Demonstration