large scale genomic data mining curtis huttenhower 11-14-09 harvard school of public health...
TRANSCRIPT
Large scalegenomic data mining
Curtis Huttenhower
11-14-09Harvard School of Public HealthDepartment of Biostatistics
3
Are We There Yet?
• How much biology is out
there?
• How much have we found?
• How fast are we finding it?
Human Proteins withAnnotated Biological Roles
Age-Adjusted Citation Rates forMajor Sequencing Projects
Species Diversity ofEnvironmental Samples
Schloss and Handelsman, 2006
#DistinctRoles
Matt Hibbs
4
#DistinctRoles
Matt Hibbs
Are We There Yet?
• How much biology is out
there?
• How much have we found?
• How fast are we finding it?
Human Proteins withAnnotated Biological Roles
Age-Adjusted Cost per Citation forMajor Sequencing Projects
Species Diversity ofEnvironmental Samples
Schloss and Handelsman, 2006
Lots!
Not nearly all
Not fast enough
Our job is to create computational microscopes:
To ask and answer specific biomedical questions using
millions of experimental results
5
Outline
1. Methodology:Algorithms for mining
genome-scale datasets
2. Microscopic:Microbial communities and functional metagenomics
3. Macroscopic:Functional genomic data in a
large prospective cohort
6
A Framework for Functional Genomics
HighSimilarity
LowSimilarity
HighCorrelation
LowCorrelation
G1G2
+
G4G9
+
…
G3G6
-
G7G8
-
…
G2G5
?
0.9 0.7 … 0.1 0.2 … 0.8
+ - … - - … +
0.8 0.5 … 0.05 0.1 … 0.6
HighCorrelation
LowCorrelation
Fre
quen
cy
Coloc.Not coloc.
Fre
quen
cy
SimilarDissim.
Fre
quen
cy
P(G2-G5|Data) = 0.85
100Ms gene pairs →
← 1
Ks
data
sets
7
A Framework for Functional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
8
Predicted Functional Interaction Networks
Global interaction network
Metabolism network Fibroblast network Colon cancer network
Currently have data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
9
Functional Mapping:Mining Integrated Networks
Predicted relationships between genes
HighConfidence
LowConfidence
The average strength of these relationships
indicates how cohesive a process is.
Cell cycle genes
10
Functional Mapping:Mining Integrated Networks
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
11
Functional Mapping:Mining Integrated Networks
DNA replication genes
The average strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
12
Functional Mapping:Scoring Functional Associations
How can we formalizethese relationships?
Any sets of genes G1 and G2 in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set• The background edges
incident to each set• The baseline of all edges
in the network
),(),(
),(
2121
21, 21 GGwithin
baseline
GGbackground
GGbetweenFA GG
Stronger connections between the sets increase association.
Stronger within self-connections or nonspecific background connections decrease association.
13
Functional Mapping:Bootstrap p-values
• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?
Empirically!# Genes 1 5 10 50
1
5
10
50
Histograms of FAs for random sets
For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is
approximately normal with mean 1.
Standard deviation is asymptotic in the sizes
of both gene sets.
Maps FA scores to p-values for any gene sets and
underlying graph.
100
102
104
100
101
102
103
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
|G1|
|G2|
Null distribution σs for one graph
|)(|||
|||)(|),(ˆ
1),(ˆ
ji
jijiFA
jiFA
GCG
BGGAGG
GG
)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG
14
Functional Mapping:Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
15
Functional Mapping:Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
16
Functional Mapping:Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
17
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
Data integration summarizes an impossibly huge amount of experimental data into an
impossibly huge number of predictions; what next?
18
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
How can a biologist take advantage of all this data to study
his/her favorite gene/pathway/disease without
losing information?
Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease
associations• Underlying experimental results and
functional activities in data
23
Outline
1. Methodology:Algorithms for mining
genome-scale datasets
2. Microscopic:Microbial communities and functional metagenomics
3. Macroscopic:Functional genomic data in a
large prospective cohort
24
Microbial Communities andFunctional Metagenomics
• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!
• Pathogen collections of “single” organisms form similar communities
• Another data integration problem– Must include datasets from multiple organisms
• What questions can we answer?– What pathways/processes are present/over/under-
enriched in a newly sequences microbe/community?– What’s shared within community X?
What’s different? What’s unique?– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …– Current functional methods annotate
~50% of synthetic data, <5% of environmental data
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
With Jacques Izard, Wendy Garrett
25
Data Integration for Microbial Communities
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
~350 available expression datasets
~25 species
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
• Data integration should work just as well in microbes as it does in yeast and humans• We know an awful lot about some microorganisms and almost nothing about others• Sequence-based and network-based tools for function transfer both work in isolation• We can use data integration to leverage both and mine out additional biology
26
Functional Maps forFunctional Metagenomics
YG17
YG16YG15
YG10
YG6
YG9
YG8
YG5
YG11
YG7
YG12
YG13
YG14
YG2
YG1
YG4
YG3
KO8
KO4
KO5
KO7
KO9
KO6
KO2
KO3
KO1
KO1: YG1, YG2, YG3KO2: YG4KO3: YG6…
ECG1, ECG2PAG1ECG3, PAG2…
28
Validating Orthology-BasedFunctional Mapping
Does unweighted data integration predict functional relationships?
What is the effect of “projecting” through an orthologous space?
Recall
log(
Pre
cisi
on/R
ando
m)
KEGG
GO
Recall
log(
Pre
cisi
on/R
ando
m)
Recall
log(
Pre
cisi
on/R
ando
m)
GO
Unsupervised integration
Individual datasets
Recall
log(
Pre
cisi
on/R
ando
m) Individual
datasets
KEGG
Unsupervised integration
29
Validating Orthology-BasedFunctional Mapping
YG17
YG16YG15
YG10
YG6
YG9
YG8
YG5
YG11
YG7
YG12
YG13
YG14
YG2
YG1
YG4
YG3Holdout set,
uncharacterized “genome”
Random subsets,characterized “genomes”
31KEGG KEGG
GO GO
Validating Orthology-BasedFunctional Mapping
Can subsets of the yeast genome predict a heldout subset’s
functional maps?
Can subsets of the yeast genome predict a heldout subset’s
interactome?
0.68 0.48
0.39 0.25
0.30 0.37
0.27 0.39
0.43
0.40
What have we learned?• Yeast is incredibly well-curated
• KEGG tends to be more specific than GO
• Predicting interactomes by projecting through
functional maps
works decently in the absolute best case
32
Functional Maps forFunctional Metagenomics
Now, what happens if you do this forcharacterized microbes?
• ~10 (somewhat) well-characterized species
• 1-35 datasets each
• Integrate within species
• Evaluate using KEGG
• Then cross-validate by holding out species
Recall
log(
Pre
cisi
on/R
ando
m)
KEGGUnsupervised integrations
Check back soon for more results, preliminary data on metagenomes
33
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations.
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!It’s also speedy: improves on Bayes Net Toolbox by
~22x in memory usage and up to >100x in runtime.
34
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations.
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!
8 hours
1 minute
30 years
2 months
18 hours
Original processing time
Current processing time
2.5 hours
35
Outline
1. Methodology:Algorithms for mining
genome-scale datasets
2. Microscopic:Microbial communities and functional metagenomics
3. Macroscopic:Functional genomic data in a
large prospective cohort
36
Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort
With Shuji Ogino, Charlie Fuchs
~3,100gastrointestinal
subjects
~3,800tissue samples
~1,450colon cancer
samples~1,150
CpG island methylation
~1,200LINE-1
methylation
~700TMA immuno-histochemistry
~2,100cancer
mutation tests
Health Professionals Follow-Up
StudyNurse’s HealthStudy
LINE-1 Methylation• Repetitive element making up ~20% of
mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation
level
DASL Gene Expression• Gene expression analysis from
paraffin blocks• Thanks to Todd Golub, Yujin
Hoshida
~775gene
expression
37
Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation
Chr. 19 rearrangement,membrane receptors/channels
HSC signature
Neural/ESC signature
Angiogenesis, proliferation
BRCA interactors,chrom. stability factors
Cell cycle regulation
C1 C2 C3 C4Nonnegative matrix factorizationTumors →
← G
enes
38
Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation
Subramanian et al, 2005
195
146678
166945
325
799
NeuralStem Cell Signature
HematopoeiticStem Cell Signature
EmbryonicStem Cell Signature
Chr. 19q
18
8
7
BAX
CD133 + Bcl-X(L)
CD44 + CD166
Hypotheses?• Two main pathways to
proliferation:• HSC program + BAX• ESC/NSC program
• Two main pathways to deregulation:
• Angiogenesis + chrom. instability• Cell cycle disruption (MSI?)
Note that these regulatory programsdo not appear to correspond
with demographics or commonpathologic markers…
Testing now for correlation with outcome.
39
Epigenetics of Colorectal Cancer:LINE-1 methylation levels
30 35 40 45 50 55 60 65 70 75 8030
40
50
60
70
80
LINE-1 Methylation in Mul-tiple Tumors from the Same
Subject
Methylation %, Tumor #1M
eth
ylat
ion
%,
Tu
mo
r #2
ρ = 0.718, p < 0.01
Ogino et al, 2008
Lower LINE-1 methylation associates with poor colon cancer prognosis.
LINE-1 methylation varies remarkably between individuals…
…but it is highly correlated within individuals.
What does it all mean??What is the biological
mechanism linking LINE-1 methylation to colon cancer?
40
Epigenetics of Colorectal Cancer:LINE-1 methylation levels
30 35 40 45 50 55 60 65 70 75 8030
40
50
60
70
80
LINE-1 Methylation in Mul-tiple Tumors from the Same
Subject
Methylation %, Tumor #1M
eth
ylat
ion
%,
Tu
mo
r #2
ρ = 0.718, p < 0.01
Ogino et al, 2008
Lower LINE-1 methylation associates with poor colon cancer prognosis.
LINE-1 methylation varies remarkably between individuals…
…but it is highly correlated within individuals.
This suggests a genetic effect.
This suggests a copy number variation.
This suggests linkage to a cancer-related pathway.
Is anything different about these outliers?
What is the biological mechanism linking LINE-1
methylation to colon cancer?
41
Epigenetics of Colorectal Cancer:LINE-1 methylation levels
What is the biological mechanism linking LINE-1
methylation to colon cancer?
Preliminary Data• 10 genes differentially expressed even using simple methods• 1/3 are from the same family with known GI tumor prognostic value• 1/3 are X-chromosome testis/cancer-specific antigens• 1/2 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a cascade of antigens/membrane receptors/TFs
Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays• GSEA pulls out a wide range of proliferation up (E2F),
immune response down; need to regress out prognosis confounds
Check back in acouple of months!
42
Outline
1. Methodology:Algorithms for mining
genome-scale datasets
2. Microscopic:Microbial communities and functional metagenomics
3. Macroscopic:Functional genomic data in a
large prospective cohort
• Bayesian system for genomic
data integration• HEFalMp system for human data
analysis and integration• Functional mapping to statistically
summarize large data collections
• Integration for microbial
communities and metagenomics
• Network alignment and mapping
for microbial community analysis
• Sleipnir software for efficient
large scale data mining
• Demographic/molecular/
genomic data for ~1,000
colorectal cancers• Ongoing analysis of
geneactivity and LINE-1
methylation
43
Thanks!
NIGMShttp://function.princeton.edu/sleipnir
http://function.princeton.edu/hefalmp
Interested? We’re recruiting students and postdocs!Biostatistics Department
http://huttenhower.sph.harvard.edu
Hilary CollerErin HaleyTsheko Mutungu
Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz
Shuji OginoCharlie Fuchs
Jacques Izard
Wendy Garrett