large scale genomic data mining curtis huttenhower 11-14-09 harvard school of public health...

Large scalegenomic data mining

Curtis Huttenhower

11-14-09Harvard School of Public HealthDepartment of Biostatistics

2

Greatest Biological Discoveries?

3

Are We There Yet?

• How much biology is out

there?

• How much have we found?

• How fast are we finding it?

Human Proteins withAnnotated Biological Roles

Age-Adjusted Citation Rates forMajor Sequencing Projects

Species Diversity ofEnvironmental Samples

Schloss and Handelsman, 2006

#DistinctRoles

Matt Hibbs

4

#DistinctRoles

Matt Hibbs

Are We There Yet?

• How much biology is out

there?

• How much have we found?

• How fast are we finding it?

Human Proteins withAnnotated Biological Roles

Age-Adjusted Cost per Citation forMajor Sequencing Projects

Species Diversity ofEnvironmental Samples

Schloss and Handelsman, 2006

Lots!

Not nearly all

Not fast enough

Our job is to create computational microscopes:

To ask and answer specific biomedical questions using

millions of experimental results

5

Outline

1. Methodology:Algorithms for mining

genome-scale datasets

2. Microscopic:Microbial communities and functional metagenomics

3. Macroscopic:Functional genomic data in a

large prospective cohort

6

A Framework for Functional Genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

…

G3G6

-

G7G8

-

…

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Coloc.Not coloc.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

7

A Framework for Functional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

8

Predicted Functional Interaction Networks

Global interaction network

Metabolism network Fibroblast network Colon cancer network

Currently have data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

9

Functional Mapping:Mining Integrated Networks

Predicted relationships between genes

HighConfidence

LowConfidence

The average strength of these relationships

indicates how cohesive a process is.

Cell cycle genes

10



HighConfidence

LowConfidence

Cell cycle genes

11


DNA replication genes

The average strength of these relationships indicates how

associated two processes are.


HighConfidence

LowConfidence

Cell cycle genes

12

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

13

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

14

Functional Mapping:Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

15



VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport



Metabolism


Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism


Organelle Fusion



16



VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport



Metabolism


Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism


Organelle Fusion



17

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

18

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

19

HEFalMp: Predicting HumanGene Function

HEFalMp

20

HEFalMp: Predicting HumanGenetic Interactions

HEFalMp

21

HEFalMp: Analyzing HumanGenomic Data

HEFalMp

22

HEFalMp: UnderstandingHuman Disease

HEFalMp

23

Outline






24

Microbial Communities andFunctional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Pathogen collections of “single” organisms form similar communities

• Another data integration problem– Must include datasets from multiple organisms

• What questions can we answer?– What pathways/processes are present/over/under-

enriched in a newly sequences microbe/community?– What’s shared within community X?

What’s different? What’s unique?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– Current functional methods annotate

~50% of synthetic data, <5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

With Jacques Izard, Wendy Garrett

25

Data Integration for Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

~350 available expression datasets

~25 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration should work just as well in microbes as it does in yeast and humans• We know an awful lot about some microorganisms and almost nothing about others• Sequence-based and network-based tools for function transfer both work in isolation• We can use data integration to leverage both and mine out additional biology

26

Functional Maps forFunctional Metagenomics

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1

YG4

YG3

KO8

KO4

KO5

KO7

KO9

KO6

KO2

KO3

KO1

KO1: YG1, YG2, YG3KO2: YG4KO3: YG6…

ECG1, ECG2PAG1ECG3, PAG2…

27


28

Validating Orthology-BasedFunctional Mapping

Does unweighted data integration predict functional relationships?

What is the effect of “projecting” through an orthologous space?

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

GO

Recall

log(

Pre

cisi

on/R

ando

m)

Recall

log(

Pre

cisi

on/R

ando

m)

GO

Unsupervised integration

Individual datasets

Recall

log(

Pre

cisi

on/R

ando

m) Individual

datasets

KEGG

Unsupervised integration

29


YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1

YG4

YG3Holdout set,

uncharacterized “genome”

Random subsets,characterized “genomes”

30


31KEGG KEGG

GO GO


Can subsets of the yeast genome predict a heldout subset’s

functional maps?

Can subsets of the yeast genome predict a heldout subset’s

interactome?

0.68 0.48

0.39 0.25

0.30 0.37

0.27 0.39

0.43

0.40

What have we learned?• Yeast is incredibly well-curated

• KEGG tends to be more specific than GO

• Predicting interactomes by projecting through

functional maps

works decently in the absolute best case

32


Now, what happens if you do this forcharacterized microbes?

• ~10 (somewhat) well-characterized species

• 1-35 datasets each

• Integrate within species

• Evaluate using KEGG

• Then cross-validate by holding out species

Recall

log(

Pre

cisi

on/R

ando

m)

KEGGUnsupervised integrations

Check back soon for more results, preliminary data on metagenomes

33

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!It’s also speedy: improves on Bayes Net Toolbox by

~22x in memory usage and up to >100x in runtime.

34

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

8 hours

1 minute

30 years

2 months

18 hours

Original processing time

Current processing time

2.5 hours

35

Outline






36

Current Work: Molecular Mechanismsin a Colorectal Cancer Cohort

With Shuji Ogino, Charlie Fuchs

~3,100gastrointestinal

subjects

~3,800tissue samples

~1,450colon cancer

samples~1,150

CpG island methylation

~1,200LINE-1

methylation

~700TMA immuno-histochemistry

~2,100cancer

mutation tests

Health Professionals Follow-Up

StudyNurse’s HealthStudy

LINE-1 Methylation• Repetitive element making up ~20% of

mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation

level

DASL Gene Expression• Gene expression analysis from

paraffin blocks• Thanks to Todd Golub, Yujin

Hoshida

~775gene

expression

37

Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation

Chr. 19 rearrangement,membrane receptors/channels

HSC signature

Neural/ESC signature

Angiogenesis, proliferation

BRCA interactors,chrom. stability factors

Cell cycle regulation

C1 C2 C3 C4Nonnegative matrix factorizationTumors →

← G

enes

38

Molecular Subtypes of Colorectal Cancer:Stem Cell Programs and Proliferation

Subramanian et al, 2005

195

146678

166945

325

799

NeuralStem Cell Signature

HematopoeiticStem Cell Signature

EmbryonicStem Cell Signature

Chr. 19q

18

8

7

BAX

CD133 + Bcl-X(L)

CD44 + CD166

Hypotheses?• Two main pathways to

proliferation:• HSC program + BAX• ESC/NSC program

• Two main pathways to deregulation:

• Angiogenesis + chrom. instability• Cell cycle disruption (MSI?)

Note that these regulatory programsdo not appear to correspond

with demographics or commonpathologic markers…

Testing now for correlation with outcome.

39

Epigenetics of Colorectal Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

What does it all mean??What is the biological

mechanism linking LINE-1 methylation to colon cancer?

40


30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

This suggests a genetic effect.

This suggests a copy number variation.

This suggests linkage to a cancer-related pathway.

Is anything different about these outliers?

What is the biological mechanism linking LINE-1

methylation to colon cancer?

41


What is the biological mechanism linking LINE-1

methylation to colon cancer?

Preliminary Data• 10 genes differentially expressed even using simple methods• 1/3 are from the same family with known GI tumor prognostic value• 1/3 are X-chromosome testis/cancer-specific antigens• 1/2 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a cascade of antigens/membrane receptors/TFs

Cell adhesion p-value ≈ 0, moderate correlation in many cancer arrays• GSEA pulls out a wide range of proliferation up (E2F),

immune response down; need to regress out prognosis confounds

Check back in acouple of months!

42

Outline






• Bayesian system for genomic

data integration• HEFalMp system for human data

analysis and integration• Functional mapping to statistically

summarize large data collections

• Integration for microbial

communities and metagenomics

• Network alignment and mapping

for microbial community analysis

• Sleipnir software for efficient

large scale data mining

• Demographic/molecular/

genomic data for ~1,000

colorectal cancers• Ongoing analysis of

geneactivity and LINE-1

methylation

43

Thanks!

NIGMShttp://function.princeton.edu/sleipnir

http://function.princeton.edu/hefalmp

Interested? We’re recruiting students and postdocs!Biostatistics Department

http://huttenhower.sph.harvard.edu

Hilary CollerErin HaleyTsheko Mutungu

Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz

Shuji OginoCharlie Fuchs

Jacques Izard

Wendy Garrett

large scale genomic data mining curtis huttenhower 11-14-09 harvard school of public health...

Documents

functional relationship

functional associations12how

functional metagenomics

functional genomic data

functional genomics7

network stronger connections

formalizethese relationships

gene sets of arbitrary