integrative analysis of high dimensional gene expression, metabolite and blood chemistry data

NISS Metabolomics Workshop, 2005 1

Integrative Analysis of High Dimensional Gene Expression,

Metabolite and Blood Chemistry Data

Kwan R. Lee, Ph.D. and

Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences

GlaxoSmithKline [email protected]


Overview

• Systems Biology• Challenges for Statisticians• Possible solutions• Example of integrative data analysis• Summary and discussion


Of mice and men

?

?


Integrate knowledge and technologiesReduce attrition by running coordinated studies in animal and man


Focusing on one platform may miss an obvious signal!!!


How can efficacy failures be attacked?

Animal Phenotype Human Phenotype

Classic Phenotypic Approach

Animal Phenotype Human Phenotype

Animal Biomarker Fingerprint

Human Biomarker Fingerprint

Integrative Biology

Few data to support analogy

Many data to support analogy


‘Systems Biology’ approach to drug discovery


123456789 ppm

1H NMR metabolites

Affy Transcriptome

LC-MS Lipid

LC-MS metabolites

Adiponectin (change from baseline)

-15

-10

-5

0

5

10

15

day 7day 14

db/+ db/db

Veh Met30

Gly1

Gly3

Met75 Veh Met

30Gly1

Gly3

Met75

*

Adi

pone

ctin

(ug/

ml)

“Non-omic” markers

Veh A B C D Veh A B C DNormal Disease

A

A

Experimental Platforms Non-omics and Omics, what are they?


Experimental Platforms Non-omics and Omics, what are they? (cont.)

• Traditional Blood Chemistry (non-omics)• Gene Expression (transcriptomics)• Metabolite (metabonomics)• Lipid (lipomics)• Protein (proteomics)


Five Challenges

1. Data Pre-processing2. High Dimensionality3. Multiple Testing for Marker Selection4. Data Integration5. Validation of the Prediction Model


• Peak Alignment (NMR, LC/MS)

• Normalization (Gene Chip, NMR, LC/MS data)– Why? Remove systematic

bias in the data– Normalization within the

platform makes data comparable across samples

Challenge #1: Data Pre-processing


Challenge # 2: High Dimensionality# of subjects << # of variables

• Blood Chemistry: 9 markers• Gene Expression: 22,000 probe sets • Lipid LC/MS: 2, 000 peaks• Metabolite LS/MS: 3,000 peaks• NMR: 500 buckets

Animal 1Animal 2 . . . .

Animal 100

probe set 1 …… 22,000 Lipid 1 ...… 2,000 Metabolite 1 … 3,000 NMR 1 …… 500Choles, Trig,… ...


Noise Signal Signal+Noise

No Adjustment for Multiple Testing

FWERAdjustment

FDR

+ =

Challenge #3: Multiple Testing in Variable Selection


123456789 ppm

1H NMR metabolites

Affy Transcriptome

LC-MS Lipid

LC-MS metabolites

Adiponectin (change from baseline)

-15

-10

-5

0

5

10

15

day 7day 14

db/+ db/db

Veh Met30

Gly1

Gly3

Met75 Veh Met

30Gly1

Gly3

Met75

*

Adi

pone

ctin

(ug/

ml)

“Non-omic” markers

Veh A B C D Veh A B C DNormal Disease

A

Challenge #4: Data integration

A


Platform A20000s var.

Platform B1000s var.

CombinedData


Platform B1000s var.

Dimension Reduction(eg variable selection)


Platform B100s var.

CombinedData

Integration Approach 1: Integration Approach 2:

Challenge #4: Data integration (cont.)


Integration approach 1: Simple data integration

– Simply combining the platform data together, the platform with large amount of data and variability will dominate the other platforms

Challenge #4: Data integration Example 1


PCA on Non-omics, Transcriptomics, and Combined.

-8

-6

-4

-2

0

2

4

6

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397.M1 (PCA-X), Non-omics (40 markers) onlyt[Comp. 1]/t[Comp. 2]

Class 1Class 2

1111

1111

1111

11 11

11

1111

1111

11

11

1111

11

11

11

11

11

11

11

11

1111 11

1111

11111111

11 11 11 111111

1111

11

11

58 58

58

58

58

58

58

58

58

58

58

5858

5858

5858

58

58

585858

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

-80

-60

-40

-20

0

20

40

60

80

-90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397 Transcriptomics only.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58

Non-omics (20)

Transcriptomics (12,488)

Combined (12,508)

Mirror image!!!

Transcriptomics data dominate Non-omics data!!!

-80

-60

-40

-20

0

20

40

60

80

-90 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397 non-omics 20 n transcriptomics.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58


PCA on Non-omics, Transcriptomics, and Combined.

-8

-6

-4

-2

0

2

4

6

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

t[2]

t[1]

integrated_orignal_intensity_liver_Epac397.M1 (PCA-X), Non-omics (40 markers) onlyt[Comp. 1]/t[Comp. 2]

Class 1Class 2

1111

1111

1111

11 11

11

1111

1111

11

11

1111

11

11

11

11

11

11

11

11

1111 11

1111

11111111

11 11 11 111111

1111

11

11

58 58

58

58

58

58

58

58

58

58

58

5858

5858

5858

58

58

585858

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

Non-omics (20)

Transcriptomics (20 PCs)

Combined (40)

Like a mirror image!!!

-6

-4

-2

0

2

4

6

8

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

t[2]

t[1]

Nonomics20 n Tran20.M1 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

111111

11

111111 11

11

11

11

11

11 11

11

1111

11

11

11

11

11

11

11

11

1111

1111

1111

111111

11 11 11 111111

1111

11

11

5858

58

58

5858

58

58

58

58

58

5858

5858

58

58

58

58

5858

58

58

58

58

58

58

58

58

5858

5858 58

58

58

58

58

-2

-1

0

1

2

-2 -1 0 1 2 3

t[2]

t[1]

Transcriptome first 20 PCs.M2 (PCA-X)t[Comp. 1]/t[Comp. 2]

Class 1Class 2

11

11

11 1111

11

11

11

11

11

11

11

11

11 1111

1111

11

11

11

11

11

11

1111

11

11

1111

1111

11

11

11

11

11

11

11

11

1111

11

11

58

585858

58

58

58 58

58

5858

5858

58

58

58

58

58

58

58

58

58

58

5858 58

58

585858

58

58

58

58

58

58

58 58


Integration approach 2: Integrate on selected markers– 9 blood chemistry + 2000 probe sets + 150 metabolites – There are still platforms with more selected markers – How to weight different platforms appropriately? Eg. 9 blood

chemistry markers are known to relate to disease or drug– Identify relationship among the probe sets, metabolites, along with

the blood chemistry markers in terms of biological pathways

Challenge #4: Data integrationExample 2


NormalDisease

-30

-20

-10

0

10

20

30

-70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70

t[2]

t[1]

fat integrated.M11 (PCA-X), Allt[Comp. 1]/t[Comp. 2]Colored according to values in variable fat integrated

Ellipse: Hotelling T2 (0.95)

Series (Settings for DIET)

Missing1158

D10D10

D10

D10D10

D10

D10

D10D30D30

D30

D30D30D30

D30D60

D60

D60

D60D60

D60

Veh

Veh

Veh

VehVeh

Veh

Veh

D10

D10

D10

D10D10D10 D10D10

D10 D30D30 D30

D30

D30

D30

D30

D30

D30D30

D60 D60

D60D60

D60

D60

D60

D60

D60

Veh

Veh

VehVeh

VehVeh

VehVeh

SIMCA-P+ 10.5 - 04/23/2004 14:57

Principle Component Analysis (PCA ) Projection of 67 animals of 28 normal (black) , 39 disease (red)

(9 NO, 1991 TA, 115 MT)

All markers used for projection


-0.060

-0.050

-0.040

-0.030

-0.020

-0.010

0.000

0.010

0.020

0.030

0.040

0.050

-0.030 -0.020 -0.010 0.000 0.010 0.020 0.030

p[2]

p[1]

fat integrated.M11 (PCA-X), Allp[Comp. 1]/p[Comp. 2]Colored according to Var ID (Platform)

MENOTA

G

C

T

N

HH

I

A

L

SIMCA-P+ 10.5 - 04/23/2004 15:00

Loading Plot


-30

-20

-10

0

10

20

30

-80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80

t[2]

t[1]

fat integrated.M6 (PLS-DA), All platforms, HFDt[Comp. 1]/t[Comp. 2]Colored according to classes in M6


Class 1Class 2

Veh

VehVeh

Veh

VehVeh

Veh

VehVeh

Veh

D10

D10

D10 D10

D10D10

D10

D10

D10D30D30

D30

D30

D30

D30

D30

D30

D30 D30

D60 D60D60

D60

D60

D60D60

D60

D60

D60

SIMCA-P+ 10.5 - 04/23/2004 15:13

Partial Least Square Discriminant Analysis (PLS-DA)

Disease group only

VehicleDrug


PLS-DA: Corresponding projection of all markers

(9 NO, 1991 TA, 115 MT), Which are important drug markers?

-0.060

-0.040

-0.020

0.000

0.020

0.040

0.060

-0.030 -0.020 -0.010 0.000 0.010 0.020 0.030

w*c

[2]

w *c[1]

fat integrated.M6 (PLS-DA), All platforms, HFDw*c[Comp. 1]/w*c[Comp. 2]Colored according to Var ID (Platform)

$M6.DA1$M6.DA2MENOTA

$

$

GC

T

N

HH

I

A L

SIMCA-P+ 10.5 - 04/23/2004 15:19

Drug

Veh


Ranked drug markers by importance or by coefficients.

marker importance by variable importance on projection

Up or down regulation by coefficients

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100

VIP[

4]

fat integrated.M6 (PLS-DA), All platforms, HFDVIP[Comp. 4]

-0.0100

-0.0080

-0.0060

-0.0040

-0.0020

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100

Coe

ffCS[

4]($

M6.

DA1

)

fat integrated.M6 (PLS-DA), All platforms, HFDCoeffCS[Comp. 4](YVar $M6.DA1)


Validation of the model: R2, Q2 and permutation tests 100 times (P < 0.01)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

fat integrated.M6 (PLS-DA): Validate Model$M6.DA1 Intercepts: R2=(0.0, 0.918), Q2=(0.0, 0.35)

100 permutations 4 components

R2Q2


Variation explained by each platformPLS-DA for prediction of 2 experimental groups

Platform Q2(cum)

1 NO 49%

2 TA 86%

3 MT 45%

4 All 86%

Two Groups

HFD, vehicle

HFD, Drug treated

Q2(Y) = amount of variation among the 2 groups explained by the model (cross-validated)

The above table is based on 2- component model. If the 4th model uses morecomponents, 91% of the variation in the data can be explained by 4 components.


Challenge #5: Validation of the Prediction Model

• Correct way of doing cross-validation– Especially when the variables are selected

• Is your prediction accuracy significant?


Random Noise Data

• Simulate 20,000 marker columns of random noise for 100 patients and one additional column containing arbitrary labels of class indicators.

• Select 5 marker columns showing most correlation with class label.

• Make a prediction model for class indicators based on these 5 selected markers.


PCA of Full Markers

-40

-30

-20

-10

0

10

20

30

40

-40 -30 -20 -10 0 10 20 30 40 50

t[2]

t[1]

random_noise.M12 (PCA-X), Untitledt[Comp. 1]/t[Comp. 2]Colored according to classes in M12


Class 1Class 2

SIMCA-P+ 10.5 - 2/5/2005 8:22:09 AM


PLS-DA on Random Noise Data

• Running a full model on SIMCA does not yield a model – no significant Q2.– Multivariate approach is conservative.– Q2 computes prediction performance.

• But forced the software to fit a 6 -component model by PLS-DA

• (R2 = 1.0, Q2 = 0.225)


Full marker modelPLS-DA

-30

-20

-10

0

10

20

30

-30 -20 -10 0 10 20 30

t[2]

t[1]

random_noise.M1 (PLS-DA), All datat[Comp. 1]/t[Comp. 2]Colored according to classes in M1


Class 1Class 2

SIMCA-P+ 10.5 - 2/4/2005 1:20:50 PM


Was it real or by chance?

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

random_noise.M1 (PLS-DA): Validate Model$M1.DA1 Intercepts: R2=(0.0, 1), Q2=(0.0, 0.245)


R2Q2

SIMCA-P+ 10.5 - 2/4/2005 1:19:58 PM


Select 5 Markers

• Selected top 5 markers using VIP from the over-fitted model and fit PLS-DA again on the same data.

• Now we have (R2 = 0.459, Q2 = 0.348)


Good prediction from PLS-DA? Q2 = 0.35

-3

-2

-1

0

1

2

-3 -2 -1 0 1 2 3

t[2]

t[1]

random_noise.M11 (PLS-DA), Untitledt[Comp. 1]/t[Comp. 2]R2 = 0.447, Q2=0.348


Class 1Class 2

SIMCA-P+ 10.5 - 1/19/2005 10:17:08 PM


Validated by permutation test?Significance of Q2

-0.20

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

random_noise.M11 (PLS-DA): Validate Model$M11.DA1 Intercepts: R2=(0.0, 0.0155), Q2=(0.0, -0.17)


R2Q2

SIMCA-P+ 10.5 - 1/19/2005 10:21:30 PM


Selection Bias

• When a prediction model is tested on the same data that were used in the first instance to select the markers, selection bias makes the test error overly optimistic.– Many publications claimed a small set of selected

“genes” is highly predictive.– IBI practice is to use a data set to select markers

and use the same data set to fit a prediction model based on selected markers.


How to correct for selection bias?

• External validation should be undertaken subsequent to feature selection process.

1. Independent test data set (hold-out data set) that never used for feature selection.

2. External cross-validation (ECV).• Cross validation of the prediction model is external

to the selection process.• In other words, make a new selection for each

cross validation round.


Externally Validated PLS.Model and variable selection

• Divide the data set randomly into d parts.• Set ecv = 1; (this means hold-out one part and use d-1 parts for modeling)• Set a =1 ; (the number of components, do until 10)• Set k = total number of variables;• Loop: • Fit PLS model with given a and k , PLS (a,k);• Predict hold-out set, compute PRESS (ecv, a, k) and save;• Choose top half of the variables by appropriate statistics (coeff, vip, t-ratio etc);• Set k = k/2;• Go back to Loop until k = 2;• Set a = a + 1;• Go back to Loop until a =10;• Set ecv = ecv + 1;• Go back to Loop until ecv = d;• Compute PRESS (a, k) = Sum over ecv {PRESS (ecv, a, k)};• Compute Q2(a, k) = 1 – PRESS (a, k)/TSS;• Plot Q2(a,k) vs. log2(k);


Simulation of 2000 Random DataR. Simon 2003

• 20 x 6000 and 10/10 for class labels• Repeat 2000 times• Compute 3 different error rates

– Re-substitution (wrong)– Cross validation after selection (wrong)– Cross validation before selection (correct)


Results of 2000 Random Data


Permutation testing

• Because of the high dimensionality of gene expression data, it may be possible to achieve relatively small error rates even for random data.

• To assess the significance of the classification results, permutation test may be suggested.


Challenge #5: Validation of the Prediction Model - summary

• Correct way of doing cross-validation– All the steps of the prediction modeling should

be cross-validated.– Each cross validation step should start from

scratch• Is your prediction accuracy significant?

– Random data can give you low prediction error– Permutation tests, bootstrap aggregation


Summary and Discussion• Recent technological advances present

challenging and interesting biological data at molecular level.

• Statistics and multivariate analysis play an important role in understanding and extracting knowledge from these type of data.

• Integrative analysis is even more challenging and we presented some solutions to these challenges. There is plenty of room for improvement.


Acknowledgement

GlaxoSmithKline– High Throughput Biology– Biomedical Data Sciences– Genomics and Proteomics Science– Pathology, Cellular & Biochemical Toxicology– Discovery IT


Data exploration: Present Challenges

Data is an extremely valuable asset, but like a cash crop, unless harvested, it is wasted.

- Sid Adelman

integrative analysis of high dimensional gene expression, metabolite and blood chemistry data

Documents

manniss metabolomics

platform data

nonomics data

transcriptomics data

data comparable

analogymany data

data integrationexample

data integration example