representation of metabolomic data with wavelets

35
Representation of metabolomic data with wavelets Nathalie Villa-Vialaneix http://www.nathalievilla.org Toulouse School of Economics Workgroup BioPuces, INRA de Castanet June 5th, 2009 BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16

Upload: tuxette

Post on 11-May-2015

97 views

Category:

Self Improvement


0 download

DESCRIPTION

June 5th, 2009 : Representation of metabolomic data with wavelets, Groupe de travail BioPuces, INRA d’Auzeville.

TRANSCRIPT

Page 1: Representation of metabolomic data with wavelets

Representation of metabolomic data with wavelets

Nathalie Villa-Vialaneixhttp://www.nathalievilla.org

Toulouse School of Economics

Workgroup BioPuces, INRA de CastanetJune 5th, 2009

BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16

Page 2: Representation of metabolomic data with wavelets

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 2 / 16

Page 3: Representation of metabolomic data with wavelets

Database presentation

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 3 / 16

Page 4: Representation of metabolomic data with wavelets

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Page 5: Representation of metabolomic data with wavelets

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Page 6: Representation of metabolomic data with wavelets

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Page 7: Representation of metabolomic data with wavelets

Database presentation

Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.

As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16

Page 8: Representation of metabolomic data with wavelets

Database presentation

Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16

Page 9: Representation of metabolomic data with wavelets

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females

3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Page 10: Representation of metabolomic data with wavelets

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice

3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Page 11: Representation of metabolomic data with wavelets

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice

⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Page 12: Representation of metabolomic data with wavelets

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Page 13: Representation of metabolomic data with wavelets

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Page 14: Representation of metabolomic data with wavelets

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.

In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Page 15: Representation of metabolomic data with wavelets

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Page 16: Representation of metabolomic data with wavelets

Wavelet representation

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 8 / 16

Page 17: Representation of metabolomic data with wavelets

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k) +J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Page 18: Representation of metabolomic data with wavelets

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Page 19: Representation of metabolomic data with wavelets

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Page 20: Representation of metabolomic data with wavelets

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced

↓ ↘

Level 1 Trend↓ ↘

Level 2 Trend. . .↓ ↘

Level 9⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Page 21: Representation of metabolomic data with wavelets

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details

↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Page 22: Representation of metabolomic data with wavelets

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details

. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Page 23: Representation of metabolomic data with wavelets

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details

⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Page 24: Representation of metabolomic data with wavelets

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Page 25: Representation of metabolomic data with wavelets

Wavelet representation

Examples

Trend Details

BioPuces (05/06/09) Nathalie Villa Metabolomic data 11 / 16

Page 26: Representation of metabolomic data with wavelets

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)

Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Page 27: Representation of metabolomic data with wavelets

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.

Minimization of an empirical (self-created) quality criterium:√1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Page 28: Representation of metabolomic data with wavelets

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Page 29: Representation of metabolomic data with wavelets

Wavelet representation

Final reconstruction of the data

274 positive coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 13 / 16

Page 30: Representation of metabolomic data with wavelets

Wavelet representation

BoxplotsOriginal coefficients

Scaled coefficients (reduction by mean and standard deviation)

BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16

Page 31: Representation of metabolomic data with wavelets

Wavelet representation

BoxplotsScaled coefficients (reduction by mean and standard deviation)

BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16

Page 32: Representation of metabolomic data with wavelets

Perspective of work

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 15 / 16

Page 33: Representation of metabolomic data with wavelets

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.

Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16

Page 34: Representation of metabolomic data with wavelets

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16

Page 35: Representation of metabolomic data with wavelets

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16