representation of metabolomic data with wavelets

Representation of metabolomic data with wavelets

Nathalie Villa-Vialaneixhttp://www.nathalievilla.org

Toulouse School of Economics

Workgroup BioPuces, INRA de CastanetJune 5th, 2009

BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work


Database presentation

Sommaire






Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.



Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.

As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.



Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.



Description of the experiment

72 mice from:2 sexes 36 males 36 females

3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.




72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice

3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.




72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice

⇒ 18 groups.




72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.



Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.



Measurements days



For each mice, from 2 to 22 measurements are made.

In conclusion, 397 observations for 950 variables.



Measurements days



For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.


Wavelet representation

Sommaire






Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k) +J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸︷︷︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k


)︸︷︷︸

Details at levels 1,...,J: based on the mother wavelet Φ




f(x) =∑

k


+J∑

j=1

∑k


)

f(x) =∑

k


+J∑

j=1

∑k


)︸︷︷︸





f(x) =∑

k


+J∑

j=1

∑k


)︸︷︷︸




Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced

↓ ↘

Level 1 Trend↓ ↘

Level 2 Trend. . .↓ ↘

Level 9⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.





Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details

↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.






Level 1 Trend Details↓ ↘


. . .↓ ↘










⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.



Examples

Trend Details



DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)

Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients




c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.

Minimization of an empirical (self-created) quality criterium:√1n

∑i

1D

∑j


)2+





c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j


)2+




Final reconstruction of the data

274 positive coefficients



BoxplotsOriginal coefficients

Scaled coefficients (reduction by mean and standard deviation)



BoxplotsScaled coefficients (reduction by mean and standard deviation)


Perspective of work

Sommaire





Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.

Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.


Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.


representation of metabolomic data with wavelets

Self Improvement

mice biopuces

urine of mice

data base

original data

females biopuces

database presentation

database presentation

database presentation