representation of metabolomic data with wavelets
DESCRIPTION
June 5th, 2009 : Representation of metabolomic data with wavelets, Groupe de travail BioPuces, INRA d’Auzeville.TRANSCRIPT
Representation of metabolomic data with wavelets
Nathalie Villa-Vialaneixhttp://www.nathalievilla.org
Toulouse School of Economics
Workgroup BioPuces, INRA de CastanetJune 5th, 2009
BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16
Sommaire
1 Database presentation
2 Wavelet representation
3 Perspective of work
BioPuces (05/06/09) Nathalie Villa Metabolomic data 2 / 16
Database presentation
Sommaire
1 Database presentation
2 Wavelet representation
3 Perspective of work
BioPuces (05/06/09) Nathalie Villa Metabolomic data 3 / 16
Database presentation
Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.
Baseline has been removed and peaks have been aligned.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16
Database presentation
Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.
Baseline has been removed and peaks have been aligned.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16
Database presentation
Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.
Baseline has been removed and peaks have been aligned.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16
Database presentation
Purpose of the work
Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.
As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16
Database presentation
Purpose of the work
Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16
Database presentation
Description of the experiment
72 mice from:2 sexes 36 males 36 females
3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16
Database presentation
Description of the experiment
72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice
3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16
Database presentation
Description of the experiment
72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice
⇒ 18 groups.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16
Database presentation
Description of the experiment
72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16
Database presentation
Measurements days
The urine was collected:
Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18
For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16
Database presentation
Measurements days
The urine was collected:
Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18
For each mice, from 2 to 22 measurements are made.
In conclusion, 397 observations for 950 variables.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16
Database presentation
Measurements days
The urine was collected:
Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18
For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16
Wavelet representation
Sommaire
1 Database presentation
2 Wavelet representation
3 Perspective of work
BioPuces (05/06/09) Nathalie Villa Metabolomic data 8 / 16
Wavelet representation
Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k) +J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸
Details at levels 1,...,J: based on the mother wavelet Φ
BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16
Wavelet representation
Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸
Details at levels 1,...,J: based on the mother wavelet Φ
BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16
Wavelet representation
Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸
Details at levels 1,...,J: based on the mother wavelet Φ
BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16
Wavelet representation
Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discretesampling.
Original Data: f observed at t1 ... t1024 equally spaced
↓ ↘
Level 1 Trend↓ ↘
Level 2 Trend. . .↓ ↘
Level 9⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16
Wavelet representation
Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discretesampling.
Original Data: f observed at t1 ... t1024 equally spaced↓ ↘
Level 1 Trend Details
↓ ↘
Level 2 Trend Details. . .↓ ↘
Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16
Wavelet representation
Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discretesampling.
Original Data: f observed at t1 ... t1024 equally spaced↓ ↘
Level 1 Trend Details↓ ↘
Level 2 Trend Details
. . .↓ ↘
Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16
Wavelet representation
Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discretesampling.
Original Data: f observed at t1 ... t1024 equally spaced↓ ↘
Level 1 Trend Details↓ ↘
Level 2 Trend Details. . .↓ ↘
Level 9 Trend Details
⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16
Wavelet representation
Hierarchical decomposition
We add 74 zero values at the end of the spectra to have a diadic discretesampling.
Original Data: f observed at t1 ... t1024 equally spaced↓ ↘
Level 1 Trend Details↓ ↘
Level 2 Trend Details. . .↓ ↘
Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16
Wavelet representation
Examples
Trend Details
BioPuces (05/06/09) Nathalie Villa Metabolomic data 11 / 16
Wavelet representation
DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:
c∗ =
{0 if |c | < 2
√log 10σ̂
c if |c | ≥ 2√
log 10σ̂
(Donoho and Johnstone)
Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?
to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√
1n
∑i
1D
∑j
(fi(tj) − f̂i(tj)
)2+
Nb of non negative coefficientsNb of coefficients
BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16
Wavelet representation
DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:
c∗ =
{0 if |c | < 2
√log 10σ̂
c if |c | ≥ 2√
log 10σ̂
(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?
to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.
Minimization of an empirical (self-created) quality criterium:√1n
∑i
1D
∑j
(fi(tj) − f̂i(tj)
)2+
Nb of non negative coefficientsNb of coefficients
BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16
Wavelet representation
DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:
c∗ =
{0 if |c | < 2
√log 10σ̂
c if |c | ≥ 2√
log 10σ̂
(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?
to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√
1n
∑i
1D
∑j
(fi(tj) − f̂i(tj)
)2+
Nb of non negative coefficientsNb of coefficients
BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16
Wavelet representation
Final reconstruction of the data
274 positive coefficients
BioPuces (05/06/09) Nathalie Villa Metabolomic data 13 / 16
Wavelet representation
BoxplotsOriginal coefficients
Scaled coefficients (reduction by mean and standard deviation)
BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16
Wavelet representation
BoxplotsScaled coefficients (reduction by mean and standard deviation)
BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16
Perspective of work
Sommaire
1 Database presentation
2 Wavelet representation
3 Perspective of work
BioPuces (05/06/09) Nathalie Villa Metabolomic data 15 / 16
Perspective of work
Using random forests
The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.
Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:
• the dose (either as a number or as a class leading to a classificationproblem);
• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);
• any other interesting idea?
The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16
Perspective of work
Using random forests
The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:
• the dose (either as a number or as a class leading to a classificationproblem);
• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);
• any other interesting idea?
The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16
Perspective of work
Using random forests
The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:
• the dose (either as a number or as a class leading to a classificationproblem);
• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);
• any other interesting idea?
The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.
BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16