metabolomic data: combining wavelet representation with learning approaches
DESCRIPTION
Groupe de travail Biopuces, INRA d'Auzeville May 19th, 2010TRANSCRIPT
Metabolomic data: combining waveletrepresentation with learning approaches
Nathalie Villa-Vialaneixhttp://www.nathalievilla.org
In collaboration with Noslen Hernández (CENATAV, La
Havane, Cuba) & Philippe Besse
IUT de Carcassonne (UPVD)
& Institut de Mathématiques de Toulouse
Groupe de travail BioPuces, INRA de Castanet
May 19th, 2010
1 / 23Nathalie Villa-Vialaneix
N
Présentation générale
1 Presentation of the data
2 Wavelet preprocessing and normalization
3 Learning methods
4 Identification of relevant metabolites
2 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.
3 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.
3 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Presentation of the data
Data have been provided by Alain Paris (INRA): they aremetabolomic spectra (H NMR) from mice urine and consist of950 variables (from 0.50 ppm to 9.99 ppm).
Peaks have been aligned and baseline has been removed.3 / 23
Nathalie Villa-VialaneixN
Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on themetabolism: HR flowers are responsible for a mortal disease forhorses, the “Australian stringhalt” (nervous system attack,trembling...)
Experiences have been made with 72 mice.
4 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Biologic question
Study the effets of Hypochoeris radicata (HR) ingestion on themetabolism: HR flowers are responsible for a mortal disease forhorses, the “Australian stringhalt” (nervous system attack,trembling...)Experiences have been made with 72 mice.
4 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24mice
⇒ 18 groups (but groups coming from sacrifice dates are irrelevantfor the biological question).
5 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24mice
⇒ 18 groups (but groups coming from sacrifice dates are irrelevantfor the biological question).
5 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24mice
⇒ 18 groups (but groups coming from sacrifice dates are irrelevantfor the biological question).
5 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Description of the experiments
Mice are divided into several groups according to:
genders : 36 males ; 36 females
daily HR doses ingested : 0 (control) : 24 mice ; 3% : 24 mice ;9% : 24 mice
3 sacrifice dates : 8th day : 24 mice ; 15th : 24 mice ; 21st : 24mice
⇒ 18 groups (but groups coming from sacrifice dates are irrelevantfor the biological question).
5 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21Nb of obs. 68 68 68 66 46 44 19 18
For each mice, from 1 to 8 measures were done.Finally, 397 observations with 950 variables.
6 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21Nb of obs. 68 68 68 66 46 44 19 18
For each mice, from 1 to 8 measures were done.
Finally, 397 observations with 950 variables.
6 / 23Nathalie Villa-Vialaneix
N
Presentation of the data
Day of measures
Urine was collected the following days:
Days 0 1 4 8 11 15 18 21Nb of obs. 68 68 68 66 46 44 19 18
For each mice, from 1 to 8 measures were done.Finally, 397 observations with 950 variables.
6 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k) +J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend based on father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸Details of levels 1, . . . , J
based on mother wavelet Φ
7 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend based on father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend based on father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸Details of levels 1, . . . , J
based on mother wavelet Φ
7 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Basics about wavelets
For a given integer J, a spectrum f can be expressed at level J by:
f(x) =∑
k
αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend based on father wavelet Ψ
+J∑
j=1
∑k
βjk 2−j/2Φ(2−jx − k
)︸ ︷︷ ︸Details of levels 1, . . . , J
based on mother wavelet Φ
7 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Example of a hierarchical decomposi-tion for a metabolomic spectrum
↓ ↘
... Details 1 to 8↓ ↘
8 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Example of a hierarchical decomposi-tion for a metabolomic spectrum
↓ ↘
... Details 1 to 8↓ ↘
8 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Example of a hierarchical decomposi-tion for a metabolomic spectrum
↓ ↘
... Details 1 to 8↓ ↘
8 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Example of a hierarchical decomposi-tion for a metabolomic spectrum
... Details 1 to 8↓ ↘
8 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Several strategies
Several wavelet basisHaar wavelets (easily interpretable because they are close todiscrete derivatives);
D4 Daubechies wavelets (smoother representation but notdirectly interpretable).
Several preprocessingsUse all wavelet coefficients as input data;
Use thresholded wavelet coefficients as input data (i.e., deletethe smallest coefficient with an automatic method called “softthresholding”);
Use only the detailed coefficients (and the detailed coefficientsof the shifted spectra) as input data.
9 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Several strategies
Several wavelet basisHaar wavelets (easily interpretable because they are close todiscrete derivatives);
D4 Daubechies wavelets (smoother representation but notdirectly interpretable).
Several preprocessingsUse all wavelet coefficients as input data;
Use thresholded wavelet coefficients as input data (i.e., deletethe smallest coefficient with an automatic method called “softthresholding”);
Use only the detailed coefficients (and the detailed coefficientsof the shifted spectra) as input data.
9 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Scaling of wavelet coefficients (ex: Haardetailed coefficients)
D.1 D.57 D.125 D.297 D.370 D.443 D2.41 D2.120 D2.304 D2.389 D2.474
−40
−20
020
40
Before scaling
D.1 D.57 D.125 D.297 D.370 D.443 D2.41 D2.120 D2.304 D2.389 D2.474
−15
−10
−5
05
1015
After scaling
10 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Normalization issue
●
●
●●
●
●
● ●
●
●
●
●● ●
●
●
●●
●
●
●●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
●
●
●
● ●●
●
●
●
● ● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●●
●
●
●
● ●
●
●
●●
●
●
● ●●● ●
●
●●●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●
●●●
●
−10 −5 0 5 10 15
−10
−5
05
PC1 vs. PC2
PC1
PC
2
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
●
●
●
●
●
●
● ●
●
●
●
●
● ●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●● ●●
●
●●●
●●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
−10 −5 0 5 10 15
−20
−10
010
PC1 vs. PC3
PC1
PC
3
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
● ●●●
●●
●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●●
● ●
●
●
●
● ●
●
●● ●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
−10 −5 0 5 10 15
−15
−5
05
1015
20
PC1 vs. PC4
PC1
PC
4
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
●
●
● ●
●
●
●●
●
●●
●●●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●●●●
●
●●●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
−10 −5 0 5
−20
−10
010
PC2 vs. PC3
PC2
PC
3
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●●●
●●
●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
−10 −5 0 5
−15
−5
05
1015
20
PC2 vs. PC4
PC2
PC
4
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
● ●●●
● ●
●
●
●●
●●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●●
● ●
●
●
●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
−20 −10 0 10
−15
−5
05
1015
20
PC3 vs. PC4
PC3
PC
4
●
●
●
●
●
●
●
●
Day 0
Day 1
Day 4
Day 8
Day 11
Day 15
Day 18
Day 21
PCA for the coef-ficients: the dayof measure for thecontrol group isemphasized onaxis 2 and 4
11 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day ofmeasure based on the control group.
Use these values for the normalization of all the observations(according to the day of measure).
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20.
00.
20.
40.
6
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20−
0.10
0.00
0.10
D.78
Day
Wav
elet
coe
ffici
ents
●
●
●
0 1 4 8 11 15 18 21
0.0
0.5
1.0
1.5
2.0
2.5
D.332
Day
Wav
elet
coe
ffici
ents
●
●●
●●
●
●
0 1 4 8 11 15 18 21
−1.
5−
1.0
−0.
5
D2.289
Day
Wav
elet
coe
ffici
ents
●
●
●
●
0 1 4 8 11 18
−2
−1
01
2
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●●
●
0 1 4 8 11 18
−3
−1
01
2
D.78
Day
Wav
elet
coe
ffici
ents
●
● ●
0 1 4 8 11 18
−3
−1
01
23
D.332
Day
Wav
elet
bco
effic
ient
s
●
●●
●●
●
●
0 1 4 8 11 18
−3
−1
01
23
D2.289
Day
Wav
elet
coe
ffici
ents
Before After
12 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
Normalization
Find median and variance of the coefficients for each day ofmeasure based on the control group.
Use these values for the normalization of all the observations(according to the day of measure).
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20.
00.
20.
40.
6
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●
●
●
0 1 4 8 11 15 18 21
−0.
20−
0.10
0.00
0.10
D.78
Day
Wav
elet
coe
ffici
ents
●
●
●
0 1 4 8 11 15 18 21
0.0
0.5
1.0
1.5
2.0
2.5
D.332
Day
Wav
elet
coe
ffici
ents
●
●●
●●
●
●
0 1 4 8 11 15 18 21
−1.
5−
1.0
−0.
5
D2.289
Day
Wav
elet
coe
ffici
ents
●
●
●
●
0 1 4 8 11 18
−2
−1
01
2
D2.444
Day
Wav
elet
coe
ffici
ents
●
●
●●
●
0 1 4 8 11 18
−3
−1
01
2
D.78
Day
Wav
elet
coe
ffici
ents
●
● ●
0 1 4 8 11 18
−3
−1
01
23
D.332
Day
Wav
elet
bco
effic
ient
s
●
●●
●●
●
●
0 1 4 8 11 18
−3
−1
01
23
D2.289
Day
Wav
elet
coe
ffici
ents
Before After 12 / 23Nathalie Villa-Vialaneix
N
Wavelet preprocessing and normalization
PCA after normalization
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●●●
●
● ●
●
●●
●●
● ●●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●●
●
●● ●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
−10 −5 0 5 10 15
02
46
810
PC1 vs. PC2
PC1
PC
2
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●●●
●● ●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
−10 −5 0 5 10 15
−10
−5
05
1015
PC1 vs. PC3
PC1
PC
3
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
●●
●
●●
● ●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−10 −5 0 5 10 15
−5
05
PC1 vs. PC4
PC1
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
● ●●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
0 2 4 6 8 10 12−
10−
50
510
15
PC2 vs. PC3
PC2
PC
3
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
● ●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0 2 4 6 8 10 12
−5
05
PC2 vs. PC4
PC2
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●●
●
●●●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−10 −5 0 5 10 15
−5
05
PC3 vs. PC4
PC3
PC
4
●
●
●
●
●
●
●
●
Day 0Day 1Day 4Day 8Day 11Day 15Day 18Day 21
13 / 23Nathalie Villa-Vialaneix
N
Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolismby predicting from the spectra the total HR dose ingested. Ifthe prediction is accurate, the impact is not an artefact of the dataand the biological dependency is validated.
Compared methods :
random forest (R package randomForest)
ridge regression (R package glmnet)
LASSO (R package glmnet)
Elasticnet (R package glmnet)
Partial Least Squares (PLS) (R package mixOmics)
sparse PLS (R package mixOmics)
14 / 23Nathalie Villa-Vialaneix
N
Learning methods
Motivations
Purpose: Validation of the impact of HR ingestion on metabolismby predicting from the spectra the total HR dose ingested. Ifthe prediction is accurate, the impact is not an artefact of the dataand the biological dependency is validated.Compared methods :
random forest (R package randomForest)
ridge regression (R package glmnet)
LASSO (R package glmnet)
Elasticnet (R package glmnet)
Partial Least Squares (PLS) (R package mixOmics)
sparse PLS (R package mixOmics)
14 / 23Nathalie Villa-Vialaneix
N
Learning methods
Methodology
Split the data into train and test sets that are balanced according tothe groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds ofpreprocessing) on the train set with a cross-validation strategy totune the parameters;
Calculate the mean squared error on the test set.
Repeat the previous scheme 250 times.
15 / 23Nathalie Villa-Vialaneix
N
Learning methods
Methodology
Split the data into train and test sets that are balanced according tothe groups;
Preprocess (or not), scale and normalize the data with wavelets;
Learn each of the 6 methods (for each of the 7 kinds ofpreprocessing) on the train set with a cross-validation strategy totune the parameters;
Calculate the mean squared error on the test set.
Repeat the previous scheme 250 times.
15 / 23Nathalie Villa-Vialaneix
N
Learning methods
Mean performances in test
Methods Original Daubechies Daubechies Daubechies Haar Haar Haar- Details - Full - Threshold - Details - Full - Threshold
ELN 0.5 16.29 (1.03) 15.38 (0.9) 14.33 (1.07) 42.94 (52.25) 15.39 (1.04) 14.49 (1.03) 30.98 (16.43)ELN 0.25 16.12 (1.03) 15.28 (0.9) 14.35 (0.94) 44.62 (61.3) 15.2 (1) 14.47 (0.98) 32.54 (17.31)ELN 0.1 15.81 (0.98) 15.14 (0.77) 14.38 (0.84) 42.58 (53.83) 15.15 (0.87) 14.58 (0.92) 35.41 (19.43)ELN 0.75 16.31 (1.1) 15.48 (0.9) 14.43 (1.1) 42.62 (51.59) 15.44 (1.06) 14.5 (1.01) 30.31 (15.92)Lasso 16.37 (1.27) 15.56 (1.01) 14.45 (1.14) 41.82 (50.86) 15.56 (1.1) 14.49 (1.01) 30.8 (17.01)Ridge 16.82 (0.83) 16.22 (0.67) 15.56 (0.74) 41.75 (25.09) 16.16 (0.7) 15.66 (0.8) 37.58 (16.07)PLS 16.83 (1.1) 16.25 (0.79) 15.61 (0.87) 81.56 (116.21) 16.09 (0.87) 15.87 (0.91) 42.6 (25.14)RF 16.69 (0.91) 16.33 (1.36) 16.2 (1.16) 18.91 (1.66) 16.24 (1.06) 16.11 (1.09) 18.8 (1.32)SPLS 5 19.71 (1.63) 19.25 (1.25) 16.55 (1.18) 36.54 (31.88) 19.1 (1.63) 17.24 (1.4) 34.25 (24.99)SPLS 10 19.25 (1.65) 19.22 (1.23) 16.74 (1.15) 79.35 (110.56) 18.66 (1.36) 17.14 (1.25) 42.46 (23.76)SPLS 20 18.41 (1.5) 18.81 (1.18) 17.55 (1.2) 76.05 (104.74) 18.55 (1.2) 17.11 (1.13) 42.38 (23.74)
16 / 23Nathalie Villa-Vialaneix
N
Learning methods
Boxplot for full Daubechies representa-tion
●
●
●●
●
●
●● ●
● ●● ●
●
●
●
●●
●●●●●●●
●
●●
●
●
●
Lass
o
Rid
ge
ELN
0.1
ELN
0.2
5
ELN
0.5
ELN
0.7
5
PLS
SP
LS 5
SP
LS 1
0
SP
LS 2
0
RF
1214
1618
20
Daubechies wavelets − Full
17 / 23Nathalie Villa-Vialaneix
N
Learning methods
Full Daubechies representation andELN: Accuracy (on test)
●●●●
●●
●●
●●●
●●●
●
●●●
●●●●
●
●●
●●●
●
●●●●
●●●●●●
●
●●●●
●
●
●●
●
●
●
●
●
●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
●●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●
●●●
●
●●●●
●●●●●
●
●●●●
●●●●
●●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●
●
●●
●
●
●
●
●●●
●
●●●
●●
●
●
●
●
●
●
●● ●
0 50 100 150
050
100
150
True values
Pred
icted
value
s
Mean R2 on test sets is equal to 89.0% (minimum is 83.1% andmaximum is 92.8%). 18 / 23
Nathalie Villa-VialaneixN
Identification of relevant metabolites
Identification issue
The full learning process is the following:
Spectra→Wavelet preprocess→ Learning→ HR dose prediction
Hence, due to the preprocessing step, the coefficients selectedby ELN are not directly related to metabolites (or to localizationon the spectra).
19 / 23Nathalie Villa-VialaneixN
Identification of relevant metabolites
Identification issue
The full learning process is the following:
Spectra→Wavelet preprocess→ Learning→ HR dose prediction
Hence, due to the preprocessing step, the coefficients selectedby ELN are not directly related to metabolites (or to localizationon the spectra).
19 / 23Nathalie Villa-VialaneixN
Identification of relevant metabolites
Adaptation of the importance measure
for Each of the 950 variables, v, of the original data set doRandomize the observations of the variable vCompute the full Daubechies wavelet representationwith the randomized observations for vScale and normalize according to the true values mean,median or variancefor Each test set, i do
Calculate new predictions with false values of vand corresponding mse: msev ,i
Calculate decrease in accuracy for test set: DAi =1 − msei
msev ,iend forAverage over i, DAi , to obtain Importance of v
end for20 / 23
Nathalie Villa-VialaneixN
Identification of relevant metabolites
Values of importance
●
●
●●
●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 200 400 600
0.0
0.2
0.4
0.6
0.8
Rank
Impo
rtan
ce
21 / 23Nathalie Villa-Vialaneix
N
Identification of relevant metabolites
Identification of important metabolites
2 4 6 8 10
05
1015
20
ppm
Some havealready been identified: the most important is scyllo-inositol; oneof the orange is probably valine; one of the light yellow is probablytrimethylamine. The others are new.
22 / 23Nathalie Villa-Vialaneix
N
Identification of relevant metabolites
What next?
Identification of the metabolites, study of the correlation betweenthe ones found and the ones previously emphasized.Questions? Propositions?
23 / 23Nathalie Villa-Vialaneix
N