extensions of common component and specific weight analysis for applications in chemometrics douglas...
TRANSCRIPT
Extensions of Common Component and Specific Weight Analysisfor applications in Chemometrics
Douglas N. Rutledge, Delphine Jouan-Rimbaud--Bouveresse
AgroParisTech/INRA, UMR1145 Ingénierie Procédés Aliments, F-75005 PARIS
Introduction : Multiblock Data sets
Different types of multivariate data are measured on the same individuals.
• Examples :– Sensory analysis fixed or free choice profiling.– Process technology multivariate measurements are performed
at different stages of the process.– Functional Genomics genetic data, molecular data and
phenotypic data are collected.
1X2X mXn
p1p2 pm
Introduction : Multiblock Data sets Objectives :
– To investigate the relationships between data blocks.– To highlight and interpret the within block patterns :
relationships among samples.
How ?- Define the underlying (common) dimensions to the data blocks and assess how much each dimension is relevant to each data block.
How to interpret ?- Express the underlying dimensions in terms of the variables in the various data blocks;- Or, express these underlying dimensions in terms a reference data set.
A plethora of methods
PLS Path modelling
ACCPS
Tukey3
Parafac
Tukey1Tukey2 Common Principal Components
…..
AFM
GPA
Statis
PLS2
Indscal
With so many methods around, what do you do?
Depending on your aim, pick one that you believe in, and work with it honestly.
PLS1
Analyse en Composantes Communes et Poids Spécifiques
Common Component and Specific Weights Analysis
Outline
The ComDim procedure
ComDim – Simultaneous analysis of several data tables
ComDim_VarSelec - ComDim for Variable selection
AComDim - ComDim for significant Factor detection
AComDim_VarSelec - ComDim for Variable selection AND Factor detection
O-PLS_ComDim - Separate orthogonal and non-orthogonal contributions
ComDim_DA - Discriminant Analyis based on ComDim on barycentres
ComDim - Simultaneous analysis of several data tables
[1] E. Qannari, I. Wakeling, P. Courcoux, H.J.H MacFie, Food Quality and Preference (2000) 11, 151
"Common Components and Specific Weights Analysis" - CCSWA [1]
Simultaneously study several matrices- with different variables describing the same samples
Describe p data tables observed for the same n samples :- a set of p data matrices (X) each with n rows,- but not necessarily the same number of columns
Determine a common space for all p data tables, - each matrix has a specific contribution ("salience")
to the definition of each dimension of this common space
Start with p matrices Xi of size n × ki (i = 1 to p)
Each Xi column-centered and scaled by dividing by matrix norm :Xsi
For each Xsi, an n × n scalar product matrix Wi can be computed as :
Wi = Xsi • Xsi T
Wi reflect the dispersion of the samples in the space of that table
The common dimensions of all the tables are computed iteratively
At each iteration, a weighted sum of the p Wi matrices is computed, resulting in a global WG matrix
k
r
dim'dim
(k EqdimqW
1+=å
=
)dim
kl
The ComDim or CCSWA model
X2X1
W1W2
l1 l2
W1
W2
Dif = (W1-l1.q.qT) + (W2-l2.q.qT)
Aux = I - q . qT
X1 = Aux . X1
X2 = Aux . X2
W = X.XT
l1
l2
"ComDim" the implementation of CCSWA used here is part of the SAISIR toolbox SAISIR (2008): Statistics Applied to the Interpretation of Spectra in the InfraRed, D. Bertrand
(http://www.chimiometrie.fr/saisir_webpage.html)
- Saliences of tables : lk dim
- Global scores of individuals : qdim
- Loadings of variables : pk dim
- Local scores of individuals : tk dim
The outputs of ComDim
The data• Samples• Fifteen members of the laboratory or their relatives were recruited as volunteers
- urine samples collected following standardised methodologies - urine samples spiked with a mixture of metabolites
• Technique• Samples analysed by 5 NMR and 9 MS platforms• Participants used their in-house protocol for instrument tuning and data
processing• Between 88 and 9699 variables in each table
• Pretreatment• NMR data were pareto-scaled• MS data were first Log10 transformed then pareto-scaled
ComDim procedure to estimate the proportion of common spectral information extracted from different MS and NMR platforms in a metabonomic study
Application to metabonomic data
RMN1
RMN3
ORBI1
ORBI3
ORBI5
QTOF1
QTOF2
QTOF4
QTOF5
TOF 0
1000
2000
3000
4000
5000
6000
7000
8000
751252 88 233
5035
157018271715
2668
1814
118112881688
908 909438
6992
5167
580 398
Number of features retained per platform
Global scores and Saliences
Non spiked
Spiked
Non spiked
Spiked
Spiked
Non spiked
Open symbols: NMR / Black symbols: MS
Non spiked
Spiked
Global and Local Scores
Can We Trust Untargeted Metabolomics: Results of the Metabo-Ring initiative, a large-scale multi-instruments inter-laboratory studyAfroData, Stellenbosch, 2012Jean-Charles Martin, Matthieu Maillot, Gérard Mazerolles, Alexandre Verdu, Bernard Lyan, Carole Migne, Catherine Defoort, Cecile Canlet, Christophe Junot, Claude Guillou, Claudine Manach, Christophe Cordella, Daniel Jabob, Delphine Jouan-Rimbaud Bouveresse, Estelle Paris, Estelle Pujos, Fabien Jourdan, Franck Giacomoni, Fréderique Courant, Gaëlle Favé, Gwenaëlle Le Gall, Hubert Chassaigne, Jean-Claude Tabet, Jean-Francois Martin, Jean-Philippe Antignac, Laetitia Shintu, Marianne Defernez, Mark Philo, Marie-Cécile Alexandre-Gouaubau, Marie-Jo Amiot-Carlin, Mathilde Bossis, Mohamed N. Triba, Natali Stojilkovic, Nathalie Banzet, Roland Molinié, Romain Bott, Sophie Goulitquer, Stefano Caldarelli, Douglas N. Rutledge
MS platforms distinguish inter-individual differences
Both NMR & MS detect 2 unusual individuals
No effect of the technological architecture for MS (Orbitrap = TOF = QTOF)
No influence of the number of variables extracted (from 88 to 7000)
An application of ACCPS / ComDim
Three types of sunflower oils are used : - oil “A” (production date: April 2009), - oil “O” (production date: March 2010), - oil “L” (production date: January 2011).
Packaged in : - PET bottles, - Glass bottles.
Accelerated aging at 40oC,- GC-MS analysis at 10, 20 & 30 days
Chromatogram for each ion in a separate table
ComDim with 9 Common Components
ACCPS : Analyse en Composantes Communes et Poids Spécifiques
ACCPS : Analyse en Composantes Communes et Poids Spécifiques
ACCPS : Analyse en Composantes Communes et Poids Spécifiques
ACCPS : Analyse en Composantes Communes et Poids Spécifiques
Application to Lignin data
8 types of Time Domain-NMR signals concatenated
20 samples in duplicate, with different characteristics :
- 2 Moisture levels : 33% (MgCl2 solution) / 75% (NaCl solution)
- 2 Shapes : Film / Cane
- 5 Lignin concentration levels : 0%, 5%, 10%, 15%, 30%
ComDim
X (40, 240) segmented into : 8 matrices (1 block per signal type)
ComDim with 6 Common Components applied to 8 matrices
Global scores of each Common ComponentCC1 = MoistureCC2 = ShapeCC3 = Lignin
ComDim
Saliences of each tableCC1 = MoistureCC2 = ShapeCC3 = Lignin
ComDim
Gloabl loadings of each Common ComponentCC1 = MoistureCC2 = ShapeCC3 = Lignin
ComDim
ComDim_VarSelec - ComDim for variable selection
ComDim procedure applied to a single data matrix X segmented into p sub-matrices Xi :
X X1 X2 Xp
Each sub-matrix Xi (i=1 to p) is considered as an individual data table in ComDim
The objective is to use ComDim to detect which segments contain information, thus revealing the range(s) of interesting variables
Since all segments contribute to the global matrix WG ,all variables are analysed simultaneously.
ComDim_VarSelec
X (40, 240) segmented into 240 matrices with 1 variable each :
ComDim with 6 Common Components applied simultaneously to all 240 matrices Xi (40, 1) :
- CC1 == Moisture (contributes to many blocks)- CC2 == Shape (contributes to several blocks)- CC3 == Lignin (contributes to only a few blocks)- …
Correlation between ComDim Scores and Moisture levels
ComDim_VarSelec
Correlation between ComDim Scores and Shape levels
ComDim_VarSelec
AComDim - ComDim for influential Factor detection
ComDim procedure applied to the data matricesobtained in the first step of ANOVA-PCA :
The Factor n matrices include the Interactions
In AComDim [2], all matrices are analysed simultaneously with ComDim.
In ANOVA-PCA, all Factor k matrices are analysed one by one by PCA, to evaluate each factor's significance
[2] D. Jouan-Rimbaud Bouveresse, D.N. Rutledge et al., Chemom. Intell. Lab. Syst., 2010
Application to Lignin data
8 types of Time Domain-NMR signals concatenated
20 samples in duplicate, with different characteristics :
- 2 Moisture levels : 33% (MgCl2 solution) / 75% (NaCl solution)
- 2 Shapes : Film / Cane
- 5 Lignin concentration levels : 0%, 5%, 10%, 15%, 30%
AComDim
X (40, 240) decomposed into : 7 (Factor mean+residual) matrices & 1 Residuals matrix :
- F1+Res, F2+Res, F3+Res, F12+Res, F13+Res, F23+Res, F123+Res, Res
ComDim with 8 Common Components applied to all matrices :
- CC1 == Noise (expected as Residuals are common to all matrices)- CC3 == Moisture- CC4 == Shape- CC5 == Lignin- …
Saliences of each table (7 Factor mean+residual matrices & 1 Residuals matrix)CC1 & CC2 = noiseCC3 = F1, CC4 = F2, CC5 = F3CC6 = F23, CC7 = F13, CC8 = F123
Ti 1 i 1( ) q Wq
AComDim
Separation of the samples for Factor 1 on CC3
AComDim
Separation of the samples for Factor 2 on CC4
AComDim
Separation of the samples for Factor 3 on CC5
AComDim
AComDim
Overlap of the samples for Interactions 2x3 & 1x3 on CC6 & CC7
To compare the variance of each Factor data table to the Residuals data table, calculate an F-value :
)( F
i
resi
1iT1
1resT1
i
resi
)( F
qWq
qWq
Ti 1 i 1( ) q Wq
AComDim
AComDim
P =0.0166
P =0.0004
P <0.0001
Lignin data setF-values for each Factor
Moisture
Shape
Lignin
Moisture : CC3 Lignin : CC5Shape : CC4
50 100 150 200 250
-0.1
-0.05
0
0.05
0.1
0.15
5 10 15 20 25 30 35 40
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
50 100 150 200 250
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
5 10 15 20 25 30 35 40
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
50 100 150 200 250
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5 10 15 20 25 30 35 40
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
AComDim
Global Scores and Loadings
Correlation between ComDim Scores and Lignin concentrations
ComDim_VarSelec
AComDim_VarSelec - AComDim for Factor detection and for Variable selection
This method combines both previous methods :
1) X is segmented into p segments
2) AComDim is performed on each segment
AComDim_VarSelec
1) X (40, 240) segmented into 240 matrices Xi (40, 1) :
Each Xi (40, 1) decomposed into 8 matrices :
- F1+Res, F2+Res, F3+Res, F12+Res, F13+Res, F23+Res, F123+Res, Res
2) AComDim with 9 Common Components applied to each of the 240 sets of 8 matrices
240 segments of 1 variable8 Factor matrices (Blocks)9 Common Components
Saliences on CC1 : for each Factor and Segment
AComDim_VarSelec
resi
i
F ( )
Moisture
AComDim_VarSelec
Shape
AComDim_VarSelec
Lignin
AComDim_VarSelec
ComDim- OPLS – ComDim based OPLS
ComDim- OPLS
ComDim-OPLS algorithm pseudo-code:
(1) Centring, optionally scaling and normalising each block(2) Compute the Wi association matrices with XiXi
T
(3) Compute WG as the weighted sum of association matrices according to the saliences λi
d of the blocks, using initial values of 1(4) OPLS modeling to relate the independent WG matrix to the dependent Y block(5) Deflate the estimated predictive variation and extract the score vector to
d corresponding to the largest eigenvalue (6) Update the saliences λi
d according to tod
(7) Evaluate the model fit increase- If fit increase > threshold go to (3)- If fit increase < threshold go to (8)
(8) Deflate separately the X blocks and go to (2)(9) When the orthogonal variation is exhausted, compute the predictive component(s) tp and the associated saliences
ComDim-OPLS
Saliences between 8 blocks and Lignin concentration for regression component and orthogonal components
Scores on regression and orthogonal components
ComDim-OPLS
Concatenated loadings on regression component
Concatenated loadings on orthogonal component
ComDim_DA - ComDim Discriminant Analyis
1) Calculate barycentre vectors of groups for each table Xi
2) Create prototype matrices Gi with repetitions of the barycentre vectors
3) Perform ComDim on prototype matrices
4) Project individuals of each Xi onto ComDim space of Gi
ComDim_DA
Data matrix X (40, 240) segmented into 8 matrices Xi (40, pi)
Calculate prototype Gi (40, pi) for each characteristic
ComDim with 4 Common Componentsapplied to the 8 Gi (40, pi) matrices for each characteristic
Project each Xi onto ComDim space of corresponding Gi
Application to Lignin data
8 types of Time Domain-NMR signals
20 samples in duplicate, with different characteristics :
- 2 Moisture levels
- 2 Shape levels
- 5 Lignin concentration levels
Moisture
ComDim_DA
Scores of G & X for All Blocks Scores of G & X on Block 1
Shape
ComDim_DA
Scores of G & X on Block 5Scores of G & X on All Blocks
Lignin
ComDim_DA
Scores of G & X on Block 7Scores of G & X on All Blocks
Conclusions
ComDim can be adapted for :
Selection of blocks of variables
Detection of significant factors
Regression modelling
Discriminant Analysis
Path modelling
Other possibilities
Generate several matrices from a single initial matrix:- Wavelet decomposition- Different data pretreatment methods- …
Use the different planes of multi-way data sets (3D-fluorescence, LC-MS, …)
Use the different colour planes of hyper-spectral images
…
Acknowledgements
El Mostafa Qannari
Gérard Mazerolle
Jean-Charles Martin
Matthieu Maillot,
Julien Boccard
Rui Pinto
Thank you