extensions of common component and specific weight analysis for applications in chemometrics douglas...

Extensions of Common Component and Specific Weight Analysisfor applications in Chemometrics

Douglas N. Rutledge, Delphine Jouan-Rimbaud--Bouveresse

AgroParisTech/INRA, UMR1145 Ingénierie Procédés Aliments, F-75005 PARIS

rutledge@agroparistech.fr

Introduction : Multiblock Data sets

Different types of multivariate data are measured on the same individuals.

• Examples :– Sensory analysis fixed or free choice profiling.– Process technology multivariate measurements are performed

at different stages of the process.– Functional Genomics genetic data, molecular data and

phenotypic data are collected.

1X2X mXn

p1p2 pm

Introduction : Multiblock Data sets Objectives :

– To investigate the relationships between data blocks.– To highlight and interpret the within block patterns :

relationships among samples.

How ?- Define the underlying (common) dimensions to the data blocks and assess how much each dimension is relevant to each data block.

How to interpret ?- Express the underlying dimensions in terms of the variables in the various data blocks;- Or, express these underlying dimensions in terms a reference data set.

A plethora of methods

PLS Path modelling

Tukey3

Parafac

Tukey1Tukey2 Common Principal Components

Statis

Indscal

With so many methods around, what do you do?

Depending on your aim, pick one that you believe in, and work with it honestly.

Analyse en Composantes Communes et Poids Spécifiques

Common Component and Specific Weights Analysis

Outline

The ComDim procedure

ComDim – Simultaneous analysis of several data tables

ComDim_VarSelec - ComDim for Variable selection

AComDim - ComDim for significant Factor detection

AComDim_VarSelec - ComDim for Variable selection AND Factor detection

O-PLS_ComDim - Separate orthogonal and non-orthogonal contributions

ComDim_DA - Discriminant Analyis based on ComDim on barycentres

ComDim - Simultaneous analysis of several data tables

[1] E. Qannari, I. Wakeling, P. Courcoux, H.J.H MacFie, Food Quality and Preference (2000) 11, 151

"Common Components and Specific Weights Analysis" - CCSWA [1]

Simultaneously study several matrices- with different variables describing the same samples

Describe p data tables observed for the same n samples :- a set of p data matrices (X) each with n rows,- but not necessarily the same number of columns

Determine a common space for all p data tables, - each matrix has a specific contribution ("salience")

to the definition of each dimension of this common space

Start with p matrices Xi of size n × ki (i = 1 to p)

Each Xi column-centered and scaled by dividing by matrix norm :Xsi

For each Xsi, an n × n scalar product matrix Wi can be computed as :

Wi = Xsi • Xsi T

Wi reflect the dispersion of the samples in the space of that table

The common dimensions of all the tables are computed iteratively

At each iteration, a weighted sum of the p Wi matrices is computed, resulting in a global WG matrix

dim'dim

(k EqdimqW

The ComDim or CCSWA model

Dif = (W1-l1.q.qT) + (W2-l2.q.qT)

Aux = I - q . qT

X1 = Aux . X1

X2 = Aux . X2

W = X.XT

"ComDim" the implementation of CCSWA used here is part of the SAISIR toolbox SAISIR (2008): Statistics Applied to the Interpretation of Spectra in the InfraRed, D. Bertrand

(http://www.chimiometrie.fr/saisir_webpage.html)

- Saliences of tables : lk dim

- Global scores of individuals : qdim

- Loadings of variables : pk dim

- Local scores of individuals : tk dim

The outputs of ComDim

The data• Samples• Fifteen members of the laboratory or their relatives were recruited as volunteers

- urine samples collected following standardised methodologies - urine samples spiked with a mixture of metabolites

• Technique• Samples analysed by 5 NMR and 9 MS platforms• Participants used their in-house protocol for instrument tuning and data

processing• Between 88 and 9699 variables in each table

• Pretreatment• NMR data were pareto-scaled• MS data were first Log10 transformed then pareto-scaled

ComDim procedure to estimate the proportion of common spectral information extracted from different MS and NMR platforms in a metabonomic study

Application to metabonomic data

751252 88 233

157018271715

118112881688

908 909438

580 398

Number of features retained per platform

Global scores and Saliences

Non spiked

Spiked

Non spiked

Spiked

Non spiked

Open symbols: NMR / Black symbols: MS

Non spiked

Spiked

Global and Local Scores

Can We Trust Untargeted Metabolomics: Results of the Metabo-Ring initiative, a large-scale multi-instruments inter-laboratory studyAfroData, Stellenbosch, 2012Jean-Charles Martin, Matthieu Maillot, Gérard Mazerolles, Alexandre Verdu, Bernard Lyan, Carole Migne, Catherine Defoort, Cecile Canlet, Christophe Junot, Claude Guillou, Claudine Manach, Christophe Cordella, Daniel Jabob, Delphine Jouan-Rimbaud Bouveresse, Estelle Paris, Estelle Pujos, Fabien Jourdan, Franck Giacomoni, Fréderique Courant, Gaëlle Favé, Gwenaëlle Le Gall, Hubert Chassaigne, Jean-Claude Tabet, Jean-Francois Martin, Jean-Philippe Antignac, Laetitia Shintu, Marianne Defernez, Mark Philo, Marie-Cécile Alexandre-Gouaubau, Marie-Jo Amiot-Carlin, Mathilde Bossis, Mohamed N. Triba, Natali Stojilkovic, Nathalie Banzet, Roland Molinié, Romain Bott, Sophie Goulitquer, Stefano Caldarelli, Douglas N. Rutledge

MS platforms distinguish inter-individual differences

Both NMR & MS detect 2 unusual individuals

No effect of the technological architecture for MS (Orbitrap = TOF = QTOF)

No influence of the number of variables extracted (from 88 to 7000)

An application of ACCPS / ComDim

Three types of sunflower oils are used : - oil “A” (production date: April 2009), - oil “O” (production date: March 2010), - oil “L” (production date: January 2011).

Packaged in : - PET bottles, - Glass bottles.

Accelerated aging at 40oC,- GC-MS analysis at 10, 20 & 30 days

Chromatogram for each ion in a separate table

ComDim with 9 Common Components

ACCPS : Analyse en Composantes Communes et Poids Spécifiques

Application to Lignin data

8 types of Time Domain-NMR signals concatenated

20 samples in duplicate, with different characteristics :

- 2 Moisture levels : 33% (MgCl2 solution) / 75% (NaCl solution)

- 2 Shapes : Film / Cane

- 5 Lignin concentration levels : 0%, 5%, 10%, 15%, 30%

ComDim

X (40, 240) segmented into : 8 matrices (1 block per signal type)

ComDim with 6 Common Components applied to 8 matrices

Global scores of each Common ComponentCC1 = MoistureCC2 = ShapeCC3 = Lignin

ComDim

Saliences of each tableCC1 = MoistureCC2 = ShapeCC3 = Lignin

ComDim

Gloabl loadings of each Common ComponentCC1 = MoistureCC2 = ShapeCC3 = Lignin

ComDim

ComDim_VarSelec - ComDim for variable selection

ComDim procedure applied to a single data matrix X segmented into p sub-matrices Xi :

X X1 X2 Xp

Each sub-matrix Xi (i=1 to p) is considered as an individual data table in ComDim

The objective is to use ComDim to detect which segments contain information, thus revealing the range(s) of interesting variables

Since all segments contribute to the global matrix WG ,all variables are analysed simultaneously.

ComDim_VarSelec

X (40, 240) segmented into 240 matrices with 1 variable each :

ComDim with 6 Common Components applied simultaneously to all 240 matrices Xi (40, 1) :

- CC1 == Moisture (contributes to many blocks)- CC2 == Shape (contributes to several blocks)- CC3 == Lignin (contributes to only a few blocks)- …

Correlation between ComDim Scores and Moisture levels

ComDim_VarSelec

Correlation between ComDim Scores and Shape levels

ComDim_VarSelec

AComDim - ComDim for influential Factor detection

ComDim procedure applied to the data matricesobtained in the first step of ANOVA-PCA :

The Factor n matrices include the Interactions

In AComDim [2], all matrices are analysed simultaneously with ComDim.

In ANOVA-PCA, all Factor k matrices are analysed one by one by PCA, to evaluate each factor's significance

[2] D. Jouan-Rimbaud Bouveresse, D.N. Rutledge et al., Chemom. Intell. Lab. Syst., 2010

8 types of Time Domain-NMR signals concatenated

- 2 Moisture levels : 33% (MgCl2 solution) / 75% (NaCl solution)

- 2 Shapes : Film / Cane

- 5 Lignin concentration levels : 0%, 5%, 10%, 15%, 30%

AComDim

X (40, 240) decomposed into : 7 (Factor mean+residual) matrices & 1 Residuals matrix :

- F1+Res, F2+Res, F3+Res, F12+Res, F13+Res, F23+Res, F123+Res, Res

ComDim with 8 Common Components applied to all matrices :

- CC1 == Noise (expected as Residuals are common to all matrices)- CC3 == Moisture- CC4 == Shape- CC5 == Lignin- …

Saliences of each table (7 Factor mean+residual matrices & 1 Residuals matrix)CC1 & CC2 = noiseCC3 = F1, CC4 = F2, CC5 = F3CC6 = F23, CC7 = F13, CC8 = F123

Ti 1 i 1( ) q Wq

AComDim

Separation of the samples for Factor 1 on CC3

AComDim

Overlap of the samples for Interactions 2x3 & 1x3 on CC6 & CC7

To compare the variance of each Factor data table to the Residuals data table, calculate an F-value :

1resT1

Ti 1 i 1( ) q Wq

AComDim

P =0.0166

P =0.0004

P <0.0001

Lignin data setF-values for each Factor

Moisture

Lignin

Moisture : CC3 Lignin : CC5Shape : CC4

50 100 150 200 250

5 10 15 20 25 30 35 40

50 100 150 200 250

5 10 15 20 25 30 35 40

50 100 150 200 250

5 10 15 20 25 30 35 40

AComDim

Global Scores and Loadings

Correlation between ComDim Scores and Lignin concentrations

ComDim_VarSelec

AComDim_VarSelec - AComDim for Factor detection and for Variable selection

This method combines both previous methods :

1) X is segmented into p segments

2) AComDim is performed on each segment

AComDim_VarSelec

1) X (40, 240) segmented into 240 matrices Xi (40, 1) :

Each Xi (40, 1) decomposed into 8 matrices :

- F1+Res, F2+Res, F3+Res, F12+Res, F13+Res, F23+Res, F123+Res, Res

2) AComDim with 9 Common Components applied to each of the 240 sets of 8 matrices

240 segments of 1 variable8 Factor matrices (Blocks)9 Common Components

Saliences on CC1 : for each Factor and Segment

AComDim_VarSelec

Moisture

AComDim_VarSelec

Lignin

AComDim_VarSelec

ComDim- OPLS – ComDim based OPLS

ComDim- OPLS

ComDim-OPLS algorithm pseudo-code:

(1) Centring, optionally scaling and normalising each block(2) Compute the Wi association matrices with XiXi

(3) Compute WG as the weighted sum of association matrices according to the saliences λi

d of the blocks, using initial values of 1(4) OPLS modeling to relate the independent WG matrix to the dependent Y block(5) Deflate the estimated predictive variation and extract the score vector to

d corresponding to the largest eigenvalue (6) Update the saliences λi

d according to tod

(7) Evaluate the model fit increase- If fit increase > threshold go to (3)- If fit increase < threshold go to (8)

(8) Deflate separately the X blocks and go to (2)(9) When the orthogonal variation is exhausted, compute the predictive component(s) tp and the associated saliences

ComDim-OPLS

Saliences between 8 blocks and Lignin concentration for regression component and orthogonal components

Scores on regression and orthogonal components

ComDim-OPLS

Concatenated loadings on regression component

Concatenated loadings on orthogonal component

ComDim_DA - ComDim Discriminant Analyis

1) Calculate barycentre vectors of groups for each table Xi

2) Create prototype matrices Gi with repetitions of the barycentre vectors

3) Perform ComDim on prototype matrices

4) Project individuals of each Xi onto ComDim space of Gi

ComDim_DA

Data matrix X (40, 240) segmented into 8 matrices Xi (40, pi)

Calculate prototype Gi (40, pi) for each characteristic

ComDim with 4 Common Componentsapplied to the 8 Gi (40, pi) matrices for each characteristic

Project each Xi onto ComDim space of corresponding Gi

8 types of Time Domain-NMR signals

- 2 Moisture levels

- 2 Shape levels

- 5 Lignin concentration levels

Moisture

ComDim_DA

Scores of G & X for All Blocks Scores of G & X on Block 1

ComDim_DA

Scores of G & X on Block 5Scores of G & X on All Blocks

Lignin

ComDim_DA

Scores of G & X on Block 7Scores of G & X on All Blocks

Conclusions

ComDim can be adapted for :

Selection of blocks of variables

Detection of significant factors

Regression modelling

Discriminant Analysis

Path modelling

Other possibilities

Generate several matrices from a single initial matrix:- Wavelet decomposition- Different data pretreatment methods- …

Use the different planes of multi-way data sets (3D-fluorescence, LC-MS, …)

Use the different colour planes of hyper-spectral images

Acknowledgements

El Mostafa Qannari

Gérard Mazerolle

Jean-Charles Martin

Matthieu Maillot,

Julien Boccard

Rui Pinto

Thank you

extensions of common component and specific weight analysis for applications in chemometrics douglas...

data tables comdim

p data tables

molecular data

phenotypic data

p w i matrices

set of p data matrices

varselec comdim

common space slide

Documents

synthese technique - agroparistech

[jacques bouveresse] essais wittgenstein les s(bookzz.org)

denis jouan institut de physique nucléaire orsay denis...

guide lecteur infodoc agroparistech 2014

bouveresse. la demanda de filosofía

texte pr paratoire sc21 - bouveresse sur...

thermodynamique - agroparistech

j. bouveresse la parole malheureuse 1971.pdf

jacques bouveresse - la demande que

jacques bouveresse, l'abîme des lieux communs

agroparistech concours terr ’eau fertile

bouveresse. bourdieu, sabio y político

wittgenstein reads freud (by bouveresse)

jacques bouveresse philosophie des mathématiques et...

39 - jacques bouveresse - bourdieu, sabio y político

parcours de formation - agroparistech

i ndependent c omponents a nalysis with the jade algorithm...

chapitre 12 - agroparistech

bouveresse hermeneutique et linguistique

bouveresse carnap y el lenguaje