structured data analysis with rgcca - lss.supelec.fr · n x 1 x 2 x j 1 p 2 p 3... part i....

42
Structured data analysis with RGCCA Arthur Tenenhaus 2015/02/13

Upload: hoangliem

Post on 12-Sep-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Structured data analysis with RGCCA

Arthur Tenenhaus 2015/02/13

Overview of the presentation

X1 X2 XJn

p2 p3p1

...

Part I. Multi-block analysis

X1 X2 XJn

p2 p3p1

...

Part II. Multi-block and

Multi-way analysis

X2X21

X1

X2

XI

n1

p...

Part III. Multi-group analysis

n2

nI

X1 X2 XJn

p2 p3p1

...

Part I. Multi-block analysis

M. Tenenhaus

(HEC, Paris)

V. Guillemot

(ICM)

T. Löfstedt

(CEA, Neurospin)

C. Philippe

(IGR)

V. Frouin

(CEA, Neurospin)P. Groenen

(Erasmus University,

Rotterdam)

4

Economic inequality and political instability Data

from Russett (1964)

Agricultural inequality

GINI : Inequality of land

distributions

FARM : % farmers that own half

of the land (> 50)

RENT : % farmers that rent all

their land

Industrial development

GNPR : Gross national product

per capita ($ 1955)

LABO : % of labor force

employed in agriculture

INST : Instability of executive

(45-61)

ECKS : Nb of violent internal war

incidents (46-61)

DEAT : Nb of people killed as a

result of civic group

violence (50-62)

D-STAB : Stable democracy

D-UNST : Unstable democracy

DICT : Dictatorship

Economic inequality Political instability

Economic inequality and political instability(Data from Russett, 1964)

Gini Farm Rent Gnpr Labo Inst Ecks Deat Demo

Argentine 86.3 98.2 32.9 374 25 13.6 57 217 2

Australie 92.9 99.6 * 1215 14 11.3 0 0 1

Autriche 74.0 97.4 10.7 532 32 12.8 4 0 2

France 58.3 86.1 26.0 1046 26 16.3 46 1 2

Yougoslavie 43.7 79.8 0.0 297 67 0.0 9 0 3

X1 X2X3

5

1 = Stable democracy

2 = Unstable democracy

3 = Dictatorship

Three data blocks

GINI

FARM

RENT

GNPR

LABO

Agricultural inequality (X1)

Industrial development (X2)

ECKS

DEAT

D-STB

D-INS

INST

DICT

Political instability (X3)

Agr.

ineq.

Ind.

dev.

Pol.

inst.

C13 = 1

C23 = 1

C12 = 0

Path diagram

6

Block components

𝐲1 = 𝐗1𝐚1 = 𝑎11𝐆𝐈𝐍𝐈 + 𝑎12𝐅𝐀𝐑𝐌 + 𝑎13𝐑𝐄𝐍𝐓

𝐲2 = 𝐗2𝐚2 = 𝑎21𝐆𝐍𝐏𝐑 + 𝑎22𝐆𝐍𝐏𝐑

𝐲3 = 𝐗3𝐚3 = 𝑎31𝐈𝐍𝐒𝐓 + 𝑎32𝐄𝐂𝐊𝐒 + 𝑎33𝐃𝐄𝐀𝐓𝐇+𝑎34𝐃. 𝐒𝐓𝐀𝐁 + 𝑎35𝐃.𝐔𝐍𝐒𝐓𝐀𝐁 + 𝑎36𝐃𝐈𝐂𝐓

Block components should verified two properties at the same

time:

(i) Block components explain well their own block.

(ii) Block components are as correlated as possible for

connected blocks.

Covariance-based criteria

cjk = 1 if blocks are linked, 0 otherwise and cjj = 0

maximize

𝑗,𝑘

𝑐𝑗𝑘cor(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximize

𝑗,𝑘

𝑐𝑗𝑘cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximize

𝑗,𝑘

𝑐𝑗𝑘|cor 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |

SUMCOR (Horst, 1961)

SSQCOR (Mathes, 1993 ; Hanafi, 2004)

SABSCOR (Mathes, 1993 ; Hanafi, 2004)

SUMCOV (Van de Geer, 1984)

SSQCOV (Hanafi & Kiers, 2006)

SABSCOV (Krämer, 2006)

Some modified multi-block methods

GENERALIZED CANONICAL CORRELATION ANALYSIS

GENERALIZED CANONICAL COVARIANCE ANALYSIS

cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝐤𝐚𝐤

Some multi-block methods

Covariance-based criteria

cjk = 1 if blocks are linked, 0 otherwise and cjj = 0

maximizeall var 𝐗𝐣𝐚𝐣 =1

𝑗,𝑘

𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall var 𝐗𝐣𝐚𝐣 =1

𝑗,𝑘

𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall var 𝐗𝐣𝐚𝐣 =1

𝑗,𝑘

𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)

maximizeall ‖𝑎𝑗‖=1

𝑗,𝑘

𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |

SUMCOR:

SSQCOR:

SABSCOR:

SUMCOV:

SSQCOV:

SABSCOV:

cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝒌𝐚𝐤

RGCCA optimization problem

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints

and:

identity (Horst sheme)

square (Factorial scheme)

abolute value (Centroid scheme)

g

Shrinkage constant between 0 and 1j

otherwise 0

connected is and if 1 kj XXjkcwhere: A monotone convergent algorithm

Schäfer and Strimmer formula can be used for an

optimal determination of the shrinkage constants

• Tenenhaus A. and Tenenhaus M., Regularized Generalized Canonical Correlation Analysis, Psychometrika, vol. 76, Issue 2, pp. 257-284, 2011

• Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics and Data Analysis, in revision.

• Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html

Method Criterion Constraints

PLS regression 1 1 2 2Maximize Cov( , )X a X a 1 2 1 a a

Canonical

Correlation

Analysis 1 1 2 2Maximize Cor( , )X a X a 1 1 2 2Var( ) Var( ) 1 X a X a

Redundancy

analysis of X1 with

respect to X2

1/ 2

1 1 2 2 1 1

Maximize

Cor( , )Var( )X a X a X a 1

2 2

1

Var( ) 1

a

X a

Special cases

Components X1a1 and X2a2

are well correlated.

No stability condition for

2nd component1st component is stable

argmax𝐚1 ,𝐚2

cov(𝐗1𝐚1,𝐗2𝐚2)

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,2 Subject to the constraints

Choice of the shrinkage constant j (part 1)

Choice of the shrinkage constant j (part 2)

0 1

Favoring

correlation

Favoring

stability

j

Schäfer and Strimmer formula can be used for an

optimal determination of the shrinkage constants

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints

Choice of the design matrix C

Hierarchical models

(a) One second order block

(b) Several second order blocks

max𝐚1,𝐚2,…,𝐚𝐽

𝑗≠𝑘

𝐽

g cov 𝐗𝑗𝐚𝑗 , 𝐗𝐽+1𝐚𝐽+1

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗2= 1, 𝑗 = 1,… , 𝐽 + 1

max𝐚1,𝐚2,…,𝐚𝐽

𝑗=1

𝐽1

𝑘=𝐽1+1

𝐽

𝑐𝑗𝑘g cov 𝐗𝑗𝐚𝑗 , 𝐗𝑘𝐚𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗2= 1, 𝑗 = 1,… , 𝐽

𝐗1, …, 𝐗J1 = Predictor blocks

Very often:

𝐗J1+1, …, 𝐗J = Response Blocks

PLS Regression Wold S., Martens & Wold H. (1983): The multivariate calibration problem in chemistry solved by the PLS method. In Proc.

Conf. Matrix Pencils, Ruhe A. & Kåstrøm B. (Eds), March 1982, Lecture Notes in Mathematics, Springer Verlag,

Heidelberg, p. 286-293.

Redundancy analysis Barker M. & Rayens W. (2003): Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173.

Regularized CCA Vinod H. D. (1976): Canonical ridge and econometrics of joint production. Journal of Econometrics, 4, 147–166.

Inter-battery factor analysis Tucker L.R. (1958): An inter-battery method of factor analysis, Psychometrika, vol. 23, n°2, pp. 111-136.

MCOA Chessel D. and Hanafi M. (1996): Analyse de la co-inertie de K nuages de points. Revue de Statistique Appliquée, 44, 35-60

SSQCOV Hanafi M. & Kiers H.A.L. (2006): Analysis of K sets of data, with differential emphasis on agreement between and within

sets, Computational Statistics & Data Analysis, 51, 1491-1508.

SUMCOR Horst P. (1961): Relations among m sets of variables, Psychometrika, vol. 26, pp. 126-149.

SSQCOR Kettenring J.R. (1971): Canonical analysis of several sets of variables, Biometrika, 58, 433-451

MAXDIFF Van de Geer J. P. (1984): Linear relations among k sets of variables. Psychometrika, 49, 70-94.

PLS path modeling Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2005): PLS path modeling. Computational Statistics and Data

(mode B) Analysis, 48, 159-205.

Generalized Orthogonal Vivien M. & Sabatier R. (2003): Generalized orthogonal multiple co-inertia analysis (-PLS): new multiblock component

MCOA and regression methods, Journal of Chemometrics, 17, 287-301.

Caroll’s GCCA Carroll, J.D. (1968): A generalization of canonical correlation analysis to three or more sets of variables, Proc. 76th

Conv. Am. Psych. Assoc., pp. 227-228.

special cases of RGCCA (among others)

two-block case

multi-block case

monotone convergent algorithms for these criteria

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,… , 𝐽 Subject to

Two key ingredients: (i) Block relaxation

(ii) Majorization by Minorization

The RGCCA algorithm (primal version)

jjj aXy

Outer Estimation

(explains the block)

1var12

jjjjj aaX Initial

step ja

j

t

jjj

t

jjj

t

j

j

t

jjj

t

jj

j

zXIXXn

Xz

zXIXXn

a1

1

11

11

Iterate until

convergence

of the criterion

cjk = 1 if blocks are linked, 0 otherwise and cjj = 0

Inner

Estimation

(explains

relation

between

block)

k

jk

jkj e yz

Choice of weights ejk:

- Horst :

- Centroid :

- Factorial :

jkjk ce

kjjkjk ce yy ,corsign

kjjkjk ce yy ,cov

Dimension = 𝒑𝒋 × 𝒑𝒋

The RGCCA algorithm (dual version)

Initial

step jα

jj

t

jjj

t

jj

t

j

jj

t

jjj

j

zIXXn

XXz

zIXXn

α1

1

11

11

Iterate until

convergence

of the criterion

cjk = 1 if blocks are linked, 0 otherwise and cjj = 0

Inner

Estimation

(explains

relation

between

block)

k

jk

jkj e yz

Choice of weights ejk:

- Horst :

- Centroid :

- Factorial :

jkjk ce

kjjkjk ce yy ,corsign

kjjkjk ce yy ,cov

j

t

jjj αXXy

Outer Estimation

(explains the block)

1)1( 1 j

t

jjnjj

t

jj

t

j αXXIXXα

Dimension = 𝒏 × 𝒏

jtjj αXa

GINI

FARM

RENT

GNPR

LABO

Agricultural inequality (X1)

Industrial development (X2)

ECKS

DEAT

D-STB

INST

DICT

Political instability (X3)

Agr.

ineq.

Ind.

dev.

Pol.

inst.

Weight vectors

18

0.66

0.74

0.10

0.69

-0.72

0.17

0.44

0.48

-0.56

0.49

Corr = 0.428

Corr = -0.767

small dimensional block settings ⟹ primal algorithm for RGCCA

Bootstrap confidence intervals

19

GINI

FARM

RENT

GNPR

LABO

ECKS

DEAT

D-STB

INST

DICT

Agr.

ineq.

Ind.

dev.

Pol.

inst.

0.66

0.74

0.10

0.69

-0.72

0.17

0.44

0.48

-0.56

0.49

Corr = 0.428

Corr = -0.767

Data vizualization

20

Agricultural inequality

Industr

ial develo

pm

ent

These countries

have known a

period of

dictatorship

after 1964.

Greece : colonels’ dictatorship 1967-1974

Chili : Pinochet's military regime 1973-1990

Argentine : military dictatorship 1976-1983

Brasil : Branco’s military dictactorship 1964-1985

21

Glioma Cancer Data(Department of Pediatric Oncology of the Gustave Roussy Institute)

Gene 1 Gene 2 … Gene 15201 CGH1 … CGH 1909 Localization

Patient 1 0.18 -0.21 -0.73 0.00 -0.55 Hemisphere

Patient 2 1.15 -0.45 0.27 -0.30 0.00 Midline

Patient 3 1.35 0.17 0.22 0.33 0.64 DIPG

Patient 53 1.39 0.18 … -0.17 0.00 … 0.43 Hemisphere

Transcriptomic data (X1)

CGH data (X2)

outcome (X3)

Glioma Cancer Data: from an RGCCA viewpoint(Department of Pediatric Oncology of the Gustave Roussy Institute)

High dimensional block settings ⟹ dual algorithm for RGCCA

Gene 1 … Gene 15201

Patient 1 0.18 -0.73

Patient 2 1.15 0.27

Patient 3 1.35 0.22

Patient 53 1.39 -0.17

CGH1 … CGH 1909

Patient 1 0.00 -0.55

Patient 2 -0.30 0.00

Patient 3 0.33 0.64

Patient 53 0.00 0.43

2

1

3

Hemisphere DIPG

Patient 1 1 0

Patient 2 0 0

Patient 3 0 1

Patient 53 1 0

RGCCA with factorial scheme - 1 = 1, 2 = 1 and 3 = 0

C13 = 1

C23 = 1

C12 = 1C12 = 0

Block components

𝐲1 = 𝐗𝟏𝐚𝟏 = 𝑎11𝐆𝐞𝐧𝐞𝟏 +⋯+ 𝑎1,15201𝐆𝐞𝐧𝐞𝟏𝟓𝟐𝟎𝟏

𝐲2 = 𝐗𝟐𝐚𝟐 = 𝑎21𝐂𝐆𝐇𝟏 +⋯+ 𝑎2,1909𝐂𝐆𝐇𝟏𝟗𝟎𝟗

𝐲𝟑 = 𝐗𝟑𝐚𝟑 = 𝑎31𝐇𝐞𝐦𝐢𝐬𝐩𝐡𝐞𝐫𝐞 + 𝑎32𝐃𝐈𝐏𝐆

Block components should verified three properties at the same

time:

(i) Block components well explain their own block.

(ii) Block components are as correlated as possible for

connected blocks.

(iii) Block components are built from sparse 𝐚𝒋

Behavioral data(Clinic, psychometric)

Intermediate phenotype

Final phenotype

Genotype

Functional MRI

Gene Expression

Structured variable selection for RGCCA

(Structured) variable selection for RGCCA

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

subject to 𝐚𝑗𝑡𝐌𝑗𝐚𝑗 = 1, 𝑗 = 1,… , 𝐽

Ω(𝐚𝑗 ) ≤ 𝑐𝑗 , 𝑗 = 1,… , 𝐽

• LASSO: Ω 𝐚𝐣 = 𝐚𝐣 1

Ω 𝐚𝐣 =

𝑘=1

𝑝𝑗

𝑎𝑗𝑘 + 𝜆

𝑘=1

𝑝𝑗

𝑎𝑗𝑘 − 𝑎𝑗,𝑘−1

Ω 𝐚𝐣 =

𝑔∈𝒢

ag 2• Group LASSO:

• Fused LASSO:

• Tenenhaus A., Philippe C., Guillemot V., Lê Cao K.-A., Grill J., Frouin V., Variable Selection for Generalized Canonical Correlation Analysis, Biostatistics, 15 (3),

569-583, 2014.

• Löfstedt T., Hadj-Salem F., Guillemot V., Philippe C., Duchesnay E., Frouin V., and Tenenhaus A., (2014). Structured variable selection for generalized

canonical correlation analysis. In: Proceedings of the 8th International Conference on Partial Least Squares and Related Methods (PLS14), Paris, France.

• Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html

Signature stability

28

Predictive performances

Visualization

GE1

CG

H1

X1 X2 XJn

p2 p3p1

...

Part II. Multi-block and

Multi-way analysis

X2X21

G. Lechuga

(CentraleSupélec, L2S)

L. Le Brusquet

(CentraleSupélec, L2S)L. Puybasset, V. Perlbarg & D. Galanaud

Hôpital La Pitié-Salpêtrière

31

Multimodal neuroimaging from a multiblock viewpoint(Brain and Spine Institute, La pitié Salpêtrière Hospital)

2

1

3

C13 = 1

C23 = 1

C12 = 1

C13 = 0

Anatomical MRI (X1)

Functional MRI (X2)

Behavior (X3) Anat 1 … Anat p1

Patient 1 0.18 -0.73

Patient 2 1.15 0.27

Patient 3 1.35 0.22

Patient n 1.39 -0.17

fMRI 1 … fMRI p1

Patient 1 0.00 -0.55

Patient 2 -0.30 0.00

Patient 3 0.33 0.64

Patient n 0.00 0.43

BEH 1 … BEH p3

Patient 1 0.18 -0.73

Patient 2 1.15 0.27

Patient 3 1.35 0.22

Patient n 1.39 -0.17

p1 ~104

p2 ~104

p3 ~10

n ~

100

n ~

100

X5

Final Phenotype

X1 X2 X3 X4

Anatomical MRI Diffusion MRI Functional MRIPET

p1 p1 p1 p1p2

From Multiblock data to …

Multiway RGCCA (MGCCA)

RGCCA algorithm

a1 a2 a3 a5a4

4×p1 parameters to estimate

n

• Tenenhaus A., Le Brusquet L. Regularized Generalized Canonical Correlation Analysis extended to three way data, International Conference of the ERCIM WG on

Computational and Methodological Statistics, 2014

• Tenenhaus A., Le Brusquet L. Three-way Regularized Generalized Canonical Correlation Analysis, ThRee-way methods In Chemistry And Psychology, (TRICAP) ,2015

X5

… to Multiblock / Multiway data

X121

C12 = 1

J

1bK

1b2b

X5X1 X2 X3 X4P1 = P2 =and

𝐛𝑗𝑡𝐛𝑗 = 1, 𝑗 = 1,2

𝐛1 = 𝐛1𝐾 ⊗𝐛1

𝐽 s.c.max

𝐛1 , 𝐛2 cov(𝐏1𝐛1 ,𝐏2𝐛2)

P2

… to Multiblock / Multiway data

P121

C12 = 1

J

1bK

1b2b

MGCCA

b2K

1bJ

1b

4 + p1 parameters to estimate

instead of 4 × p1

K J

1 1 1b b b

MGCCA optimization problem

𝐛𝑗𝑡𝐛𝑗 = 1, 𝑗 = 1,… , 𝐽

𝐛𝑗 = 𝐛𝑗𝐾 ⊗𝐛𝑗

𝐽 , 𝑗 = 1,… , 𝐽 s.c.max

𝐛1 ,…,𝐛𝐽 𝑐𝑗𝑘 g cov(𝐏𝑗𝐛𝑗 ,𝐏𝑘𝐛𝑘)

𝐽

𝑗≠𝑘

1

..1P

1

..kP

1

1

..KP

3

..1P

3

..kP

3

3

..KP

2

..1P

2

..kP

2

2

..KP

1

3

2

C13 = 1

C23 = 1

C12 = 1

C13 = 0

J

1bK

1b

J

2b

K

2b

J

3b

K

3b

1

2

3

MGCCA algorithm

y P bj j j

Outer component

(summarizes the block)

Initial

step jb

Iterate until

convergence

of the criterion

cjk = 1 if blocks/groups are connected and 0 otherwise

Inner

component

(take into

account

relationships

between

blocks)

j jk k

k j

z e y

1

, arg max cov( , )b b bb

b b P b z

K J

K J

j j j j

Choice of weights ejk:

- Horst :

- Centroid :

- Factorial :

jk jke =c

sign cov( , )ik jk j ke c y y

cov( , )jk jk j ke c y y

...,j

j j

j ..1 ..K, P P P

𝐛𝑗 = 1 and 𝐛𝑗 = 𝐛𝑗𝐾 ⊗𝐛𝑗

𝐽

,K J

j jb b are obtained by SVD

MGCCA results

P2P1 21

J

1bK

1b2b

Predict the long term recovery of patients after traumatic brain injury

Influence of spatial positions Influence of the

modalities J

1b

K

1bDiscriminating

voxels within

the white matter

bundles

Bad prognosis

Good prognosis

Control

X1

X2

XI

n1

p

...

Part III. Multi-group analysis

n2

nI

M. Tenenhaus

(HEC, Paris)

Part III: multigroup data analysis

• SETTINGS: The same set of

variables are measured on individuals

structured in several groups. groups

are centered and normalized (unit

norm)

• OBJECTIVE: investigate the

relationships between variables

within the various groups.

X2n

1

p

X2

n2

nI

X2

X2

Tenenhaus, A. and Tenenhaus, M. (2014). Regularized Generalized Canonical Correlation Analysis for multiblock or

multigroup data analysis. European Journal of Operational Research, 238 :391–403.

X2

X2

RGCCA for multiblock data analysis

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints

Block

component

cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝒌𝐚𝒌

Block components should verified two properties at the same

time:

(i) Block components well explain their own block.

(ii) Block components are as correlated as possible for

connected blocks.

RGCCA for multigroup data analysis

argmax𝐚1 ,…,𝐚𝐼

𝑐𝑖𝑘g 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 ,𝐗𝑘

𝑡 𝐗𝑘𝐚𝑘

𝐼

𝑖≠𝑘

(1 − 𝜏𝑖)‖𝐗𝑖𝐚𝑖‖2 + 𝜏𝑖‖𝐚𝑖‖

2 = 1, 𝑖 = 1,… , 𝐼 s.c.

𝐗𝑖𝑡𝐗𝑖𝐚𝑖 , 𝐗𝑘

𝑡𝐗𝑘𝐚𝑘 = cos 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 , 𝐗𝑘

𝑡𝐗𝑘𝐚𝑘 × 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 × ‖𝐗𝑘

𝑡𝐗𝑘𝐚𝑘‖

Group loadings and group components should verified the

following properties at the same time:

• Group component 𝐗𝑖𝐚𝑖 well explains their own block.

• Small angle between loadings if groups are connected

Similar Loadings

Group

loadings

Group

components

Conclusion

X1 X2 XJn

p2 pJp1

...

Multiblock analysis

X1 X2 XJn

p2 p3p1

...

Multiblock/Multiway analysis

X2X21

X1

X2

XI

n1

p

...

Multigroup analysis

n2

nI

RGCCA for multiblock,

multigroup or multiway

data allows analyzing

the data in their natural

(but complex) structure.