structured data analysis with rgcca - lss.supelec.fr · n x 1 x 2 x j 1 p 2 p 3... part i....
TRANSCRIPT
Overview of the presentation
X1 X2 XJn
p2 p3p1
...
Part I. Multi-block analysis
X1 X2 XJn
p2 p3p1
...
Part II. Multi-block and
Multi-way analysis
X2X21
X1
X2
XI
n1
p...
Part III. Multi-group analysis
n2
nI
X1 X2 XJn
p2 p3p1
...
Part I. Multi-block analysis
M. Tenenhaus
(HEC, Paris)
V. Guillemot
(ICM)
T. Löfstedt
(CEA, Neurospin)
C. Philippe
(IGR)
V. Frouin
(CEA, Neurospin)P. Groenen
(Erasmus University,
Rotterdam)
4
Economic inequality and political instability Data
from Russett (1964)
Agricultural inequality
GINI : Inequality of land
distributions
FARM : % farmers that own half
of the land (> 50)
RENT : % farmers that rent all
their land
Industrial development
GNPR : Gross national product
per capita ($ 1955)
LABO : % of labor force
employed in agriculture
INST : Instability of executive
(45-61)
ECKS : Nb of violent internal war
incidents (46-61)
DEAT : Nb of people killed as a
result of civic group
violence (50-62)
D-STAB : Stable democracy
D-UNST : Unstable democracy
DICT : Dictatorship
Economic inequality Political instability
Economic inequality and political instability(Data from Russett, 1964)
Gini Farm Rent Gnpr Labo Inst Ecks Deat Demo
Argentine 86.3 98.2 32.9 374 25 13.6 57 217 2
Australie 92.9 99.6 * 1215 14 11.3 0 0 1
Autriche 74.0 97.4 10.7 532 32 12.8 4 0 2
France 58.3 86.1 26.0 1046 26 16.3 46 1 2
Yougoslavie 43.7 79.8 0.0 297 67 0.0 9 0 3
X1 X2X3
5
1 = Stable democracy
2 = Unstable democracy
3 = Dictatorship
Three data blocks
GINI
FARM
RENT
GNPR
LABO
Agricultural inequality (X1)
Industrial development (X2)
ECKS
DEAT
D-STB
D-INS
INST
DICT
Political instability (X3)
Agr.
ineq.
Ind.
dev.
Pol.
inst.
C13 = 1
C23 = 1
C12 = 0
Path diagram
6
Block components
𝐲1 = 𝐗1𝐚1 = 𝑎11𝐆𝐈𝐍𝐈 + 𝑎12𝐅𝐀𝐑𝐌 + 𝑎13𝐑𝐄𝐍𝐓
𝐲2 = 𝐗2𝐚2 = 𝑎21𝐆𝐍𝐏𝐑 + 𝑎22𝐆𝐍𝐏𝐑
𝐲3 = 𝐗3𝐚3 = 𝑎31𝐈𝐍𝐒𝐓 + 𝑎32𝐄𝐂𝐊𝐒 + 𝑎33𝐃𝐄𝐀𝐓𝐇+𝑎34𝐃. 𝐒𝐓𝐀𝐁 + 𝑎35𝐃.𝐔𝐍𝐒𝐓𝐀𝐁 + 𝑎36𝐃𝐈𝐂𝐓
Block components should verified two properties at the same
time:
(i) Block components explain well their own block.
(ii) Block components are as correlated as possible for
connected blocks.
Covariance-based criteria
cjk = 1 if blocks are linked, 0 otherwise and cjj = 0
maximize
𝑗,𝑘
𝑐𝑗𝑘cor(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximize
𝑗,𝑘
𝑐𝑗𝑘cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximize
𝑗,𝑘
𝑐𝑗𝑘|cor 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |
SUMCOR (Horst, 1961)
SSQCOR (Mathes, 1993 ; Hanafi, 2004)
SABSCOR (Mathes, 1993 ; Hanafi, 2004)
SUMCOV (Van de Geer, 1984)
SSQCOV (Hanafi & Kiers, 2006)
SABSCOV (Krämer, 2006)
Some modified multi-block methods
GENERALIZED CANONICAL CORRELATION ANALYSIS
GENERALIZED CANONICAL COVARIANCE ANALYSIS
cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝐤𝐚𝐤
Some multi-block methods
Covariance-based criteria
cjk = 1 if blocks are linked, 0 otherwise and cjj = 0
maximizeall var 𝐗𝐣𝐚𝐣 =1
𝑗,𝑘
𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall var 𝐗𝐣𝐚𝐣 =1
𝑗,𝑘
𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall var 𝐗𝐣𝐚𝐣 =1
𝑗,𝑘
𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘cov(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘cov2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)
maximizeall ‖𝑎𝑗‖=1
𝑗,𝑘
𝑐𝑗𝑘|cov 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 |
SUMCOR:
SSQCOR:
SABSCOR:
SUMCOV:
SSQCOV:
SABSCOV:
cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝒌𝐚𝐤
RGCCA optimization problem
argmax𝐚1 ,𝐚2 ,…,𝐚𝐽
𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘
𝐽
𝑗≠𝑘
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2
= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints
and:
identity (Horst sheme)
square (Factorial scheme)
abolute value (Centroid scheme)
g
Shrinkage constant between 0 and 1j
otherwise 0
connected is and if 1 kj XXjkcwhere: A monotone convergent algorithm
Schäfer and Strimmer formula can be used for an
optimal determination of the shrinkage constants
• Tenenhaus A. and Tenenhaus M., Regularized Generalized Canonical Correlation Analysis, Psychometrika, vol. 76, Issue 2, pp. 257-284, 2011
• Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics and Data Analysis, in revision.
• Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html
Method Criterion Constraints
PLS regression 1 1 2 2Maximize Cov( , )X a X a 1 2 1 a a
Canonical
Correlation
Analysis 1 1 2 2Maximize Cor( , )X a X a 1 1 2 2Var( ) Var( ) 1 X a X a
Redundancy
analysis of X1 with
respect to X2
1/ 2
1 1 2 2 1 1
Maximize
Cor( , )Var( )X a X a X a 1
2 2
1
Var( ) 1
a
X a
Special cases
Components X1a1 and X2a2
are well correlated.
No stability condition for
2nd component1st component is stable
argmax𝐚1 ,𝐚2
cov(𝐗1𝐚1,𝐗2𝐚2)
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2
= 1, 𝑗 = 1,2 Subject to the constraints
Choice of the shrinkage constant j (part 1)
Choice of the shrinkage constant j (part 2)
0 1
Favoring
correlation
Favoring
stability
j
Schäfer and Strimmer formula can be used for an
optimal determination of the shrinkage constants
argmax𝐚1 ,𝐚2 ,…,𝐚𝐽
𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘
𝐽
𝑗≠𝑘
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2
= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints
Choice of the design matrix C
Hierarchical models
(a) One second order block
(b) Several second order blocks
max𝐚1,𝐚2,…,𝐚𝐽
𝑗≠𝑘
𝐽
g cov 𝐗𝑗𝐚𝑗 , 𝐗𝐽+1𝐚𝐽+1
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗2= 1, 𝑗 = 1,… , 𝐽 + 1
max𝐚1,𝐚2,…,𝐚𝐽
𝑗=1
𝐽1
𝑘=𝐽1+1
𝐽
𝑐𝑗𝑘g cov 𝐗𝑗𝐚𝑗 , 𝐗𝑘𝐚𝑘
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗2= 1, 𝑗 = 1,… , 𝐽
𝐗1, …, 𝐗J1 = Predictor blocks
Very often:
𝐗J1+1, …, 𝐗J = Response Blocks
PLS Regression Wold S., Martens & Wold H. (1983): The multivariate calibration problem in chemistry solved by the PLS method. In Proc.
Conf. Matrix Pencils, Ruhe A. & Kåstrøm B. (Eds), March 1982, Lecture Notes in Mathematics, Springer Verlag,
Heidelberg, p. 286-293.
Redundancy analysis Barker M. & Rayens W. (2003): Partial least squares for discrimination, Journal of Chemometrics, 17, 166-173.
Regularized CCA Vinod H. D. (1976): Canonical ridge and econometrics of joint production. Journal of Econometrics, 4, 147–166.
Inter-battery factor analysis Tucker L.R. (1958): An inter-battery method of factor analysis, Psychometrika, vol. 23, n°2, pp. 111-136.
MCOA Chessel D. and Hanafi M. (1996): Analyse de la co-inertie de K nuages de points. Revue de Statistique Appliquée, 44, 35-60
SSQCOV Hanafi M. & Kiers H.A.L. (2006): Analysis of K sets of data, with differential emphasis on agreement between and within
sets, Computational Statistics & Data Analysis, 51, 1491-1508.
SUMCOR Horst P. (1961): Relations among m sets of variables, Psychometrika, vol. 26, pp. 126-149.
SSQCOR Kettenring J.R. (1971): Canonical analysis of several sets of variables, Biometrika, 58, 433-451
MAXDIFF Van de Geer J. P. (1984): Linear relations among k sets of variables. Psychometrika, 49, 70-94.
PLS path modeling Tenenhaus M., Esposito Vinzi V., Chatelin Y.-M., Lauro C. (2005): PLS path modeling. Computational Statistics and Data
(mode B) Analysis, 48, 159-205.
Generalized Orthogonal Vivien M. & Sabatier R. (2003): Generalized orthogonal multiple co-inertia analysis (-PLS): new multiblock component
MCOA and regression methods, Journal of Chemometrics, 17, 287-301.
Caroll’s GCCA Carroll, J.D. (1968): A generalization of canonical correlation analysis to three or more sets of variables, Proc. 76th
Conv. Am. Psych. Assoc., pp. 227-228.
special cases of RGCCA (among others)
two-block case
multi-block case
monotone convergent algorithms for these criteria
argmax𝐚1 ,𝐚2 ,…,𝐚𝐽
𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘
𝐽
𝑗≠𝑘
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2
= 1, 𝑗 = 1,… , 𝐽 Subject to
Two key ingredients: (i) Block relaxation
(ii) Majorization by Minorization
The RGCCA algorithm (primal version)
jjj aXy
Outer Estimation
(explains the block)
1var12
jjjjj aaX Initial
step ja
j
t
jjj
t
jjj
t
j
j
t
jjj
t
jj
j
zXIXXn
Xz
zXIXXn
a1
1
11
11
Iterate until
convergence
of the criterion
cjk = 1 if blocks are linked, 0 otherwise and cjj = 0
Inner
Estimation
(explains
relation
between
block)
k
jk
jkj e yz
Choice of weights ejk:
- Horst :
- Centroid :
- Factorial :
jkjk ce
kjjkjk ce yy ,corsign
kjjkjk ce yy ,cov
Dimension = 𝒑𝒋 × 𝒑𝒋
The RGCCA algorithm (dual version)
Initial
step jα
jj
t
jjj
t
jj
t
j
jj
t
jjj
j
zIXXn
XXz
zIXXn
α1
1
11
11
Iterate until
convergence
of the criterion
cjk = 1 if blocks are linked, 0 otherwise and cjj = 0
Inner
Estimation
(explains
relation
between
block)
k
jk
jkj e yz
Choice of weights ejk:
- Horst :
- Centroid :
- Factorial :
jkjk ce
kjjkjk ce yy ,corsign
kjjkjk ce yy ,cov
j
t
jjj αXXy
Outer Estimation
(explains the block)
1)1( 1 j
t
jjnjj
t
jj
t
j αXXIXXα
Dimension = 𝒏 × 𝒏
jtjj αXa
GINI
FARM
RENT
GNPR
LABO
Agricultural inequality (X1)
Industrial development (X2)
ECKS
DEAT
D-STB
INST
DICT
Political instability (X3)
Agr.
ineq.
Ind.
dev.
Pol.
inst.
Weight vectors
18
0.66
0.74
0.10
0.69
-0.72
0.17
0.44
0.48
-0.56
0.49
Corr = 0.428
Corr = -0.767
small dimensional block settings ⟹ primal algorithm for RGCCA
Bootstrap confidence intervals
19
GINI
FARM
RENT
GNPR
LABO
ECKS
DEAT
D-STB
INST
DICT
Agr.
ineq.
Ind.
dev.
Pol.
inst.
0.66
0.74
0.10
0.69
-0.72
0.17
0.44
0.48
-0.56
0.49
Corr = 0.428
Corr = -0.767
Data vizualization
20
Agricultural inequality
Industr
ial develo
pm
ent
These countries
have known a
period of
dictatorship
after 1964.
Greece : colonels’ dictatorship 1967-1974
Chili : Pinochet's military regime 1973-1990
Argentine : military dictatorship 1976-1983
Brasil : Branco’s military dictactorship 1964-1985
21
Glioma Cancer Data(Department of Pediatric Oncology of the Gustave Roussy Institute)
Gene 1 Gene 2 … Gene 15201 CGH1 … CGH 1909 Localization
Patient 1 0.18 -0.21 -0.73 0.00 -0.55 Hemisphere
Patient 2 1.15 -0.45 0.27 -0.30 0.00 Midline
Patient 3 1.35 0.17 0.22 0.33 0.64 DIPG
Patient 53 1.39 0.18 … -0.17 0.00 … 0.43 Hemisphere
Transcriptomic data (X1)
CGH data (X2)
outcome (X3)
Glioma Cancer Data: from an RGCCA viewpoint(Department of Pediatric Oncology of the Gustave Roussy Institute)
High dimensional block settings ⟹ dual algorithm for RGCCA
Gene 1 … Gene 15201
Patient 1 0.18 -0.73
Patient 2 1.15 0.27
Patient 3 1.35 0.22
Patient 53 1.39 -0.17
CGH1 … CGH 1909
Patient 1 0.00 -0.55
Patient 2 -0.30 0.00
Patient 3 0.33 0.64
Patient 53 0.00 0.43
2
1
3
Hemisphere DIPG
Patient 1 1 0
Patient 2 0 0
Patient 3 0 1
Patient 53 1 0
RGCCA with factorial scheme - 1 = 1, 2 = 1 and 3 = 0
C13 = 1
C23 = 1
C12 = 1C12 = 0
Block components
𝐲1 = 𝐗𝟏𝐚𝟏 = 𝑎11𝐆𝐞𝐧𝐞𝟏 +⋯+ 𝑎1,15201𝐆𝐞𝐧𝐞𝟏𝟓𝟐𝟎𝟏
𝐲2 = 𝐗𝟐𝐚𝟐 = 𝑎21𝐂𝐆𝐇𝟏 +⋯+ 𝑎2,1909𝐂𝐆𝐇𝟏𝟗𝟎𝟗
𝐲𝟑 = 𝐗𝟑𝐚𝟑 = 𝑎31𝐇𝐞𝐦𝐢𝐬𝐩𝐡𝐞𝐫𝐞 + 𝑎32𝐃𝐈𝐏𝐆
Block components should verified three properties at the same
time:
(i) Block components well explain their own block.
(ii) Block components are as correlated as possible for
connected blocks.
(iii) Block components are built from sparse 𝐚𝒋
Behavioral data(Clinic, psychometric)
Intermediate phenotype
Final phenotype
Genotype
Functional MRI
Gene Expression
Structured variable selection for RGCCA
(Structured) variable selection for RGCCA
argmax𝐚1 ,𝐚2 ,…,𝐚𝐽
𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘
𝐽
𝑗≠𝑘
subject to 𝐚𝑗𝑡𝐌𝑗𝐚𝑗 = 1, 𝑗 = 1,… , 𝐽
Ω(𝐚𝑗 ) ≤ 𝑐𝑗 , 𝑗 = 1,… , 𝐽
• LASSO: Ω 𝐚𝐣 = 𝐚𝐣 1
Ω 𝐚𝐣 =
𝑘=1
𝑝𝑗
𝑎𝑗𝑘 + 𝜆
𝑘=1
𝑝𝑗
𝑎𝑗𝑘 − 𝑎𝑗,𝑘−1
Ω 𝐚𝐣 =
𝑔∈𝒢
ag 2• Group LASSO:
• Fused LASSO:
• Tenenhaus A., Philippe C., Guillemot V., Lê Cao K.-A., Grill J., Frouin V., Variable Selection for Generalized Canonical Correlation Analysis, Biostatistics, 15 (3),
569-583, 2014.
• Löfstedt T., Hadj-Salem F., Guillemot V., Philippe C., Duchesnay E., Frouin V., and Tenenhaus A., (2014). Structured variable selection for generalized
canonical correlation analysis. In: Proceedings of the 8th International Conference on Partial Least Squares and Related Methods (PLS14), Paris, France.
• Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html
X1 X2 XJn
p2 p3p1
...
Part II. Multi-block and
Multi-way analysis
X2X21
G. Lechuga
(CentraleSupélec, L2S)
L. Le Brusquet
(CentraleSupélec, L2S)L. Puybasset, V. Perlbarg & D. Galanaud
Hôpital La Pitié-Salpêtrière
31
Multimodal neuroimaging from a multiblock viewpoint(Brain and Spine Institute, La pitié Salpêtrière Hospital)
2
1
3
C13 = 1
C23 = 1
C12 = 1
C13 = 0
Anatomical MRI (X1)
Functional MRI (X2)
Behavior (X3) Anat 1 … Anat p1
Patient 1 0.18 -0.73
Patient 2 1.15 0.27
Patient 3 1.35 0.22
Patient n 1.39 -0.17
fMRI 1 … fMRI p1
Patient 1 0.00 -0.55
Patient 2 -0.30 0.00
Patient 3 0.33 0.64
Patient n 0.00 0.43
BEH 1 … BEH p3
Patient 1 0.18 -0.73
Patient 2 1.15 0.27
Patient 3 1.35 0.22
Patient n 1.39 -0.17
p1 ~104
p2 ~104
p3 ~10
n ~
100
n ~
100
X5
Final Phenotype
X1 X2 X3 X4
Anatomical MRI Diffusion MRI Functional MRIPET
p1 p1 p1 p1p2
From Multiblock data to …
Multiway RGCCA (MGCCA)
RGCCA algorithm
a1 a2 a3 a5a4
4×p1 parameters to estimate
n
• Tenenhaus A., Le Brusquet L. Regularized Generalized Canonical Correlation Analysis extended to three way data, International Conference of the ERCIM WG on
Computational and Methodological Statistics, 2014
• Tenenhaus A., Le Brusquet L. Three-way Regularized Generalized Canonical Correlation Analysis, ThRee-way methods In Chemistry And Psychology, (TRICAP) ,2015
X5
… to Multiblock / Multiway data
X121
C12 = 1
J
1bK
1b2b
X5X1 X2 X3 X4P1 = P2 =and
𝐛𝑗𝑡𝐛𝑗 = 1, 𝑗 = 1,2
𝐛1 = 𝐛1𝐾 ⊗𝐛1
𝐽 s.c.max
𝐛1 , 𝐛2 cov(𝐏1𝐛1 ,𝐏2𝐛2)
P2
… to Multiblock / Multiway data
P121
C12 = 1
J
1bK
1b2b
MGCCA
b2K
1bJ
1b
4 + p1 parameters to estimate
instead of 4 × p1
K J
1 1 1b b b
MGCCA optimization problem
𝐛𝑗𝑡𝐛𝑗 = 1, 𝑗 = 1,… , 𝐽
𝐛𝑗 = 𝐛𝑗𝐾 ⊗𝐛𝑗
𝐽 , 𝑗 = 1,… , 𝐽 s.c.max
𝐛1 ,…,𝐛𝐽 𝑐𝑗𝑘 g cov(𝐏𝑗𝐛𝑗 ,𝐏𝑘𝐛𝑘)
𝐽
𝑗≠𝑘
1
..1P
1
..kP
1
1
..KP
3
..1P
3
..kP
3
3
..KP
2
..1P
2
..kP
2
2
..KP
1
3
2
C13 = 1
C23 = 1
C12 = 1
C13 = 0
J
1bK
1b
J
2b
K
2b
J
3b
K
3b
1
2
3
MGCCA algorithm
y P bj j j
Outer component
(summarizes the block)
Initial
step jb
Iterate until
convergence
of the criterion
cjk = 1 if blocks/groups are connected and 0 otherwise
Inner
component
(take into
account
relationships
between
blocks)
j jk k
k j
z e y
1
, arg max cov( , )b b bb
b b P b z
K J
K J
j j j j
Choice of weights ejk:
- Horst :
- Centroid :
- Factorial :
jk jke =c
sign cov( , )ik jk j ke c y y
cov( , )jk jk j ke c y y
...,j
j j
j ..1 ..K, P P P
𝐛𝑗 = 1 and 𝐛𝑗 = 𝐛𝑗𝐾 ⊗𝐛𝑗
𝐽
,K J
j jb b are obtained by SVD
MGCCA results
P2P1 21
J
1bK
1b2b
Predict the long term recovery of patients after traumatic brain injury
Influence of spatial positions Influence of the
modalities J
1b
K
1bDiscriminating
voxels within
the white matter
bundles
Bad prognosis
Good prognosis
Control
Part III: multigroup data analysis
• SETTINGS: The same set of
variables are measured on individuals
structured in several groups. groups
are centered and normalized (unit
norm)
• OBJECTIVE: investigate the
relationships between variables
within the various groups.
X2n
1
p
X2
n2
nI
X2
X2
Tenenhaus, A. and Tenenhaus, M. (2014). Regularized Generalized Canonical Correlation Analysis for multiblock or
multigroup data analysis. European Journal of Operational Research, 238 :391–403.
X2
X2
RGCCA for multiblock data analysis
argmax𝐚1 ,𝐚2 ,…,𝐚𝐽
𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘
𝐽
𝑗≠𝑘
1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2
= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints
Block
component
cov2 𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤 = var 𝐗𝐣𝐚𝐣 cor2(𝐗𝐣𝐚𝐣, 𝐗𝐤𝐚𝐤)var 𝐗𝒌𝐚𝒌
Block components should verified two properties at the same
time:
(i) Block components well explain their own block.
(ii) Block components are as correlated as possible for
connected blocks.
RGCCA for multigroup data analysis
argmax𝐚1 ,…,𝐚𝐼
𝑐𝑖𝑘g 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 ,𝐗𝑘
𝑡 𝐗𝑘𝐚𝑘
𝐼
𝑖≠𝑘
(1 − 𝜏𝑖)‖𝐗𝑖𝐚𝑖‖2 + 𝜏𝑖‖𝐚𝑖‖
2 = 1, 𝑖 = 1,… , 𝐼 s.c.
𝐗𝑖𝑡𝐗𝑖𝐚𝑖 , 𝐗𝑘
𝑡𝐗𝑘𝐚𝑘 = cos 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 , 𝐗𝑘
𝑡𝐗𝑘𝐚𝑘 × 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 × ‖𝐗𝑘
𝑡𝐗𝑘𝐚𝑘‖
Group loadings and group components should verified the
following properties at the same time:
• Group component 𝐗𝑖𝐚𝑖 well explains their own block.
• Small angle between loadings if groups are connected
Similar Loadings
Group
loadings
Group
components