structured data analysis -...

11
Structured data analysis Arthur Tenenhaus 2014/09/12

Upload: vanphuc

Post on 12-Sep-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Structured data analysis

Arthur Tenenhaus 2014/09/12

Neuroimaging-Genetic multiblock datasets

X1 DNA arrays (SNP)

p1 ~106

X2 Functional MRI

p2 ~104

X3 Developmental disorders - Reading

difficulties - Basic numerical

knowledge -…- Visuo-spatial

abilities - Visuo-motor abilities

p3 ~10

n ~100

c12=1 c23=1

c13=0

Block components

𝐲1 = 𝐗1𝐚1 = 𝑎11𝐒𝐍𝐏1 +⋯+ 𝑎1,𝑗1𝐒𝐍𝐏𝑗1

𝐲2 = 𝐗2𝐚2 = 𝑎21𝐂𝐆𝐇1 +⋯+ 𝑎2,𝑗2𝐂𝐆𝐇𝑗2

𝐲3 = 𝐗3𝐚3 = 𝑎31𝐁𝐄𝐇𝐀𝐕𝟏 + 𝑎3,𝑗3𝐁𝐄𝐇𝐀𝐕𝑗2

Block components should verified two properties at the same

time:

(i) Block components well explain their own block.

(ii) Block components are as correlated as possible for

connected blocks.

RGCCA optimization problem

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

1 − 𝜏𝑗 var 𝐗𝑗𝐚𝑗 + 𝜏𝑗 𝐚𝑗 2

= 1, 𝑗 = 1,… , 𝐽 Subject to the constraints

and:

i d e n t i t y ( H o r s t s h e m e )

s q u a r e ( F a c t o r i a l s c h e m e )

a b o l u t e v a l u e ( C e n t r o i d s c h e m e )

g

S h r i n k a g e c o n s t a n t b e t w e e n 0 a n d 1j

otherwise 0

connected is and if 1kj

XX

jkcwhere:

• Tenenhaus A. and Tenenhaus M., Regularized Generalized Canonical Correlation Analysis, Psychometrika, vol. 76, Issue 2, pp. 257-284, 2011

• Tenenhaus A., Philippe C., Frouin V., Kernel Generalized Canonical Correlation Analysis, Computational Statistics and Data Analysis, submitted.

• Tenenhaus A. and Guillemot V. (2013): RGCCA Package. http://cran.project.org/web/packages/RGCCA/index.html

Block components

Block components should verified two properties at the same

time:

(i) Block components well explain their own block.

(ii) Block components are as correlated as possible for

connected blocks.

(iii) Block components are built from sparse 𝐚𝒋

𝐲1 = 𝐗1𝐚1 = 𝑎11𝐒𝐍𝐏1 +⋯+ 𝑎1,𝑗1𝐒𝐍𝐏𝑗1

𝐲2 = 𝐗2𝐚2 = 𝑎21𝐂𝐆𝐇1 +⋯+ 𝑎2,𝑗2𝐂𝐆𝐇𝑗2

𝐲3 = 𝐗3𝐚3 = 𝑎31𝐁𝐄𝐇𝐀𝐕𝟏 + 𝑎3,𝑗3𝐁𝐄𝐇𝐀𝐕𝑗2

Behavioral data (Clinic, psychometric)

Intermediate phenotype

Final phenotype

Genotype

Functional MRI

Gene Expression

Structured variable selection for RGCCA

(Structured) variable selection for RGCCA

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

subject to 𝐚𝑗𝑡𝐌𝑗𝐚𝑗 = 1, 𝑗 = 1,… , 𝐽

Ω(𝐚𝑗 ) ≤ 𝑐𝑗 , 𝑗 = 1,… , 𝐽

• LASSO: Ω 𝐚𝐣 = 𝐚𝐣 1

Ω 𝐚𝐣 = 𝑎𝑗𝑘

𝑝𝑗

𝑘=1

+ 𝜆 𝑎𝑗𝑘 − 𝑎𝑗,𝑘−1

𝑝𝑗

𝑘=1

Ω 𝐚𝐣 = ag 2𝑔∈𝒢

• Group LASSO:

• Fused LASSO:

• Tenenhaus A., Philippe C., Guillemot V., Lê Cao K.-A., Grill J., Frouin V., Variable Selection for Generalized Canonical Correlation Analysis, Biostatistics,

doi : 10.1093/biostatistics/kxu001, 2014.

• Löfstedt T., Hadj-Salem F., Guillemot V., Philippe C., Duchesnay E., Frouin V., and Tenenhaus A., (2014). Structured variable selection for generalized

canonical correlation analysis. In: Proceedings of the 8th International Conference on Partial Least Squares and Related Methods (PLS14), Paris, France.

multigroup data analysis

• SETTINGS: The same set of

variables are measured on

individuals structured in

several groups.

• OBJECTIVE: investigate

the relationships between

variables within the various

groups.

X2 n

1

p

X2

n2

nI

• Tenenhaus, A. and Tenenhaus, M. (2014). Regularized Generalized Canonical Correlation Analysis for multiblock or multigroup data analysis.

European Journal of Operational Research, 238 :391–403.

argmax𝐚1 ,𝐚2 ,…,𝐚𝐼

𝑐𝑖𝑙g 𝐗𝑖𝑡𝐗𝑖𝐚𝑖 ,𝐗𝑙

𝑡𝐗𝑙𝐚𝑙

𝐼

𝑖 ,𝑙 ,𝑖≠𝑙

1 − 𝜏𝑖 𝐗𝐢𝐚𝑖 2 + 𝜏𝑖 𝐚𝑖

2 = 1, 𝑖 = 1,… , 𝐼 s.c.

X1

X6

SNP array Final Phenotype

n X2

X3

X4

X5

Anatomical MRI Diffusion MRI Functional MRI PET

p2 p2 p2 p2 p1 p3

From Multiblock data to …

X1

X3

SNP array Final Phenotype

n

p1 p3

… to Multiblock / Multiway data

p2

𝐗2

NeuroImaging

argmax𝐚1 ,𝐚2 ,…,𝐚𝐽

𝑐𝑗𝑘 g cov 𝐗𝑗𝐚𝑗 ,𝐗𝑘𝐚𝑘

𝐽

𝑗≠𝑘

𝐚𝑗𝑡𝐌𝑗𝐚𝑗 = 1 and 𝐚𝑗 = 𝐚𝑗

𝐾 ⊗𝐚𝑗𝐽 , 𝑗 = 1,… , 𝐽 subject to the constraints

• Tenenhaus A., Le Brusquet L. Regularized Generalized Canonical Correlation Analysis extended to three way data, International Conference of the ERCIM

WG on Computational and Methodological Statistics, 2014

• Tenenhaus A., Le Brusquet L. Three-way Regularized Generalized Canonical Correlation Analysis, ThRee-way methods In Chemistry And Psychology,

(TRICAP) ,2015

Conclusions

RGCCA for multiblock, multigroup or multiway data is a

general framework which allows analyzing the data in

their natural (but complex) structure.