interpretable sparse sliced inverse regression for digitized functional data

Interpretable Sparse Sliced Inverse Regression fordigitized functional data

Victor Picheny, Rémi Servien & Nathalie Villa-Vialaneix

[email protected]://www.nathalievilla.org

Séminaire Institut de Mathématiques de Bordeaux8 avril 2016

Nathalie Villa-Vialaneix | IS-SIR 1/26

mailto:[email protected]

http://www.nathalievilla.org

Sommaire

1 Background and motivation

2 Presentation of SIR

3 Our proposal

4 Simulations


Sommaire



3 Our proposal

4 Simulations


A typical case study: meta-model in agronomy

climate(daily time series:

rain, temperature...)

plant phenotypes

predictions(yield, N leaching...)

Agronomic model

Agronomic model:

based on biological and chemical knowledge;

computationaly expensive to use;

useful for realistic predictions but not to understand the link betweenthe inputs and the outputs.

Metamodeling: train a simplified, fast and interpretable model which canbe used as a proxy for the agronomic model.


A first case study: SUNFLO [Casadebaig et al., 2011]

Inputs: 5 daily time series (length: one year) and 8 phenotypes for differentsunflower typesOutput: sunflower yield

Data: 1000 sunflower types × 190 climatic series (different places andyears) (n = 190 000) of variables in R5×183 × R8


Main facts obtained from a preliminary studyR. Kpekou internship

The study focused on the influence of the climate on the yield: 5 functionalvariables digitized at 183 points.

Main result: Using summary of the variables (mean, sd...) on severalweeks and an automatic aggregating procedure in a random forestmethod, led to obtain good accuracy in prediction.


Question and mathematical framework

A functional regression problem: X : random variable (functional) & Y :random real variable

E(Y |X)?

Data: n i.i.d. observations (xi , yi)i=1,...,n.xi is not perfectly known but sampled at (fixed) points

xi = (xi(t1), . . . , xi(tp))T ∈ Rp . We denote: X =

xT

1...

xTn

.Question: Find a model which is easily interpretable and points outrelevant intervals for the prediction within the range of X .




E(Y |X)?



xT

1...

xTn

.

Question: Find a model which is easily interpretable and points outrelevant intervals for the prediction within the range of X .




E(Y |X)?



xT

1...

xTn

.Question: Find a model which is easily interpretable and points outrelevant intervals for the prediction within the range of X .


Related works (variable selection in FDA)

LASSO / L1 regularization in linear models[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluationpoints), [Matsui and Konishi, 2011] (selects elements of an expansionbasis), [James et al., 2009] (sparsity on derivatives: piecewiseconstant predictors)

[Fraiman et al., 2015] (blinding approach useable for variousproblems: PCA, regression...)

[Gregorutti et al., 2015] adaptation of the importance of variables inrandom forest for groups of variables

Our proposal: a semi-parametric (not entirely linear) model which selectsrelevant intervals combined with an automatic procedure to define theintervals.


Sommaire



3 Our proposal

4 Simulations


SIR in multidimensional framework

SIR: a semi-parametric regression model for X ∈ Rp

Y = F(aT1 X , . . . , aT

d X , ε)

for a1, . . . , ad ∈ Rp (to be estimated), F : Rd+1 → R, unknown, and ε, an

error, independant from X .

Standard assumption for SIR

Y y X | PA (X)

in which A is the so-called EDR space, spanned by (ak )k=1,...,d .


Estimation

Equivalence between SIR and eigendecomposition

A is included in the space spanned by the first d Σ-orthogonaleigenvectors of the generalized eigendecomposition problem:Γa = λΣa, with Σ = E

[(X − E(X |Y)))TE(X |Y)

]and

Γ = E(E(X |Y)TE(X |Y)

)Estimation (when n > p)

compute X = 1n∑n

i=1 xi and Σ = 1n XT (X − X)

split the range of Y into H different slices: τ1, ... τH and estimateE(X |Y) =

(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|,

Γ = E(X |Y)T DE(X |Y) with D = Diag(

n1n , . . . ,

nHn

)solving the eigendecomposition problem Γa = λΣa gives theeigenvectors a1, . . . , ad ⇒ A = (a1, . . . , ad), p × d


Estimation



[(X − E(X |Y)))TE(X |Y)

]and


)

Estimation (when n > p)

compute X = 1n∑n



(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|,


n1n , . . . ,

nHn



Estimation



[(X − E(X |Y)))TE(X |Y)

]and



compute X = 1n∑n



(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|,


n1n , . . . ,

nHn



Estimation



[(X − E(X |Y)))TE(X |Y)

]and



compute X = 1n∑n



(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|,


n1n , . . . ,

nHn

)

solving the eigendecomposition problem Γa = λΣa gives theeigenvectors a1, . . . , ad ⇒ A = (a1, . . . , ad), p × d


Estimation



[(X − E(X |Y)))TE(X |Y)

]and



compute X = 1n∑n



(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|,


n1n , . . . ,

nHn



Equivalent formulationsSIR as a regression problem [Li and Yin, 2008] shows that SIR isequivalent to the (double) minimization of

E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .

Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT

j X), where φ is any function R→ R and (aj)j areΣ-orthonormal.

Rk: The solution is shown to satisfy φ(y) = aTj E(X |Y = y) and aj is

also obtained as the solution of the mean square error problem:

minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...

SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT




minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT




minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT

j X), where φ is any function R→ R and (aj)j areΣ-orthonormal.Rk: The solution is shown to satisfy φ(y) = aT

j E(X |Y = y) and aj isalso obtained as the solution of the mean square error problem:

minajE

(φ(Y) − aT

j X)2


SIR in large dimensions: problem

In large dimension (or in Functional Data Analysis), n < p and Σ is

ill-conditionned and does not have an inverse⇒ Z = (X − InXT

)Σ−1/2 cannot be computed.

Different solutions have been proposed in the litterature based on:

prior dimension reduction (e.g., PCA) [Ferré and Yao, 2003] (in theframework of FDA)

regularization (ridge...) [Li and Yin, 2008, Bernard-Michel et al., 2008]

sparse SIR[Li and Yin, 2008, Li and Nachtsheim, 2008, Ni et al., 2005]


SIR in large dimensions: ridge penalty / L2-regularizationof Σ

Following [Li and Yin, 2008] which shows that SIR is equivalent to theminimization of

E2(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

+µ2

H∑h=1

ph‖ACh‖2

,

[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in ahigh dimensional setting.

They also show that this problem is equivalent to finding the eigenvectorsof the generalized eigenvalue problem

Γa = λ(Σ + µ2Ip

)a.


SIR in large dimensions: ridge penalty / L2-regularizationof Σ

Following [Li and Yin, 2008] which shows that SIR is equivalent to theminimization of

E2(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2+µ2

H∑h=1

ph‖ACh‖2,

[Bernard-Michel et al., 2008] propose to penalize by a ridge penalty in ahigh dimensional setting.

They also show that this problem is equivalent to finding the eigenvectorsof the generalized eigenvalue problem

Γa = λ(Σ + µ2Ip

)a.


SIR in large dimensions: sparse versions

Specific issue to introduce sparsity in SIRsparsity on a multiple-index model. Most authors use shrinkageapproaches.

First version: sparse penalization of the ridge solutionIf (A , C) are the solutions of the ridge SIR as described in the previousslide, [Ni et al., 2005, Li and Yin, 2008] propose to shrink this solution byminimizing

Es,1(α) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣDiag(α)A Ch

∥∥∥∥2+ µ1‖α‖L1

(regression formulation of SIR)


SIR in large dimensions: sparse versions

Specific issue to introduce sparsity in SIRsparsity on a multiple-index model. Most authors use shrinkageapproaches.

Second version: [Li and Nachtsheim, 2008] derive the sparse optimizationproblem from the correlation formulation of SIR:

minas

j

n∑i=1

[Paj (X |yi) − (as

j )T xi

]2+ µ1,j‖as

j ‖L1 ,

in which Paj is the projection of E(X |Y = yi) = Xh onto the space spannedby the solution of the ridge problem.


Characteristics of the different approaches and possibleextensions

[Li and Yin, 2008] [Li and Nachtsheim, 2008]

sparsity on shrinkage coefficients estimatesnb optimization pb 1 dsparsity common to all dims specific to each dim

Extension to block-sparse SIR (like in PCA)?


Sommaire



3 Our proposal

4 Simulations


IS-SIR: a two step approachBackground: Back to the functional setting, we suppose that t1, ..., tp aresplit into D intervals I1, ..., ID .

First step: Solve the ridge problem on the digitized functions (viewed ashigh dimensional vectors) to obtain A and C:

minA ,C

H∑h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2+ µ2

H∑h=1

ph‖ACh‖2

Second step: Sparse shrinkage using the intervals. IfPA (E(X |Y = yi)) = (Xh − X)T A for h st yi ∈ τh and if Pi = (P1

i , . . . ,Pdi )T

and Pj = (Pj1, . . . ,P

jn)T , we solve:

arg minα∈RD

d∑j=1

‖Pj − (X∆(aj))α‖2 + µ1‖α‖L1

with ∆(aj) the (p × D)-matrix such that ∆kl(aj) = ajl if tl ∈ Ik and 0otherwise.


IS-SIR: a two step approachBackground: Back to the functional setting, we suppose that t1, ..., tp aresplit into D intervals I1, ..., ID .First step: Solve the ridge problem on the digitized functions (viewed ashigh dimensional vectors) to obtain A and C:

minA ,C

H∑h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2+ µ2

H∑h=1

ph‖ACh‖2

Second step: Sparse shrinkage using the intervals. IfPA (E(X |Y = yi)) = (Xh − X)T A for h st yi ∈ τh and if Pi = (P1

i , . . . ,Pdi )T

and Pj = (Pj1, . . . ,P

jn)T , we solve:

arg minα∈RD

d∑j=1

‖Pj − (X∆(aj))α‖2 + µ1‖α‖L1

with ∆(aj) the (p × D)-matrix such that ∆kl(aj) = ajl if tl ∈ Ik and 0otherwise.


IS-SIR: Characteristics

uses the approach based on the correlation formulation (because thedimensionality of the optimization problem is smaller);

uses a shrinkage approach and optimizes shrinkage coefficients in asingle optimization problem;

handles functional setting by penalizing entire intervals and not justisolated points.


Parameter estimationH (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);

µ2 and d (ridge estimate A ):I L -fold CV for µ2 (for a d0 large enough)

I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],

in which Πd and Πd are the projector onto the first d dimensions of theEDR space and its estimate, is derived similarly as in[Liquet and Saracco, 2012]. The evolution of R(d) versus d is studiedto select a relevant d.

µ1 (LASSO) glmnet is used, in which µ1 is selected by CV along theregularization path.


Parameter estimationH (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);µ2 and d (ridge estimate A ):

I L -fold CV for µ2 (for a d0 large enough) Note that GCV as described in[Li and Yin, 2008] can not be used since the current version of the L2

penalty involves the use of an estimate of Σ−1.

I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],




Parameter estimationH (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);µ2 and d (ridge estimate A ):

I L -fold CV for µ2 (for a d0 large enough)I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],




An automatic approach to define intervals1 Initial state: ∀ k = 1, . . . , p, τk = {tk }

2 Iterate

I define: D− (“strong zeros”) and D+ (“strong non zeros”)

Final solution: Minimize GCVD over D.


An automatic approach to define intervals1 Initial state: ∀ k = 1, . . . , p, τk = {tk }2 Iterate

I along the regularization path, select three values for µ1:

P% of thecoefficients are zero, P% of the coefficients are non zero, best GCV.define: D− (“strong zeros”) and D+ (“strong non zeros”)




I along the regularization path, select three values for µ1: P% of thecoefficients are zero, P% of the coefficients are non zero, best GCV.define: D− (“strong zeros”) and D+ (“strong non zeros”)




I define: D− (“strong zeros”) and D+ (“strong non zeros”)I merge consecutive “strong zeros” (or “strong non zeros”) or “strong

zeros” (resp. “strong non zeros”) separated by a few numbers ofintervals which are of undetermined type.

Until no more iterations can be performed.




I define: D− (“strong zeros”) and D+ (“strong non zeros”)I merge consecutive “strong zeros” (or “strong non zeros”) or “strong

zeros” (resp. “strong non zeros”) separated by a few numbers ofintervals which are of undetermined type.

Until no more iterations can be performed.3 Output: Collection of models (first with p intervals, last with 1),M∗D

(optimal for GCV) and corresponding GCVD versus D (number ofintervals).



Sommaire



3 Our proposal

4 Simulations


Simulation framework

Data generated with:

Y =∑d

j=1 log∣∣∣〈X , aj〉

∣∣∣ with X(t) = Z(t) + ε in which Z is a Gaussianprocess with mean µ(t) = −5 + 4t − 4t2 and the Matern 3/2covariance function with parameters σ = 0.1 and θ = 0.2/

√3, ε is a

centered Gaussian variable independant of Z , with standard deviation0.1.;

aj = sin(

t(2+j)π2 −

(j−1)π3

)IIj (t)

two models: (M1), d = 1, I1 = [0.2, 0.4]. For (M2), d = 3 andI1 = [0, 0.1], I2 = [0.5, 0.65] and I3 = [0.65, 0.78].


Simulation framework


Ridge step (model M1)

Selection of µ2: µ2 = 1


Ridge step (model M1)

Selection of d: d = 1


Definition of the intervalsD = 200 (initial state)

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

a 1

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200

0.01

90.

020

0.02

10.

022

0.02

3

Number of intervals

CV

err

or


Definition of the intervalsD = 147 (retained solution)

0.2 0.4 0.6 0.8 1.0

0.00

0.02

0.04

0.06

0.08

a 1

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200

0.01

90.

020

0.02

10.

022

0.02

3

Number of intervals

CV

err

or


Definition of the intervalsD = 43

0.2 0.4 0.6 0.8 1.0

−0.

050.

000.

05

a 1

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200

0.01

90.

020

0.02

10.

022

0.02

3

Number of intervals

CV

err

or


Definition of the intervalsD = 5

0.2 0.4 0.6 0.8 1.0

−0.

04−

0.02

0.00

0.02

0.04

0.06

0.08

a 1

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200

0.01

90.

020

0.02

10.

022

0.02

3

Number of intervals

CV

err

or


Definition of the intervals

●

●

●

●

●●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

0 50 100 150 200

0.01

90.

020

0.02

10.

022

0.02

3

Number of intervals

CV

err

or


Conclusion

IS-SIR:

sparse dimension reduction model adapted to functional framework;

fully automated definition of relevant intervals in the range of thepredictors

Perspective:

application to real data

block-wise sparse SIR?


Aneiros, G. and Vieu, P. (2014).Variable in infinite-dimensional problems.Statistics and Probability Letters, 94:12–20.

Bernard-Michel, C., Gardes, L., and Girard, S. (2008).A note on sliced inverse regression with regularizations.Biometrics, 64(3):982–986.

Casadebaig, P., Guilioni, L., Lecoeur, J., Christophe, A., Champolivier, L., and Debaeke, P. (2011).SUNFLO, a model to simulate genotype-specific performance of the sunflower crop in contrasting environments.Agricultural and Forest Meteorology, 151(2):163–178.

Ferraty, F., Hall, P., and Vieu, P. (2010).Most-predictive design points for functiona data predictors.Biometrika, 97(4):807–824.

Ferré, L. and Yao, A. (2003).Functional sliced inverse regression analysis.Statistics, 37(6):475–488.

Fraiman, R., Gimenez, Y., and Svarc, M. (2015).Feature selection for functional data.Journal of Multivariate Analysis.In Press.

Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).Grouped variable importance with random forests and application to multiple functional data analysis.Computational Statistics and Data Analysis, 90:15–35.

James, G., Wang, J., and Zhu, J. (2009).Functional linear regression that’s interpretable.Annals of Statistics, 37(5A):2083–2108.

Li, L. and Nachtsheim, C. (2008).


Sparse sliced inverse regression.Technometrics, 48(4):503–510.

Li, L. and Yin, X. (2008).Sliced inverse regression with regularizations.Biometrics, 64:124–131.

Liquet, B. and Saracco, J. (2012).A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approches.Computational Statistics, 27(1):103–125.

Matsui, H. and Konishi, S. (2011).Variable selection for functional regression models via the l1 regularization.Computational Statistics and Data Analysis, 55(12):3304–3310.

Ni, L., Cook, D., and Tsai, C. (2005).A note on shrinkage sliced inverse regression.Biometrika, 92(1):242–247.


interpretable sparse sliced inverse regression for digitized functional data

Science