random effects selection in generalized linear mixed models via shrinkage penalty function

14
Stat Comput (2014) 24:725–738 DOI 10.1007/s11222-013-9398-0 Random effects selection in generalized linear mixed models via shrinkage penalty function Jianxin Pan · Chao Huang Received: 5 September 2012 / Accepted: 6 April 2013 / Published online: 8 May 2013 © Springer Science+Business Media New York 2013 Abstract In this paper, we discuss the selection of random effects within the framework of generalized linear mixed models (GLMMs). Based on a reparametrization of the co- variance matrix of random effects in terms of modified Cholesky decomposition, we propose to add a shrinkage penalty term to the penalized quasi-likelihood (PQL) func- tion of the variance components for selecting effective ran- dom effects. The shrinkage penalty term is taken as a func- tion of the variance of random effects, initiated by the fact that if the variance is zero then the corresponding variable is no longer random (with probability one). The proposed method takes the advantage of a convenient computation for the PQL estimation and appealing properties for certain shrinkage penalty functions such as LASSO and SCAD. We propose to use a backfitting algorithm to estimate the fixed effects and variance components in GLMMs, which also selects effective random effects simultaneously. Simulation studies show that the proposed approach performs quite well in selecting effective random effects in GLMMs. Real data analysis is made using the proposed approach, too. Keywords Generalized linear mixed models · Modified Cholesky decomposition · Penalized quasi-likelihood · Penalty function · Random effects 1 Introduction In statistical modeling, many potential covariates may be included into statistical models at an initial stage, forming J. Pan ( ) · C. Huang School of Mathematics, The University of Manchester, Oxford Road, Manchester M13 9PL, United Kingdom e-mail: [email protected] a high-dimensional problem when making statistical infer- ences based on the models. However, some of the covari- ates may not necessarily make contributions to the models under consideration. Hence, selecting significant covariates from a number of candidates becomes extremely important and necessary. Early literature work on variable selection includes full screening search such as Mallow’s C p , AIC and BIC methods and optimal subset search like backward-, forward- and stepwise-selection methods. It is well known that those methods are either computationally intensive or statistically unreliable (Breiman 1996). Variable selection via penalty functions was developed in the last decade. In those methods, certain penalty func- tions of regression coefficients are added to the residual sum of squares or subtracted from the log-likelihood function, then minimization or maximization of the penalized objec- tive functions with respect to regression coefficients leads to penalized least square or penalized likelihood estimates, which helps to remove unnecessary or insignificant covari- ates from the models. Ridge regression and bridge regres- sion methods with L 2 penalty (Hoerl and Kennard 1970a, 1970b) and L q (q 0) penalty (Frank and Friedman 1993) were discussed as such examples. The least absolute shrink- age and selection operator (LASSO), which shrinks some coefficients and sets others to 0 with L 1 penalty, was pro- posed by Tibshirani (1996). This penalty function retains good features of both subset selection and ridge regres- sion. Zou (2006) proposed adaptive LASSO, in which adap- tive weights are used in LASSO penalties in order to im- prove the consistency of LASSO. Another penalty function, called smoothly clipped absolute deviation (SCAD) proce- dure, was proposed by Fan and Li (2001), providing a con- tinuous differentiable penalty function. The SCAD can be viewed as a modification of LASSO and enjoys the prop- erties of continuity, sparsity and unbiasedness. With such

Upload: chao-huang

Post on 19-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738DOI 10.1007/s11222-013-9398-0

Random effects selection in generalized linear mixed modelsvia shrinkage penalty function

Jianxin Pan · Chao Huang

Received: 5 September 2012 / Accepted: 6 April 2013 / Published online: 8 May 2013© Springer Science+Business Media New York 2013

Abstract In this paper, we discuss the selection of randomeffects within the framework of generalized linear mixedmodels (GLMMs). Based on a reparametrization of the co-variance matrix of random effects in terms of modifiedCholesky decomposition, we propose to add a shrinkagepenalty term to the penalized quasi-likelihood (PQL) func-tion of the variance components for selecting effective ran-dom effects. The shrinkage penalty term is taken as a func-tion of the variance of random effects, initiated by the factthat if the variance is zero then the corresponding variableis no longer random (with probability one). The proposedmethod takes the advantage of a convenient computationfor the PQL estimation and appealing properties for certainshrinkage penalty functions such as LASSO and SCAD. Wepropose to use a backfitting algorithm to estimate the fixedeffects and variance components in GLMMs, which alsoselects effective random effects simultaneously. Simulationstudies show that the proposed approach performs quite wellin selecting effective random effects in GLMMs. Real dataanalysis is made using the proposed approach, too.

Keywords Generalized linear mixed models · ModifiedCholesky decomposition · Penalized quasi-likelihood ·Penalty function · Random effects

1 Introduction

In statistical modeling, many potential covariates may beincluded into statistical models at an initial stage, forming

J. Pan (�) · C. HuangSchool of Mathematics, The University of Manchester, OxfordRoad, Manchester M13 9PL, United Kingdome-mail: [email protected]

a high-dimensional problem when making statistical infer-ences based on the models. However, some of the covari-ates may not necessarily make contributions to the modelsunder consideration. Hence, selecting significant covariatesfrom a number of candidates becomes extremely importantand necessary. Early literature work on variable selectionincludes full screening search such as Mallow’s Cp , AICand BIC methods and optimal subset search like backward-,forward- and stepwise-selection methods. It is well knownthat those methods are either computationally intensive orstatistically unreliable (Breiman 1996).

Variable selection via penalty functions was developedin the last decade. In those methods, certain penalty func-tions of regression coefficients are added to the residual sumof squares or subtracted from the log-likelihood function,then minimization or maximization of the penalized objec-tive functions with respect to regression coefficients leadsto penalized least square or penalized likelihood estimates,which helps to remove unnecessary or insignificant covari-ates from the models. Ridge regression and bridge regres-sion methods with L2 penalty (Hoerl and Kennard 1970a,1970b) and Lq (q ≥ 0) penalty (Frank and Friedman 1993)were discussed as such examples. The least absolute shrink-age and selection operator (LASSO), which shrinks somecoefficients and sets others to 0 with L1 penalty, was pro-posed by Tibshirani (1996). This penalty function retainsgood features of both subset selection and ridge regres-sion. Zou (2006) proposed adaptive LASSO, in which adap-tive weights are used in LASSO penalties in order to im-prove the consistency of LASSO. Another penalty function,called smoothly clipped absolute deviation (SCAD) proce-dure, was proposed by Fan and Li (2001), providing a con-tinuous differentiable penalty function. The SCAD can beviewed as a modification of LASSO and enjoys the prop-erties of continuity, sparsity and unbiasedness. With such

Page 2: Random effects selection in generalized linear mixed models via shrinkage penalty function

726 Stat Comput (2014) 24:725–738

penalty functions, non-informative or insignificant variablesare removed from the model by shrinking the associated re-gression coefficients towards zero, but significant variablesare retained in the model with little or no shrinkage. Com-pared to other variable selection methods, shrinkage penaltymethods have the advantage of selecting important covari-ates and estimating the associated parameters, simultane-ously, and save considerable computing times.

Generalized linear mixed models (GLMMs) were pro-posed for continuous and discrete longitudinal/clustereddata analysis, which nowadays have become a standard toolto handle heterogeneous and clustered data. GLMMs incor-porate random effects into the linear predictor of general-ized linear models, so that heterogeneity across subjects canbe taken into account. In the literature, many authors in-cluding Stiratelli et al. (1984), Zeger et al. (1988), Schall(1991), Breslow and Clayton (1993) and Pan and Thomp-son (2007) among others, have contributed to the modelingdevelopment. The most commonly used estimation methodis the penalized quasi-likelihood (PQL) estimation proposedby Breslow and Clayton (1993). The key idea of the PQLmethod is to utilize Laplace’s approximation to approximatethe integrated likelihood. The main advantage of the PQLmethod lies in that it can be easily fulfilled by iteratively fit-ting a linear mixed model to a modified working response(Breslow and Clayton 1993).

Variable selection in GLMMs has been studied by manyauthors in recent years. Among those, Schelldorfer andBühlmann (2011) and Groll and Tutz (2013) considered theselection of fixed effects in GLMMs using L1-penalization.Groll and Tutz (2012) applied boosting techniques to the se-lection of fixed effects in GLMMs. Ibrahim et al. (2010) pro-posed a variable selection method within the framework of ageneral class of mixed-effects models by the use of a penal-ized likelihood function with SCAD and adaptive LASSOpenalties, where a model selection criterion called ICQ cri-terion is used to select the tuning parameter. In contrast, theselection of random effects in GLMMs has received rela-tively little attention. Note that the selection of effective ran-dom effects is very important, because redundant randomeffects may lead to a singular covariance matrix of ran-dom effects and cause unstable computation problems ofparameter estimation. Chen and Dunson (2003) and Kin-ney and Dunson (2007) proposed random effects selectionmethods based on hierarchical Bayesian models where ran-dom effects are selected using a Markov chain Monte Carlo(MCMC) algorithm. Fahrmeir et al. (2010) gave a detailedreview of Bayesian approaches for the selection of ran-dom effects. When GLMMs reduce to linear mixed models(LMMs), Bondell et al. (2010) proposed a variable selec-tion method for both fixed and random effects using the EMalgorithm. For LMMs, very recently Ahn et al. (2012) andLin et al. (2012) proposed approaches on random effects se-lection using moment-based method and two-stage method,

respectively. Fan and Li (2012) proposed a group variableselection strategy to simultaneously select and estimate ran-dom effects, by using a proxy matrix to replace the unknowncovariance matrix of random effects. Note that their methodis restricted to LMMs and the penalties are taken as a func-tion of random effects.

In this paper, we focus on the selection of randomeffects within the framework of GLMMs. Based on areparametrization of the covariance matrix of random effectsin terms of a modified Cholesky decomposition, we proposeto add a shrinkage penalty term to the PQL function of thevariance components for selecting effective random effects.The shrinkage penalty term is taken as a function of thevariances of random effects, initiated by the fact that if thevariance is equal to zero then the corresponding variable isno longer random in the sense of probability being equalto 1 and can thus be treated as a constant. The proposedmethod takes the advantage of a convenient computationfor the PQL estimation and appealing properties for certainshrinkage penalty functions such as LASSO and SCAD. Wethen propose to use a backfitting algorithm to estimate thefixed effects and variance components in GLMMs, whichalso selects effective random effects simultaneously. The or-ganization of this paper is outlined as follows. In Sect. 2,the methodology and the algorithm for selecting effectiverandom effects are discussed. In Sect. 3, simulation studiesare carried out to access the performance of the proposedmethod. A real data analysis is provided in Sect. 4. Furtherdiscussions and concluding remarks are given in Sect. 5.

2 Methodology and algorithm

Let yij be the j th of ni response measurements on the ithof n subjects. Assume μij = E(yij |bi) is the conditional ex-pectation of the measurement yij , given the q-dimensionalrandom effects bi . GLMMs can be defined by

g(μij ) = ηij = xTijβ + zT

ij bi,

var(yij |bi) = a(φ)v(μij ),(2.1)

where β is the p-dimentional vector of fixed effects, xij andzij are covariates corresponding to β and bi , respectively,g(·) is a known monotone and differenciable link function,v(·) and a(·) are known specific functions in the variance,and φ is a nuisance dispersion parameter.

The random effects bi are assumed to follow a nor-mal distribution bi ∼ N(0,D), where D is an (q × q)covariance matrix, D = (dij )1≤i,j≤q . Based on the mod-ified Cholesky decomposition (Chen and Dunson 2003),the covariance matrix D can be uniquely decomposed intoD = ΛΓ Γ T Λ, where Λ = diag(λ1, . . . , λq) is a diagonalmatrix with positive elements λl proportional to the standarddeviation of the random effects bi , and Γ = (γlr )1≤l,r≤q

Page 3: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 727

is a lower triangular matrix with 1’s as its diagonal el-ements, that is, γll = 1 and γlr = 0 for l = 1, . . . , q;r = l + 1, . . . , q . The matrix Γ actually determines the cor-relation of the components of the random effects bi . De-note θ1 = (λ1, . . . , λq)T , θ2 = (γ21;γ31, γ32; . . . ;γq1, . . . ,

γq(q−1))T and θ = (θT

1 , θT2 , φ)T .

Conditionally on the random effects bi , we denote thedensity function of the response yij by f (yij |bi;β,φ). Letyi = (yi1, yi2, . . . , yini

)T for i = 1, . . . , n. The marginallikelihood function, L(β, θ), that integrates out the unob-served random effects from the joint likelihood function ofyi and bi , can be expressed as

L(β, θ) =n∏

i=1

f (yi;β, θ)

=n∏

i=1

bi

{ni∏

j=1

f (yij |bi;β,φ)

}f (bi; θ1, θ2)dbi,

where f (bi; θ1, θ2) = (2π)−q/2|D|−1/2 exp{− 12bT

i D−1bi}.Accordingly, the log-likelihood, �(β, θ), has the followingform

�(β, θ) =n∑

i=1

logf (yi;β, θ)

=n∑

i=1

log∫

bi

{ni∏

j=1

f (yij |bi;β,φ)

}f (bi; θ1, θ2)dbi.

The integral above may have a closed form when the con-ditional distribution of yij given bi is normal. However, ingeneral this likelihood function is analytically intractabledue to the complex form of the functions

∏ni

j=1 f (yij |bi;β,φ). As an approximation to �(β, θ), the PQL approxi-mates the log-likelihood through

�(β, θ) ≈n∑

i=1

ni∑

j=1

logf (yij |bi;β,φ) − 1

2

n∑

i=1

bTi D−1bi

(2.2)

Then the PQL estimation can be made by iterating the twosteps below. First, for given variance components θ , max-imization of �(β, θ) with respect to β and bi is taken, sothat the estimates (β, bi) = (β(θ), bi (θ)) can be obtained inexplicit expression forms. In fact, let ηi = (ηi1, . . . , ηini

)T ,μi =(μi1, . . . ,μini

)T and g′(μi)=(g′(μi1), . . . , g′(μini

))T .Then the estimating equations for β and bi are given byn∑

i=1

XTi Wi(Yi − ηi) = 0 and ZT

i Wi(Yi − ηi) = D−1bi

where Xi = (xi1, . . . , xini)T , Zi = (zi1, . . . , zini

)T , Yi =ηi + (yi − μi)g

′(μi) is the working responses, and Wi =diag(wi1, . . . ,wini

) with wij = {a(φ)v(μij )g′(μij )

2}−1 isthe weight matrix. The solutions to the estimating equationsare then provided by

β =(

n∑

i=1

XTi V −1

i Xi

)−1( n∑

i=1

XTi V −1

i Yi

), (2.3)

bi = DZTi V −1

i (Yi − Xiβ), (2.4)

where Vi = W−1i + ZiDZT

i . Therefore, as long as a startingvalue of (β , θ ) is given, the above formulaes can be itera-tively used to update the estimate of the fixed effects β andthe prediction of the random effects bi .

Second, once the estimate of β and the prediction of bi

are obtained, the log-likelihood function can be further ap-proximated by

�2(β, θ) = −1

2

n∑

i=1

log |Vi |

− 1

2

n∑

i=1

(Yi − Xiβ)T V −1i (Yi − Xiβ)

(Breslow and Clayton 1993). An alternative approximation,which corresponds to the restricted maximum likelihood(REML), is given by

�2R(β, θ) = −1

2

n∑

i=1

log |Vi | − 1

2

n∑

i=1

log∣∣XT

i V −1i Xi

∣∣

− 1

2

n∑

i=1

(Yi − Xiβ)T V −1i (Yi − Xiβ),

aiming to reduce the bias of the estimates of the variancecomponents θ (Patterson and Thompson 1971). Maximiza-tion of �2(β, θ) or �2R(β, θ) with respect to θ leads to theML and REML estimate of θ , respectively. Then the esti-mates of β and θ are iterated until convergence, yielding thePQL estimates of β and θ . See Breslow and Clayton (1993)for more details.

Consider the lth diagonal entry of Λ in D = ΛΓ Γ T Λ,say λl . It is obvious that if λl = 0 then dll = 0, so thatall the entries of the lth row and the lth column in D areequal to zero. In other words, the lth component of the ran-dom effects bi becomes ineffective as long as dll = 0, sothat it can be removed from bi in this case. In the spiritof penalty-based variable selection methods, we propose toimpose penalties on λl (l = 1, . . . , q) to either �2(β, θ) or�2R(β, θ) when estimating the variance components θ , so asto shrink the variances of ineffective random effects towardszero. In other words, we maximize the following penalizedlog-likelihood function

p�2(β, θ) = �2(β, θ) − npψ

(|θ1|), (2.5)

or its REML version

p�2R(β, θ) = �2R(β, θ) − npψ

(|θ1|)

(2.6)

with respect to θ , so that ineffective random effects canbe automatically removed from bi . In (2.5) and (2.6),

Page 4: Random effects selection in generalized linear mixed models via shrinkage penalty function

728 Stat Comput (2014) 24:725–738

pψ(|θ1|) = ∑q

l=1 pψ(|λl |) are the penalty functions forλ1, . . . , λq with ψ as the tuning parameter, which can bechosen as either LASSO or SCAD penalty functions. Forthe LASSO penalty, it is given by pψ(|λl |) = ψ |λl |. For theSCAD penalty, it takes the form

(|λl |) = ψ

{I(|λl | ≤ ψ

) + (aψ − |λl |)+(a − 1)ψ

I(|λl | > ψ

)},

for some a > 2 and |λl | > 0, where I (·) is the indicatorfunction and (x)+ = x if x > 0 and (x)+ = 0 otherwise.Note that in (2.5) and (2.6) a common tuning parameter ψ

is applied to all λ1, . . . , λq , implying that the random effectsare all in a comparable scale. Otherwise, the random effectsmay need to be standardized first. We also point out that thepenalty function in (2.5) or (2.6) is multiplied by the factorn, because in total there are n independent subjects.

In what follows, we discuss how to estimate the vari-ance components θ = (θT

1 , θT2 , φ)T based on the penalized

log-likelihood function p�2(β, θ) or p�2R(β, θ). We firstassume that φ is known. Then maximizing p�2(β, θ) orp�2R(β, θ) with respect to θ1 and θ2 leads to the PQL es-timates θ1 and θ2 of θ1 and θ2, respectively. To obtain theestimate of θ1, we fix θ2 first and in this case p�2(β, θ)

in (2.5) or p�2R(β, θ) in (2.6) is simply a function of θ1

only. We then use Newton-Raphson algorithm to calculatethe estimate θ1. Note that the penalty function pψ(·) maynot always have continuous second-order derivatives. Wetherefore propose to replace the penalty function with itsquadratic approximation functions (Fan and Li 2001). Thedetails are given below. Let θ

(k)1 and θ

(k−1)1 be the kth and

(k − 1)th iteration estimates of θ1. Given β and θ2, the esti-mate of θ1 can be iteratively updated using

θ(k)1 = θ

(k−1)1 − (∇2�2

(k−1)1

) + nΣψ

(k−1)1

))−1

· (∇�2(θ

(k−1)1

) + nUψ

(k−1)1

)), (2.7)

where ∇�2(θ1) = ∂�2∂θ1

and ∇2�2(θ1) = ∂2�2∂θ1θ

T1

are the first and

second order derivatives of �2(β, θ) with respect to θ1, and

(k−1)1

) = Σψ

(k−1)1

(k−1)1

and

Σψ

(k−1)1

) = diag

(p′

ψ(|λ(k−1)1 |)

|λ(k−1)1 |

, . . . ,p′

ψ(|λ(k−1)q |)

|λ(k−1)q |

)

are the approximated first and second order derivatives of

pψ(|θ(k−1)1 |) with respect to θ1. Alternatively, the REML

estimate of θ1 can be obtained simply by replacing �2

with �2R .We now turn to calculate the estimate of θ2. Suppose that

θ1 is already obtained using the iteration in (2.7). Assumecertain components of θ1 are equal to zero, for example,λl = 0. Then the lth component of the random effects bi

is not effective, so that it should be removed from bi . As a

result, γl1, . . . , γl(l−1) are removed from θ2. The remainingcomponents in θ2 can then be updated using

θ(k)2 = θ

(k−1)2 − (∇2�2

(k−1)2

))−1(∇�2(θ

(k−1)2

)), (2.8)

where ∇�2(θ2) = ∂�2∂θ2

and ∇2�2(θ2) = ∂2�2∂θ2θ

T2

. The deriva-

tion of the first- and second-order derivatives in (2.7)–(2.8)is provided in Appendix, where the details of Σψ(θ

(k−1)1 )

and Uψ(θ(k−1)1 ) are presented in Part 1 there. The techni-

cal details of the derivatives ∇�2(θ(k−1)s ) and ∇2�2(θ

(k−1)s )

(s = 1,2) are given in Part 2 of Appendix.At convergence, the covariance matrices of θ1 and θ2 can

be calculated using the following sandwich formulae (Fanand Li 2001),

Cov(θ1) = (∇2�2(θ1) + nΣψ(θ1))−1Cov

(∇�2(θ1))

· (∇2�2(θ1) + nΣψ(θ1))−1

and

Cov(θ2) = (∇2�2(θ2))−1Cov

(∇�2(θ2))(∇2�2(θ2)

)−1.

If φ is unknown, it can be estimated iteratively usingthe residuals. For example, if yij follow a normal distribu-tion and it is assumed that the predictions of random ef-fects bi are already obtained, then φ can be updated us-ing φ(k) = (1/N)

∑ni=1(e

(k)i )T e

(k)i where N = ∑n

i=1 ni and

e(k)i = yi − Xiβ

(k) − Zib(k)i are the conditional residuals in

the kth iteration. If yij follow a binomial or a Poisson distri-bution conditional on the random effects bi , we may simplytake φ = 1 as long as no over- or under-dispersion occurs.In general, the dispersion parameter φ can be estimated bythe help of the Pearson residuals and the trace of the corre-sponding hat matrix as the effective degrees of freedom. SeeGroll and Tutz (2013) for the details.

In summary, we propose to use the algorithm below tocalculate the estimates of the fixed effects β and the vari-ance components θ , which selects effective random effectssimultaneously.

Algorithm

Step 1. Given initial values of the fixed effects β(0) and thevariance components θ(0) = (θ

(0)1 , θ

(0)2 , φ(0)), compute the

estimate of fixed effects β(1) and the prediction of randomeffects b

(1)i using (2.3) and (2.4), respectively.

Step 2. Given β(1), θ(0)2 and φ(0), use (2.7) to update the

variance component estimate θ(1)1 . Then, insert β(1), θ

(1)1

and φ(0) to (2.8), and compute the variance component es-timate θ

(1)2 . Update φ(0) by φ(1) if necessary.

Step 3. Repeat Steps 1 and 2 above until convergence.

Note that when yi are normally distributed, φ needs tobe estimated. In this case, we first calculate the residuals

Page 5: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 729

e(1)i = yi −Xiβ

(1) −Zib(1)i in Step 1 and then update the es-

timate φ(1) using e(1)i . In addition, unless there is some prior

information about the covariance matrix D of the random ef-fects bi , in general, the identity structure can be chosen asthe initial value of D. Equivalently, θ

(0)1 = 0 and θ

(0)2 = 0

can be chosen as the initial values of θ1 and θ2.Note that the tuning parameter ψ in the penalty function

pψ(|θ1|) is generally unknown and needs to be estimatedtogether with other unknown parameters. We propose to usethe leave-one-subject-out cross validation (SCV) method toobtain the optimal tuning parameter ψ . In other words, weminimize the SCV criterion

SCV(ψ) =n∑

i=1

(Y

(−i)i − Xiβ

(−i))T [

V(−i)i

]−1

× (Y

(−i)i − Xiβ

(−i)),

with respect to ψ in order to obtain the optimal tuningparameter, where Y

(−i)i = η

(−i)i + (yi − μ

(−i)i )g′(μ(−i)

i ),

V(−i)i = (W

(−i)i )−1 + ZiD

(−i)ZTi , D(−i) are computed us-

ing data without the ith subject, and β(−i), η(−i)i , μ

(−i)i ,

W(−i)i are the corresponding vector or matrix estimates after

deleting the ith subject.

3 Simulation studies

In this section, simulation studies are carried out to assessthe performance of the proposed method for selecting ran-dom effects in GLMMs. The studies are made for four cases,i.e., normal, Bernoulli, binomial and Poisson distributions.In each case, we consider two different model scenarios forlongitudinal data, of which the first scenario assumes thatcovariates are time-independent and the second one consid-ers the case for time-dependent covariates. We use LASSOand SCAD penalty functions when selecting effective ran-dom effects. Note that for the SCAD penalty we choose theparameter a = 3.7, as suggested by Fan and Li (2001).

3.1 Normal distribution

3.1.1 Time-independent covariates

For the normal distribution with time-independent covari-ates, we generate 100 data sets, of which each contains100 subjects (n = 100) and each subject has 10 repeatedmeasurements (ni ≡ 10). The model considered is givenby μij = ηij = xT

ijβ + zTij bi , where xij ∼ Np(0, σ 2V ) with

V = (ρ|j−k|)1≤j,k≤p and p = 10, and the true values of σ 2

and ρ are σ 2 = 1 and ρ = 0.2,0.5,0.8, respectively. We alsoassume zij = xij and var(yij |bi) ≡ 1, meaning a(φ) = 1.The fixed effects β are set to be a 10-dimensional vectorwith the true value β = (10,9,8,7,6,5,4,3,2,1), and the

random effects bi = (bi1, bi2, . . . , bi,10) ∼ N10(0,D) withD = diag(3,2,1,0, . . . ,0), in other words, only the firstthree random effects are effective. The LASSO and SCADpenalties are implemented in our proposed method for se-lecting random effects and the SCV criterion is used tochoose the optimal tuning parameter ψ .

Table 1 summarizes the estimation results for the meanparameter β . In Table 1, both the averaged mean (Mean)and stand deviation (SD) of β over 100 simulations are re-ported. It is clear that all the estimates of β are very close tothe true parameter values. Also, it is interesting to see thatthe correlation parameter ρ has little effect on the estima-tion of β . And there is no much difference in the estimatesof β between LASSO and SCAD methods, when the focusis to select effective random effects. This is obvious from(2.5) or (2.6) because the penalty function is only imposedonto the variance components θ1. For this reason, in all othersimulation studies below we omit the report of the estima-tion of the fixed effects and focus on the selection of randomeffects.

To measure the effectiveness of random effects selection,motivated by the “Median Ratio of Model Error” (MRME,Fan and Li 2001), we propose to consider the following cri-terion, called “Average Error” (AE),

AE =q∑

l=1

1

q(λl − λl)

2 +q∑

l=2

l−1∑

r=1

1

q(q − 1)/2(γlr − γlr )

2,

where λl and γlr are the covariance components of the ran-dom effects and λl and γlr are the associated estimates. Theratio of the AE of the model with either the LASSO orSCAD penalty function to the model without any penaltyfunction is calculated for each simulation, and the me-dian of the ratios over the 100 simulations, called “Me-dian Ratio of Average Error” (MRAE), is used to measurethe performance of the proposed random effects selectionmethod.

Table 2 gives the estimation results in terms of MRAE,correct and incorrect number of zeros (C and IC), the pro-portion of under-fit, correct-fit and over-fit. Here the correctnumber of zeros means the number of zero-components inaverage over the 100 simulations that are correctly identi-fied as zero, whilst the incorrect number of zeros meansthe number of nonzero components in average over the 100simulations that are incorrectly judged to be zero. For thissimulation study, we would expect that the correct numberof zeros should be close to 7 as there are 7 zero compo-nents, and the incorrect number of zeros should be closeto 0. Under-fit in Table 2 represents the case in which somenon-zero components are mistakenly judged as zero, whilstover-fit means that some zero components are wrongly iden-tified as nonzero. From Table 2, it is clear that the MRAEvalue is very small, whatever the correlation coefficient ρ

of the covariates is, implying that the proposed method is

Page 6: Random effects selection in generalized linear mixed models via shrinkage penalty function

730 Stat Comput (2014) 24:725–738

Table 1 Mean parameters estimation

ρ Fixed effect

β1 β2 β3 β4 β5 β6 β7 β8 β9 β10

10 9 8 7 6 5 4 3 2 1

SCAD 0.2 (Mean) 10.01 9.02 7.99 6.99 6.00 5.00 4.00 3.00 2.00 1.00

(SD) 0.21 0.16 0.11 0.05 0.04 0.05 0.04 0.04 0.05 0.04

0.5 (Mean) 10.01 8.97 7.99 7.00 6.00 5.00 3.99 3.01 1.99 1.00

(SD) 0.17 0.16 0.10 0.05 0.05 0.05 0.05 0.06 0.05 0.04

0.8 (Mean) 9.99 8.98 8.00 7.02 5.97 5.01 3.99 3.01 2.00 1.00

(SD) 0.17 0.17 0.14 0.08 0.08 0.09 0.09 0.09 0.09 0.08

LASSO 0.2 (Mean) 10.01 9.03 7.99 6.99 6.00 5.00 4.00 3.00 2.00 1.00

(SD) 0.21 0.16 0.11 0.05 0.04 0.05 0.04 0.04 0.05 0.04

0.5 (Mean) 10.01 8.97 7.99 7.00 6.00 5.00 3.99 3.01 1.99 1.00

(SD) 0.17 0.16 0.11 0.05 0.05 0.05 0.05 0.06 0.05 0.04

0.8 (Mean) 9.99 8.98 8.00 7.02 5.97 5.01 3.99 3.01 2.00 1.00

(SD) 0.17 0.17 0.14 0.08 0.08 0.09 0.09 0.09 0.09 0.08

Table 2 Random effects selection

Method ρ MRAE No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 0.2 1.31e-2 7.00 0.02 0.02 0.98 0.00

0.5 6.38e-3 7.00 0.00 0.00 1.00 0.00

0.8 1.38e-4 7.00 0.05 0.05 0.95 0.00

LASSO 0.2 1.26e-2 7.00 0.01 0.01 0.99 0.00

0.5 6.07e-3 7.00 0.00 0.00 1.00 0.00

0.8 1.37e-4 7.00 0.05 0.05 0.95 0.00

able to reduce the estimation error of the variance compo-nents of random effects. Also, the proposed random effectselection method behaves very well in terms of selecting ef-fective random effects and dropping false ones. Moreover,despite of occasional under-fit cases, the proposed methodhas a very high correct-fit probability for both LASSO andSCAD methods. It seems that the incorrect number of ze-ros and the percentage of under-fit increase with ρ, whichreflects the degree of collinearity among covariates.

In Fig. 1, box plots are provided to demonstrate howthe proposed random effects selection method works for thecase of ρ = 0.2 with SCAD penalty. For example, the firstthree box plots in Fig. 1 show the performance of varianceestimates for random effects bi1, bi2 and bi3 among 100 sim-ulations. As their true variance values are d11 = 3, d22 = 2and d33 = 1, they are obviously effective random effects.In contrast, random effects bi4, . . . , bi,10 are ineffective be-cause their variances are zero. Figure 1 shows this point verywell.

Fig. 1 Box plots of variances estimates of bi1, bi2, . . . , bi,10

Page 7: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 731

Table 3 Random effectsselection Method n MRAE No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 100 5.6e-5 4.00 0.03 0.03 0.97 0.00

200 5.8e-5 4.00 0.02 0.02 0.98 0.00

LASSO 100 5.7e-5 4.00 0.03 0.03 0.97 0.00

200 6.1e-5 4.00 0.02 0.02 0.98 0.00

We also estimate the dispersion parameter φ by usingresiduals. Note that the true value is φ = 1. The estimatesare obtained as 1.04, 1.05 and 1.08 for the cases of ρ =0.2,0.5,0.8 using the SCAD penalty, and 1.04, 1.06 and1.06 for the same values of ρ using the LASSO penalty.These results show that the residuals method for estimatingφ is fairly acceptable.

3.1.2 Time-dependent covariates

In time-dependent covariates normal case, we use the samemodel as for the time-independent covariates normal caseexcept that the covariates for fixed effects and random ef-fects are assumed to be (p − 1)th and (q − 1)th order poly-nomials in time, respectively. The measurement times tijare generated by tij = j − 0.5 + eij where eij ∼ N(0,0.1)

are random perturbations. After standardization of tij , tij =(tij − ti·)/s(ti·), where ti· and s(ti·) represent the samplemean and standard deviation of time points ti1, ti2, . . . , tini

,the time-dependent covariates involving the (p − 1)th and(q − 1)th order polynomials of tij are set to be covariates:

xij = (1, tij , . . . , tp−1ij ), zij = (1, tij , . . . , t

q−1ij ). For simplic-

ity, we assume zij = xij with p = q = 6, meaning the 5thorder polynomials in time are used to model the data yij .We also assume var(yij |bi) ≡ 1, meaning a(φ) = 1. Therandom effects are bi = (bi1, bi2, . . . , bi6) ∼ N(0,D) withD = diag(2,1,0,0,0,0), implying only the intercept andthe linear trend of measurement times involve effective ran-dom effects in the true model. The true value of the fixed ef-fects β is set to be β = (6,−5,4,−3,2,−1). For each com-bination of n = 100,200 and ni ≡ 10, we carried out 100simulations. For each simulation, the LASSO and SCADpenalty methods are used to select effective random effects.

Similar to Table 2, the MRAE, number of zeros, propor-tion of under-, over- and correct-fit are reported in Table 3to assess the performance of the proposed method for ran-dom effects selection. From Table 3, it is clear that there isa small number of under-fit cases, leading to small incorrectnumber of selecting effective random effects. In fact, thereare 97 % and 98 % correct selection of effective randomeffects among the 100-time simulations, and only 3 % and2 % of under-fit for random effects. Figure 2 provides thebox plot for the performance of the proposed random effects

Fig. 2 Box plots of variances estimates of bi1, bi2, . . . , bi,6

selection method for the case with n = 100 using SCADpenalty, which again demonstrates the good performance ofthe proposed method.

Finally, we also give the estimation value of the nuisanceparameter φ. When n = 100, then φ = 1.05 for both SCADand LASSO methods. When n increase to 200, φ = 1.05 forthe SCAD method and φ = 1.04 for the LASSO method.These estimates are quite close to φ = 1 with a very smallbias.

3.2 Bernoulli distribution

In this subsection, we consider binary distributed data as anexample of discrete longitudinal data. Like the normal distri-bution case, we also look at two cases, i.e., time-independentcovariates and time-dependent covariates.

3.2.1 Time-independent covariates

The setting mechanism of the model for Bernoulli data withtime-independent covariates is similar to the one for the nor-mal distribution with time-independent covariates. The onlydifference here lies in that the responses yij are generated

Page 8: Random effects selection in generalized linear mixed models via shrinkage penalty function

732 Stat Comput (2014) 24:725–738

Table 4 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 0.027195 6.87 0.00 0.00 0.88 0.12

LASSO 0.016400 6.87 0.00 0.00 0.88 0.12

Table 5 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 4.0e-5 3.91 0.00 0.00 0.91 0.09

LASSO 4.2e-5 3.87 0.00 0.00 0.87 0.13

Table 6 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 0.016264 6.96 0.00 0.00 0.97 0.03

LASSO 0.073532 6.93 0.00 0.00 0.94 0.06

according to Bernoulli distributions Bin(pij ), where the suc-cess probability pij is modeled via logit(pij ) = logit(μij ) =xTijβ + zT

ij bi . Still, xij are simulated from multivariate-normal distributions with ρ = 0.5 for moderate correlationand β = (10,9,8,7,6,5,4,3,2,1), The covariance matrixof the random effects bi = (bi1, bi2, . . . , bi,10) is set to beD = diag(3,2,1,0, . . . ,0).

Table 4 is presented to assess the performance of the ran-dom effects selection. In Table 4, there are 12 % over-fitusing either the SCAD or LASSO penalty method, implyingthat the proposed method seems to choose some ineffectiverandom effects into the model for longitudinal binary data,but the error is relatively small.

3.2.2 Time-dependent covariates

Using the similar model settings as in the case of nor-mal distribution with time-dependent covariates, we con-duct a simulation study with n = 100 and ni = 10 for time-dependent covariates case, where the responses yij are bi-nary distributed with the corresponding success probabilitylogit(pij ) = logit(μij ) = xT

ijβ + zTij bi .

Table 5 gives a summary of the random effects selectionfor the model we considered. It is clear that both the LASSOand SCAD methods provide a high rate of correctly selectedzero variance components of random effects, though not ashigh as in the normal distribution case. In contrast to thenormal distribution case, the main difference lies in the rateof under- and over-fit. The correct-fit rate is not as high asfor the normal distribution case due to the increasing rate ofover-fit.

3.3 Binomial distribution

In this subsection, binomial distributed simulation studiesare considered as other examples for longitudinal discretedata. The settings of the models are similar to Bernoullicases, except that this time yij come from binomial distri-butions B(m,pij ) where m is the number of trials and pij

is the success probability of the trial. Time-independent co-variates and time-dependent covariates cases are considered,respectively.

3.3.1 Time-independent covariates

For the case of time-independent covariates, the response yij

are generated according to binomial distribution B(m,pij )

with corresponding success probability logit(pij ) = xTijβ +

zTij bi and the trial number m = 4, where the covariates xij

are generated from multi-normal distribution with ρ = 0.5.The true values of the mean parameters β and the covariancematrix of random effects are the same as those in the twoprevious cases with time-independent covariates.

Table 6 gives the performance of the proposed methodfor the random effects selection. We note that there is a sig-nificant improvement over the Bernoulli case in terms of thecorrect-fit rate. The SCAD method outperforms the LASSOin terms random effects selection for the binomial case asthe former has a higher rate of correct fit and a lower rate ofover-fit as well.

3.3.2 Time-dependent covariates

The parameter settings and data generating mechanismare the same as for the case of Bernoulli data with time-

Page 9: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 733

Table 7 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 0.017954 3.96 0.00 0.00 0.96 0.04

LASSO 0.017325 3.97 0.00 0.00 0.97 0.03

Table 8 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 0.000647 6.98 0.00 0.00 0.98 0.02

LASSO 0.001998 6.95 0.00 0.00 0.96 0.04

Table 9 Random effectsselection Method MRME No. of zeros Proportion of

C IC Under-fit Correct-fit Over-fit

SCAD 2.43e-5 3.99 0.00 0.00 0.99 0.01

LASSO 2.47e-5 3.98 0.00 0.00 0.98 0.02

dependent covariates. The only difference is that the re-sponses yij are generated from a binomial distribution withthe trial number m = 4. Table 7 gives the performance ofrandom effects selection. Like in the other cases, the pro-posed method for random effects selection works very wellin the case of time-dependent covariates for binomial data.In terms of comparison between two discrete data cases,binomial data cases outperform Bernoulli cases, which isnatural as binomial data provide more information.

3.4 Poisson distribution

In this subsection, we conduct simulation studies for Pois-son distributed data, i.e., the responses yij are Poisson dis-tributed with mean μij . Similar to the previous subsections,we also consider two cases of time-independent and time-dependent covariates.

3.4.1 Time-independent covariates

The general parameter settings and data generating mecha-nism are similar to the cases of normal, Bernoulli and bi-nomial data with time-independent covariates. But this timethe responses yij are generated according to Poisson distri-butions Pois(μij ), where log(μij ) = xT

ijβ +zTij bi and the co-

variates xij and zij are generated from a multivariate normaldistribution with ρ = 0.5. Assume β = (1.0,0.9, . . . ,0.1).Again, the covariance matrix of the random effects bi =(bi1, bi2, . . . , bi,10) are set as D = diag(3,2,1,0, . . . ,0).The simulation results are summarized in Table 8.

Table 8 indicates that for Poisson distribution data bothSCAD and LASSO methods have done a good job in select-ing effective random effects, as both have very high rates

of correct-fit and correct identification of ineffective ran-dom effects. In comparison, the SCAD method performsslightly better than the LASSO approach in selecting effec-tive random effects for Poisson data. In addition, it seemsthat for the Poisson distribution case the proposed randomeffects selection method improves in correct-fit rate whencompared to the performance for Bernoulli/binomial distri-bution cases.

3.4.2 Time-dependent covariates

Similar to the previous cases with time-dependent covari-ates, the responses yij are generated from Poisson dis-tributions with mean μij , formed by log(μij ) = xT

ijβ +zTij bi , where xij and zij are polynomials in time. The co-

variance matrix of random effects bi is taken as D =diag(2,1,0,0,0,0) and the fixed effects are set to be β =(0.6,−0.5,0.4,−0.3,0.2,−0.1). The simulation results aresummarized in Table 9. From Table 9, it is clear that theproposed random effects selection method performs verywell for Poisson distribution data with time-dependent co-variates.

4 Real data analysis for CD4+ cell data

In this section, we illustrate the use of our proposed methodto the analysis of CD4+ data for a HIV study (Kaslow et al.1987). In total, there were 369 patients enrolled in the studyand for each patient repeated measurements were taken. Thestudy period lasted about eight and half years and 2376 mea-surements are taken in total. The measurement time and

Page 10: Random effects selection in generalized linear mixed models via shrinkage penalty function

734 Stat Comput (2014) 24:725–738

Table 10 Mean parametersestimation with LASSO andSCAD

β1 β2 β3 β4 β5 β6 β7

YP06 875.22 −207.27 −22.48 32.21 −1.31 −1.92 0.25

SCAD 874.74 −203.16 −19.33 31.65 −1.37 −1.84 0.24

LASSO 874.75 −203.17 −19.35 31.65 −1.37 −1.84 0.24

measurement number for each patient varied from patientto patient, so that the data set is highly unbalanced. ThisCD4+ data has already shown a very strong individual vari-ability (Kaslow et al. 1987), leading to the use of GLMM toaccommodate individual heterogeneities.

The CD4+ data has been analyzed by many authors in-cluding Groll and Tutz (2012) and Ye and Pan (2006) amongothers. In Ye and Pan (2006), the mean and variance forthe CD4+ data were modeled in terms of polynomials intime, but random effects were not considered. Below weuse the same polynomial in time as Ye and Pan (2006)to model the mean structure but add potential random ef-fects, here up to random cubic term, into the model. In otherwords,

yij = β1 + β2tij + β3t2ij + β4t

3ij + β5t

4ij + β6t

5ij + β7t

6ij

+ bi1 + bi2tij + bi3t2ij + bi4t

3ij + εij ,

where yij is the CD4+ cell number of the ith patient mea-sured at j th time point, bi = (bi1, bi2, bi3, bi4)

T are the po-tential random effects following a normal distribution withmean 0 and covariance matrix D, and εij are random errorsfollowing a normal distribution.

Table 10 presents the estimates of β using SCAD andLASSO methods and Ye and Pan (2006)’s estimation results.In Table 10, ‘YP06’ represents the estimates of the fixed ef-fects β by Ye and Pan (2006) and the other results are thoseof our proposed method using LASSO and SCAD penalties.From Table 10, we can see that the mean parameter esti-mates using the proposed method are quite close to those ofYe and Pan (2006).

For the selection of random effects, we make a com-parison between the proposed method and a full screeningsearch method based on AIC or BIC selection criterion. Atthe initial stage, a polynomial in time with likely random ef-fects up to random cubic term is taken as the starting model,as shown in the model above. The random effects bi1, bi2,bi3, bi4 may be all needed or may not be needed, so thatthere are 24 − 1 = 15 possible sub-models when incorpo-rating random effects. Based on the criterion of either AICor BIC, the best three sub-models selected from all the sub-models, which we rank as 1st, 2nd and 3rd models, are pre-sented in Table 11. Although the AIC or BIC-based selectionmethod is computationally intensive, they may select a rea-sonably good model from all the sub-models. Thus the AICor BIC-based selected model can be used as a benchmark for

Table 11 Random effects selection

Methods Rank Selected random effects

AIC 1st bi1, bi4

2nd bi1, bi2, bi3

3rd bi1, bi2, bi4

BIC 1st bi1, bi4

2nd bi1, bi2, bi3

3rd bi1, bi2, bi4

SCAD bi1, bi4

LASSO bi1, ti4

comparison purpose. Table 11 reports the selected randomeffects using the AIC or BIC model selection method andour proposed penalty-based method, respectively. Table 11shows that our proposed SCAD- or LASSO-penalty selec-tion method selects bi1, bi4 as the effective random effects,which is consistent with the ones selected by using the AICor BIC-based screening search method.

Note that the AIC or BIC-based screening search methodrequires to estimate the parameters in each sub-model andto make comparison for all possible sub-models. Inevitably,it is very computationally intensive, in particular when thedimension of random effects q is large, as there are 2q − 1possible random effects sub-models to consider. In contrast,our proposed approach only requires to work on one model,which is able to simultaneously select effective random ef-fects and estimate parameters in the model.

Finally, we report that in our analysis the estimates of thevariances of the effective random effects bi1 and bi4 are, re-spectively, 2.61 and 0.04 by the LASSO method, and 3.51and 0.04 by the SCAD method. Similar to Groll and Tutz(2013), in Fig. 3 we provide a graphical illustration for vari-ance build-ups of random effects for CD4+ cell count datausing SCAD penalty. From Fig. 3, it is clear that the vari-ances estimates of bi1, . . . , bi4 are all not zero at ψ = 0.0,but at the optimal tuning parameter ψ = 1.0 the varianceestimates of bi2 and bi3 become zero. Note for bi4 it is0.04 at ψ = 1.0, which is close but not equal to zero. Forillustration, in Fig. 4 we display the estimated populationmean (solid curve), and the estimated individual trajectoryfor Subject 329 (dashed curve) together with his CD4+ cellcount data (circle points), based on the model including ran-

Page 11: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 735

Fig. 3 Random effects variances buildups for CD4+ cell count data;the optimal value of the tuning parameter ψ is shown by the verticalline

dom effects bi1 and bi4 using LASSO and SCAD penal-ties. It shows that the proposed random effects models us-ing both LASSO and SCAD penalties fit the data reasonablywell.

5 Discussions

The PQL estimation method has been widely used inGLMMs due to its easy computation, although some biasmay likely occur in the estimation of the variance compo-nents, in particular when the data are binary or the samplesize is small. For binary and binomial correlated data, Bres-low and Lin (1995) and Lin and Breslow (1996) showedthat the PQL estimates of the parameters in GLMMs areacceptable if the data are not too sparse. For Poisson corre-lated data, the PQL estimation has a smaller bias than thosein the binomial case (Lin 2007). With a normal distribu-tion assumption, GLMMs reduce to LMMs and in this casethe PQL approach works reasonably well, so that the pro-posed SCAD or LASSO penalty method performs very wellin selecting effective random effects and estimating vari-ance components, as we saw in our simulation studies. ForGLMMs, the method proposed in this paper also works wellin selecting effective random effects, although the estimatesof the variance components using the ML version (2.5) mayhave some bias. Note that such bias can be reduced by usingthe REML version (2.6) when selecting effective randomeffects and estimating the fixed effects and variance compo-nents, simultaneously.

As we mentioned in Sect. 1, there is already some lit-erature research on variable selection for fixed effects in

Fig. 4 Population and individual mean curves with original data of acertain subject

GLMMs. In this paper, we focus on selecting effective ran-dom effects by imposing either SCAD or LASSO penaltyfunctions of the variance components of random effects tothe PQL function. An extension of the proposed approachto selection of both fixed effects and random effects, simul-taneously, is rather straightforward. For example, this canbe done by adding a second penalty function of the fixedeffects β to the penalized log-likelihood function (2.5) or(2.6), making the profile log-likelihood function �2 (β, θ) or�2R (β, θ) be penalized doubly. We will report such studiesin our follow-up paper.

Acknowledgements Pan’s research was supported by a grant fromthe Royal Society of the UK, and Huang’s research was funded by ascholarship from the University of Manchester. We would like to thanktwo anonymous referees and the Editor/AE for their constructive com-ments and helpful suggestions.

Page 12: Random effects selection in generalized linear mixed models via shrinkage penalty function

736 Stat Comput (2014) 24:725–738

Appendix: Derivation of the first- and second-orderderivatives

Part 1. First we show how to derive Σψ(θ(k−1)1 ) and

Uψ(θ(k−1)1 ). The main idea is to localize pψ(|θ1|) approx-

imately by a quadratic function. This is a direct result ofFan and Li (2001). Suppose that we already have θ

(k−1)1 =

(λ(k−1)1 , . . . , λ

(k−1)q ) that is close to the minimizer of (2.5).

Then pψ(|λl |), l = 1, . . . , q can be locally approximated bya quadratic function as

[pψ

(|λl |)]′ = p′

ψ

(|λl |)

sgn(λl) ≈ p′ψ(|λ(k−1)

l |)|λ(k−1)

l |λl

for λl ≈ λ(k−1)l , where sgn(λl) is the sign of λl . In other

words,

(|λl |) ≈ pψ

(|λl |(k−1))

+ 1

2

p′ψ(|λ(k−1)

l |)|λ(k−1)

l |(λ2

l − (λ

(k−1)l

)2).

Thus, with the local approximated form of pψ(|θ1|) =(pψ(|λ1|), . . . , pψ(|λq |)), we can derive its second and

first derivatives, i.e., Σψ(θ(k−1)1 ) and Uψ(θ

(k−1)1 ) at given

value θ(k−1)1 .

Part 2. Without lose of generality, we suppress (k − 1) inθ

(k−1)i . We will show how to get the elements in ∇l2(θi)

and ∇2l2(θi), and similarly in ∇l2R(θi) and ∇2l2R(θi), i =1,2, given θ1 = (λ1, . . . , λq) and θ2 = (γ21;γ31, γ32; . . . ;γq1, . . . , γq(q−1)).

With l2(β, θ) = − 12

∑ni=1 log|Vi | − 1

2

∑ni=1(Yi − Xiβ)T

V −1i (Yi − Xiβ) and let ei = Yi − Xiβ , we have

l2(β, θ) = −1

2

n∑

i=1

log |Vi | − 1

2

n∑

i=1

eTi V −1

i ei . (A.1)

Similarly, we have

l2R(β, θ) = −1

2

n∑

i=1

log |Vi | − 1

2

n∑

i=1

eTi V −1

i ei

− 1

2

n∑

i=1

log∣∣XT

i V −1i Xi

∣∣. (A.2)

At first we deal with the elements in ∇l2(θ1) and

∇2l2(θ1), i.e., ∂l2(β,θ)∂λl

and ∂l2(β,θ)∂λl∂λm

, where l,m = 1, . . . , q .From (A.1),

∂l2(β, θ)

∂λl

=n∑

k=1

{−1

2tr

(V −1

k

∂Vk

∂λl

)+ 1

2eTk V −1

k

∂Vk

∂λl

V −1k ek

},

∂l2(β, θ)

∂λl∂λm

=n∑

k=1

{1

2tr

(V −1

k

∂Vk

∂λl

V −1k

∂Vk

∂λm

)

− 1

2tr

(V −1

k

∂2Vk

∂λl∂λm

)

− eTk V −1

k

∂Vk

∂λl

V −1k

∂Vk

∂λm

V −1k ek

+ 1

2eTk V −1

k

∂2Vk

∂λl∂λm

V −1k ek

}.

Similarly we can calculate the elements in ∇l2R(θ1) and

∇2l2R(θ1), i.e., ∂l2R(β,θ)∂λl

and ∂l2R(β,θ)∂λl∂λm

, where l,m=1, . . . , q .From (A.2),

∂l2R(β, θ)

∂λl

=n∑

k=1

{−1

2tr

(V −1

k

∂Vk

∂λl

)+ 1

2eTk V −1

k

∂Vk

∂λl

V −1k ek

+ 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂λl

V −1k Xk

)}

∂l2R(β, θ)

∂λl∂λm

=n∑

k=1

{1

2tr

(V −1

k

∂Vk

∂λl

V −1k

∂Vk

∂λm

)

− 1

2tr

(V −1

k

∂2Vk

∂λl∂λm

)

− eTk V −1

k

∂Vk

∂λl

V −1k

∂Vk

∂λm

V −1k ek

+ 1

2eTk V −1

k

∂2Vk

∂λl∂λm

V −1k ek

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂λl

V −1k Xk

× (XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂λm

V −1k Xk

)

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂λl

V −1k

× ∂Vk

∂λm

V −1k Xk

)

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂λm

V −1k

× ∂Vk

∂λl

V −1k Xk

)

+ 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

× ∂2Vk

∂λl∂λm

V −1k Xk

)}

For the derivatives with respect to θ2, we have simi-

lar expressions. Denote the elements in ∇l2(θ2) by ∂l2(β,θ)∂γs1r1

where s1 = 2, . . . , q; r1 = 1, . . . , l1 − 1. And denote the

elements in ∇2l2(θ2) by ∂l2(β,θ)∂γs1r1 ∂γs2r2

, where s2 = 2, . . . , q;

r2 = 1, . . . , l2 − 1. From (A.1),

Page 13: Random effects selection in generalized linear mixed models via shrinkage penalty function

Stat Comput (2014) 24:725–738 737

∂l2(β, θ)

∂γs1r1

=n∑

k=1

{−1

2tr

(V −1

k

∂Vk

∂γs1r1

)

+ 1

2eTk V −1

k

∂Vk

∂γs1r1

V −1k ek

}

∂l2(β, θ)

∂γs1r1∂γs2r2

=n∑

k=1

{1

2tr

(V −1

k

∂Vk

∂γs1r1

V −1k

∂Vk

∂γs2r2

)

− 1

2tr

(V −1

k

∂2Vk

∂γs1r1∂γs2r2

)

− eTk V −1

k

∂Vk

∂γs1r1

V −1k

∂Vk

∂γs2r2

V −1k ek

+ 1

2eTk V −1

k

∂2Vk

∂γs1r1∂γs2r2

V −1k ek

}

Similarly form (A.2), we have

∂l2R(β, θ)

∂γs1r1

=n∑

k=1

{−1

2tr

(V −1

k

∂Vk

∂γs1r1

)

+ 1

2eTk V −1

k

∂Vk

∂γs1r1

V −1k ek

+ 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

× ∂Vk

∂γs1r1

V −1k Xk

)}

∂l2R(β, θ)

∂γs1r1∂γs2r2

=n∑

k=1

{1

2tr

(V −1

k

∂Vk

∂γs1r1

V −1k

∂Vk

∂γs2r2

)

− 1

2tr

(V −1

k

∂2Vk

∂γs1r1∂γs2r2

)

− eTk V −1

k

∂Vk

∂γs1r1

V −1k

∂Vk

∂γs2r2

V −1k ek

+ 1

2eTk V −1

k

∂2Vk

∂γs1r1∂γs2r2

V −1k ek

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂γs1r1

V −1k

× Xk

(XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂γs2r2

V −1k Xk

)

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂γs1r1

V −1k

× ∂Vk

∂γs2r2

V −1k Xk

)

− 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

∂Vk

∂γs2r2

V −1k

× ∂Vk

∂γs1r1

V −1k Xk

)

+ 1

2tr

((XT

k V −1k Xk

)−1XT

k V −1k

× ∂2Vk

∂γs1r1∂γs2r2

V −1k Xk

)}

Finally, we give the expression forms of ∂Vk

∂λl( ∂Vk

∂λm),

∂2Vk

∂λl∂λmand ∂Vk

∂γs1r1( ∂Vk

∂γs2r2), ∂2Vk

∂γs1r1∂γs2r2in the above equations.

First we have

∂Vk

∂λl

= ∂(ZkΛΓ Γ T ΛZTk + W−1

k )

∂λl

= Zk

∂Λ

∂λl

Γ Γ T ΛZTk + ZkΛΓ Γ T ∂Λ

∂λl

ZTk ,

where ∂Λ∂λl

is a q × q matrix with 1 at the lth diagonal entryand 0 at all the other entries. Also we have

∂2Vk

∂λl∂λm

= Zk

∂Λ

∂λl

Γ Γ T ∂Λ

∂λm

ZTk + Zk

∂Λ

∂λm

Γ Γ T ∂Λ

∂λl

ZTk ,

where ∂Λ∂λm

is a q ×q matrix with 1 at the mth diagonal entryand 0 at all the other entries.

If l = m, it is easy to see that ∂2Vk

∂λl∂λmbecomes to a q × q

matrix with 0 at all entries; if l = m, ∂2Vk

∂λl∂λm= 2 ∗ ZkUlZ

Tk ,

where Ul is a q × q matrix with 1 + Σl−1i=1γ 2

li at the lth diag-onal entry and 0 at all the other entries.

The expression form of ∂Vk

∂γs1r1is

∂Vk

∂γs1r1

= ∂(ZkΛΓ Γ T ΛZTk + W−1

k )

∂γs1r1

= ZkΛ∂Γ Γ T

∂γs1r1

ΛZTk

= ZkΛ∂Γ

∂γs1r1

Γ T ΛZTk + ZkΛΓ

(∂Γ

∂γs1r1

)T

ΛZTk ,

where ∂Γ∂γs1r1

is a q ×q matrix with 1 at the s1th row and r1th

column entry and 0 at all the other entries. Furthermore, wehave

∂2Vk

∂γs1r1∂γs2r2

= ZkΛ∂Γ

∂γs1r1

(∂Γ

∂γs2r2

)T

ΛZTk

+ ZkΛ∂Γ

∂γs2r2

(∂Γ

∂γs1r1

)T

ΛZTk

If r1 = r2, ∂Γ∂γs1r1

( ∂Γ∂γs2r2

)T turns to be a zero matrix,

hence ∂2Vk

∂γs1r1 ∂γs2r2also becomes a zero matrix; if r1 = r2,

∂Γ∂γs1r1

( ∂Γ∂γs2r2

)T turns into a matrix with 1 at s1th row and

s2th column entry and 0 at all the other entries. At this time,

if s1 = s2, ∂2Vk

∂γs1r1 ∂γs2r2become a zero matrix since Λ is a di-

agonal matrix; if s1 = s2, ∂2Vk

∂γs1r1 ∂γs2r2= 2 ∗ ZkSs1Z

Tk , where

Ss1 is matrix with λ2s1

at the s1th diagonal entry and 0 at allthe other entries.

Page 14: Random effects selection in generalized linear mixed models via shrinkage penalty function

738 Stat Comput (2014) 24:725–738

References

Ahn, M., Zhang, H.H., Lu, W.: Moment-based method for random ef-fects selection in linear mixed models. Stat. Sin. 22, 1539–1562(2012)

Bondell, H.D., Krishna, A., Ghosh, S.K.: Joint variable selection forfixed and random effects in linear mixed models. Biometrics 66,1069–1077 (2010)

Breiman, L.: Heuristics of instability and stabilization in model selec-tion. Ann. Stat. 24, 2350–2383 (1996)

Breslow, N.E., Clayton, D.G.: Approximate inference in generalizedlinear mixed models. J. Am. Stat. Assoc. 88, 9–25 (1993)

Breslow, N.E., Lin, X.H.: Bias correction in generalised linear mixedmodels with a single component of dispersion. Biometrika 82, 81–91 (1995)

Chen, Z., Dunson, D.B.: Random effects selection in linear mixedmodels. Biometrics 59, 762–769 (2003)

Fahrmeir, L., Kneib, T., Konrath, S.: Bayesian regularisation in struc-tured additive regression: a unifying perspective on shrinkage,smoothing and predictor selection. Stat. Comput. 20, 203–219(2010)

Fan, J.Q., Li, R.Z.: Variable selection via nonconcave penalized like-lihood and its oracle properties. J. R. Stat. Assoc. 96, 1348–1360(2001)

Fan, Y., Li, R.Z.: Variable selection in linear mixed effects models.Ann. Stat. 40, 2043–2068 (2012)

Frank, I.E., Friedman, J.H.: A statistical view of some chenometricregression tools (with discussion). Technometrics 35, 109–148(1993)

Groll, A., Tutz, G.: Variable selection for generalized linear mixedmodels by L1-penalized estimation. Stat. Comput. (2013).doi:10.1007/s11222-012-9359-z

Groll, A., Tutz, G.: Regularization for generalized additive mixed mod-els by likelihood-based boosting. Methods Inf. Med. 51, 168–177(2012)

Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation fornonorthogonal problems. Technometrics 12, 55–67 (1970a)

Hoerl, A.E., Kennard, R.W.: Ridge regression: application tononorthogonal problems. Technometrics 12, 69–82 (1970b)

Ibrahim, J.G., Zhu, H., Garcia, R.I., Guo, R.: Fixed and random effectsselection in mixed effects models. Biometrics 67, 495–503 (2010)

Kaslow, R.A., Ostrow, D.G., Detels, R., et al.: The multicenter AIDScohort study: rationale, organization and selected characteristicsof the participants. Am. J. Epidemiol. 126, 310–318 (1987)

Kinney, S.K., Dunson, D.B.: Fixed and random effects selection in lin-ear and logistic models. Biometrics 63, 690–698 (2007)

Lin, B., Pang, Z., Jiang, J.: Fixed and random effects selection byREML and pathwise coordinate optimization. J. Comput. Graph.Stat. (2012). doi:10.1080/10618600.2012.681219

Lin, X.H.: Estimation using penalized quasilikelihood and quasi-pseudo-likelihood in Poisson mixed models. Lifetime Data Anal.13, 533–544 (2007)

Lin, X.H., Breslow, N.E.: Bias correction in generalized linear mixedmodels with multiple components of dispersion. J. Am. Stat. As-soc. 91, 1007–1016 (1996)

Pan, J., Thompson, R.: Quasi-Monte Carlo estimation in generalizedlinear mixed models. Comput. Stat. Data Anal. 51, 5765–5775(2007)

Patterson, H.D., Thompson, R.: Recovery of inter-block informationwhen block sizes are unequal. Biometrika 58, 545–554 (1971)

Schall, R.: Estimation in generalized linear models with random ef-fects. Biometrika 78, 719–727 (1991)

Schelldorfer, J., Bühlmann, P.: GLMMLasso: an algorithm forhighdimensional generalized linear mixed models using L1-penalization. Preprint, ETH Zurich (2011). http://stat.ethz.ch/people/schell

Stiratelli, R., Laird, N.M., Ware, J.H.: Random effects models for se-rial observations with binary response. Biometrics 40, 961–971(1984)

Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R.Stat. Soc. B 58, 267–288 (1996)

Ye, H.J., Pan, J.X.: Modelling covariance structures in generalized es-timating equations for longitudinal data. Biometrika 93, 927–941(2006)

Zeger, S.L., Liang, K., Albert, P.S.: Models for longitudinal data:a generalized estimating equation approach. Biometrics 44, 1049–1060 (1988)

Zou, H.: The adaptive LASSO and its oracle properties. J. Am. Stat.Assoc. 101, 1418–1429 (2006)