bayesian model choice and information criteria in sparse generalized linear...

Bayesian Model Choice and Information Criteria inSparse Generalized Linear Models

Mathias Drton

Department of StatisticsUniversity of Chicago

(Paper with this title: Rina Foygel & M.D., arXiv:1112.5635)

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

2 / 36

Outline

4 Ising models

BIC and extensions 2 / 36

Bayesian information criterion (BIC)

Sample Y1, . . . ,Yn

Parametric model MMaximized log-likelihood function ˆ(M)

Bayesian information criterion (Schwarz, 1978)

BIC(M) := ˆ(M)− dim(M)

2log n

‘Generic’ model selection approach:

Maximize BIC(M) over set of considered models

Motivation: 1) Bayesian model choice

Posterior model probability in fully Bayesian treatment:

P(M|Y1, . . . ,Yn) ∝ P(M)︸︷︷︸prior

P(Y1, . . . ,Yn |M).

Marginal likelihood:

Ln(M) := P(Y1, . . . ,Yn |M)

∫P(Y1, . . . ,Yn | θM,M)︸︷︷︸

likelihood fct.

d P(θM |M)︸︷︷︸prior

Motivation: 2) Asymptotics

Y1, . . . ,Yn i.i.d. sample from P0 ∈M

Theorem (Schwarz, 1978; Haughton, 1988; and others)

Assume P(θM |M) is a ‘nice’ prior on Rd . Then in ‘nice’ models,

log Ln(M) = ˆn(M)− d

2log n + Op(1),

and a better (Laplace) approximation is possible:

log Ln(M) = ˆn(M)− d

2log( n

)+ log P(θM |M)

2log det

nHessian(θM)

(n−1/2

Consistency

Theorem

Fix a finite set of ‘nice’ models. Then, BIC selects a true model ofsmallest dimension with probability tending to one as n→∞.

Proof.

Finite set of models =⇒ pairwise comparisons suffice.

If P0 ∈M1 (M2 and d1 < d2, then

ˆn(M2)− ˆ

n(M1) = Op(1); and (d2 − d1) log n→∞.

If P0 ∈M1 \ clos(M2), then with probability tending to one,

[ˆn(M1)− ˆ

n(M2)]> ε > 0; and log(n)/n→ 0.

Linear regression (covariates i.i.d. N(0, 1), φ1 = 1, σ = 2)

BIC in higher-dimensional linear regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1

Higher-dimensional linear regression . . . too large models

σ = 1,k = 2,φ1 = φ2 = 1

Broman & Speed (2002)

Informative prior on models in higher dim. regression

σ = 1,k = 2,φ1 = φ2 = 1

Informative prior on models in higher dim. regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

BICEBIC

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1

Extended Bayesian information criterion

Linear regression

Models given by subsets of covariates J ⊂ [p] := {1, . . . , p}Prior on models

P(J) =1

p + 1· 1( p|J|)

has k = #covariates and (J|k) uniformly distributed.

Extended BIC defined as

EBIC(J) = BIC(J)− |J| log p;

we have |J| � p in mind.

Bogdan et al. (2004), Chen & Chen (2008), Scott and Berger (2010), . . .

Theory = consistency for EBIC

Chen & Chen ’08 High-dimensional sparse linear regression(fixed design, # active covariates bounded).

Chen & Chen ’11 Generalized linear models(fixed design, canonical link).

Chen et al. ’11 Generalizations for fixed design regression

Gao et al. ’10 Gaussian graphical modelsFoygel & D ’10 (adjust penalty for number of graphs)

Questions

Bayesian connection under high-dimensional asymptotics:

� Laplace approximation to marginal likelihood accurate uniformly over agrowing number of models?

� EBIC captures growth of marginal likelihood?

Consistency for random designs?

Consistency for pseudo-likelihood approaches to graphical modelselection?

Consistency of fully Bayesian model choice as corollaries?

Shang & Clayton (2011)

Outline

4 Ising models

Asymptotics for marginal likelihood of GLMs 15 / 36

Generalized linear model: Setup

Independent (response) observations Y1, . . . ,Yn

Distribution of Yi ∼ pθi from univariate exponential family:

pθ(y) ∝ exp {y · θ − b(θ)} , θ ∈ Θ = R.

Linearity:θ = (θ1, . . . , θn)T = Xφ, φ ∈ Rp,

for design matrix X = (Xij) ∈ Rn×p.(rows , experiments, col’s , covariates)

Random design with X1•, . . . ,Xn• i.i.d.

Variable selection:

Find support J∗ ⊂ [p] of true parameter φ∗.

Assumptions

(A) Bounded covariates (or a moment condition)

(B1) Subexponential growth of dimension: log(pn) = o(n).

(B2) Dimension of smallest true model bounded by a fixed q ∈ N.

(B3) Small sets of covariates have second moment matrices with mimimaleigenvalue bounded away from zero:

(E[X1JXT

])> a > 0 for all |J| ≤ 2q.

(B4) Norm of signal ‖φ∗‖2 bounded.

Theorem (Laplace approximation)

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Then thereis a constant C such that the marginal likelihood sequence Ln(J) satisfiesthat

log Ln(J) = `n(φJ)− |J|2

log(n) + log fJ(φJ) +|J|2

log(2π)

2log det

nHessianJ(φJ)

√log(np)

nfor all |J| ≤ q,

with probability tending to 1 as n→∞.

EBIC approximation

EBIC (with parameter γ ≥ 0):

EBICγ(J) = `n(φJ) − |J|2

log(n) − γ|J| · log(p) .

Corollary

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Adopt theunnormalized model prior

Pγ(J) =

)−γ· 1 {|J| ≤ q} .

Then there is a constant C ′ such that with probability tending to 1 asn→∞, we have∣∣∣log

[Pγ(J,Y )

]− EBICγ(J)

∣∣∣ ≤ C ′ for all |J| ≤ q.

Laplace approximation to marginal likelihood

exp(`n(φJ + γ)

)· fJ(φJ + γ) dγ

Taylor series:

`n(φJ + γ) = `[n](φJ) − 1

2γ> HessianJ(φJ + tγ · γ) γ

Approximation by Gaussian integral:

fJ(φJ) ·∫RJ

exp(`n(φJ)

)· exp

2γ> HessianJ(φJ) γ

= fJ(φJ) · exp(`n(φJ) ·

√(2π

)|J|· det

nHessianJ(φJ)

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸︷︷︸

≈ `n(φJ)− 1

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

(`n(φJ)− 1

≈ `n(φJ)− 1

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

(`n(φJ)− 1

≈ `n(φJ)− 1

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

(`n(φJ)− 1

≈ `n(φJ)− 1

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

(`n(φJ)− 1

≈ `n(φJ)− 1

●●

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Assumptions on priors

Family of priors (fJ : J ⊂ [p], |J| ≤ q) is ‘nice’ if for constants0 < F1,F2,F3 <∞ we have uniformly for all |J| ≤ q:

(i) an upper bound:sup φJ fJ(φJ) ≤ F1 <∞,

(ii) a lower bound over a compact set:

inf ‖φJ‖2≤R+1fJ(φJ) ≥ F2 > 0,

where R is a function of the constants in (A) & (B1)-(B4),

(iii) a Lipschitz property on the same compact set:

sup ‖φJ‖2≤R+1 ‖∇fJ(φJ)‖2 ≤ F3 <∞.

Outline

4 Ising models

Consistency for GLMs 23 / 36

(B5) Small true coefficients don’t decay too fast:√log(npn)

{∣∣φ∗j ∣∣ : j ∈ J∗}).

Theorem (EBIC consistency in GLM)

Assume (A), (B1)-(B5). Let

κ = limn→∞

log pn

log n∈ [0,∞],

and take γ > 1− 12κ . Then with prob. tending to 1 as n→∞, we have

BICγ(J∗)− maxJ 6=J∗,|J|≤q

BICγ(J) ≥ log(p) · Chigh + log(n) · Clow

for constants Chigh,Clow > 0.

EBIC approximates Bayesian model choice

Corollary (Consistency of Bayesian model choice)

Assume (A), (B1)-(B5) and ‘nice priors’. Then with probability tending to1 as n→∞, we have

Pγ(J∗ |Y ) > maxJ 6=J∗,|J|≤q

Pγ(J |Y ) .

Experiment for sparse logistic regression (with lasso)

Spambase data from UCI Machine Learning Data Repository

n0 = 4601 emails, p0 = 57 covariates

Downsample to n < n0 experiments.

Create p − p0 noise covariates by random permutation.

Total number of covariates p satisfies pn = p0

25 ≈ 2.28.

Select a model from lasso path using EBIC, cross-validation andstability selection (Meinshausen & Buhlmann, 2010).

Positive selection and false discovery rate

Number of samples

100 200 300 400 500 600

●●

●● ●

BIC0.25

BIC0.5

Cross−validationStability selection

Number of samples

100 200 300 400 500 6000%

● ● ● ● ● ●● ● ● ● ● ●

BIC0.25

BIC0.5

Comparison to full data

P−value of feature in the full regression (sample size 4601)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

BIC0.25

BIC0.5

Figure: Smoothed probability of selecting a true feature, as a function of thep-value of that feature in the full regression.

Outline

4 Ising models

Ising models 29 / 36

Ising model

Observe i.i.d. X (1), . . . ,X (n) ∈ {0, 1}p

Likelihood function:

Z (Θ)· exp

{∑jΘj0xj +

∑j<kΘjkxjxk

}↑ ↑

normalizing (sparse)

const. potential matrix

Full conditional for Xj is proportional to

xj ·(

Θj0 +∑

k 6=jΘjkxk

)}Model selection problem:

Find support E∗ (the ‘graph’) of true potential matrix Θ∗

Neighborhood selection for sparse Ising models

For each Xj , select its neighborhood via the Lasso:

Θ(λ)j• = arg max

[`Xj |X−j

(Θj•

)+ λ ·

∑k 6=j

|Θjk |]

(Meinshausen & Buhlmann, 2006; Ravikumar et al., 2010)

How to choose λ, i.e., neighborhoods from each path?

Cross-validation tends to select too large neighborhoods.

Apply EBIC:

� Let Ej,λ be the edges incident to j in support of Θ(λ)j• .

� Maximize

`Xj |X−j

(λ)j•

)− |Ej,λ|

2log(n) − |Ej,λ| · γ log(p)

with respect to λ.

Consisteny of EBIC for Ising model selection

Theorem

Consider subexponential growth of p = pn with

κ = limn→∞

log pn

log n∈ [0,∞].

Assume

all neighborhood sizes bounded by a constant,√log(np)

n � |Θ∗jk | ≤ a constant, for all edges (j , k).

Take γ > 2− 12κ . Then with probability tending to 1 as n→∞:

EBICγ selects the right neighborhood for every Xj .

Follows from consistency of EBIC for GLMs with random covariates.

Precipitation data (U.S. Historical Climatology Network)

89 weather stations

measure precipitation (1 or 0) on 278 (nonconsecutive) dates

discard locations of the weather stations can we recover the geographical layout?

●●

● ●

●●

● ●

●●

−96 −94 −92 −90 −88 −86

Longitude

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

Extended BIC

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

Cross−validation

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

Stability selection

●●

● ●

●●

● ●

●●

● ●

γ = 0.25

Edge selection vs distance

Distance between weather stations (miles)

0 100 200 300 400 500 600

BICextended BICcross−validationstability selection

Conclusion

Laplace approximation can be accurate uniformly over large numberof sparse GLMs

Chen & Chen’s extended Bayesian information criterion (EBIC):

� connected to Bayesian model choice;� its consistency proves consistency of ‘generic’ Bayesian procedures;� computationally inexpensive alternative to stability selection and other

resampling methods;� seems useful for tuning regularization methods.

For details including references, see:

Bayesian model choice and information criteria in sparse generalizedlinear models (with Rina Foygel). arXiv:1112.5635

bayesian model choice and information criteria in sparse generalized linear...

Documents

sparse coding trees with application to emotion...

masc: multi-scale afﬁnity with sparse convolution for 3d...

deep sparse representation for robust image registration ·...

sparse lu factorization for parallel circuit simulation on...

an introduction to sparse coding, sparse sensing, and...

sparse non‐rigid registration of 3d...

bayesian information criterion for singular...

hyperspectral image fusion based on sparse constraint...

from sparse regression to sparse multiple correspondence...

abrupt feature extraction via the combination of sparse...

sparse shape representation using the laplace-beltrami...

capped 1-norm sparse representation method for graph...

error-bounded sampling for analytics on big sparse data ·...

optimizing complex automated negotiation using sparse pseudo...

sparse recovery ( using sparse matrices)

tutorial on sparse coding - pami.sjtu.edu.cn... “online...

sparse optimization...

sparse matrix-vector multiplier -...

super: sparse signals with unknown phases efficiently...

factors of sparse polynomials are sparse