bayesian model choice and information criteria in sparse generalized linear...

Post on 16-Mar-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bayesian Model Choice and Information Criteria inSparse Generalized Linear Models

Mathias Drton

Department of StatisticsUniversity of Chicago

(Paper with this title: Rina Foygel & M.D., arXiv:1112.5635)

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

2 / 36

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

BIC and extensions 2 / 36

Bayesian information criterion (BIC)

Sample Y1, . . . ,Yn

Parametric model MMaximized log-likelihood function ˆ(M)

Bayesian information criterion (Schwarz, 1978)

BIC(M) := ˆ(M)− dim(M)

2log n

‘Generic’ model selection approach:

Maximize BIC(M) over set of considered models

BIC and extensions 3 / 36

Motivation: 1) Bayesian model choice

Posterior model probability in fully Bayesian treatment:

P(M|Y1, . . . ,Yn) ∝ P(M)︸ ︷︷ ︸prior

P(Y1, . . . ,Yn |M).

Marginal likelihood:

Ln(M) := P(Y1, . . . ,Yn |M)

=

∫P(Y1, . . . ,Yn | θM,M)︸ ︷︷ ︸

likelihood fct.

d P(θM |M)︸ ︷︷ ︸prior

BIC and extensions 4 / 36

Motivation: 2) Asymptotics

Y1, . . . ,Yn i.i.d. sample from P0 ∈M

Theorem (Schwarz, 1978; Haughton, 1988; and others)

Assume P(θM |M) is a ‘nice’ prior on Rd . Then in ‘nice’ models,

log Ln(M) = ˆn(M)− d

2log n + Op(1),

and a better (Laplace) approximation is possible:

log Ln(M) = ˆn(M)− d

2log( n

)+ log P(θM |M)

− 1

2log det

[1

nHessian(θM)

]+ Op

(n−1/2

)

BIC and extensions 5 / 36

Consistency

Theorem

Fix a finite set of ‘nice’ models. Then, BIC selects a true model ofsmallest dimension with probability tending to one as n→∞.

Proof.

Finite set of models =⇒ pairwise comparisons suffice.

If P0 ∈M1 (M2 and d1 < d2, then

ˆn(M2)− ˆ

n(M1) = Op(1); and (d2 − d1) log n→∞.

If P0 ∈M1 \ clos(M2), then with probability tending to one,

1

n

[ˆn(M1)− ˆ

n(M2)]> ε > 0; and log(n)/n→ 0.

BIC and extensions 6 / 36

Linear regression (covariates i.i.d. N(0, 1), φ1 = 1, σ = 2)

BIC and extensions 7 / 36

BIC in higher-dimensional linear regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

p

Pro

b co

rrec

t

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1

BIC and extensions 8 / 36

Higher-dimensional linear regression . . . too large models

σ = 1,k = 2,φ1 = φ2 = 1

Broman & Speed (2002)

BIC and extensions 9 / 36

Informative prior on models in higher dim. regression

σ = 1,k = 2,φ1 = φ2 = 1

BIC and extensions 10 / 36

Informative prior on models in higher dim. regression

Exhaustive search up to 6 covariates

10 20 30 40 50 60 70 80

0.0

0.2

0.4

0.6

0.8

1.0

p

Pro

b co

rrec

t

BICEBIC

n = p,σ = 1,k = 2,φ1 = 1φ2 = 1

BIC and extensions 11 / 36

Extended Bayesian information criterion

Linear regression

Models given by subsets of covariates J ⊂ [p] := {1, . . . , p}Prior on models

P(J) =1

p + 1· 1( p|J|)

has k = #covariates and (J|k) uniformly distributed.

Extended BIC defined as

EBIC(J) = BIC(J)− |J| log p;

we have |J| � p in mind.

Bogdan et al. (2004), Chen & Chen (2008), Scott and Berger (2010), . . .

BIC and extensions 12 / 36

Theory = consistency for EBIC

Chen & Chen ’08 High-dimensional sparse linear regression(fixed design, # active covariates bounded).

Chen & Chen ’11 Generalized linear models(fixed design, canonical link).

Chen et al. ’11 Generalizations for fixed design regression

Gao et al. ’10 Gaussian graphical modelsFoygel & D ’10 (adjust penalty for number of graphs)

BIC and extensions 13 / 36

Questions

Bayesian connection under high-dimensional asymptotics:

� Laplace approximation to marginal likelihood accurate uniformly over agrowing number of models?

� EBIC captures growth of marginal likelihood?

Consistency for random designs?

Consistency for pseudo-likelihood approaches to graphical modelselection?

Consistency of fully Bayesian model choice as corollaries?

Shang & Clayton (2011)

BIC and extensions 14 / 36

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

Asymptotics for marginal likelihood of GLMs 15 / 36

Generalized linear model: Setup

Independent (response) observations Y1, . . . ,Yn

Distribution of Yi ∼ pθi from univariate exponential family:

pθ(y) ∝ exp {y · θ − b(θ)} , θ ∈ Θ = R.

Linearity:θ = (θ1, . . . , θn)T = Xφ, φ ∈ Rp,

for design matrix X = (Xij) ∈ Rn×p.(rows , experiments, col’s , covariates)

Random design with X1•, . . . ,Xn• i.i.d.

Variable selection:

Find support J∗ ⊂ [p] of true parameter φ∗.

Asymptotics for marginal likelihood of GLMs 16 / 36

Assumptions

(A) Bounded covariates (or a moment condition)

(B1) Subexponential growth of dimension: log(pn) = o(n).

(B2) Dimension of smallest true model bounded by a fixed q ∈ N.

(B3) Small sets of covariates have second moment matrices with mimimaleigenvalue bounded away from zero:

λmin

(E[X1JXT

1J

])> a > 0 for all |J| ≤ 2q.

(B4) Norm of signal ‖φ∗‖2 bounded.

Asymptotics for marginal likelihood of GLMs 17 / 36

Theorem (Laplace approximation)

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Then thereis a constant C such that the marginal likelihood sequence Ln(J) satisfiesthat

log Ln(J) = `n(φJ)− |J|2

log(n) + log fJ(φJ) +|J|2

log(2π)

− 1

2log det

(1

nHessianJ(φJ)

)± C

√log(np)

nfor all |J| ≤ q,

with probability tending to 1 as n→∞.

Asymptotics for marginal likelihood of GLMs 18 / 36

EBIC approximation

EBIC (with parameter γ ≥ 0):

EBICγ(J) = `n(φJ) − |J|2

log(n) − γ|J| · log(p) .

Corollary

Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Adopt theunnormalized model prior

Pγ(J) =

(p

|J|

)−γ· 1 {|J| ≤ q} .

Then there is a constant C ′ such that with probability tending to 1 asn→∞, we have∣∣∣log

[Pγ(J,Y )

]− EBICγ(J)

∣∣∣ ≤ C ′ for all |J| ≤ q.

Asymptotics for marginal likelihood of GLMs 19 / 36

Laplace approximation to marginal likelihood

∫RJ

exp(`n(φJ + γ)

)· fJ(φJ + γ) dγ

Taylor series:

`n(φJ + γ) = `[n](φJ) − 1

2γ> HessianJ(φJ + tγ · γ) γ

Approximation by Gaussian integral:

fJ(φJ) ·∫RJ

exp(`n(φJ)

)· exp

(−1

2γ> HessianJ(φJ) γ

)dγ

= fJ(φJ) · exp(`n(φJ) ·

√(2π

n

)|J|· det

(1

nHessianJ(φJ)

)−1

Asymptotics for marginal likelihood of GLMs 20 / 36

Laplace approximation to marginal likelihood

∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸

≈ `n(φJ)− 1

2γ> HessianJ(φJ) γ

)dγ

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Asymptotics for marginal likelihood of GLMs 21 / 36

Laplace approximation to marginal likelihood

∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸

≈ `n(φJ)− 1

2γ> HessianJ(φJ) γ

)dγ

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Asymptotics for marginal likelihood of GLMs 21 / 36

Laplace approximation to marginal likelihood

∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸

≈ `n(φJ)− 1

2γ> HessianJ(φJ) γ

)dγ

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Asymptotics for marginal likelihood of GLMs 21 / 36

Laplace approximation to marginal likelihood

∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸

≈ `n(φJ)− 1

2γ> HessianJ(φJ) γ

)dγ

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Asymptotics for marginal likelihood of GLMs 21 / 36

Laplace approximation to marginal likelihood

∫RJ

exp

(`n(φJ)− 1

2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸

≈ `n(φJ)− 1

2γ> HessianJ(φJ) γ

)dγ

●●

`n(φJ)

γ = 0 ↔ φ = φJ

‖γ‖2 ≤√

log(p)n

‖γ‖2 ≤ 1

Asymptotics for marginal likelihood of GLMs 21 / 36

Assumptions on priors

Family of priors (fJ : J ⊂ [p], |J| ≤ q) is ‘nice’ if for constants0 < F1,F2,F3 <∞ we have uniformly for all |J| ≤ q:

(i) an upper bound:sup φJ fJ(φJ) ≤ F1 <∞,

(ii) a lower bound over a compact set:

inf ‖φJ‖2≤R+1fJ(φJ) ≥ F2 > 0,

where R is a function of the constants in (A) & (B1)-(B4),

(iii) a Lipschitz property on the same compact set:

sup ‖φJ‖2≤R+1 ‖∇fJ(φJ)‖2 ≤ F3 <∞.

Asymptotics for marginal likelihood of GLMs 22 / 36

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

Consistency for GLMs 23 / 36

(B5) Small true coefficients don’t decay too fast:√log(npn)

n= o

(min

{∣∣φ∗j ∣∣ : j ∈ J∗}).

Theorem (EBIC consistency in GLM)

Assume (A), (B1)-(B5). Let

κ = limn→∞

log pn

log n∈ [0,∞],

and take γ > 1− 12κ . Then with prob. tending to 1 as n→∞, we have

BICγ(J∗)− maxJ 6=J∗,|J|≤q

BICγ(J) ≥ log(p) · Chigh + log(n) · Clow

for constants Chigh,Clow > 0.

Consistency for GLMs 24 / 36

EBIC approximates Bayesian model choice

Corollary (Consistency of Bayesian model choice)

Assume (A), (B1)-(B5) and ‘nice priors’. Then with probability tending to1 as n→∞, we have

Pγ(J∗ |Y ) > maxJ 6=J∗,|J|≤q

Pγ(J |Y ) .

Consistency for GLMs 25 / 36

Experiment for sparse logistic regression (with lasso)

Spambase data from UCI Machine Learning Data Repository

n0 = 4601 emails, p0 = 57 covariates

Downsample to n < n0 experiments.

Create p − p0 noise covariates by random permutation.

Total number of covariates p satisfies pn = p0

25 ≈ 2.28.

Select a model from lasso path using EBIC, cross-validation andstability selection (Meinshausen & Buhlmann, 2010).

Consistency for GLMs 26 / 36

Positive selection and false discovery rate

Number of samples

Pos

itive

sel

ectio

n ra

te (

PS

R)

100 200 300 400 500 600

0%10

%20

%30

%40

%50

%

●●

●●

●● ●

BIC0

BIC0.25

BIC0.5

BIC1

Cross−validationStability selection

Number of samples

Fals

e di

scov

ery

rate

(F

DR

)

100 200 300 400 500 6000%

20%

40%

60%

80%

● ● ● ● ● ●● ● ● ● ● ●

BIC0

BIC0.25

BIC0.5

BIC1

Cross−validationStability selection

Consistency for GLMs 27 / 36

Comparison to full data

P−value of feature in the full regression (sample size 4601)

Sm

ooth

ed p

rob.

of s

elec

tion

(sub

sam

ple

size

600

)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

BIC0

BIC0.25

BIC0.5

BIC1

Cross−validationStability selection

Figure: Smoothed probability of selecting a true feature, as a function of thep-value of that feature in the full regression.

Consistency for GLMs 28 / 36

Outline

1 BIC and extensions

2 Asymptotics for marginal likelihood of GLMs

3 Consistency for GLMs

4 Ising models

Ising models 29 / 36

Ising model

Observe i.i.d. X (1), . . . ,X (n) ∈ {0, 1}p

Likelihood function:

1

Z (Θ)· exp

{∑jΘj0xj +

∑j<kΘjkxjxk

}↑ ↑

normalizing (sparse)

const. potential matrix

Full conditional for Xj is proportional to

exp{

xj ·(

Θj0 +∑

k 6=jΘjkxk

)}Model selection problem:

Find support E∗ (the ‘graph’) of true potential matrix Θ∗

Ising models 30 / 36

Neighborhood selection for sparse Ising models

For each Xj , select its neighborhood via the Lasso:

Θ(λ)j• = arg max

[`Xj |X−j

(Θj•

)+ λ ·

∑k 6=j

|Θjk |]

(Meinshausen & Buhlmann, 2006; Ravikumar et al., 2010)

How to choose λ, i.e., neighborhoods from each path?

Cross-validation tends to select too large neighborhoods.

Apply EBIC:

� Let Ej,λ be the edges incident to j in support of Θ(λ)j• .

� Maximize

`Xj |X−j

(λ)j•

)− |Ej,λ|

2log(n) − |Ej,λ| · γ log(p)

with respect to λ.

Ising models 31 / 36

Consisteny of EBIC for Ising model selection

Theorem

Consider subexponential growth of p = pn with

κ = limn→∞

log pn

log n∈ [0,∞].

Assume

all neighborhood sizes bounded by a constant,√log(np)

n � |Θ∗jk | ≤ a constant, for all edges (j , k).

Take γ > 2− 12κ . Then with probability tending to 1 as n→∞:

EBICγ selects the right neighborhood for every Xj .

Follows from consistency of EBIC for GLMs with random covariates.

Ising models 32 / 36

Precipitation data (U.S. Historical Climatology Network)

89 weather stations

measure precipitation (1 or 0) on 278 (nonconsecutive) dates

discard locations of the weather stations can we recover the geographical layout?

●●

●●

● ●

●●

●●

● ●

●●

●●

−96 −94 −92 −90 −88 −86

3638

4042

Longitude

Latit

ude

Ising models 33 / 36

●●

●●

● ●

●●

●●

● ●

●●

● ●

BIC

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

Extended BIC

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

Cross−validation

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

Stability selection

●●

●●

● ●

●●

●●

● ●

●●

● ●

γ = 0.25

Ising models 34 / 36

Edge selection vs distance

Distance between weather stations (miles)

Sm

ooth

ed p

roba

bilit

y of

sel

ectin

g ed

ge

0 100 200 300 400 500 600

0.0

0.2

0.4

0.6

0.8

1.0

BICextended BICcross−validationstability selection

Ising models 35 / 36

Conclusion

Laplace approximation can be accurate uniformly over large numberof sparse GLMs

Chen & Chen’s extended Bayesian information criterion (EBIC):

� connected to Bayesian model choice;� its consistency proves consistency of ‘generic’ Bayesian procedures;� computationally inexpensive alternative to stability selection and other

resampling methods;� seems useful for tuning regularization methods.

For details including references, see:

Bayesian model choice and information criteria in sparse generalizedlinear models (with Rina Foygel). arXiv:1112.5635

Ising models 36 / 36

top related