regularized horseshoe - github pages

Post on 04-Jul-2022

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

REGULARIZED HORSESHOE

Aki Vehtari

Aalto University, Finlandaki.vehtari@aalto.fi

@avehtari

Joint work with Juho Piironen

1 / 24

Outline

Large p, small n regressionPrior on weights vs. prior on shrinkageHorseshoe priorRegularized horseshoe prior

2 / 24

Large p, small n regression

Linear or generalized linear regressionnumber of covariates pnumber of observations n

Large p, small n common e.g.in modern medical/bioinformatics studies (e.g. microarrays,GWAS)brain imagingin our examples p is around 1e2–1e5, and usually n < 100

3 / 24

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensions

If noisy observations, more complicatedIf correlating covariates, more complicated

4 / 24

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicated

If correlating covariates, more complicated

4 / 24

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicatedIf correlating covariates, more complicated

4 / 24

Large p, small n regression

Priors!

Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Large p, small n regression

Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Large p, small n regression

Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2

Gaussian prior

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Horseshoe prior

rstanarm + bayesplot

6 / 24

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2

Gaussian prior

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Horseshoe prior

rstanarm + bayesplot

6 / 24

Example

Gaussian vs. Horseshoe predictive performance usingcross-validation (loo package, more in Friday Model selectiontutorial)

> compare ( loog , loohs )e l p d _ d i f f se

7.9 2.8

7 / 24

Large p, small n regression

Sparse priors assume only small number of covariateseffectively non-zero meff � p

Laplace prior (“Bayesian lasso”)computationally convenient (continuous and log-concave), butnot really sparse

spike-and-slab (with point-mass at zero)prior on number of non-zero covariates, discrete

Horseshoe and hierarchical shrinkage priorsprior on amount of shrinkage, continuous

Carvalho et al (2009)8 / 24

Prior on shrinkage

Slope of the prior at specific value determines the amount ofshrinkage

Carvalho et al (2009) 9 / 24

Spike-and-slab vs horseshoe prior

Spike and slab prior (with point-mass at zero) has mix ofcontinuous prior and probability mass at zero

parameter space is mixture of continuous and discrete

Hierarchical shrinkage and horseshoe priors are continuous

Piironen and Vehtari (2017a)

10 / 24

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 1.0We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.9We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.8We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.7We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.6We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.5We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.4We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.3We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.2We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0

How to specify prior for τ?

12 / 24

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity

13 / 24

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity

13 / 24

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity13 / 24

Illustration p(τ) vs. p(meff)

Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0

σ√n , D =

dimensionality

p(meff) with different choices of p(τ):

0 2 4 6 8 10

D = 10

τ = τ0

0 2 4 6 8 10

τ ∼ N+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, 1)

0 5 10 15 20

D = 1000

0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000

14 / 24

Illustration p(τ) vs. p(meff)

Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0

σ√n , D =

dimensionality

p(meff) with different choices of p(τ):

0 2 4 6 8 10

D = 10

τ = τ0

0 2 4 6 8 10

τ ∼ N+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, 1)

0 5 10 15 20

D = 1000

0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000

14 / 24

Non-Gaussian observation models

The reference value:

τ0 =p0

D − p0

σ√n

The framework can be applied also to non-Gaussianobservation models by deriving appropriate plug-in valuesfor σ

Gaussian approximation to the likelihoodE.g. σ = 2 for logistic regression

15 / 24

Regularized horseshoe

HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p

Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables

16 / 24

Regularized horseshoe

HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p

Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables

16 / 24

Horseshoe vs regularized horseshoe

Regularized horseshoe helps regularize relevant variables

17 / 24

Regularized horseshoe in rstanarm

Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)

fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,

slab_df=4))

Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ

luckily the result is not sensitive to the exact value

Note 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)

luckily the result is not sensitive to the exact value

18 / 24

Regularized horseshoe in rstanarm

Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)

fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,

slab_df=4))

Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ

luckily the result is not sensitive to the exact valueNote 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)

luckily the result is not sensitive to the exact value

18 / 24

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7

20 / 24

Experiments

Table: Summary of the real world datasets, D denotes thenumber of predictors and n the dataset size.

Dataset Type D n

Ovarian Classification 1536 54Colon Classification 2000 62Prostate Classification 5966 102ALLAML Classification 7129 72

21 / 24

Horseshoe vs regularized horseshoe

Regularized horseshoe helps to reduce the number ofdivergences, too

22 / 24

Horseshoe vs regularized horseshoe

Regularized horseshoe helps to reduce the number ofdivergences, too

22 / 24

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Even if Horseshoe shrinks a lot, coefficient posterior hasunecrtainty and it’s not exactly zeroTomorrow in Model selection tutorial

how to select most relevant variableshow to do the inference after the selection while taking intoaccount the uncertainties in the full model

23 / 24

Summary of regularized horseshoe prior

Sparse as horseshoe, but more robust inference andcomputationBetter performance than LASSO and Bayesian LASSO

References (with code examples for Stan included)

Juho Piironen and Aki Vehtari (2017). Sparsity information andregularization in the horseshoe and other shrinkage priors. In ElectronicJournal of Statistics, 11(2):5018-5051.https://projecteuclid.org/euclid.ejs/1513306866Juho Piironen and Aki Vehtari (2017). On the hyperprior choice for theglobal shrinkage parameter in the horseshoe prior. Proceedings of the20th International Conference on Artificial Intelligence and Statistics,PMLR 54:905-913. http://proceedings.mlr.press/v54/piironen17a.htmlJuho Piironen, and Aki Vehtari (2018). Iterative supervised principalcomponents. Proceedings of the 21th International Conference onArtificial Intelligence and Statistics, accepted for publication.https://arxiv.org/abs/1710.06229See also model selection tutorial with some notebooks using regularizedhorseshoe https://github.com/avehtari/modelselection_tutorial

24 / 24

top related