regularized horseshoe - github pages

53
REGULARIZED HORSESHOE Aki Vehtari Aalto University, Finland aki.vehtari@aalto.fi @avehtari Joint work with Juho Piironen 1 / 24

Upload: others

Post on 04-Jul-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regularized horseshoe - GitHub Pages

REGULARIZED HORSESHOE

Aki Vehtari

Aalto University, [email protected]

@avehtari

Joint work with Juho Piironen

1 / 24

Page 2: Regularized horseshoe - GitHub Pages

Outline

Large p, small n regressionPrior on weights vs. prior on shrinkageHorseshoe priorRegularized horseshoe prior

2 / 24

Page 3: Regularized horseshoe - GitHub Pages

Large p, small n regression

Linear or generalized linear regressionnumber of covariates pnumber of observations n

Large p, small n common e.g.in modern medical/bioinformatics studies (e.g. microarrays,GWAS)brain imagingin our examples p is around 1e2–1e5, and usually n < 100

3 / 24

Page 4: Regularized horseshoe - GitHub Pages

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensions

If noisy observations, more complicatedIf correlating covariates, more complicated

4 / 24

Page 5: Regularized horseshoe - GitHub Pages

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicated

If correlating covariates, more complicated

4 / 24

Page 6: Regularized horseshoe - GitHub Pages

Large p, small n regression

If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicatedIf correlating covariates, more complicated

4 / 24

Page 7: Regularized horseshoe - GitHub Pages

Large p, small n regression

Priors!

Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Page 8: Regularized horseshoe - GitHub Pages

Large p, small n regression

Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Page 9: Regularized horseshoe - GitHub Pages

Large p, small n regression

Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models

Sparse priors assume only small number of covariateseffectively non-zero meff � n

5 / 24

Page 10: Regularized horseshoe - GitHub Pages

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2

Gaussian prior

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Horseshoe prior

rstanarm + bayesplot

6 / 24

Page 11: Regularized horseshoe - GitHub Pages

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2

Gaussian prior

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Horseshoe prior

rstanarm + bayesplot

6 / 24

Page 12: Regularized horseshoe - GitHub Pages

Example

Gaussian vs. Horseshoe predictive performance usingcross-validation (loo package, more in Friday Model selectiontutorial)

> compare ( loog , loohs )e l p d _ d i f f se

7.9 2.8

7 / 24

Page 13: Regularized horseshoe - GitHub Pages

Large p, small n regression

Sparse priors assume only small number of covariateseffectively non-zero meff � p

Laplace prior (“Bayesian lasso”)computationally convenient (continuous and log-concave), butnot really sparse

spike-and-slab (with point-mass at zero)prior on number of non-zero covariates, discrete

Horseshoe and hierarchical shrinkage priorsprior on amount of shrinkage, continuous

Carvalho et al (2009)8 / 24

Page 14: Regularized horseshoe - GitHub Pages

Prior on shrinkage

Slope of the prior at specific value determines the amount ofshrinkage

Carvalho et al (2009) 9 / 24

Page 15: Regularized horseshoe - GitHub Pages

Spike-and-slab vs horseshoe prior

Spike and slab prior (with point-mass at zero) has mix ofcontinuous prior and probability mass at zero

parameter space is mixture of continuous and discrete

Hierarchical shrinkage and horseshoe priors are continuous

Piironen and Vehtari (2017a)

10 / 24

Page 16: Regularized horseshoe - GitHub Pages

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Page 17: Regularized horseshoe - GitHub Pages

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Page 18: Regularized horseshoe - GitHub Pages

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Page 19: Regularized horseshoe - GitHub Pages

Horseshoe prior

Linear regression model with covariates x = (x1, . . . , xD)

yi = βTxi + εi , εi ∼ N(

0, σ2), i = 1, . . . ,n ,

The horseshoe prior:

βj |λj , τ ∼ N(

0, λ2j τ

2),

λj ∼ C+(0,1) , j = 1, . . . ,D.

The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the

shrinkage

−2 0 20

0.15

0.3

0.45

β1

p(β1)

−2 0 2

−2

0

2

β1

β2

11 / 24

Page 20: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 21: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 1.0We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 22: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.9We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 23: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.8We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 24: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.7We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 25: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.6We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 26: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.5We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 27: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.4We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 28: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.3We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 29: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.2We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 30: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 31: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0

How to specify prior for τ?

12 / 24

Page 32: Regularized horseshoe - GitHub Pages

Horseshoe prior

Given the hyperparameters, the posterior mean satisfiesapproximately

β̄j = (1− κj)βMLj , κj =

11 + nσ−2τ2λ2

j,

where κj is the shrinkage factor

With λj ∼ C+(0,1), the prior for κj looks like:

0 0.5 1

κ

nσ−2τ 2 = 0.1We expect both

relevant (β̄j ≈ βMLj ) features

irrelevant (β̄j ≈ 0) features

Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?

12 / 24

Page 33: Regularized horseshoe - GitHub Pages

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity

13 / 24

Page 34: Regularized horseshoe - GitHub Pages

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity

13 / 24

Page 35: Regularized horseshoe - GitHub Pages

The global shrinkage parameter τ

Effective number of nonzero coefficients

meff =D∑

j=1

(1− κj)

The prior mean can be shown to be

E[meff | τ, σ] =τσ−1√n

1 + τσ−1√

nD

Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ

τ0 =p0

D − p0

σ√n

⇒ Prior guess for τ based on our beliefs about the sparsity13 / 24

Page 36: Regularized horseshoe - GitHub Pages

Illustration p(τ) vs. p(meff)

Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0

σ√n , D =

dimensionality

p(meff) with different choices of p(τ):

0 2 4 6 8 10

D = 10

τ = τ0

0 2 4 6 8 10

τ ∼ N+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, 1)

0 5 10 15 20

D = 1000

0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000

14 / 24

Page 37: Regularized horseshoe - GitHub Pages

Illustration p(τ) vs. p(meff)

Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0

σ√n , D =

dimensionality

p(meff) with different choices of p(τ):

0 2 4 6 8 10

D = 10

τ = τ0

0 2 4 6 8 10

τ ∼ N+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, τ 2

0

)

0 2 4 6 8 10

τ ∼ C+(0, 1)

0 5 10 15 20

D = 1000

0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000

14 / 24

Page 38: Regularized horseshoe - GitHub Pages

Non-Gaussian observation models

The reference value:

τ0 =p0

D − p0

σ√n

The framework can be applied also to non-Gaussianobservation models by deriving appropriate plug-in valuesfor σ

Gaussian approximation to the likelihoodE.g. σ = 2 for logistic regression

15 / 24

Page 39: Regularized horseshoe - GitHub Pages

Regularized horseshoe

HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p

Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables

16 / 24

Page 40: Regularized horseshoe - GitHub Pages

Regularized horseshoe

HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p

Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables

16 / 24

Page 41: Regularized horseshoe - GitHub Pages

Horseshoe vs regularized horseshoe

Regularized horseshoe helps regularize relevant variables

17 / 24

Page 42: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)

fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,

slab_df=4))

Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ

luckily the result is not sensitive to the exact value

Note 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)

luckily the result is not sensitive to the exact value

18 / 24

Page 43: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)

fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,

slab_df=4))

Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ

luckily the result is not sensitive to the exact valueNote 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)

luckily the result is not sensitive to the exact value

18 / 24

Page 44: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Page 45: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Page 46: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Page 47: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe

19 / 24

Page 48: Regularized horseshoe - GitHub Pages

Regularized horseshoe in rstanarm

Simulated regression examplen = 100, p = 200, true p0 = 7

20 / 24

Page 49: Regularized horseshoe - GitHub Pages

Experiments

Table: Summary of the real world datasets, D denotes thenumber of predictors and n the dataset size.

Dataset Type D n

Ovarian Classification 1536 54Colon Classification 2000 62Prostate Classification 5966 102ALLAML Classification 7129 72

21 / 24

Page 50: Regularized horseshoe - GitHub Pages

Horseshoe vs regularized horseshoe

Regularized horseshoe helps to reduce the number ofdivergences, too

22 / 24

Page 51: Regularized horseshoe - GitHub Pages

Horseshoe vs regularized horseshoe

Regularized horseshoe helps to reduce the number ofdivergences, too

22 / 24

Page 52: Regularized horseshoe - GitHub Pages

Example

(Intercept)

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

x16

x17

x18

x19

x20

sigma

−2 −1 0 1 2 3

Even if Horseshoe shrinks a lot, coefficient posterior hasunecrtainty and it’s not exactly zeroTomorrow in Model selection tutorial

how to select most relevant variableshow to do the inference after the selection while taking intoaccount the uncertainties in the full model

23 / 24

Page 53: Regularized horseshoe - GitHub Pages

Summary of regularized horseshoe prior

Sparse as horseshoe, but more robust inference andcomputationBetter performance than LASSO and Bayesian LASSO

References (with code examples for Stan included)

Juho Piironen and Aki Vehtari (2017). Sparsity information andregularization in the horseshoe and other shrinkage priors. In ElectronicJournal of Statistics, 11(2):5018-5051.https://projecteuclid.org/euclid.ejs/1513306866Juho Piironen and Aki Vehtari (2017). On the hyperprior choice for theglobal shrinkage parameter in the horseshoe prior. Proceedings of the20th International Conference on Artificial Intelligence and Statistics,PMLR 54:905-913. http://proceedings.mlr.press/v54/piironen17a.htmlJuho Piironen, and Aki Vehtari (2018). Iterative supervised principalcomponents. Proceedings of the 21th International Conference onArtificial Intelligence and Statistics, accepted for publication.https://arxiv.org/abs/1710.06229See also model selection tutorial with some notebooks using regularizedhorseshoe https://github.com/avehtari/modelselection_tutorial

24 / 24