regularized horseshoe - github pages
Post on 04-Jul-2022
9 Views
Preview:
TRANSCRIPT
REGULARIZED HORSESHOE
Aki Vehtari
Aalto University, Finlandaki.vehtari@aalto.fi
@avehtari
Joint work with Juho Piironen
1 / 24
Outline
Large p, small n regressionPrior on weights vs. prior on shrinkageHorseshoe priorRegularized horseshoe prior
2 / 24
Large p, small n regression
Linear or generalized linear regressionnumber of covariates pnumber of observations n
Large p, small n common e.g.in modern medical/bioinformatics studies (e.g. microarrays,GWAS)brain imagingin our examples p is around 1e2–1e5, and usually n < 100
3 / 24
Large p, small n regression
If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensions
If noisy observations, more complicatedIf correlating covariates, more complicated
4 / 24
Large p, small n regression
If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicated
If correlating covariates, more complicated
4 / 24
Large p, small n regression
If noiseless observations we can fit uniquely identifiedregression in n − 1 dimensionsIf noisy observations, more complicatedIf correlating covariates, more complicated
4 / 24
Large p, small n regression
Priors!
Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models
Sparse priors assume only small number of covariateseffectively non-zero meff � n
5 / 24
Large p, small n regression
Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models
Sparse priors assume only small number of covariateseffectively non-zero meff � n
5 / 24
Large p, small n regression
Priors!Non-sparse priors assume most covariates are relevant, butmay have strong correlations→ factor models
Sparse priors assume only small number of covariateseffectively non-zero meff � n
5 / 24
Example
(Intercept)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
sigma
−2 −1 0 1 2
Gaussian prior
(Intercept)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
sigma
−2 −1 0 1 2 3
Horseshoe prior
rstanarm + bayesplot
6 / 24
Example
(Intercept)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
sigma
−2 −1 0 1 2
Gaussian prior
(Intercept)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
sigma
−2 −1 0 1 2 3
Horseshoe prior
rstanarm + bayesplot
6 / 24
Example
Gaussian vs. Horseshoe predictive performance usingcross-validation (loo package, more in Friday Model selectiontutorial)
> compare ( loog , loohs )e l p d _ d i f f se
7.9 2.8
7 / 24
Large p, small n regression
Sparse priors assume only small number of covariateseffectively non-zero meff � p
Laplace prior (“Bayesian lasso”)computationally convenient (continuous and log-concave), butnot really sparse
spike-and-slab (with point-mass at zero)prior on number of non-zero covariates, discrete
Horseshoe and hierarchical shrinkage priorsprior on amount of shrinkage, continuous
Carvalho et al (2009)8 / 24
Prior on shrinkage
Slope of the prior at specific value determines the amount ofshrinkage
Carvalho et al (2009) 9 / 24
Spike-and-slab vs horseshoe prior
Spike and slab prior (with point-mass at zero) has mix ofcontinuous prior and probability mass at zero
parameter space is mixture of continuous and discrete
Hierarchical shrinkage and horseshoe priors are continuous
Piironen and Vehtari (2017a)
10 / 24
Horseshoe prior
Linear regression model with covariates x = (x1, . . . , xD)
yi = βTxi + εi , εi ∼ N(
0, σ2), i = 1, . . . ,n ,
The horseshoe prior:
βj |λj , τ ∼ N(
0, λ2j τ
2),
λj ∼ C+(0,1) , j = 1, . . . ,D.
The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the
shrinkage
−2 0 20
0.15
0.3
0.45
β1
p(β1)
−2 0 2
−2
0
2
β1
β2
11 / 24
Horseshoe prior
Linear regression model with covariates x = (x1, . . . , xD)
yi = βTxi + εi , εi ∼ N(
0, σ2), i = 1, . . . ,n ,
The horseshoe prior:
βj |λj , τ ∼ N(
0, λ2j τ
2),
λj ∼ C+(0,1) , j = 1, . . . ,D.
The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the
shrinkage
−2 0 20
0.15
0.3
0.45
β1
p(β1)
−2 0 2
−2
0
2
β1
β2
11 / 24
Horseshoe prior
Linear regression model with covariates x = (x1, . . . , xD)
yi = βTxi + εi , εi ∼ N(
0, σ2), i = 1, . . . ,n ,
The horseshoe prior:
βj |λj , τ ∼ N(
0, λ2j τ
2),
λj ∼ C+(0,1) , j = 1, . . . ,D.
The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the
shrinkage
−2 0 20
0.15
0.3
0.45
β1
p(β1)
−2 0 2
−2
0
2
β1
β2
11 / 24
Horseshoe prior
Linear regression model with covariates x = (x1, . . . , xD)
yi = βTxi + εi , εi ∼ N(
0, σ2), i = 1, . . . ,n ,
The horseshoe prior:
βj |λj , τ ∼ N(
0, λ2j τ
2),
λj ∼ C+(0,1) , j = 1, . . . ,D.
The global parameter τ shrinks all βj towards zeroThe local parameters λj allow some βj to escape the
shrinkage
−2 0 20
0.15
0.3
0.45
β1
p(β1)
−2 0 2
−2
0
2
β1
β2
11 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 1.0We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.9We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.8We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.7We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.6We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.5We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.4We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.3We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.2We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.1We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.1We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0
How to specify prior for τ?
12 / 24
Horseshoe prior
Given the hyperparameters, the posterior mean satisfiesapproximately
β̄j = (1− κj)βMLj , κj =
11 + nσ−2τ2λ2
j,
where κj is the shrinkage factor
With λj ∼ C+(0,1), the prior for κj looks like:
0 0.5 1
κ
nσ−2τ 2 = 0.1We expect both
relevant (β̄j ≈ βMLj ) features
irrelevant (β̄j ≈ 0) features
Small τ ⇒ more coefficients ≈ 0How to specify prior for τ?
12 / 24
The global shrinkage parameter τ
Effective number of nonzero coefficients
meff =D∑
j=1
(1− κj)
The prior mean can be shown to be
E[meff | τ, σ] =τσ−1√n
1 + τσ−1√
nD
Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ
τ0 =p0
D − p0
σ√n
⇒ Prior guess for τ based on our beliefs about the sparsity
13 / 24
The global shrinkage parameter τ
Effective number of nonzero coefficients
meff =D∑
j=1
(1− κj)
The prior mean can be shown to be
E[meff | τ, σ] =τσ−1√n
1 + τσ−1√
nD
Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ
τ0 =p0
D − p0
σ√n
⇒ Prior guess for τ based on our beliefs about the sparsity
13 / 24
The global shrinkage parameter τ
Effective number of nonzero coefficients
meff =D∑
j=1
(1− κj)
The prior mean can be shown to be
E[meff | τ, σ] =τσ−1√n
1 + τσ−1√
nD
Setting E[meff | τ, σ] = p0 (prior guess for the number ofnonzero coefficients) yields for τ
τ0 =p0
D − p0
σ√n
⇒ Prior guess for τ based on our beliefs about the sparsity13 / 24
Illustration p(τ) vs. p(meff)
Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0
σ√n , D =
dimensionality
p(meff) with different choices of p(τ):
0 2 4 6 8 10
D = 10
τ = τ0
0 2 4 6 8 10
τ ∼ N+(0, τ 2
0
)
0 2 4 6 8 10
τ ∼ C+(0, τ 2
0
)
0 2 4 6 8 10
τ ∼ C+(0, 1)
0 5 10 15 20
D = 1000
0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000
14 / 24
Illustration p(τ) vs. p(meff)
Let n = 100, σ = 1, p0 = 5, τ0 = p0D−p0
σ√n , D =
dimensionality
p(meff) with different choices of p(τ):
0 2 4 6 8 10
D = 10
τ = τ0
0 2 4 6 8 10
τ ∼ N+(0, τ 2
0
)
0 2 4 6 8 10
τ ∼ C+(0, τ 2
0
)
0 2 4 6 8 10
τ ∼ C+(0, 1)
0 5 10 15 20
D = 1000
0 5 10 15 20 0 20 40 60 80 100 0 200 400 600 8001,000
14 / 24
Non-Gaussian observation models
The reference value:
τ0 =p0
D − p0
σ√n
The framework can be applied also to non-Gaussianobservation models by deriving appropriate plug-in valuesfor σ
Gaussian approximation to the likelihoodE.g. σ = 2 for logistic regression
15 / 24
Regularized horseshoe
HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p
Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables
16 / 24
Regularized horseshoe
HS allows some coefficients to be completely unregularizedallows complete separation in logistic model with n� p
Regularized horseshoe adds additional wide slabmaintains division to relevant and non-relevant variables
16 / 24
Horseshoe vs regularized horseshoe
Regularized horseshoe helps regularize relevant variables
17 / 24
Regularized horseshoe in rstanarm
Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)
fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,
slab_df=4))
Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ
luckily the result is not sensitive to the exact value
Note 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)
luckily the result is not sensitive to the exact value
18 / 24
Regularized horseshoe in rstanarm
Easy in rstanarm (thanks to Ben Goodrich)p0 <- 5tau0 <- p0/(D-p0) * sigmaguess/sqrt(n)
fit <- stan_glm(y ∼ x, gaussian(), hs(global_scale=tau0, slab_scale=2.5,
slab_df=4))
Note: rstanarm does not condition on σ, and thus need toscale tau0 with a guess of expected value of σ
luckily the result is not sensitive to the exact valueNote 2: hs() prior is called “hierarchical shrinkage” prior, asit is extension of Horseshoe (Horseshoe has local_df=1)
luckily the result is not sensitive to the exact value
18 / 24
Regularized horseshoe in rstanarm
Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe
19 / 24
Regularized horseshoe in rstanarm
Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe
19 / 24
Regularized horseshoe in rstanarm
Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe
19 / 24
Regularized horseshoe in rstanarm
Simulated regression examplen = 100, p = 200, true p0 = 7Gaussian vs. “Bayesian LASSO” vs. Reg. Horseshoe
19 / 24
Regularized horseshoe in rstanarm
Simulated regression examplen = 100, p = 200, true p0 = 7
20 / 24
Experiments
Table: Summary of the real world datasets, D denotes thenumber of predictors and n the dataset size.
Dataset Type D n
Ovarian Classification 1536 54Colon Classification 2000 62Prostate Classification 5966 102ALLAML Classification 7129 72
21 / 24
Horseshoe vs regularized horseshoe
Regularized horseshoe helps to reduce the number ofdivergences, too
22 / 24
Horseshoe vs regularized horseshoe
Regularized horseshoe helps to reduce the number ofdivergences, too
22 / 24
Example
(Intercept)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
sigma
−2 −1 0 1 2 3
Even if Horseshoe shrinks a lot, coefficient posterior hasunecrtainty and it’s not exactly zeroTomorrow in Model selection tutorial
how to select most relevant variableshow to do the inference after the selection while taking intoaccount the uncertainties in the full model
23 / 24
Summary of regularized horseshoe prior
Sparse as horseshoe, but more robust inference andcomputationBetter performance than LASSO and Bayesian LASSO
References (with code examples for Stan included)
Juho Piironen and Aki Vehtari (2017). Sparsity information andregularization in the horseshoe and other shrinkage priors. In ElectronicJournal of Statistics, 11(2):5018-5051.https://projecteuclid.org/euclid.ejs/1513306866Juho Piironen and Aki Vehtari (2017). On the hyperprior choice for theglobal shrinkage parameter in the horseshoe prior. Proceedings of the20th International Conference on Artificial Intelligence and Statistics,PMLR 54:905-913. http://proceedings.mlr.press/v54/piironen17a.htmlJuho Piironen, and Aki Vehtari (2018). Iterative supervised principalcomponents. Proceedings of the 21th International Conference onArtificial Intelligence and Statistics, accepted for publication.https://arxiv.org/abs/1710.06229See also model selection tutorial with some notebooks using regularizedhorseshoe https://github.com/avehtari/modelselection_tutorial
24 / 24
top related