supplementary information for bayesian analysis of multiple …jhp.snu.ac.kr/data/aoas_si.pdf ·...

32
Supplementary Information for Bayesian Analysis of Multiple Parametric Changes for High-dimensional Longitudinal Data Analysis Contents 1 Figure 2 Details 2 2 Figure 3 Details 4 3 HMBB Algorithm and Software Implementation 6 3.1 Algorithm ..................................... 6 3.2 Estimation using BridgeChange ......................... 7 3.3 Decoupled Shrinkage and Selection of HMBB ................. 8 3.4 The Support of the Concavity Parameter .................... 9 3.5 Model Diagnostics ................................ 10 4 Extensions of HMBB 12 4.1 Panel Data .................................... 12 4.2 Binary and Count Data ............................. 13 4.3 Nonparametric Regression ............................ 15 5 Simulation 17 5.1 Simulation Design ................................. 17 5.2 Estimating Procedure for Benchmark Estimates ................ 18 5.3 High Dimensional Data with No Change-point ................. 18 5.4 High Dimensional Correlated Data ....................... 21 6 Application Details 21 6.1 Additional Results on Nunn and Qian (2014)’s Example ........... 21 6.2 Additional Results on Alvarez et al. (1991)’s Example ............. 24 7 Computation Time 28 1

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Supplementary Information forBayesian Analysis of Multiple Parametric Changes for

High-dimensional Longitudinal Data Analysis

Contents

1 Figure 2 Details 2

2 Figure 3 Details 4

3 HMBB Algorithm and Software Implementation 63.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Estimation using BridgeChange . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Decoupled Shrinkage and Selection of HMBB . . . . . . . . . . . . . . . . . 83.4 The Support of the Concavity Parameter . . . . . . . . . . . . . . . . . . . . 93.5 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Extensions of HMBB 124.1 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Binary and Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Nonparametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Simulation 175.1 Simulation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Estimating Procedure for Benchmark Estimates . . . . . . . . . . . . . . . . 185.3 High Dimensional Data with No Change-point . . . . . . . . . . . . . . . . . 185.4 High Dimensional Correlated Data . . . . . . . . . . . . . . . . . . . . . . . 21

6 Application Details 216.1 Additional Results on Nunn and Qian (2014)’s Example . . . . . . . . . . . 216.2 Additional Results on Alvarez et al. (1991)’s Example . . . . . . . . . . . . . 24

7 Computation Time 28

1

Page 2: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

OLS: RMSE = 1.71

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)OLS

Lasso: RMSE = 1.22

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Lasso

Elastic Net: RMSE = 1.21

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Elastic Net

Ridge: RMSE = 1.22

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Ridge

Adaptive: RMSE = 1.29

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Adaptive

Bayesian Lasso: RMSE = 1.22

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Bayesian Lasso

Horseshoe: RMSE = 1.33

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Horseshoe

Normal−Gamma: RMSE = 1.23

Coeffi

cient

−3−2

−10

12

3 Regime 1 (True)Regime 2 (True)Normal−Gamma

Figure 1: Change-point Problem in Regularized Regression Analysis of Longitudinal Data: Groundtruth is displayed by transparent dots (•) and vertical lines. Estimates are marked by red asterisks(∗). RMSE =

√p−1∑p

j=1(βj − βtruej )2. Implementation details are available in the supplementaryinformation.

1 Over-regularization of Time-varying Signals in Reg-ularization Methods

In this section, we illustrate the over-regularization of time-varying signals using a syntheticdata set with 50 predictors, 100 observations, and a single parameter break at t = 50. Among50 predictors, only 10 predictors have non-zero coefficients randomly drawn from U(−3, 3)in each regime. Red translucent dots indicate non-sparse signals in Regime 1 and bluetranslucent dots indicate non-sparse signals in Regime 2. Seven regularization methods (lasso(Tibshirani, 1996), elastic net (?), ridge (?), adaptive lasso (Zou, 2006), Bayesian lasso (?),horseshoe (?), and normal gamma (?)) and the ordinary least squares method are usedto estimate parameters. Not surprisingly, all the regularization methods pool time-varyingsignals into weak signals and then force them toward zero.

In the following, we report an R code that replicates the above simulation. First, we generatea synthetic time series data set with n = 100 (the number of observations) and p = 50 (thenumber of predictors). The number of non-sparse predictors is 10 (p ∗ signal.ratio = 10) ineach regime and they randomly drawn in each regime from a uniform distribution U(−3, 3).A single break is planted in the mid-point.set.seed(1973);require(monomvn)require(glmnet)require(BridgeChange)require(MASS)require(genlasso)

dataGen <- function(n,p, signal.ratio = 0.7, sigma = NULL) if (is.null(sigma))

sigma <- diag(p)

n_signal <- round(signal.ratio * p); n_noise <- p - n_signalX <- MASS::mvrnorm(n, mu = runif(p, -1, 1), Sigma = diag(p))beta1 <- c(runif(n_signal, -3, 3), rep(0, n_noise))beta2 <- c(runif(n_signal, -3, 3), rep(0, n_noise))permuted <- sample(1:p, p, replace=FALSE)beta1 <- beta1[permuted]beta2 <- beta2[permuted]

2

Page 3: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Y <- rep(NA, n)cut <- n/2Y[1:cut] <- X[1:cut, ] %*% beta1 + rnorm(n/2)Y[(cut+1):n] <- X[(cut+1):n, ] %*% beta2 + rnorm(n/2)

return(list("X" = X, "Y" = Y,"beta1" = beta1, "beta2" = beta2,"signal" = permuted))

set.seed(18394)n = 100p = 50addTrans <- NetworkChange:::addTransrain.8 <- rainbow(8)

sim <- dataGen(n = n, p = p, signal.ratio = 0.4)y=sim$YX=sim$X

true.beta <- matrix(NA, 2, p)true.beta[1,] <- sim$beta1true.beta[2,] <- sim$beta2

Next, we fit a list of regularization methods and extract parameters for the RMSE computation.We used glmnet package (Friedman et al., 2010) for lasso, ridge, elastic net, and adaptivelasso estimation. For Bayesian lasso, horseshoe, and normal gamma models, we used monomvnpackage (Gramacy, 2018).

## OLS regression: benchmarkols <- lm(y~X-1)beta.ols <- coef(ols)

## Model 1 Lassofit.lasso <- cv.glmnet(y = y, x = X, type.measure="mse",

alpha=1, standardize = TRUE, family="gaussian")beta.lasso <- coef(fit.lasso)[-1]

## Model 2 Ridgefit.ridge <- cv.glmnet(y = y, x = X, type.measure="mse",

alpha=0, standardize = TRUE, family="gaussian")beta.ridge <- coef(fit.ridge)[-1]

## Model 3 Elastic

3

Page 4: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

fit.elastic <- cv.glmnet(y = y, x = X, type.measure="mse",alpha=0.5, standardize = TRUE, family="gaussian")

beta.elastic <- coef(fit.elastic)[-1]

## Model 4 Adaptivew3 <- 1/abs(matrix(coef(fit.ridge,

s=fit.ridge$lambda.min)[, 1][2:(ncol(X)+1)] ))^1w3[w3[,1] == Inf] <- 999999999cv.adaptive <- cv.glmnet(x=X, y=y,

family='gaussian', alpha=1, penalty.factor=w3)beta.adaptive <- coef(cv.adaptive)[-1]

## Model 5 Horseshoe priorblas.hs <- blasso(y=y, X = X, T=1000, case = "hs")beta.hs <- apply(blas.hs$beta, 2, mean)

## Model 6 Bayesian lassoblas <- blasso(y=y, X = X, T=1000)beta.blas <- apply(blas$beta, 2, mean)

## Model 7 Bayesian Normal Gammablas.ng <- blasso(y=y, X = X, T=1000, case = "ng")beta.ng <- apply(blas.ng$beta, 2, mean)

2 Fused Lasso Test

In this section, we explain how the fused lasso method can be used to identify parameterjumps in the above data set with a single break. First, we write the fused lasso model asfollows:

β = arg minβ∈RpT

T∑t=1

(Yt −X>t βt)2 + λp∑j=1

T∑t=2|β(j)t − β

(j)t−1|

. (1)

Then, we transform the design matrix X ∈ RT×p as follows:

1. Let L be the lower triangular matrix with dimension T × T

2. Concatinate p copies of L as L = [L | L | · · · | L] so that L ∈ RT×pT .

Note that this is given byL = 1>p ⊗ L. (2)

3. Create Z as follows:Z = (X⊗ 1>T )︸ ︷︷ ︸

T×pT

L︸︷︷︸T×pT

(3)

4

Page 5: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

where is element-wise and ⊗ is Kronecker product.

Then, we apply the standard lasso on Z, which is

minβ∈RpT

‖Y − Zβ‖2

2 + λ‖β‖1

. (4)

After fitting the lasso method, a time-specific coefficient for variable j can be recovered by

βj = L(j)β (5)

where L(j) = [0|0| · · · |L| · · · |0].

The following R code implements the above discussion.## Model 8 fused lasso estimationp <- dim(X)[2]n <- dim(X)[1]

## Make Z matrix (N by N*K) using X (N by K)save_mat <- list()for (j in 1:p)

Ltmp <- matrix(X[,j], nrow = n, ncol = n)Ltmp[upper.tri(Ltmp) ] <- 0save_mat[[j]] <- Ltmp

Z <- do.call("cbind", save_mat)fit <- cv.glmnet(Z, y, type.measure = 'mse', alpha=1,

standardize = TRUE, family = 'gaussian')est <- coef(fit)[-1]beta_est <- matrix(NA, nrow = p, ncol = n)for (j in 1:p)

st <- 1 + n * (j-1)ed <- n * jbeta_est[j,] <- cumsum(est[st:ed])

beta_true <- cbind(matrix(rep(true.beta[1,], 50), nrow = p),matrix(rep(true.beta[2,], 50), nrow = p)

)

5

Page 6: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

3 HMBB Algorithm and Software Implementation

In this section, we explain details of HMBB in modeling, estimation, and model diagnostics.

3.1 Algorithm

We explain the algorithm of HMBB for a linear regression model. As explained in the paper,the posterior density of HMBB for a linear regression model can be written as follows:

p(λj)p(σ2)p(α)p(ν)∏nt=1

∏2m=1

exp

(− 1

2σ2m

(yt − x>t βm)2)∏p

j=1 exp(− β2

m,j

2τ2mλm,j

)1st=m

= p(λj)p(σ2)p(α)p(ν)∏1≤t≤t?

exp

(− 1

2σ21(yt − x>t β1)2

)∏pj=1 exp

(− β2

1,j2τ2

1λ1,j

)

×∏t?<t′≤n

exp

(− 1

2σ22(yt′ − x>t′β2)2

)∏pj=1 exp

(− β2

2,j2τ2

2λ2,j

)

where t? = arg maxt:st=1

st.

After centering (group-centering in the case of panel data) data, the sampling is done in thefollowing scheme.

1. Sampling p(β|α,Λ, σ2, τ,P,S,y): The posterior of β follows the multivariate normaldistribution, which is given by

βm|σ2, λm, αm, τ,P,S,ym ∼ Np

VX′mymσ2m

,V =(

X′mXm + σ2m

τ 2 λmI)−1

. (6)

2. Sampling p(β0|Λ,β, α, τ, σ2,P,S,y): We separately estimate the intercepts for eachregime to remove any discrepancy in regression slopes in each simulation.

β0m ← ym −X>mβm

whereym =

∑nt=1 1st = myt∑nt=1 1st = m

, and Xm,j =∑nt=1 1st = mXm,tj∑n

t=1 1st = m. (7)

3. Sampling p(α|Λ,β, σ2, τ,P,S,y): We use a Griddy Gibbs sampler (Tanner, 1996) forthe sampling of α because α is univariate and its support is bounded by (0, 2].

4. Sampling p(τ |Λ,β, α, σ2,P,S,y): Sample ν first and then transform ν to τ .

νm ∼ Gamma(c, d)τm = ν−

1αm

6

Page 7: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

where c = c0 + p/αm and d = d0 +∑pj=1 |βj,m|αm .

5. Sampling S|Λ,β, α, τ, σ2,P,y: Sample S recursively using Chib (1998)’s algorithm.

6. Sampling from P|Λ,β, α, τ, σ2,S,y:

pkk ∼ Beta(a0 + jk,k − 1, b0 + jk,k+1)

where pkk is the probability of staying when the state is k, and jk,k is the number ofjumps from state k to k, and jk,k+1 is the number of jumps from state k to k + 1.

3.2 Estimation using BridgeChange

Now, we explain how to implement HMBB using BridgeChange. We use the same data setgenerated for the above example. First, download BridgeChange from a public repository.require(devtools)install_github("soichiroy/BridgeChange")require(BridgeChange)

The HMBB analysis of y on X is executed by the following single line of code.fit.bc <- BridgeChangeReg(y = y, X = X, n.break = 1, mcmc = 1000)

Users must specify the dependent variable (y), a model matrix (X), and the number of break(n.break) to fit BridgeChangeReg.

After fitting HMBB, users can do many posterior analysis. For example, in order to see theestimated break point and regime-specific parameters,beta.bc <- coef(fit.bc)beta.bc1 <- beta.bc[1:p]beta.bc2 <- beta.bc[(p+1):(2*p)]

require(MCMCpack) ## for plotState()par(mfrow=c(1,3))par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)plotState(fit.bc)plot(true.beta[1,], beta.bc1, col=addTrans("brown", 100),

xlab="True", ylab="HMBB Estimates",main="Regime 1",pch=19, cex=2); abline(a=0, b=1, col="blue")

plot(true.beta[2,], beta.bc2, col=addTrans("brown", 100),xlab="True", ylab="HMBB Estimates",main="Regime 2",pch=19, cex=2); abline(a=0, b=1, col="blue")

Figure 1 shows the results. The left panel shows transitions of estimated hidden states.

7

Page 8: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Posterior Regime Probability

Time

Pr(

St=

k |Y

t)

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

−3 −2 −1 0 1 2

−2.

0−

1.5

−1.

0−

0.5

0.0

0.5

1.0

1.5

Regime 1

TrueH

MB

B E

stim

ates

−3 −2 −1 0 1 2 3

−2

−1

01

2

Regime 2

True

HM

BB

Est

imat

es

Figure 2: Change-point Analysis using HMBB: n = 100 (the number of observations) and p = 50(the number of predictors). The number of non-sparse predictors is 10. A single break is planted att = 50. Regime-specific non-sparse parameters are generated from a uniform distribution U(−3, 3).

Probabilities of hidden states change rapidly in the middle, indicating that the middle of thesample period is an estimated break point. Regime-specific parameter estimates are reportedin the center and right panels in Figure 1). The HMBB shrinks parameters toward zerowhile producing many non-zero estimates for sparse parameters. This is a general featureof shrinkage models as discussed in Polson and Scott (2010). Overall, HMBB successfullyidentifies the location of a break point and regime-specific parameter changes.

3.3 Decoupled Shrinkage and Selection of HMBB

Hahn and Carvalho (2015) suggest an interesting compromise between variable selectionand shrinkage (Polson and Scott, 2010). Bayesian shrinkage priors are superior to classicalregularization methods or Bayesian variable selection methods in that Bayesian shrinkagepriors produce estimates with the smallest estimation error (measured by the mean squarederror) and Bayesian shrinkage priors can incorporate estimation uncertainty. Also, tuningparameters can be estimated as hyperparameters in the Bayesian framework. However, oneimportant drawback of Bayesian shrinkage priors do not have a variable selection propertyunlike penalized likelihood methods such as the lasso (Tibshirani, 1996). In other words,Bayesian shrinkage priors do not produce parameter estimates that are exactly zero. Thelack of the variable selection feature in Bayesian shrinkage priors is a trade-off betweenestimation error and inferential parsimony: the hierarchical structure of a shrinkage priordistribution provides the smallest estimation error without thresholding (or neuronizing)parameter values at a certain level.

Hahn and Carvalho (2015)’s approach is to consider the variable selection problem as “aproblem of posterior summarization.” Their approach consists of two steps. First, theirmethod fits a Bayesian shrinkage prior model to regularize parameters. Second, posteriordistributions are used to select the optimal number of covariates by minimizing an `0-typeloss function. We adopt this hybrid approach for the high-dimensional change-point analysis.

8

Page 9: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

First, we identify multiple change-points in high-dimensional time series (cross-sectional)data using the HMBB. Second, using posterior distributions of the HMBB, we select theoptimal number of regime-specific covariates by minimizing the weighted lasso function. Bydoing so, we can optimize both prediction accuracy and inferential parsimony at the sametime. The resulting estimates have the property of the decoupled shrinkage and selection inthe sense of Hahn and Carvalho (2015) in time series data with multiple change-points.

We define the decoupled shrinkage and selection (DSS) loss function of the HMBB for regimem as follows:

L(γm) = arg minγm

squared prediction loss︷ ︸︸ ︷||Xmβ

∗m −Xmγm||22 +

parsimony penalty︷ ︸︸ ︷λ||γm||0 (8)

where Xmβ∗m is the fitted value of the fixed-effects HMBB at regime m.

To find the optimum of the DSS, we take the popular approach of the `1 surrogation of the `0problem. Also, to better target the the `0 solution, we use the weight vector (wj,m = 1

|γj,m|δ)

where δ > 0 and γj,m is the root N-consistent estimate of γj,m as suggested by Zou (2006):

βDSSm = arg min

γm||Xmβ

∗m −Xmγm||22 + λ

p∑j=1

wj,m|γj,m|. (9)

3.4 The Support of the Concavity Parameter

The Bridge estimator can be written as follows:

βbridge = arg minβ∈Rp

12

n∑t=1

(yt − x>t β)2 + νp∑j=1|βj|α

. (10)

In this model, it is well known that when 0 < α ≤ 1, the classical bridge estimator hasthe variable selection feature (Murphy, 2012). We believe that Polson et al. (2014) use thesupport of 0 < α ≤ 1 to set the bridge model between two polar cases of the lasso (α = 1)and the subset selection method (α = 0) in classical statistics. However, the support of0 < α ≤ 1 does not guarantee the variable selection feature in Bayesian framework becausethe posterior of regression parameters depends on the complicated order statistics (Polsonand Scott, 2010). Monte Carlo experiments also show that the constraint of 0 < α ≤ 1 oftenproduces biased estimates. Generally speaking, restricting the concavity parameter between 0and 1 (or fixing it to reflect a particular desired shape of a penalty function) overly penalizesregression parameters without guaranteeing sparsity in Bayesian framework.

Figure 2 compares RMSEs of HMBB under different concavity parameter supports. Thesimulated data is a time series data set with T = 100 and the number of predictors is 50. Thenumber of non-sparse predictors is 10. A single break is planted in the mid-point (t = 50).Regime-specific non-sparse parameters (40 for each regime) are generated from a uniformdistribution U(−3, 3). RMSE is the root mean squared error of estimated coefficients: RMSE=√p−1∑p

j=1(βj − βtruej )2. The bright dots are true values and the dark dots are expected

9

Page 10: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

HMBB: RMSE = 0.7C

oeffi

cien

t

−3

−2

−1

01

23

Regime 1 (True)Regime 2 (True)Regime 1 (HMBB)Regime 2 (HMBB)

HMBB (alpha constrained): RMSE = 0.73

Coe

ffici

ent

−3

−2

−1

01

23

Regime 1 (True)Regime 2 (True)Regime 1 (HMBB−constrained)Regime 2 (HMBB−constrained)

Figure 3: The Choice of the Concavity Parameter and RMSEs: The left panel shows estimatesof HMBB using the concavity parameter between 0 and 2 and the right panel shows estimates ofHMBB using the concavity parameter between 0 and 1.

values. Red color and blue color indicate the first regime and the second regime, respectively.

The left panel of Figure 2 shows estimates of HMBB using the concavity parameter between 0and 2 and the right panel shows estimates of HMBB using the concavity parameter between0 and 1. The right panel shows a larger RMSE than the left panel. If we look carefully,estimates in the right panel show significant bias for large time-varying signals. However,there is no clear sign that this (0, 1) constraint produces sparse estimates.

3.5 Model Diagnostics

The goal of model checking in this section is two-fold. First, we seek to check the generalfit of the proposed models against observed (in-sample) and independent (out-of-sample)data. Second, in the context of change-point analysis, model checking is an essential wayto identify a break number given data and a model. There are many options for Bayesianmodel comparison (or checking). In this paper, we consider three canonical methods: WAIC,k-fold cross validation, and the marginal likelihood method (or Bayes factor). WAIC andk-fold cross validation compare models based on their (expected) predictive accuracy whilethe marginal likelihood method compares models based on a posterior probability of a chosenmodel M (p(M|y,X)), which is proportional to the marginal likelihood of data given achosen model (p(y,X|M)) under the uniform prior probability of competing models.

First, we assess the expected predictive accuracy of HMBB by the Watanabe-Akaike Infor-mation Criterion (WAIC) (Watanabe, 2010). WAIC approximates the expected log pointwisepredictive density by subtracting a bias for the effective number of parameters from the sumof log pointwise predictive density. WAIC is a fully Bayesian estimate of model uncertainty

10

Page 11: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

with a low computational cost. We follow the formula suggested by Gelman et al. (2014).Dropping regime-specific parameter notations for simplicity, WAIC for HMBB with Mnumber of latent states (MM) is

WAICMM= −2

n∑t=1

log 1G

G∑g=1

p(yt|β(g), σ2,(g),Λ(g), α(g), τ (g),P(g),MM)

︸ ︷︷ ︸the expected log pointwise predictive density

n∑t=1

V Gg=1

[log p(yt|β(g), σ2,(g),Λ(g), α(g), τ (g),P(g),MM)

]︸ ︷︷ ︸

bias for the effective number of parameters

.

where G is the MCMC simulation size, V [·] indicates a variance, and θ(g) are the gth simulatedoutputs for θ.

Second, we can check the predictive accuracy of HMBB against independent data using thek-fold cross validation method. Since we deal with sequentially observed data, we dividetraining data from test data without permuting the sequential order in data. Let the indexset of the original data be denoted by I = 1, . . . , n. We partition this set into k disjointsets such that I = ⋃k

h=1 Ih and Ih⋂ Ih′ = ∅ for h 6= h′. Model parameters are estimated

using ∀t ∈ I\h and then we compute the predicted loss by

LCV,h = 1|Ih|

∑t∈Ih

(yt −X>t β(h))2

where β(h) is a vector of estimates obtained from samples in I\h. We repeat this procedurefor all h = 1, . . . , k and compute the total loss as a simple average of the individual losses,LCV = ∑k

h=1 LCV,h/k.

Third, we check the model uncertainty of HMBB using Chib (1995)’s candidate estimatorof log marginal likelihoods. Chib (1995)’s candidate estimator of log marginal likelihoodsis applicable when all parameters are sampled by Gibbs sampling methods. Although thesampling algorithm of HMBB is based on Gibbs sampling methods, it is difficult to evaluatethe posterior densities of β, Λ, and α. We approximate the log marginal likelihood of HMBBby treating them as latent variables. These parameters will be averaged out in the logmarginal likelihood computation. The resulting candidate estimator will be obtained fromthe following formula:

log p(y|MM) ≈ log p(y|σ2∗, τ ∗,P∗,MM)︸ ︷︷ ︸the likelihood ordinate

+ log p(σ2∗, τ ∗,P∗)︸ ︷︷ ︸the log prior ordinate

− log p(σ2∗, τ ∗,P∗|y)︸ ︷︷ ︸the log posterior ordinate

where ∗ indicates the posterior mean. The log posterior ordinate will be evaluated from the

11

Page 12: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

following reduced Gibbs updates:

p(σ2∗, τ ∗,P∗|y) = p(τ ∗|y)p(σ2∗|y, τ ∗)p(P∗|y, σ2∗, τ ∗)

p(τ ∗|y) ≈∫p(τ ∗|y,β,Λ, α, σ2,P,S)dp(β,Λ, α, σ2,P,S|y)

p(σ2∗|y, τ ∗) ≈∫p(σ2∗|y, τ ∗,β,Λ, α,P,S)dp(β,Λ, α,P,S|y)

p(P∗|y, σ2∗, τ ∗) ≈∫p(τ ∗|y, σ2∗, τ ∗,β,Λ, α,S)dp(β,Λ, α,S|y).

Last, a more heuristic method for break number detection is to examine latent state transitionsacross multiple models with a varying number of breaks. The inclusion of redundant breakpoints (e.g. imposing two breaks on a single break process) produces an instability in MCMCdraws of hidden state variables. One quick way to check the existence of redundant breaks isto see whether estimated latent states have singleton states (i.e. latent states with only oneobservation). Another way to check the existence of redundant breaks is to estimate averagevariances of simulated break points. This measure is equivalent to the average loss of breakpoints assuming the simulation mean of break points (τm) as true break points:

Average Loss of Break Point Estimates = 1M

M∑m=1

1G

G∑g=1

(τm − τ (g)m )2

where G is the MCMC simulation size and M is the total number of breaks. The averageloss is close to 0 if simulated break points are highly stable. Average Loss becomes larger ifat least one of break points swings widely in each simulation.

4 Extensions of HMBB

In this section, we discuss various extensions of HMBB to panel data, binary data, countdata, and a nonparametric regression model.

4.1 Panel Data

The applications in the paper use a panel HMBB. We explain the panel HMBB in this section.With some abuse of notations, let i ∈ [n] denote subject and t ∈ [T ] denote time. Let yitdenote a scalar observation for subject i at t, xit be the p× 1 vector of regressors, wit be theq × 1 vector of the random-effects regressors, which is a subset of xit, and bi be the q × 1random-effects coefficient vector with variance-covariance matrix D. Following Laird andWare (1982), we write a general model for panel (time series cross sectional or grouped timeseries) data as follows:

yit = x>itβ + w>itbi + εit, εit ∼ N (0, σ2), bi ∼ N (0,D).

12

Page 13: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

The above model can be modified to denote the fixed-effects model by setting x>itβ = αi+x>it βand wit = 0 where αi is the unobserved time-constant individual effect for subject i, xit is amodel matrix without the constant, and β is the parameter vector minus an intercept.

A fixed-effects HMBB can be written using the group-demeaned data (y, ¯x). The mainquantity of interest is the state-dependent parameters of the fixed-effects model (β, σ2):

yit =

¯x>itβ1 + εit, εit ∼ N (0, σ2

1) for t0 ≤ t < τ1... ... ... ... ...

¯x>itβM + εit, εit ∼ N (0, σ2M) for τM−1 ≤ t < T

where τm is the break point between regime m− 1 and regime m. In order to allow manypredictors, we use the Bayesian bridge prior for β as shown in Equation (2.3). The posteriordistribution of the resulting model will take a following form:

p(β, σ2,Λ, α, τ,P|y, ¯x) =∫p(y1|¯x1,β, σ

21,Λ1, α1, τ1) (11)

T∏t=2

M∑m=1

p(yt|Yt−1,¯Xt−1,βm, σ

2m,Λm, αm, τm,P)

p(st = m|st−1,β, σ2,Λ, α, τ,P)p(P)p(β,Λ)p(σ2)p(α)p(τ)dS.

Similarly, a random-effects HMBB can be written by first letting parameters of the random-effects model vary across subjects:

yit =

x>itβ1 + w>itbi + εit, bi ∼ N (0,D1), εit ∼ N (0, σ2

1) for t0 ≤ t < τ1... ... ... ... ...

x>itβM + w>itbi + εit, bi ∼ N (0,DM), εit ∼ N (0, σ2M) for τM−1 ≤ t < T

.

Then, the Bayesian bridge prior is used as the prior distribution of β. The posteriordistribution of the resulting model will take a following form:

p(β,D, σ2,Λ, α, τ,P|y,X,W) =∫p(y1|X1,W1,β1,bi,D1, σ

21,Λ1, α1, τ1) (12)

T∏t=2

M∑m=1

p(yt|Yt−1,Xt−1,Wt−1,βm,bi,Dm, σ2m,Λm, αm, τm,P)

p(st = m|st−1,β,bi,D, σ2,Λ, α, τ,P)p(P)p(β,Λ)p(D)p(σ2)p(α)p(τ)dbidS.

4.2 Binary and Count Data

It is straightforward to extend HMBB into binary response data using Albert and Chib(1993)’s data augmentation method using truncated normal distributions with latent variable

13

Page 14: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

z. The posterior distribution of the Bayesian bridge binary probit change-point model is

p(β,Λ, α, τ,P|y,X) =∫p(z,β,Λ, α, τ,P,S|y,X)dSdz

∝∫p(S|P)p(P)p(β,Λ)p(α)p(τ)

n∏t=1

p(zt|βst ,Λst , αst , τst ,P,xt)

(1(yt = 1)1(zt > 0) + 1(yt = 0)1(zt ≤ 0))dSdz

where p(zt|βst ,Λst , αst , τst ,P,xt) is a simplified notation of a Markov mixture distributiongiven hidden state st.1 See Park (2011) for detailed discussions on the sampling algorithm ofthe Bayesian probit change-point model.

Alternatively, we can use a logit link function based on Polson et al. (2013)’s data augmentationmethod to extend HMBB into binary response data. Polson et al. (2013) use the Pólya-Gamma family distributions to augment latent variables (ωt) that indicate a scale mixture ofnormal distributions. Let nt be the number of trials, which is 1 in the case of binary data.

(ωt|β) ∼ Pólya-Gamma(nt,X>t βm)βm|y, ω ∼ N (mω, Vω)

Vω =(X>mΩmXm +B−1

0

)−1

mω = Vω(X>mκ+B−1

0 b0).

Ωm is a diagonal matrix with (ωm1 , . . . ωmn) as diagonal elements and κ = (ym1 −nm1/2, . . . , ymn − nmn/2). m1 and mn are the starting and ending point of state m,respectively.

For ordered response data, we can use Park (2011)’s Bayesian ordered probit change-pointmodel with the following posterior distribution:

p(β, ξ,Λ, α, τ,P|y,X)

∝∫p(S|P)p(P)p(β,Λ)p(ξ)p(α)p(τ)

J∏`=1

n∏t=1

[Φ(ξ`,st − x′tβst)−Φ(ξ`−1,st − x′tβst)] dS

where ξ`,m is the `th cutpoint at state m. The sampling of β is done by augmenting latentvariable zt, which is sampled from a truncated normal distribution T N [ξ`−1,st ,ξ`,st ](x

′tβst , 1).

For small count data, we can use Frühwirth-Schnatter and Wagner (2006)’s data augmentation1The full notation is

p(zt|βst ,Λst , αst , τst ,P,xt) = p(z1|x1,β1,Λ1, α1, τ1)n∏

t=2

M∑m=1

p(zt|Zt−1,Xt−1,βm,Λm, αm, τm,P)p(st = m|st−1,β,Λ, α, τ,P).

14

Page 15: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

method:

p(yt|λt) = e−λtλyttyt!

ηtj ∼ Exp(λt) = Exp(1)λt

log ηtj = X>t βst + εtj, εtj ∼ log(Exp(1))

where ηtj is a hidden inter-arrival time, indicating the length of the time between the j − 1thevent and the jth event within time interval t, and Exp(·) is an Exponential distribution.

4.3 Nonparametric Regression

A general form of a nonparametric regression model with a change-point can be written asfollows:

yt = fst(xt) + εt (13)where st ∈ S = 1, . . . , S indexes the regime that time t belongs to. If we considerthe additive model f(xt) = ∑p

j fj(xtj) to reduce the curse of dimensionality and B-splinebasis functions, we can write a nonparametric regression model with a change-point as ahigh-dimensional regression model with a change-point:

yt =p∑j=1

Hj∑h=1

βst,jhbjh(xtj) + εt εt ∼ N(0, σ2st). (14)

To check the performance of the HMBB in this nonparametric regression model, we generateyt from the following model

yt = ft(xt) + εt εt ∼ N(0, 1)

ft(xt) =sin(2 ∗ xt − 1) if t < T0

0.2 ∗ xt + 0.04 ∗ x2t − 0.2 ∗ x3

t if t ≥ T0

for t = 1, . . . , 150, xt ∼ Uniform(−3.1,−3.1), and T0 = bT/2c.

Figure 7 summarizes the results of the HMBB analysis using the B-spline basis matrix for apolynomial spline of degree 3. The upper left panel shows yt and the dotted vertical line isthe planted break point. The upper middle and upper right panels show ft(xt) before andafter the break point. The lower panels show estimated results. The lower left panel showsthe estimated break point, which is highly precise. The lower middle and lower right panelsshow posterior means (the dotted line) with 95% credible intervals (bright colored area)for the pre-break and post-break periods. The HMBB closely recovers the regime-changingnon-linear effects of xt on yt.

15

Page 16: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

0 50 100 150

−4

−2

02

4

Time−series

Time

Yt

−3 −2 −1 0 1 2 3

−4

−2

02

4

Regime 1

X

f (X

)

−3 −2 −1 0 1 2 3

−4

−2

02

4

Regime 2

X

f (X

)

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Posterior State Probability

Time

Pr(

S =

k |

Y)

−3 −2 −1 0 1 2 3

−4

−2

02

4

Regime 1 Estimates

X

f (X

)

TruePosterior Mean

−3 −2 −1 0 1 2 3

−4

−2

02

4

Regime 2 Estimates

X

f (X

)

TruePosterior Mean

Figure 4: Nonparametric Regression with a Change-point: A single change-point is planted att = 75.

16

Page 17: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

5 Simulation

In this section we report details of simulation tests not reported in the main text. First, wetest our method using high dimensional uncorrelated data with no change-point. Second,we test our method using high dimensional correlated data with no change-point. Then, wetest our method using high dimensional correlated data with a change-point. The level ofcorrelation is set at 0.7 and 0.3.

5.1 Simulation Design

Following Donoho (2005) and Donoho and Stodden (2006), simulated data vary by twodimensions: the level of underdeterminedness (δ = n/p) and the level of sparsity (ρ = k/n)where n is the number of observations and k is the number of non-sparse predictors. Tomake interpretation simple, we fix the number of predictors (p) at 200 and vary n from 10 to200, and k from 1 to 200 so that both the level of underdeterminedness (δ = n/p) and thesparsity level (ρ = k/n) take 50 equidistance points on the interval [0.1, 1].

Then, we use an underlying model of y = Xβ + ε, xij ∼ N(0, 1), ε ∼ Normal(0, 42In) byvarying δ and ρ. The change point is set at the mid point, bn/2c, and coefficients are drawnindependently for each regime. Based on the value of k, regression coefficients are set asβ1:k ∼ Uniform(0, 50) and βk+1:p = 0.2 We create 50× 50 = 2, 500 unique pairs of (δ, ρ) andfor each pair (δ, ρ) and simulate 20 datasets from the same underlying model. In total, thenumber of simulated data sets is 502× 20 = 50, 000. The entire test results are reported bothin a numerical summary table and in the format of “phase diagrams” used by Donoho (2005)and Donoho and Stodden (2006).

Table 1: Simulation Performance Criteria

Metric Formula Property

Prediction Loss Lpred(β;β?) = 1n‖Xβ −Xβ?‖2 in-sample model fit

Normalized Estimation Loss L2(β;β?) = ‖β−β?‖2

‖β?‖2 parameter consistency.Cross-validation Loss LCV(y;y?) = 1

|Ic|∑

t∈Ic(yt −X>t β)2 out-of-sample predictive accuracy

We evaluate performance of different regularization methods using the criteria summarizedin Table 1. First, Prediction Loss is related with the persistency or risk consistency(e.g., see Greenshtein and Ritov, 2004) – one of the oracle properties that high-dimensionalregression estimator wishes to satisfy. Second, Normalized Estimation Loss captures parameterconsistency. Achieving high performance on Normalized Estimation Loss usually requiresstronger assumptions than those for the prediction loss. Last, Cross-validation Loss checksout-of-sample predictive accuracy. We conduct a 2-fold cross-validation prediction to computethe cross-validation loss.

2We also consider cases of a correlated design matrix. Results are reported in the supplementaryinformation.

17

Page 18: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Table 2: Hybrid Lasso Methods and HMBB for Change-point Simulation Test

Method Break Point AlgorithmLasso (Estimate) Unknown Two step estimation

1. Identify the break point by the HMM analysis of the lasso residuals.(εt = yt −

∑pj=1 xt,jβLasso

j )2. Apply the lasso method to subset data for regime-specific regularization.

Lasso (Oracle) Known Two step estimation1. Subset the data based on the true break point.2. Apply the lasso method to subset data for regime-specific regularization.

HMBB Unknown One step estimation

5.2 Estimating Procedure for Benchmark Estimates

• Lasso (Estimate)

1. Fit the lasso for the entire data and obtain residuals.

2. Fit the hidden Markov model (HMM) on the residuals using MCMCresidualBreakAnalysisin MCMCpack (Martin et al., 2011) and subset the data based on detected regimes.3

3. Fit the lasso separately for each subset data.

• Lasso (Oracle)

1. Subset the data using the true break points

2. Fit the lasso separately for each subset data.

For BridgeChange estimates, we set the correct number of break, but the location of thebreak point is determined by the HMBB. The point of comparison is to see (1) whether HMBBwith an unknown break point outperforms a two-step approach of Lasso-Estimate and (2)how closely HMBB perform against Lasso-Oracle that uses the ground truth knowledgeabout a break point.

5.3 High Dimensional Data with No Change-point

We first report simulation results from no change-point cases. We compare the performanceof the HMBB with that of three popular regularization methods: lasso, elastic net, and ridge.We use the 3-fold cross-validation to obtain estimates from these popular methods usingcv.glmnet in glmnet (Friedman et al., 2010).

Table 3 reports the numerical summary of the test. Overall, HMBB with no break (HMBB0)shows the lowest prediction loss and the lowest cross-validation loss, followed by the elasticnet. In normalized estimation loss, the elastic net and the lasso perform better than HMBB.

3Note that MCMCresidualBreakAnalysis uses the same HMM algorithm with HMBB and hence we canhold the change-point detecting algorithm constant in the test.

18

Page 19: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Table 3: Summary of No Change-point Case: The reported numbers are averaged from 50, 000simulated data sets. Data has no break. MCMC simulation for HMBB is 100 and burn-in is 100.HMBB0 indicates a HMBB with no break.

Prediction Loss Normalized Estimation Loss Cross-validation Loss

Method Mean SD Mean SD Mean SD

HMBB0 0.01 0.00 0.66 0.20 0.23 0.13ElasticNet 0.03 0.04 0.55 0.28 0.26 0.23Lasso 0.04 0.05 0.55 0.30 0.28 0.25Ridge 0.22 0.16 0.89 0.04 0.38 0.26

Figure 4 displays the entire results. The blue color indicates a smaller value of loss and thered color indicates a larger value of loss in each graph. Panel (A) compares the predictionloss. Overall, HMBB0, lasso and elastic net show the similar performance. In contrast, theprediction loss of ridge estimates increases as we move to the upper-right corner where p ≈ nand k ≈ n. As k ≈ n and n p (In the upper-left region), HMBB0 outperforms both lassoand elastic net.

Panel (B) in Figure 4 compares the normalized estimation loss. HMBB0 shows a slightlyworse performance than lasso and elastic net in the normalized estimation loss. This is dueto the difference between selection (sparsity) and shrinkage. HMBB0 shrinks small valuestoward zero, without forcing them to exactly zero. In contrast, lasso and elastic net producesparse outcomes. The simulation test uses zero for weak signals, HMBB0 produces manysmall non-zero values for true zero coefficients unlike lasso and elastic net.

Note that the performance of elastic net and lasso is clearly distinguished by the diagonalline, which corresponds to what Donoho and Stodden (2006) call the theoretical threshold of`1-based methods.4 That is, only when n/p > k/n (below the diagonal line), elastic net andlasso recover the true coefficient values very successfully.

Panel (C) compares the cross-validation loss. We use a half of the data for training and theother half for testing. Overall, HMBB0 shows better performance throughout the entire testcases. Ridge performs poorly in the upper-right corner. Lasso and elastic net also show poorperformance as k approaches n and n approaches p.

4Donoho and Stodden (2006) wrote “there is a breakdown point for standard model selection schemes,such that model selection only works well below a certain critical complexity level” (Donoho and Stodden,2006, 1)

19

Page 20: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.2

0.4

0.6

x105

Panel A (Prediction Loss)

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.25

0.50

0.75

1.00

Panel B (Normalized Estimation Loss)

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.25

0.50

0.75

1.00

1.25

x105

Panel C (CV Loss)

Figure 5: Panel (A): Prediction Loss, Lpred(β;βtrue). Panel (B): Normalized Estimation Loss,L2(β;βtrue) = ‖β − βtrue‖2/‖βtrue‖2. Panel (C): Cross-validation Loss, LCV(ytest;ytest). We fixp = 200 and vary α and ρ between 0.1 and 1. Thus, each cell in the graph represents a data with(N, p, k). We simulate 20 data sets from each (N, p, k) and take the median error.

20

Page 21: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

5.4 High Dimensional Correlated Data

In this section, we provide additional simulation results under a correlated design matrixsetup (ρ = 0.7) with or without change-point.

Figure 5 shows the performance of HMBB against a correlated design matrix with no change-point and Figure 6 summarizes the results of the simulation test for correlated data with asingle change-point.

In the case of no change-point (Figure 5), HMBB outperforms other regularization methods inthe out-of-sample predictive accuracy (Cross-validation loss) as we found in the no-correlationcase. HMBB performs as good as Elastic net and Lasso in Prediction loss and in NormalizedEstimation Loss.

In the case of a change-point (Figure 6), HMBB outperforms Lasso-Estimate and Lasso-Oracle in Normalized Estimation Loss and CV Loss. Interestingly, Lasso-Estimateshows poor performance in all criteria when a design matrix is correlated. Also, to oursurprise, Lasso-Oracle shows highly unstable performance in three measures, particularlyin the out-of-sample predictive accuracy (Cross-validation loss), which is worst among threemethods. It suggests that the selection of variables using subsets of data (training data)could be highly misleading when a design matrix is highly correlated.

To summarize, our simulation studies have shown that (1) the Bayesian bridge model is ahighly robust and reliable tool to estimate parameter values from high dimensional data withor without a break, (2) the HMBB shows outstanding performance in predicting observed(in-sample) and independent (out-of-sample) data, (3) the HMBB performs better thanhybrid lasso methods when design matrix is highly correlated, and (4) the HMBB is highlyuseful to estimate the non-linear effect of the variables in regression analysis. The superiorperformance of the HMBB over hybrid lasso methods is based on three factors. The first factoris the joint estimation of a break point and parameters in the HMBB. In contrast, Lasso-Estimate is based on a two-step approach of break detection and parameter estimation,which underestimates the uncertainty in each step. The second factor is the use of full datainformation by HMBB while Lasso-Estimate uses the information in the residuals for breakdetection. Third, HMBB’s endogenous estimation of break points provides an importantprotection against overfitting.

6 Application Details

6.1 Additional Results on Nunn and Qian (2014)’s Example

We replicated Nunn and Qian (2014)’s analysis using double machine learning method. Theestimation of the double machine learning is done using hdm package in R. The estimatedeffect of US wheat aid (1,000 MT) on intrastate conflicts changes from 0.00263 (0.00088,

21

Page 22: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.2

0.4

0.6

0.8

x105

Panel A (Prediction Loss)

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.5

1.0

Panel B (Normalized Estimation Loss)

HMBB ElasticNet Lasso Ridge

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.3

0.6

0.9

1.2

x105

Panel C (CV Loss)

Figure 6: Results of Simulation Studies using Correlated (ρ = 0.7) Univariate Time Series Data withNo Change-point: Panel (A): Prediction Loss, Lpred(β;βtrue). Panel (B): Normalized EstimationLoss, L2(β;βtrue) = ‖β − βtrue‖2/‖βtrue‖2. Panel (C): Cross-validation Loss, LCV(ytest;ytest).We fix p = 200 and vary α and ρ between 0.1 and 1. Thus, each cell in the graph represents a datawith (n, p, k). We simulate 20 data sets from each (δ, ρ) and take the median error.

22

Page 23: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

HMBB Lasso (Estimate) Lasso (Oracle)

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.1

0.2

x102

Panel A (Prediction Loss)

HMBB Lasso (Estimate) Lasso (Oracle)

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.1

0.2

0.3

x10

Panel B (Normalized Estimation Loss)

HMBB Lasso (Estimate) Lasso (Oracle)

0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00 0.25 0.50 0.75 1.00

0.25

0.50

0.75

1.00

δ = n p

ρ =

kn

0.3

0.6

0.9

1.2x102

Panel C (CV Loss)

Figure 7: Results of Simulation Studies using Univariate Time Series Data with One Change-point(Cor = 0.7): Panel (A): Prediction Loss, Lpred(β;βtrue). Panel (B): Normalized Estimation Loss,L2(β;βtrue) = ‖β − βtrue‖2/‖βtrue‖2. Panel (C): Cross-validation Loss, LCV(ytest;ytest). We fixp = 200 and vary α and ρ between 0.05 and 1. Thus, each cell in the graph represents a data with(n, p, k). We simulate 25 data sets from each (δ, ρ) and take the median error.23

Page 24: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

0.00788) to 0.00411 (0.00118, 0.00699). Regularizing nuisance parameters increases the sizeof causal estimates more than 150%.

Table 4 reports estimates of α and β in the first- and second-stage equations of Nunn andQian (2014) in the manuscript. “Pooled” indicates the original pooled 2SLS estimates and“Regime 1” to “Regime 4” indicates regime-specific estimates (DML after HMBB).

Table 4: Time-varying Estimates of α and β in the first- and second-stage equations. Est. refersto point estimates, 95%CI Low (High) refers to lower (upper) bound of 95% confidence intervals.

Data Parameter Est. 95%CI Low 95% CI HighPooled α 0.104 0.103 0.105Regime 1 α 0.106 0.103 0.109Regime 2 α 0.350 0.296 0.407Regime 3 α 0.058 0.054 0.065Regime 4 α 0.026 0.023 0.031Pooled β 0.004 0.001 0.007Regime 1 β 0.004 0.001 0.006Regime 2 β 0.000 -0.002 0.002Regime 3 β 0.003 -0.002 0.007Regime 4 β 0.008 -0.011 0.026

6.2 Additional Results on Alvarez et al. (1991)’s Example

We revisit Alvarez et al. (1991)’s study on the effect of government partisanship and labourunion centralization on economic growth. This study has produced a large body of literatureon the analysis of time series cross sectional data in political science (Alvarez et al., 1991;Beck et al., 1993; Beck and Katz, 1995; Western, 1998).

For example, Beck et al. (1993) and Beck and Katz (1995) questioned the validity of thestandard error reported in Alvarez et al. (1991). Alvarez et al. (1991) found that leftistsgovernments promote economic growth when they are accompanied by centrally organizedlabor unions. This debate led to the panel corrected standard error by Beck and Katz (1995),which has been the most frequently used panel data technique in political science since then.Western (1998) raised an issue of how to pool observations across groups (or different levelsof analysis). This debate led to the attention to the method of partial pooling, Bayesianhierarchical models, or multilevel models, which have been widely used in the analysis ofTSCS data since then (Steenbergen and Jones, 2002; Shor et al., 2007; Park and Jensen,2007).

The original Alvarez et al. (1991)’s study includes six covariates and one interaction term.Among six covariates, five of them are time-varying. Western (1998) extends the originalmodel by interacting all time-varying covariates with one time-constant variable (the level

24

Page 25: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

of centralization in labor organizations) and found that the original finding still remainssignificant.

Although there is no theoretical reason to interact only the institutional variable, scholarshave not dared to explore a full (two-way) interaction model with 6+ 6C2 = 21 predictors(excluding the intercept) due to the short time span (15 years) of the data. Our focus in thissection is to examine change-points of a full interaction model using the HMBB.

For replication, we use the agl data set included in pcse package in R. The agl data setcovers 16 OECD countries for the period of 1970 - 1984. The dependent variable is theannual growth rate and the independent variables are as follows:

1. lagg1: The lagged growth rate

2. opengdp: weighted OECD demand

3. openex: weighted OECD export

4. openimp: weighted OECD import

5. leftc: The cabinet composition of left-leaning parties

6. central: labor organization index

Since central is time-constant within countries, we could not use a country-fixed effects.Instead, we use a one-way fixed-effects model at the year level. The fixed-effects regression re-sults are reported in Table 5. As in the previous studies, the interaction term (leftc:central)shows a statistically significant positive sign and central shows a large negative sign. Figure7 shows the over-time variations of country-specific residuals. Residuals show strong timetrends and heterogeneity seems to increase over time.Also, the Adj R-squared of the model isclose to 0. In other words, the regression model seems to be highly misspecified.

Table 5: One-way Fixed-effects Estimation of Alvarez et al. (1991)’s Data. The dependent variableis annual growth rate and the fixed-effects is used for the year level. The estimation uses the groupdemeaned data.

Estimate SElagg1 0.05 0.14

opengdp -0.00 0.00openex 0.00 0.00

openimp -0.00 0.00leftc -0.02 0.01

central -0.76 0.22leftc:central 0.01 0.00

N 240 (16 countries for 15 years)Adj R-squared 0.026

Table 6 shows the break number test using the Watanabe information criteria (WAIC)

25

Page 26: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

−6

−4

−2

02

4

1970 1973 1976 1979 1982

1970 1973 1976 1979 1982

−6

−4

−2

02

4

−6

−4

−2

02

4

1970 1973 1976 1979 1982

1970 1973 1976 1979 1982

−6

−4

−2

02

4

panel residuals over group and time

pm$r

esid

uals

AUL AUS BEL CAN DEN FIN FRA GER IRE ITA JAP NET NOR SWE UK USA

Given : pdata[, index[1]]

Figure 8: Panel Residuals of One-way Fixed-effects Estimation of Alvarez et al. (1991)’s Data.Plot is drawn using plm package in R.

26

Page 27: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Full Interaction (Break 1)

Time

Pr(

St=

k |Y

t)

1970 1974 1978 1982

0.0

0.2

0.4

0.6

0.8

1.0

Full Interaction (Break 2)

Time

Pr(

St=

k |Y

t)

1970 1974 1978 1982

0.0

0.2

0.4

0.6

0.8

1.0

Figure 9: Detected Change-points from the HMBB with a single break (left) and two breaks(right). The detected break point in the left panel is 1979 and the two break points in the rightpanel are 1975 and 1979. The fitted model is a full interaction model and the data is Alvarez et al.(1991)’s agl data.

(Watanabe, 2010) and the approximate log marginal likelihood (Chib, 1995).5 We fit multipleHMBBs with different model specifications and compare the model-fit of these models todiagnose the break number. Single Interaction indicates the original one-way fixed-effectsmodel with one interaction term (p = 7). Full Interaction indicates an expanded modelwith all pairwise interactions (p = 21). Both WAIC and log marginal likelihoods favor fullinteraction models with 2 breaks. Figure 8 displays detected break points of the one breakmodel (left) and the two break model (right), respectively. When we set one break, theHMBB detects 1979 as the single break of the full interaction model. When we allow onemore break, 1979 is still detected as a break point and 1975 is chosen as an additional breakpoint. The identification of the same break point across different break numbers indicatesstrong evidence for the existence of a change-point in the data given the model.

Table 6: HMBB with One-way Fixed-effects on Alvarez et al. (1991)’s OECD Data. The dependentvariable is the annual growth rate and the fixed-effects is used for the country level and the yearlevel. When the break number is larger than 2, signs of nonconvergence are dominant.

Single Interaction Full InteractionDiagnostic break 0 break 1 break 2 break 0 break 1 break 2

WAIC 654.01 621.08 611.84 633.61 594.27 579.42Log Marginal -330.69 -318.03 -316.90 -323.72 -308.78 -304.12

5Computation details of model diagnostic measures are discussed in the Appendix 3.5.

27

Page 28: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Figure 9 visualizes parameter changes of the full interaction model with two breaks. “Pooled"indicates time-averaged estimates from the one-way fixed-effects model, 1970 - 1984. “Regime1” indicates estimates from state 1 (1970-1975), “Regime 2" indicates estimates from state 2(1976 - 1979), and “Regime 3" indicates estimates from state 3 (1980 - 1984). Overall, thefixed-effects estimates are located around zero, pooling regime-changing estimates towardzero. In contrast, regime-changing estimates of the HMBB show dramatic shifts acrossregimes. One of the most important findings of Alvarez et al. (1991) was a positive effect ofleftc:central on the annual growth rate while central alone has a negative effect on theannual growth rate. This conditional effect is too small to be meaningful in the pooled fullinteraction model as shown in Table 7. In contrast, when we take into account the hiddenregime changes, the interaction effect of leftc:central becomes much larger before 1979and then decreases after 1980. Interestingly, the marginal effect of the left-party cabinet sizeis largest during 1976-1979, and then it becomes negative after 1980.

Table 7: Conditional Effects of Left-party Cabinet Size and Centralized Labor Union Organizationson the Growth Rate: The sample range of central is (0.4, 3.6) with the sample mean of 2.0. Thesample range of leftc is (0.0, 100.0) with the sample mean of 34.8.

Parameter Regime Est. Low Highleftc Regime1 0.09 -0.01 0.18leftc Regime2 0.32 0.23 0.42leftc Regime3 -0.08 -0.14 -0.03leftc Pooled 0.00 -0.00 0.01central Regime1 -0.45 -0.56 -0.34central Regime2 -0.64 -0.75 -0.53central Regime3 0.31 0.21 0.39central Pooled -0.31 -0.69 0.07leftc:central Regime1 0.31 0.23 0.38leftc:central Regime2 0.42 0.27 0.64leftc:central Regime3 0.10 0.06 0.15leftc:central Pooled 0.01 0.00 0.02

To summarize, the HMBB analysis of Alvarez et al. (1991)’s study shows that (1) a fullinteraction model with two breaks should be preferred over a constant parameter modelwith one interaction term, (2) the growth-promoting effect of the left-party cabinet sizeaccompanied with centralized labor union organizations is strong during 1976-1979, andbecomes weaker after 1980.

7 Computation Time

BridgeChange is written as a high performance R package using Rcpp package. Table 8shows the running time of BridgeChange on a high dimensional time series data set withn = 100 and p = 200. The testing machine is MacBook Pro 2.2 GHz Intel Core i7. 5,000

28

Page 29: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

central

lagg1

lagg1_central

lagg1_leftc

lagg1_openex

lagg1_opengdp

lagg1_openimp

leftc

leftc_central

openex

openex_central

openex_leftc

openex_openimp

opengdp

opengdp_central

opengdp_leftc

opengdp_openex

opengdp_openimp

openimp

openimp_central

openimp_leftc

−0.50 −0.25 0.00 0.25 0.50

Pooled Regime1 Regime2 Regime3

Parameter Changes

Figure 10: State-dependent Parameter Estimates: A fully interacted model of Alvarez et al.(1991)’s regression analysis is used. The estimation is done by the HMBB. The dependent variableis the annual growth rate. “Pooled" indicates time-averaged estimates from the one-way fixed-effectsmodel, 1970 - 1984. “Regime 1” indicates estimates from state 1 (1970-1975), “Regime 2" indicatesestimates from state 2 (1976 - 1979), and “Regime 3" indicates estimates from state 3 (1980 - 1984).

29

Page 30: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

MCMC simulations of HMBB with 200 predictors takes about one minute. One suite of thetest and estimation from 0 break to 3 breaks takes about 12 minutes.

Table 8: Running Time of BridgeChange including Computation of WAIC and the Log MarginalLikelihood. MCMC simulation for parameter estimation is 1,000 and burn-in is 1,000, and the logmarginal likelihood computation includes additional 3,000 MCMC runs. Thus, the total MCMCrun is 5,000.

CPU time (in seconds)n p 0 break 1 break 2 breaks 3 breaks

100 200 63.92 160.65 229.21 292.71

30

Page 31: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

References

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous responsedata. Journal of the American Statistical Association, 88(422):669–679.

Alvarez, R. M., Garrett, G., and Lange, P. (1991). Government partisanship, labor organiza-tion, and macroeconomic performance. American Political Science Review, 85(2):539–556.

Beck, N. and Katz, J. (1995). What to do (and not to do) with time-series cross-sectionaldata. 89:634–647.

Beck, N., Katz, J. N., and Alvarez, R. M. (1993). Government partisanship, labor organization,and macroeconomic per- formance: A corrigendum. American Political Science Review,87:945–948.

Chib, S. (1995). Marginal likelihood from the gibbs output. Journal of the AmericanStatistical Association, 90(432):1313–1321.

Chib, S. (1998). Estimation and comparison of multiple change-point models. Journal ofEconometrics, 86(2):221–241.

Donoho, D. (2005). High-dimensional centrally symmetric polytopes with neighborlinessproportional to dimension. Discrete and Computational Geometry, 35(4):617–652.

Donoho, D. and Stodden, V. (2006). Breakdown point of model selection when the numberof variables exceeds the number of observations. In The 2006 IEEE International JointConference on Neural Network Proceedings, pages 1916–1921.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalizedlinear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.

Frühwirth-Schnatter, S. and Wagner, H. (2006). Auxiliary mixture sampling for parameter-driven models of time series of small counts with applications to state space modelling.Biometrika, 93:827–841.

Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteriafor bayesian models. Statistics and Computing, 24(6):997–1016.

Gramacy, R. B. (2018). monomvn: Estimation for Multivariate Normal and Student-t Datawith Monotone Missingness. R package version 1.9-8.

Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictorselection and the virtue of overparametrization. Bernoulli, 10(6):971–988.

Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in bayesianlinear models: A posterior summary perspective. Journal of the American StatisticalAssociation, 110(509):435–448.

31

Page 32: Supplementary Information for Bayesian Analysis of Multiple …jhp.snu.ac.kr/data/AOAS_SI.pdf · 2020-03-02 · Bayesian Analysis of Multiple Parametric Changes for High-dimensional

Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics,38(4):963–974.

Martin, A. D., Quinn, K. M., and Park, J. H. (2011). MCMCpack: Markov chain montecarlo in R. Journal of Statistical Software, 42(9):22.

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press,Boston, M.A.

Nunn, N. and Qian, N. (2014). U.s. food aid and civil conflict. American Economic Review,104(6):1630–1666.

Park, J. H. (2011). Changepoint analysis of binary and ordinal probit models: An applicationto bank rate policy under the interwar gold standard. Political Analysis, 19(2):188 – 204.

Park, J. H. and Jensen, N. (2007). Electoral competition and agricultural support in oecdcountries. American Journal of Political Science, 51(2):314–329.

Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: sparse Bayesian regular-ization and prediction. Bayesian Statistics, 9:501–538.

Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic modelsusing pólya–gamma latent variables. Journal of the American Statistical Association,108(504):1339–1349.

Polson, N. G., Scott, J. G., and Windle, J. (2014). The bayesian bridge. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76(4):713 – 733.

Shor, B., Bafumi, J., Keele, L., and Park, D. K. (2007). Bayesian multilevel modelingapproach to time series cross-sectional data. Political Analysis, 15(2):165–181.

Steenbergen, M. R. and Jones, B. S. (2002). Modeling multilevel data structures. 46(1).

Tanner, M. A. (1996). Tools for Statistical Inference: Methods for the Exploration of PosteriorDistributions and Likelihood Functions. Springer.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), pages 267–288.

Watanabe, S. (2010). Asymptotic equivalence of bayes cross validation and widely applicableinformation criterion in singular learning theory. Journal of Machine Learning Research,11:3571–3594.

Western, B. (1998). Causal heterogeneity in comparative research: A bayesian hierarchicalmodelling approach. American Journal of Political Science, 42(4):1233–1259.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the AmericanStatistical Association, 101(476):1418–1429.

32