structural break in linear regression models: bayesian asymptotic … · 2020-07-14 · structural...

Structural break in linear regression models:Bayesian asymptotic analysis

Kenichi Shimizu a

Department of Economics, Brown University,Robinson Hall, 64 Waterman Street, Providence, RI, United States

July 14, 2020

Abstract

In this paper, we revisit the threshold regression models, an important class of modelsin economic analysis. For example, multiple equilibria can give rise to threshold effects.The issue of conducting a valid inference of the regression parameters γ when the thresholdparameter τ needs to be estimated remains an open question in the frequentist literature.The non-standard aspect of the estimation problem motivates the use of Bayesian methods,which can correctly reflect the finite-sample uncertainty of estimating τ upon inference of γ.Our theoretical contribution is to establish a Bernstein-von mises type theorem (Bayesianasymptotic normality) for γ under a wide class of priors for the parameters, which essentiallyindicates an asymptotic equivalence between the conventional frequentist and the Bayesianinference. Our result is beneficial to both Bayesians and frequentists. A Bayesian user caninvoke our theorem to convey his or her statistical result to the frequentist researchers. Fora frequentist researcher, looking at the credible interval can serve as a robustness check forthe finite sample uncertainty coming from the threshold estimation, and such sensitivityanalysis is natural as our result guarantees the credible interval to converge to the frequentistconfidence interval. The simulation studies show that the conventional confidence intervalstend to under-cover while credible intervals offer a reasonable coverage in general. As samplesize increases, both methods coincide, as predicted from our theoretical conclusion. Usingthe data from Durlauf and Johnson (1995) on economic growth and Paye and Timmermann(2006) on stock return prediction, we illustrate that the traditional confidence intervals on γmight under-represent the true sampling uncertainty.

Keywords— Threshold regression, Structural break, Bernstein-von Mises theorem, Sensitivity check,Model selection

aE-mail address: kenichi [email protected]

1

Contents

1 Introduction 2

2 The model 6

3 Data generating process 7

4 Bayesian approach 74.1 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Posterior under the improper uninformative prior . . . . . . . . . . . . . . . . . . . . . . . 84.3 Posterior under the conjugate prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.4 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Asymptotic theory 95.1 n-consistency of marginal posterior of τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Bernstein-von Mises Theorem for the regression coefficients . . . . . . . . . . . . . . . . . 14

6 Simulation 186.1 Comparison with the conventional frequentist method . . . . . . . . . . . . . . . . . . . . 186.2 Comparison to the sample-splitting method [In progress] . . . . . . . . . . . . . . . . . . . 20

7 Application 207.1 Growth and multiple equilibria: Durlauf and Johnson (1995) . . . . . . . . . . . . . . . . 207.2 Stock return prediction: Paye and Timmermann (2006) [In progress] . . . . . . . . . . . . 22

8 Conclusion and future direction 23

A Proof of Propositions 23A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23A.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

B Derivation of Posterior distributions 27B.1 Improper uninformative prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27B.2 Conjugate prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1 Introduction

In this paper, we consider the class of linear regression models with a structural break, following the

notations of Bai (1997):

yi =

w′iα+ z′iδ1 + εi, for i = 1, . . . , bnτc

w′iα+ z′iδ2 + εi, for i = bnτc+ 1, . . . , n

In such models, the threshold parameter τ divides the samples into two groups or regimes. The relation-

ship between the outcome yi and the covariate zi is determined by the regime to which the particular

2

observation i belongs. There could be some covariates whose relationship to yi, measured by α, might stay

constant between the regimes. The unknown parameters include the threshold τ as well as the regression

coefficients γ = (α, δ1, δ2).

Such models have been important in empirical economics. For instance, Durlauf and Johnson (1995)

suggested that wealthy countries and poor countries can have different growth paths. In macroeconomics

and empirical finance, it is common to observe an event that affects a change in the underlying model

such as a decrease in output growth volatility in the 1980s known as ”the Great Moderation,” oil price

shocks, and labor productivity change (e.g. Paye and Timmermann (2006)). An example where threshold

regression is used in applied microeconomics is the tipping-point model; for example, see Card et al. (2008).

Refer to Hansen (2011) for an overview of the extensive uses of threshold regression models in economic

applications.

In the literature, the conventional least square estimators τ and γ are computed as follows: for each

candidate τ , compute the sum of squared residuals of the regression and denote the minimizing choice

by τ . Plug in the value τ = τ in the model and denote γ = γ(τ) to be the resulting OLS coefficient

estimator, where γ(τ) is the usual OLS estimator of γ assuming the break location τ .

Roughly speaking, the frequentist literature on this model can be divided into two groups in terms

of the assumption made about the true jump size δ0 = δ2 − δ1. In the first group of the literature that

includes Chan (1993) and Liu et al. (1997), the jump size δ0 is assumed to be constant with respect to

sample size n. The authors show that the convergence rate of τ is n−1 and the asymptotic distribution of

γ is the same as that of γ(τ0). They derived the asymptotic distribution of τ , but it depends on nuisance

parameters, and hence the construction of confidence intervals for τ is not feasible under this assumption.

In order to derive confidence intervals for τ , the second group of the authors such as Bai (1997) and

Hansen (2000) assume that δ0 shrinks to zero as n → ∞, but slower than√n → ∞. In other words,

δ0 = dnα where α ∈ (−1/2, 0) and d ∈ Rk. This reduces the convergence rate of τ but enables them to

find the asymptotic distribution that is free of nuisance parameters, and hence allows them to construct

confidence interval for τ . For the regression coefficients, Bai (1997) and Hansen (2000) obtain the same

asymptotic result as in the first group of the literature: the asymptotic distribution of γ is the same as

that of γ(τ0). Intuitively speaking, this is because δ0 shrinks to zero but it is asymptotically still large

enough for τ to be correctly estimated with high certainty. In the literature, this shrinking jump size

framework is sometimes interpreted as representing a moderately small size of the jump parameter.

For both groups mentioned above,√n (γ(τ)− γ0)

d→ N(0, Vγ), where Vγ is the standard asymptotic

covariance matrix one obtains when τ = τ0. This means that the econometrician can approximate

the distribution of γ by the conventional normal approximation as if τ is known with certainty. Such

asymptotic result implies that P(γ ∈ Θ(τ)

)→ 1 − α as n → ∞ where Θ(τ) is the usual asymptotic

(1− α)-level confidence region for γ under the assumption that τ is known.

As Hansen (2000) himself points out, with finite samples, this procedure is likely to under-represent

the true sampling uncertainty. See Section 6 of this paper for a simulation illustration of this point. The

3

classic approach such as the delta method cannot be used because γ(τ) is not differentiable with respect

to τ and τ is not normally distributed. In order to account for such finite sample uncertainty, Hansen

(2000) proposes a Bonferroni-type correction. For any ρ ∈ (0, 1), let Γ(ρ) be the confidence interval for τ

with asymptotic coverage ρ. The proposed confidence region is Θρ = ∪τ∈Γ(ρ)Θ(τ). Because Θρ ⊃ Θ(τ),

P(γ ∈ Θρ

)≥ P

(γ ∈ Θ(τ)

)→ 1 − α as n → ∞. Hence Θρ is more conservative. The drawback of this

procedure is that the researcher needs to select ρ, and the inference could be sensitive to the choice of ρ

in a finite sample as the author illustrates in his simulations.

Sample-spitting method provides an alternative solution to the failures of conventional estimators and

confidence sets. In a sample-splitting approach, one selects the threshold parameter based on one subset

of the data, and then conducts inference using the remaining data. In Section XX, we compare...

It is worth mentioning that there is another type of assumption on δ0 that recently attracted attention.

The third group considers the small break framework: δ0 = dn−1/2. In order words, the jump size shrinks

at the same rate as the sample uncertainty diminishes. The motivation in the frequentist literature for

such a framework is to better reflect the finite sample uncertainty of the estimation of τ on the asymptotic

analysis. In this framework, τ is not consistent. Elliott and Muller (2014) suggest a way to construct a

confidence interval for the regression coefficient. Andrews et al. (2019) argue that ignoring the sampling

uncertainty of estimating τ potentially leads to an invalid inference on the parameter of interest γ and

propose inference that reflect the data-dependent choice of the break location. Although the type of the

assumption on δ0 is different in our paper, conceptually, our paper contributes to this literature from a

Bayesian perspective. In addition, compared to the approaches in Elliott and Muller (2014) and Andrews

et al. (2019), our proposed Bayesian method is computationally easier to implement.

For a Bayesian, this non-standard aspect of the estimation problem can be dealt with quite naturally

by placing, for example, a uniform prior on a reasonable range of τ and an uninformative improper prior

(or a conjugate prior with a large prior variance) on γ and then by computing the marginal posterior

probabilities. Any finite sample uncertainty of estimating τ is automatically reflected in the marginal

posterior probability of γ. Note that unlike the conventional frequentist methods, Bayesian inference

has a valid interpretation even in finite samples as it does not rely on asymptotics. Indeed, Bayesian

estimations of change-point models have been very popular in statistics literature (for example, refer to

Khodadadi and Asgharian (2008)).

In this paper, we study asymptotic behavior of Bayesian estimation of the considered model under the

fixed jump size framework. Our theoretical contribution is to establish asymptotic equivalence between

the frequentist method and the Bayesian approach. Specifically, we prove a Bernstein-von mises type

theorem for the slope parameters γ which validates the frequentist interpretation of the Bayesian credible

regions. Our result is beneficial to both Bayesian and frequentist researchers. A Bayesian user can

invoke our theorem to convey his or her statistical result to the frequentist researchers. For a frequentist

researcher, looking at the credible interval can serve as a robustness check for the finite sample uncertainty

4

coming from the threshold estimation, and such sensitivity analysis is natural as our result guarantees

the credible interval to converge to the frequentist confidence interval.

Our theoretical results hold under a wide range of prior specifications. First, the prior on the regression

coefficients can be either improper uninformative or conjugate informative. Second, our assumption on

the prior for the threshold location is very mild. For example, a uniform prior satisfies the requirement.

Recently Baek (2019) investigated the same model (1). As the distribution of least-squares estimator for

τ might exhibit tri-modality for small jumps, she proposed a new point estimator τ based on a modified

objective function. Such modification is equivalent to specifying a certain type of prior for τ which indeed

satisfies the requirement for our asymptotic result.

In this paper, we do not consider the shrinking jump size frameworks which were described above

because it is likely that the Bayesian method would deliver different results compared to the frequentist

counterpart. Nevertheless, in contrast to Bai (1997) and Hansen (2000), Bayesian credible intervals

for τ are available without the assumption of the shrinking jump size. Furthermore, any finite sample

uncertainty when the true jump size is small will be automatically reflected in the posterior distribution

of γ.

Estimation of change-point models such as structural break models is considered non-standard in a

sense that there is a non-regular parameter (e.g. threshold parameter) whose point estimator converges

faster than n−1/2, the rate at which the regular parameters (e.g. regression coefficients) converge. Despite

its popularity in applications, the Bayesian theoretical literature on non-regular estimation is very scarce.

To our knowledge, frequentist evaluation of Bayesian approach for structural break models has not been

studied in the literature. Ghosal and Samanta (1995) consider the general non-regular estimation problem

from Bayesian perspective and establish conditions under which Bernstein-von mises theorem holds for

the regular part of the parameter. However, their assumptions are difficult to verify in regard to our

model in consideration.

The paper is organized as follows. Section 2 introduces the model. Section 3 lists assumptions

made about the data-generating-process. Section 4 outlines the proposed Bayesian estimation. Section

5 presents the asymptotic theory of our Bayesian method. Section 6 presents simulation evidence to

assess the adequacy of the asymptotic theory. Section 7 reports empirical applications to the multiple

equilibria growth model of Durlauf and Johnson (1995) and the stock return prediction model of Paye

and Timmermann (2006) (in progress). Section 8 concludes. The mathematical proofs of propositions are

listed to the Appendix. Proofs of some intermediate lemmas can be found under Supplementary Material.

5

2 The model

Using the notations similar to those in Bai (1997), the model we consider is

yi =

w′iα+ z′iδ1 + εi, for i = 1, . . . , bnτc

w′iα+ z′iδ2 + εi, for i = bnτc+ 1, . . . , n(1)

where wi and zi are dw×1 and dz×1 vectors of covariates and the random variable εi is a regression error.

bac is the largest integer that is strictly smaller than a. Note that α stays unchanged across the regimes

defined by the threshold parameter τ ∈ (0, 1). The vectors α, δ1, δ2, and τ are unknown parameters. We

assume δ1 6= δ2, so that a change has taken place. Using the reparametrization xi = (w′i, z′i), β = (α′, δ′1)′,

and δ = δ2 − δ1, the equations (1) can be rewritten as

yi =

x′iβ + εi, for i = 1, . . . , bnτc

x′iβ + z′iδ + εi, for i = bnτc+ 1, . . . , n(2)

Note that zi is a subvector of xi. More generally, let zi = R′xi, where R is a dx × dz known matrix with

full column rank and hence zi is defined as a linear transformation of xi. For R = (0dw×dz , Idz×dz)′, we

obtain the model (2). For R = Idx , the pure change model is obtained.

To rewrite the model in matrix form, let us introduce further notations. Define Y = (y1, . . . , yn)′,

ε = (ε1, . . . , εn)′, X = (x1, . . . , xn)′, X1τ = (x1, . . . , xbnτc, 0, . . . , 0)′, X2τ = (0, . . . , 0, xbnτc+1, . . . , xn)′.

Define Z,Z1τ , and Z2τ similarly. Then Z = XR, Z1τ = X1τR, and Z2τ = X2τR. Then the equations (2)

can be written as

Y = Xβ + Z2τδ + ε (3)

= χτγ + ε (4)

where χτ = (X,Z2τ ) and γ = (β, δ)′. Denote by Sn(τ) the sum of squared residuals of the regression (3).

The least-squares (LS) estimator of the threshold τ as in Bai (1997) is defined as

τ = argminτ∈(0,1)

Sn(τ)

and the LS estimators for the regression coefficients γ = (β, δ)′ are:

γ = γ(τ)

where γ(τ) denotes the usual OLS estimator assuming τ is known.

6

3 Data generating process

The data are assumed to include n observations on a response and a vector of covariates: Dn =

(Y n,Xn) = (y1, . . . , yn, x1, . . . , xn). Conditional on Xn, the response is generated according to the

model (1) with the true parameters θ0 = (β0, δ0, σ20, τ0). We make the following assumptions about the

true DGP:

A1. δ0 6= 0.

A2. εi is i.i.d. with E(εi|xi) = 0, E(ε2i |xi) = σ20.

A3. Assume that ΣX = E[xix′i] = plim 1

n

∑ni=1 xix

′i exists and is positive definite.

A4. For all τ1, τ2 ∈ (0, 1) with τ1 < τ2, 1n

∑bnτ2cbnτ1c+1 xiεi = Op(n

−1/2) and 1n

∑bnτ2cbnτ1c+1 xix

′i =

(τ2 − τ1)ΣX +Op(n−1/2)

If the assumptions above hold, the theoretical results in Bai (1997) apply. The author showed that the

convergence rate of τ is n−1 if δ0 is fixed with respect to the sample size:

τ = τ0 +Op(n−1)

and showed that the LS regression estimator for γ = (β, δ)′ is asymptotically normal with the asymptotic

covariance matrix being the same as if τ0 is known:

√n (γ − γ0) =

√n(β − β0

)√n(δ − δ0

) d→ N(dx+dz)(0, σ20V−1) (5)

where

V = plim1

n

( ∑ni=1 xix

′i

∑ni=bnτ0c+1 xiz

′i∑n

i=bnτ0c+1 zix′i

∑ni=bnτ0c+1 ziz

′i

)= plim

1

nχτ0χ

′τ0

4 Bayesian approach

4.1 Prior

To develop a Bayesian estimation framework, we model the distribution of the regression error term by

normal: εi|σ2 ∼ N(0, σ2). Note that the normality assumption is not made for the true DGP, so the

model can be mis-specified. The prior distribution for τ admits a density π(τ) whose support is Θ and

the ratio π(τ)/π(τ ′) is assumed to be bounded for any τ, τ ′ ∈ Θ. For example, the uniform distribution

on Θ satisfies the requirement. Other priors that reflects researcher’s prior knowledge about τ or some

penalty on the values near the boundaries as in Baek (2019) can be incorporated as well.

7

For the regression coefficients (γ, σ2)′ = (β, δ, σ2)′, the prior can be either the improper uninformative

prior or the Normal-Inverse-Gamma conjugate prior. That is, we specify either

π(γ, σ2

)∝ 1

σ2(6)

or

σ2 ∼ InvGamma(a, b)

γ|σ2 ∼ N(µ, σ2H−1) (7)

The uninformative prior implies that the reference point is the frequentist point estimator, which might

be of interest to a frequentist researcher who might want to compute the posterior intervals for robustness

check. On the other hand, the benefits of the conjugate prior include the facts that the researcher can

incorporate prior belief on the regression coefficients or can impose some regularization and that the

marginal likelihood is available conditional on τ . Intuitively, the two types of priors are equivalent when

a = b = 0 and H = 0.

4.2 Posterior under the improper uninformative prior

With the improper uninformative prior (6), we can show that the marginal posterior of the jump location

τ is

π(τ |Dn) ∝[det(χτχ

′τ

)]−0.5[Sn(τ)]−

(n−(dx+dz))2 × π(τ) (8)

The marginal posterior of γ = (β, δ)′ is a mixture with the weights being the marginal posterior

density πn(τ) of τ :

p (γ|Dn) =

∫ 1

0p (γ|τ,Dn)πn(τ)dτ

where πn(τ) is the marginal posterior density of τ . Note that in the conventional frequentist approach, τ is

fixed at the point estimate τ , and hence the information p (γ|τ,Dn)πn(τ) for τ 6= τ is essentially neglected.

In contrast, such information or uncertainty is explicitly reflected in Bayesian framework. Conditional on

τ , the posterior distributions of γ = (β, δ)′ is a (dx + dz)-dimensional t-distribution centering around the

OLS estimates given τ :

γ∣∣τ,Dn ∼ t(dx+dz)

(n− (dx + dz), γ(τ),

Sn(τ)

n− (dx + dz)

(χτχ

′τ

)−1)

(9)

where tk(v, µ,Σ) is the k-dimensional t-distribution with v degrees of freedom, a location vector µ ∈ Rk,and a k×k shape matrix Σ. It can be shown that the posterior for σ2 conditional on τ is Inverse-Gamma

8

with shape parameter (n− (dx + dz))/2 and scale parameter Sn(τ)/2:

σ2|τ,Dn ∼ InvGamma ((n− (dx + dz))/2, Sn(τ)/2) (10)

4.3 Posterior under the conjugate prior

With the conjugate prior (7), we can derive the posteriors for each of the parameters.

π(τ |Dn) ∝[det(Hτ

)]−0.5b−aτ × π(τ) (11)


(2a, µτ , (bτ/a)H−1

τ

)(12)

σ2|τ,Dn ∼ InvGamma(a, bτ

)(13)

where

Hτ = H + χτχ′τ

µτ = H−1τ

[Hµ+ χτY

]bτ = b+

1

2

[µ′Hµ+ Y ′Y − µ′τ Hτ µτ

]a = a+

n

2(14)

4.4 Posterior sampling

Due to the availability of the closed form conditional posterior for β, δ, and σ2 given τ , the posterior

sampling is simple and fast. One can first draw τ(1), . . . , τ(S) from the marginal posterior of τ as in (8)

( or (11) )via for example Metropolis-Hastings algorithm. For each τ(s), one can sample posterior draws

γ(s) and σ2(s) from the posteriors conditional on τ = τ(s), namely (9) (or (12)) and (10) (or (13) ). A

laptop with a 2.2GHz processor and 8GB RAM takes about 4.1 seconds to draw 10,000 posterior draws

in the regression problem in Section 7 that has 10 slope coefficients in total.

5 Asymptotic theory

In this section, we investigate asymptotic behavior of the Bayesian method. Section 5.1 shows that the

marginal posterior of the threshold parameter τ contracts to the true value τ0 at rate of n−1, the same

rate at which the LS estimator τ converges. The proof is based on studying the behavior of the log

ratio of the marginal posterior densities of τ . Section 5.2 establishes a Bernstein-von mises type theorem

for the regression coefficients γ = (β, δ)′. The proof exploits the fact that the conditional posterior for

9

√n (γ − γ0) given τ is asymptotically normal, which is close to the asymptotic distribution of the OLS

estimater γ(τ0) when τ is close to τ0. A bound on the KL divergence between two normal densities

together with the n-consistency is used to make the argument precise.

5.1 n-consistency of marginal posterior of τ

An intermediate step for proving Bernstein-von mises theorem is marginal posterior consistency of τ at

rate n−1. Marginal posteriors have not been studied extensively or systematically in the literature. Here,

we directly analyze the form of the marginal posterior of τ . The marginal posterior density of τ is defined

as

πn(τ) =π(τ |Y (n))∫π(τ |Y (n))dτ

=Ln(τ)∫Ln(τ)dτ

where Ln(τ) is the right hand size of the equation (8) or (11), depending on the choice of the prior. The

following theorem states the main result of this subsection.

Theorem 1 (Marginal posterior consistency of τ at rate n−1). Suppose Assumptions A1-A4 hold. Then,

under both the uninformative improper prior and the conjugate prior, ∀η > 0, ε > 0, ∃M > 0 and N > 0

such that n ≥ N =⇒

Pθ0

(∫BcM/n

(τ0)πn(τ)dτ < η

)> 1− ε

where Bδ(τ0) = (τ0 − δ, τ0 + δ).

Note that

πn(τ) =Ln(τ)∫Ln(τ ′)dτ ′

=Ln(τ0)∫Ln(τ ′)dτ ′

Ln(τ)

Ln(τ0)= πn(τ0)

Ln(τ)

Ln(τ0)

πn(τ0) =Ln(τ0)∫Ln(τ ′)dτ ′

≤ Ln(τ0)∫BcM0/n

(τ0) Ln(τ ′)dτ ′=

[ ∫BcM0/n

(τ0)

Ln(τ ′)

Ln(τ0)dτ ′]−1

for any M0 > 0. Hence for each n and for any M0 > 0,∫BcM/n

(τ0)πn(τ)dτ = πn(τ0)

∫BcM/n

(τ0)

Ln(τ)

Ln(τ0)dτ ≤

[ ∫BcM0/n

(τ0)

Ln(τ ′)

Ln(τ0)dτ ′]−1 ∫

BcM/n

(τ0)

Ln(τ)

Ln(τ0)dτ (15)

Therefore, we want to find

1. an upper bound for∫BcM/n

(τ0)Ln(τ)Ln(τ0)dτ and

2. a lower bound for∫BcM0/n

(τ0)Ln(τ ′)Ln(τ0)dτ

′ for some M0 > 0

10

Note that

Ln(τ)

Ln(τ0)= exp

[n

{1

nlogLn(τ)− 1

nlogLn(τ0)

}]Case 1: Improper uninformative prior

For the case of the improper uninformative prior, from (8), we have

1

nlogLn(τ)− 1

nlogLn(τ0)

=1

2nlog

[det(χτ0χ

′τ0

)det (χτχ′τ )

]+n− (dx + dz)

2n

[logSn(τ0)− logSn(τ)

]+

1

nlog

π(τ)

π(τ0)

Note that for example 1n

∑ni=1 xix

′i,

1n

∑ni=bnτ0c+1 xiz

′i, and 1

n

∑ni=bnτc+1 xiz

′i converge in probability and

by continuity of determinant, its determinant converges to the determinant of the moment matrix. Hence,

the quantity inside of log in the first term is Op(1). Therefore, the first term is Op(n−1). Since the ratio

π(τ)/π(τ0) is bounded, the last term is O(n−1). Hence, we have

1

nlogLn(τ)− 1

nlogLn(τ0)

= logSn(τ0)− logSn(τ) +Op(n−1) = log

[1

nSn(τ0)

]− log

[1

nSn(τ)

]+Op(n

−1)

Case 2: Conjugate prior

For the case of the conjugate prior, from (11), we have

1

nlogLn(τ)− 1

nlogLn(τ0)

=1

2nlog

[det(H−1τ0

)det(H−1τ

)]+a

n

[log(bτ0)− log

(bτ) ]

+1

nlog

π(τ)

π(τ0)

= log

(1

nbτ0

)− log

(1

nbτ

)+Op(n

−1)

We have

1

nbτ =

b

n+

1

2

[1

nµ′Hµ+

1

nY ′Y − 1

nµ′τ Hτ µτ

]=

1

2n

[Y ′Y + (χ′τY )(χτχ

′τ )−1(χ′τY )

]+Op(n

−1)

=1

2nSn(τ) +Op(n

−1) (16)

11

Hence, as in the case of the improper prior, even with the conjugate prior, we have that

1

nlogLn(τ)− 1

nlogLn(τ0) = log

[1

nSn(τ0)

]− log

[1

nSn(τ)

]+Op(n

−1)

Define Qn(τ) = 1nSn(τ) and An(τ) = g (Qn(τ)) where g(x) = −log(x). Then we can write

1

nlogLn(τ)− 1

nlogLn(τ0) = An(τ)−An(τ0) +Op(n

−1) (17)

Definition 1. For all τ ∈ Θ,

Q(τ) = σ20 +

(τ0 − τ) (1−τ0)(1−τ) δ

′0R′ΣXRδ0, if τ ≤ τ0

(τ − τ0) τ0τ δ′0R′ΣXRδ0, if τ > τ0

≡ σ20 + ∆(τ)

Figure below shows an example of Qn(τ) and Q(τ).

Figure 1: Example of Qn(τ) n=100 (+), n=1,000 (circle), and n=10,000 (cross) and Q(τ) (dashed).

The following propositions are used to prove Theorem 1.

12

Proposition 1. Under Pθ0, for all τ ,

Qn(τ) = Q(τ) +Op(n−1/2)

Define A(τ) = g(Q(τ)).

Proposition 2. A(τ) attains its unique maximum at τ0

Proposition 3. ∀η > 0, ∀ε > 0, ∃M > 0 and N > 0 such that n ≥ N =⇒

Pθ0

(inf

τ∈BcM/n

(τ0)

| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|

< η

)> 1− ε

Proof of theorem 1. By Proposition 2, A(·) attains its unique max at τ0. Note that the convex function

A(τ) is not differentiable at τ0. Hence we have,

A(τ)−A(τ0) < |τ − τ0|B1

A(τ)−A(τ0) > |τ − τ0|B2

for some B1, B2 < 0.

By Proposition 3, given η1 > 0, ∃M > 0 : with Pθ0 → 1,

An(τ)−An(τ0) < η1|τ − τ0|+A(τ)−A(τ0) < |τ − τ0|{η1 +B1} (18)

for all τ ∈ BcM/n(τ0). Similarly, given η2 > 0, ∃M0 > 0 : with Pθ0 → 1,

An(τ)−An(τ0) > −η2|τ − τ0|+A(τ)−A(τ0) > |τ − τ0|{−η2 +B2} (19)

for all τ ∈ BcM0/n

(τ0). Recall, by Eq.(17), we have

Ln(τ)

Ln(τ0)= exp

[n

(An(τ0)−An(τ)

)+Op(1)

]Hence, from Eq. (18), given η1 > 0, small compared to −B1, there is B′1 < 0, which is independent of M :

we have with Pθ0 → 1,

Ln(τ)

Ln(τ0)≤ exp

[n|τ − τ0|B′1 +Op(1)

]= exp

[n|τ − τ0|B′1

]Op(1) (20)

for all τ ∈ BcM/n(τ0). Note that the statement above still holds with a larger value of M > 0 as the area

outside of the ball will be contained by that for the original M . Similarly, from Eq. (19), there is B′2 < 0

13

and M0 > 0 : with Pθ0 → 1,

Ln(τ)

Ln(τ0)≥ exp

[n|τ − τ0|B′2 +Op(1)

]= exp

[n|τ − τ0|B′2

]Op(1) (21)

for all τ ∈ BcM0/n

(τ0). Now, by Inequality (20) and fundamental theorem of calculus,

∫BcM/n

(τ0)

Ln(τ)

Ln(τ)dτ ≤

∫BcM/n

(τ0)exp

[n|τ − τ0|B′1

]dτOp(1) =

1

nB′1

(enB

′1 − eB′1M

)Op(1)

Similarly, by Inequality (21),∫BMc

0/n(τ0)

Ln(τ)

Ln(τ)dτ ≥

∫BMc

0/n(τ0)

exp

[n|τ − τ0|B′2

]dτOp(1) =

1

nB′2

(enB

′2 − eB′2M0

)Op(1)

This means, together with the bound (15),∫BcM/n

(τ0)πn(τ)dτ ≤

[ ∫BMc

0/n(τ0)

Ln(τ ′)

Ln(τ0)dτ ′]−1 ∫

BcM/n

(τ0)

Ln(τ)

Ln(τ0)dτ ≤ B′2

B′1

eB′1n − eB′1M

eB′2n − eB′2M0

Op(1)

which can be made arbitrarily small by choosing M > 0 and n sufficiently large.

5.2 Bernstein-von Mises Theorem for the regression coefficients

The marginal posterior of (β, δ)′ is a mixture weighted by πn(τ). Furthermore, due to Theorem 1, we can

focus our attention on the values of τ in a n−1 neighborhood of τ0:∫ 1

0p(β, δ|τ,Dn)πn(τ)dτ =

∫BM/n(τ0)

p(β, δ|τ,Dn)πn(τ)dτ + op(1)

Conditional on τ = τ0, the standard result of OLS applies:

√n (γ(τ0)− γ0) =

√n(β(τ0)− β0

)√n(δ(τ0)− δ0

) d→ N(dx+dz)(0, σ20V−1) (22)

where V is defined in (22). We are now ready to establish the following Bernstein-von mises type result:

Theorem 2 (Bernstein-von Mises theorem for the slope coefficients). Suppose that Assumptions A1-A4

hold.

dTV

π√n(β − β(τ0)

)√n(δ − δ(τ0)

)∣∣∣∣Dn

, N(dx+dz)

(0, σ2

0V−1)→ 0

in Pθ0 − probability.

14

Proof of theorem 2 . Define z =√n (γ − γ(τ0)) =

(√n(β − β(τ0)

),√n(δ − δ(τ0)

))′.

dTV

(π [z|Dn] , N(dx+dz) (0, V )

)=

∫|π(z|Dn)− φ(z; 0, V )|dz

≤∫ ∫

|π(z|τ,Dn)− φ(z; 0, V )|dzdπ(τ |Dn) =

∫dTV

(π(z|τ,Dn), N(dx+dz)(0, V )

)dπ(τ |Dn)

=

∫BM/n(τ0)

dTV

(π(z|τ,Dn), N(dx+dz)(0, V )

)dπ(τ |Dn) + op(1)

where the last equality is due to theorem 1.

Case 1: Improper uninformative prior

Let us first consider the case with the improper uninformative prior (6). From (9), asymptotically, the

posterior of γ = (β, δ)′ conditional on τ is normal:

γ∣∣τ,Dn a∼ N(dx+dz)

(γ(τ),

Sn(τ)

n− (dx + dz)(χτχτ )−1

)=⇒ z|τ,Dn a∼ N(dx+dz)

(√n (γ(τ)− γ(τ0)) ,

nSn(τ)


)The total variation distance is bounded above by 2 times square root of the KL divergence. Note that for

any p-dimensional normal distributions N(µ1,Σ1), N(µ2,Σ2), we have

KL

(N(µ1,Σ1), N(µ2,Σ2)

)≤

∣∣det (Σ−12

)− det

(Σ−1

1

)∣∣min(det

(Σ−1

1

), det

(Σ−1

2

))︸︷︷︸

I

+ p||Σ−12 − Σ−1

1 ||∞||Σ1||∞︸︷︷︸II

+ ||µ1 − µ2||22||Σ−12 ||2︸︷︷︸

III

where ||Σ||∞ = maxij |Σij | is the largest element of Σ in the absolute value and ||Σ||2 = supµ||Σµ||2/||µ||2is a matrix norm induced by the standard norm on Rp, ||µ||2 =

∑pi=1 µ

2i .

In our case, µ1 =√n (γ(τ)− γ(τ0)), µ2 = 0, Σ2 = σ2

0V−1, and

Σ1 =nSn(τ)


=nQn(τ)

n− (dx + dz)

(1

nχτχτ

)−1

︸︷︷︸≡V −1

n (τ)

15

Note that since |τ − τ0| < Mn ,

Σ1 − Σ2 =n

n− (dx + dz)(Qn(τ)−Qn(τ0))︸︷︷︸

=op(1)

V −1n (τ)

+n

n− (dx + dz)Qn(τ0)

(V −1n (τ)− V −1

n (τ0))

︸︷︷︸=op(1)

+n

n− (dx + dz)

(Qn(τ0)− σ2

0

)︸︷︷︸=Op(n−1/2)

V −1n (τ0)

+n

n− (dx + dz)σ2

0

(V −1n (τ0)− V −1

)︸︷︷︸

=op(1)

+

(n

n− (dx + dz)− 1

)︸︷︷︸

=o(1)

σ20V−1 = op(1)

which implies

Σ−12 − Σ−1

1 = op(1)

Hence II = op(1). By continuity of det and min functions, we also have that I = op(1) for τ ∈ BM/n(τ0).

Lastly, to show III = op(1), note that Y = Xβ0 + Z2τ0δ0 + ε = Xβ0 + Z2τδ0 + ε∗, where ε∗ =

(Z2τ0 − Z2τ )δ0 + ε. This implies

√n (γ(τ)− γ0) =

√n(β(τ)− β0

)√n(δ(τ)− δ0

) =

[1

n

(X ′X X ′Z2τ

Z ′2τX Z ′2τZ2τ

)]−11√n

(X ′ε∗

Z ′2τ ε∗

)

=

[1

n

(X ′X X ′Z2τ

Z ′2τX Z ′2τZ2τ

)]−11√n

(X ′ε+X ′(Z2τ0 − Z2τ )δ0

Z ′2τ ε+ Z ′2τ0(Z2τ0 − Z2τ )δ0

)

It can be shown that for |τ − τ0| < Mn ,

1√nX ′(Z2τ0 − Z2τ ) = op(1)

1√nZ ′2τ0(Z2τ0 − Z2τ ) = op(1)

1

nX ′Z2τ −

1

nX ′Z2τ0 = op(1)

1

nZ ′2τZ2τ −

1

nZ ′2τ0Z2τ0 = op(1)

16

which implies

√n (γ(τ)− γ0) =

[1

n

(X ′X X ′Z2τ0

Z ′2τ0X Z ′2τ0Z2τ0

)]−11√n

(X ′ε

Z ′2τ ε

)+ op(1)

By central limit theorem, it can be shown that the left hand side above has the same limit distribution

as that of√n (γ(τ0)− γ0) for |τ − τ0| < M

n . This means that III = op(1). Finally,

dTV

(π(z|τ, Y (n)), N(dx+dz)(0, V )

)≤ 2√op(1) = op(1)

which implies

dTV

(π[z|Y (n)

], N(dx+dz)(0, V )

)≤∫BM/n(τ0)

dTV

(π(z|τ, Y (n)), N(dx+dz)(0, V )

)dπ(τ |Y (n)) + op(1) = op(1)

Case 2: Conjugate prior

Next, let us consider the case with the conjugate prior (7). From (12), asymptotically, the posterior of

γ = (β, δ)′ conditional on τ is normal:

γ∣∣τ,Dn a∼ N(dx+dz)

(µτ , (bτ/a)H−1

τ

)=⇒ z|τ,Dn a∼ N(dx+dz)

(√n (µτ − γ(τ0)) , (nbτ/a)H−1

τ

)Note that for each τ ∈ BM/n(τ0)

√n (µτ − γ(τ0)) =

√n (µτ − γ(τ)) +

√n (γ(τ)− γ(τ0)) (23)

where we know that the last term is op(1) from the proof for the case of improper prior. By definition,

µτ =

[1

nH +

1

nχτχ

′τ

]−1 [ 1

nHµ+

1

nχτY

]= γ(τ) +Op(n

−1)

so the first term in (23) is also op(1). Also note that for each τ ∈ BM/n(τ0)

(nbτ/a)H−1τ − σ2

0V =

[(nbτ/a)H−1

τ −nSn(τ)


]+

[nSn(τ)

n− (dx + dz)(χτχτ )−1 − σ2

0V

](24)

where we know that the second term is op(1) from the proof for the case of improper prior. For the first

17

term, we have

(nbτ/a)H−1τ =

bτa+ n/2

[1

nH +

1

nχτχ

′τ

]−1

=1n bτ

a/n+ 1/2

[1

nH +

1

nχτχ

′τ

]−1

From (16), we know that 1n bτ = 1

2nSn(τ) +Op(n−1), so we have

(nbτ/a)H−1τ = Sn(τ)

[χτχ

′τ

]−1+Op(n

−1)

Therefore, the first term in (24) is also op(1). The same argument from the proof for the improper prior

implies the desired result.

6 Simulation

6.1 Comparison with the conventional frequentist method

Consider the following simple model: yi = δ01(i > bnτ0c) + εi with εi iid ∼ N(0, 1). Assume τ0 = 0.5. We

consider different values of sample size n = 50, 100, 250, and 500 and the true jump size δ0 = 0.25, 0.5,

and 1.0.

We study the behavior of Bayesian approach in repeated experiments. For each pair of n and δ0, we

generate 1,000 data sets. We report the coverage of the 95% credible interval for the jump size δ (Table 1),

the mean interval length (Table 3), and the mean absolute value of the difference between the posterior

mode of δ and δ0 (Table 2).

We also present the performance of the conventional frequentist estimator. In specific, we computed

the least-square estimator τ , the jump size estimator δ(τ) conditional on τ = τ , and its 95% confidence

interval based on the conventional asymptotic theory. This would what the researchers do based on the

method in Hansen (2000). As mentioned in introduction section, he proposes a Bonferroni-type correction

of CIs, but the tuning parameter has to be selected, so we do not present Bonferroni-type corrected CIs.

1. Table 1 shows that for small n and/or small δ0, the conventional CI significantly under-covers. On

the other hand, Bayesian credible interval has a relatively reasonable coverage.

2. From Table 2, we can see that except for a few cases, the conventional method and Bayesian method

provide similar bias.

18

Least-squares Bayesian

δ0 = 0.25 0.50 1.00 0.25 0.50 1.00

n = 50 0.63 0.83 0.95 0.97 0.97 0.97n = 100 0.71 0.91 0.95 0.96 0.96 0.96n = 250 0.82 0.94 0.95 0.96 0.96 0.95n = 500 0.91 0.95 0.95 0.95 0.96 0.95

Table 1: Coverage of the conventional 95% confidence interval (left) and the 95% credible interval(right).


δ0 = 0.25 0.50 1.00 0.25 0.50 1.00

n = 50 0.56 0.37 0.23 0.3 0.28 0.24n = 100 0.35 0.2 0.16 0.21 0.19 0.16n = 250 0.18 0.1 0.1 0.13 0.11 0.1n = 500 0.09 0.07 0.07 0.08 0.07 0.07

Table 2: |Bias| of δ(τ) and the posterior mode of δ

3. From Table 3, we can see that the primary reason for the under-coverage of the confidence intervals

seems to be that the interval tends to be too short. On the other hand, credible intervals provide a

better coverage since they are wider, which would be the result of correctly reflecting the sampling

uncertainty of threshold estimation.

4. Finally, looking at Tables 1, 2, and 3, we see that as n increases, the discrepancy between the

Bayesian and the frequentist results decreases, as expected from the proven Bernstein-von mises

theorem from the previous section.

19


δ0 = 0.25 0.50 1.00 0.25 0.50 1.00

n = 50 1.31 1.25 1.12 1.64 1.57 1.35n = 100 0.92 0.84 0.79 1.14 1.02 0.85n = 250 0.56 0.51 0.49 0.68 0.56 0.5n = 500 0.38 0.35 0.35 0.44 0.36 0.35

Table 3: Length of the conventional 95% confidence interval (left) and the 95% credible interval(right).

6.2 Comparison to the sample-splitting method [In progress]

7 Application

In this section, we consider two empirical applications: economic growth model in Durlauf and Johnson

(1995) and stock return prediction in Paye and Timmermann (2006). In both papers, the estimation is

done using the conventional frequentist approach: they first computed τ and then estimated the regression

coefficients by fixing τ at τ .

7.1 Growth and multiple equilibria: Durlauf and Johnson (1995)

In this section, we apply our method on the data from Durlauf and Johnson (1995) which is also considered

in Hansen (2000). The authors suggest that cross-section growth behavior may be determined by initial

conditions. They explore this hypothesis using the Summers-Heston data set, reporting results obtained

from a regression tree approach, which is an extension of threshold regressions.

In one of the specifications in their paper, the authors divide the countries into two groups by the

literacy rate in the base year:

ln(Y/L)i,1985 − ln(Y/L)i,1960

=

δ(1)1 + δ

(2)1 ln(Y/L)i,1960 + δ

(3)1 ln(I/Y )i + δ

(4)1 ln(ni + g + δ) + δ

(5)1 ln(school)i + εi, if Liti,1960 ≤ τ

δ(1)2 + δ

(2)2 ln(Y/L)i,1960 + δ

(3)2 ln(I/Y )i + δ

(4)2 ln(ni + g + δ) + δ

(5)2 ln(school)i + εi, if Liti,1960 > τ

where for each country i, (Y/L)i,t is real GDP per member of the population aged 15-64 in year t, (I/Y )i

is investment to GDP ratio, ni is growth rate of working-age population, schooli is fraction of working-

age population enrolled in secondary school, and Liti,1960 is the literacy rate in the base year 1960. The

variables not indexed by t are annual averages over the period 1960-1985. Following the authors, we set

g + δ = 0.05, where g is the growth rate of technology and δ is the depreciation rate of both human

and physical capitals We standardize the covariates to make the comparison of the estimated coefficients

20

easier.

The prior on τ that we use is Uniform(0.05, 0.95). The posterior mean of τ is 53.7 % with 95%

credible interval being [31% 68%]. See Figure 2 for the trace plot and the posterior density. The Least

Squares method produces a similar point estimate, which is τ = 53%.

In Figure 3, we display the posterior densities of the slope coefficients as well as 95% credible intervals.

The intervals in red are the confidence intervals computed assuming the point estimate τ being fixed which

would be what the conventional frequentist method would produce. We see that in general, as it was

expected, the Bayesian credible intervals are wider than the frequentist counterparts, which would be a

result of correctly reflecting the finite sample uncertainty of the unknown τ . Importantly, this can have

a qualitative consequence on statistical importance of some parameter(s). For example, for the slope

coefficients δ(2)1 on ln(Y/L)i,1960 and δ

(3)1 on ln(I/Y )i, the confidence interval does not include 0 while

the Bayesian credible interval does. Hence, it is possible that ignoring the finite sample uncertainty of

the threshold estimation stage could lead to an invalid inference.

Figure 2: Trace plots (left) and posterior density (right) for τ

21

Figure 3: Posterior density of β’s. 95% credible interval (blue) and 95% confidence interval assuming τ(red).

Figure 4: Posterior density of σ

7.2 Stock return prediction: Paye and Timmermann (2006) [In progress]

Paye and Timmermann (2006) investigates the instability in models of ex-post predictable components

in stock returns. The model that they consider is:

Rett =

δ(1)1 + δ

(2)1 Divt−1 + δ

(3)1 Tbillt−1 + δ

(4)1 Spreadt−1 + δ

(5)1 Deft−1 + εt, if t ≤ bTτc

δ(1)2 + δ

(2)2 Divt−1 + δ

(3)2 Tbillt−1 + δ

(4)2 Spreadt−1 + δ

(5)2 Deft−1 + εt, if t > bTτc

where Rett denotes the excess stock return during month t, Divt−1 is the lagged dividend yield, Tbillt−1

is the lagged short term interest rate, Spreadt−1 is the lagged spread between the short term and the long

term interest rates, and Deft−1 is the lagged U.S. default premium.

22

8 Conclusion and future direction

In this paper, we established a Bernstein von-mises type theorem for the regression coefficients in linear

regression models with a structural break. A frequentist researcher can look at Bayesian credible intervals

for the regression coefficients as a robustness check to see whether the finite sample uncertainty coming

from the break location estimation affects the inference on the regression coefficients. Such sensitivity

analysis is natural as our theoretical result guarantees the credible interval to converge to the conventional

confidence intervals that the frequentist researcher would use otherwise.

Extending the result to multiple number of breaks should be straightforward. When the number of

breaks is unknown, the researcher can either conduct model comparisons based on marginal likelihoods,

which are available under the conjugate prior, or can compute the posterior probabilities of the number

of breaks using methods such as reversible jump MCMC or dynamic programing techniques.

I am currently working on several types of extensions of this paper. First, the homosckedasticity

assumption could be too strong in some applications, and hence extending the results to the case of

heterosckedasticity would be of interest. Second, researchers might want to relax the linearity assumption

for the regression function in some cases. I am currently working on Bayesian non-parametric regressions

with structural breaks.

A Proof of Propositions

A.1 Proof of Proposition 1

Proposition 1. Under Pθ0, for all τ ,

Qn(τ) = Q(τ) +Op(n−1/2)

Proof of Proposition 1. Let τ ∈ (0, 1) be given. Let M = I − X(X ′X)−1X ′. We have the following

identity: Sn(τ) = Sn − Vn(τ) (Amemiya (1985), Bai (1997)), where Sn is the sum of squared residuals

from regressing Y on X alone and Vn(τ) = δ′(τ)(Z ′2τMZ2τ )δ(τ). By Frisch-Waugh Theorem, the OLS

estimate of δ in Eq. (3) is equivalent to that in the model MY = MZ2τδ +Mε. Note the true model is

MY = MZ2τ0δ0 +Mε. Hence,

δ(τ) = (Z ′2τMZ2τ )−1Z ′2τMY

= (Z ′2τMZ2τ )−1Z ′2τ {MZ2τ0δ0 +Mε}

= (Z ′2τMZ2τ )−1Z ′2τMZ2τ0δ0 + (Z ′2τMZ2τ )−1Z ′2τMε

23

Hence,

Vn(τ) = δ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)δ0

+ 2δ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMε)

+ (Z ′2τMε)′(Z ′2τMZ2τ )−1(Z ′2τMε)

This means that

1

nVn(τ) =

1

nδ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)δ0 +Op(n

−1/2)

Also we have

Sn(τ) = Y ′MY = δ′0Z′2τ0MZ2τ0δ0 + 2δ0Z2τ0M + ε′Mε

which implies

1

nSn(τ) =

1

nδ′0Z

′2τ0MZ2τ0δ0 + σ2

0 +Op(n−1/2)

By the above identity,

Qn(τ) =1

nSn(τ)− 1

nVn(τ)

= σ20 +

1

nδ′0

{(Z ′2τ0MZ2τ0)− (Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)

}δ0 +Op(n

−1/2)

Note that

(Z ′2τ0MZ2τ0) = Z ′2τ0Z2τ0 − Z ′2τ0X(XX ′)−1X ′Z2τ0

= R′X ′2τ0X2τ0R−R′(X ′2τ0X)(XX ′)−1(X ′X2τ0)R

= R′X ′2τ0X2τ0R−R′(X ′2τ0X2τ0)(XX ′)−1(X ′2τ0X2τ0)R

Hence,

1

n(Z ′2τ0MZ2τ0) = (1− τ0)R′ΣXR− (1− τ0)2R′ΣXR+Op(n

−1/2)

= τ0(1− τ0)R′ΣXR+Op(n−1/2)

Similarly,

1

n(Z ′2τMZ2τ ) = τ(1− τ)R′ΣXR+Op(n

−1/2)

24

WLOG, suppose τ < τ0. Then

(Z ′2τMZ2τ0) = Z ′2τZ2τ0 − Z ′2τX(XX ′)−1X ′Z2τ0

= R′X ′2τX2τ0R−R′(X ′2τX)(XX ′)−1(X ′X2τ0)R

= R′X ′2τ0X2τ0R−R′(X ′2τX2τ )(XX ′)−1(X ′2τ0X2τ0)R

which implies that

1

n(Z ′2τMZ2τ0) = (1− τ0)R′ΣXR+ (1− τ)(1− τ0)R′ΣXR+Op(n

−1/2)

= τ(1− τ0)R′ΣXR+Op(n−1/2)

Therefore,

1

n(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0) =

[τ(1− τ0)

1

τ(1− τ)τ(1− τ0)︸︷︷︸

=τ(1−τ0)2

1−τ

]R′ΣXR+Op(n

−1/2)

Finally,

1

n

{(Z ′2τ0MZ2τ0)− (Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)

}=

[τ0(1− τ0)− τ(1− τ0)2

1− τ

]R′ΣXR+Op(n

−1/2)

= (τ0 − τ)1− τ0

1− τR′ΣXR+Op(n

−1/2)


Proposition 2. A(τ) attains its unique maximum at τ0

Proof of Proposition 2. By definition, Q(τ0) = σ20. Note that δ′0R

′ΣXRδ0 > 0. This is because (1) R

has full column rank, (2) δ0 6= 0, and (3) ΣX is assumed to be positive definite. Hence, Q(τ) > σ20

∀τ 6= τ0. Recall that A(τ) = g(Q(τ)) where g(x) = −log(x). Hence A(τ) = −log(σ20) if τ = τ0 and

A(τ) < −log(σ20) otherwise.


Proposition 3. ∀η > 0, ∀ε > 0, ∃M > 0 and N > 0 such that n ≥ N =⇒

Pθ0

(inf

τ∈BcM/n

(τ0)

| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|

< η

)> 1− ε

25

Proof of Proposition 3. Recall that An(τ) = g(Qn(τ)) and A(τ) = g(Q(τ)) where g(x) = −log(x). By

Taylor approximation, there is c between x and a:

g(x)− g(a) = g′(a)(x− a) +1

2g′′(c)(x− a)2

Hence, for each τ ∈ BcM/n(τ0), there is cn between Qn(τ) and Q(τ):

g (Qn(τ))− g (Q(τ)) = g′ (Q(τ))(Qn(τ)−Q(τ)

)+

1

2g′′ (cn)

(Qn(τ)−Q(τ)

)2= g′ (Q(τ))Op(n

−1/2) +Op(n−1)

where we used Proposition 1. Similarly, there is c0n between Qn(τ0) and Q(τ0):

g (Qn(τ0))− g (Q(τ0)) = g′ (Q(τ0))Op(n−1/2) +Op(n

−1)

Note that g′′ (cn) = 1c2n

and g′′ (c0n) = 1c20n

are bounded with probability tending to one because for each

τ , Qn(τ)p→ Q(τ), and Q(τ) is bounded.

Now,

{An(τ)−An(τ0)} − {A(τ)−A(τ0)} = {An(τ)−A(τ)} − {An(τ0)−A(τ0)}

=

{g(Qn(τ))− g(Q(τ))

}−{g(Qn(τ0))− g(Q(τ0))

}=

(g′ (Q(τ))− g′ (Q(τ0))

)Op(n

−1/2) +Op(n−1)

=

[− 1

σ20 + ∆(τ)

−(− 1

σ20 + ∆(τ0)

)]Op(n

−1/2) +Op(n−1)

=

[1

σ20

− 1

σ20 + ∆(τ)

]Op(n

−1/2) +Op(n−1)

In general, there is B > 0 such that 1b −

1b+x ≤ Bx for b, x > 0. Hence, 1

σ20− 1

σ20+∆(τ)

≤ B∆(τ) ≤B′|τ − τ0| where the last inequality holds for some B′ > 0 due to the shape of Q(τ).

Finally,

| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|

≤ B′Op(n−1/2) +1

|τ − τ0|Op(n

−1) ≤︸︷︷︸|τ−τ0|>M/n

Op(n−1/2) +

1

MOp(1)

The desired result is established by taking M large enough.

26

B Derivation of Posterior distributions

B.1 Improper uninformative prior

In this section, we derive the posterior in Section 4.2 under the improper uninformative prior (6). The

posterior is

p(θ|Dn) ∝(

1

σ

)nexp

[− 1

2σ2

{n∑i=1

(yi − χ′τ,iγ

)2}] 1

σ2π(τ)

∝(

1

σ

)nexp

[− 1

2σ2

{Sn(τ) + (γ − γ(τ))′ χτχ

′τ (γ − γ(τ))

}] 1

σ2π(τ) (25)

where Sn(τ) is the sum of squared residuals given τ . Integrating the right hand side with respect to γ,

we obtain

(1

σ2

)n−(dx+dz)2

+1

exp

[− 1

2σ2Sn(τ)

] [det(χτχ

′τ

)]−0.5π(τ)

by the property of the multivariate normal density. Integrating the above with respect to σ2 over the

positive part of the real line and using the change of variable φ = 1/σ2, we get the marginal posterior for

τ

π(τ |Dn) ∝[det(χτχ

′τ

)]−0.5[Sn(τ)]−

(n−(dx+dz))2 × π(τ)

To obtain the conditional posterior for γ given τ , we integrate (25) with respect to σ2. With the change

of variable φ = 1/σ2, we can show that

π(γ|τ,Dn) ∝[1 +

1

n− (dx + dz)(γ − γ(τ))′

1

Sn(τ)/(n− (dx + dz))χτχ

′τ (γ − γ(τ))

]−n/2which means that


(n− (dx + dz), γ(τ),

Sn(τ)

n− (dx + dz)

(χτχ

′τ

)−1)

Finally, to obtain the conditional posterior for σ2 given τ , we integrate (25) with respect to γ and can

show that

π(φ|τ,Dn) ∝ φn−(dx+dz)

2−1exp

[−Sn(τ)

2φ

]which means that φ|τ,Dn ∼ Gamma ((n− (dx + dz))/2, Sn(τ)/2) or equivalently,

σ2|τ,Dn ∼ InvGamma ((n− (dx + dz))/2, Sn(τ)/2)

27

B.2 Conjugate prior

In this section, we derive the posterior in Section 4.3 under the conjugate prior (7). It can be shown that

the posterior is

p(θ|Dn) ∝ p(Dn|θ)π(θ)

∝(

1

σ

)nexp

[− 1

2σ2

{n∑i=1

(yi − χ′τ,iγ

)2}] 1

σ2π(τ)

×(

1

σ2

)a+p/2+1

exp

[− 1

σ2

{b+

1

2

(γ − µ

)′H(γ − µ

)}]∝(

1

σ2

)a+p/2+1

exp

[− 1

σ2

{bτ +

1

2(γ − µτ )′ Hτ (γ − µτ )

}]where a, bτ , µτ , and Hτ are defined in Eq. (14). Note that the distribution for γ, σ2|τ,Dn is also a

normal-inverse-gamma distribution. By the same approach we used under the improper prior, we can

show

π(τ |Dn) ∝[det(Hτ

)]−0.5b−aτ × π(τ)

and

σ2|τ,Dn ∼ InvGamma(a, bτ

)Finally, to derive the posterior of γ conditional on τ , we use the well-known property that the integral of

a normal-inverse-gamma distribution with respect to σ2 is a t-distribution to conclude that


(2a, µτ , (bτ/a)H−1

τ

)

28

Acknowledgements

I am grateful for Andriy Norets, my dissertation advisor, for guidance and encouragement through the

work on this project. I thank for valuable comments and suggestions from Eric Renault, Susanne Schen-

nach, Jesse Shapiro, Kenneth Chay, Toru Kitagawa, Adam McCloskey, Dimitris Korobilis, and Florian

Gunsilius as well as participants in the Brown University Econometrics Seminar. This work was supported

by the Economics department dissertation fellowship at Brown University. All remaining errors are mine.

References

I. Andrews, T. Kitagawa, and A. McCloskey. Inference after estimation of breaks. Technical report,

Centre for Microdata Methods and Practice, Institute for Fiscal Studies, 2019.

Y. Baek. Estimation of structural break point in linear regression models. Working paper, 2019.

J. Bai. Estimation of a change point in multiple regression models. Review of Economics and Statistics,

79(4):551–563, 1997.

D. Card, A. Mas, and J. Rothstein. Tipping and the dynamics of segregation. The Quarterly Journal of

Economics, 123(1):177–218, 2008.

K. S. Chan. Consistency and limiting distribution of the least squares estimator of a threshold autore-

gressive model. The Annals of Statistics, 21(1):520–533, 1993.

S. N. Durlauf and P. A. Johnson. Multiple regimes and cross-country growth behaviour. Journal of

Applied Econometrics, 10(4):365–384, 1995.

G. Elliott and U. K. Muller. Pre and post break parameter inference. Journal of Econometrics, 180(2):

141–157, 2014.

S. Ghosal and T. Samanta. Asymptotic behaviour of bayes estimates and posterior distributions in

multiparameter nonregular cases. Mathematical Methods of Statistics, 4(4):361–388, 1995.

B. E. Hansen. Sample splitting and threshold estimation. Econometrica, 68(3):575–603, 2000.

B. E. Hansen. Threshold autoregression in economics. Statistics and its Interface, 4(2):123–127, 2011.

29

A. Khodadadi and M. Asgharian. Change-point problems and regression: an annotated bibliography.

Collection of Biostatistics Research Archive., Preprint 44, 2008.

J. Liu, S. Wu, and J. V. Zidek. On segmented multivariate regression. Statistica Sinica, pages 497–525,

1997.

B. S. Paye and A. Timmermann. Instability of return prediction models. Journal of Empirical Finance,

13(3):274–315, 2006.

30

structural break in linear regression models: bayesian asymptotic … · 2020-07-14 · structural...

Documents