structural break in linear regression models: bayesian asymptotic … · 2020-07-14 · structural...
TRANSCRIPT
Structural break in linear regression models:Bayesian asymptotic analysis
Kenichi Shimizu a
Department of Economics, Brown University,Robinson Hall, 64 Waterman Street, Providence, RI, United States
July 14, 2020
Abstract
In this paper, we revisit the threshold regression models, an important class of modelsin economic analysis. For example, multiple equilibria can give rise to threshold effects.The issue of conducting a valid inference of the regression parameters γ when the thresholdparameter τ needs to be estimated remains an open question in the frequentist literature.The non-standard aspect of the estimation problem motivates the use of Bayesian methods,which can correctly reflect the finite-sample uncertainty of estimating τ upon inference of γ.Our theoretical contribution is to establish a Bernstein-von mises type theorem (Bayesianasymptotic normality) for γ under a wide class of priors for the parameters, which essentiallyindicates an asymptotic equivalence between the conventional frequentist and the Bayesianinference. Our result is beneficial to both Bayesians and frequentists. A Bayesian user caninvoke our theorem to convey his or her statistical result to the frequentist researchers. Fora frequentist researcher, looking at the credible interval can serve as a robustness check forthe finite sample uncertainty coming from the threshold estimation, and such sensitivityanalysis is natural as our result guarantees the credible interval to converge to the frequentistconfidence interval. The simulation studies show that the conventional confidence intervalstend to under-cover while credible intervals offer a reasonable coverage in general. As samplesize increases, both methods coincide, as predicted from our theoretical conclusion. Usingthe data from Durlauf and Johnson (1995) on economic growth and Paye and Timmermann(2006) on stock return prediction, we illustrate that the traditional confidence intervals on γmight under-represent the true sampling uncertainty.
Keywords— Threshold regression, Structural break, Bernstein-von Mises theorem, Sensitivity check,Model selection
aE-mail address: kenichi [email protected]
1
Contents
1 Introduction 2
2 The model 6
3 Data generating process 7
4 Bayesian approach 74.1 Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Posterior under the improper uninformative prior . . . . . . . . . . . . . . . . . . . . . . . 84.3 Posterior under the conjugate prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.4 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Asymptotic theory 95.1 n-consistency of marginal posterior of τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Bernstein-von Mises Theorem for the regression coefficients . . . . . . . . . . . . . . . . . 14
6 Simulation 186.1 Comparison with the conventional frequentist method . . . . . . . . . . . . . . . . . . . . 186.2 Comparison to the sample-splitting method [In progress] . . . . . . . . . . . . . . . . . . . 20
7 Application 207.1 Growth and multiple equilibria: Durlauf and Johnson (1995) . . . . . . . . . . . . . . . . 207.2 Stock return prediction: Paye and Timmermann (2006) [In progress] . . . . . . . . . . . . 22
8 Conclusion and future direction 23
A Proof of Propositions 23A.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23A.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
B Derivation of Posterior distributions 27B.1 Improper uninformative prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27B.2 Conjugate prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1 Introduction
In this paper, we consider the class of linear regression models with a structural break, following the
notations of Bai (1997):
yi =
w′iα+ z′iδ1 + εi, for i = 1, . . . , bnτc
w′iα+ z′iδ2 + εi, for i = bnτc+ 1, . . . , n
In such models, the threshold parameter τ divides the samples into two groups or regimes. The relation-
ship between the outcome yi and the covariate zi is determined by the regime to which the particular
2
observation i belongs. There could be some covariates whose relationship to yi, measured by α, might stay
constant between the regimes. The unknown parameters include the threshold τ as well as the regression
coefficients γ = (α, δ1, δ2).
Such models have been important in empirical economics. For instance, Durlauf and Johnson (1995)
suggested that wealthy countries and poor countries can have different growth paths. In macroeconomics
and empirical finance, it is common to observe an event that affects a change in the underlying model
such as a decrease in output growth volatility in the 1980s known as ”the Great Moderation,” oil price
shocks, and labor productivity change (e.g. Paye and Timmermann (2006)). An example where threshold
regression is used in applied microeconomics is the tipping-point model; for example, see Card et al. (2008).
Refer to Hansen (2011) for an overview of the extensive uses of threshold regression models in economic
applications.
In the literature, the conventional least square estimators τ and γ are computed as follows: for each
candidate τ , compute the sum of squared residuals of the regression and denote the minimizing choice
by τ . Plug in the value τ = τ in the model and denote γ = γ(τ) to be the resulting OLS coefficient
estimator, where γ(τ) is the usual OLS estimator of γ assuming the break location τ .
Roughly speaking, the frequentist literature on this model can be divided into two groups in terms
of the assumption made about the true jump size δ0 = δ2 − δ1. In the first group of the literature that
includes Chan (1993) and Liu et al. (1997), the jump size δ0 is assumed to be constant with respect to
sample size n. The authors show that the convergence rate of τ is n−1 and the asymptotic distribution of
γ is the same as that of γ(τ0). They derived the asymptotic distribution of τ , but it depends on nuisance
parameters, and hence the construction of confidence intervals for τ is not feasible under this assumption.
In order to derive confidence intervals for τ , the second group of the authors such as Bai (1997) and
Hansen (2000) assume that δ0 shrinks to zero as n → ∞, but slower than√n → ∞. In other words,
δ0 = dnα where α ∈ (−1/2, 0) and d ∈ Rk. This reduces the convergence rate of τ but enables them to
find the asymptotic distribution that is free of nuisance parameters, and hence allows them to construct
confidence interval for τ . For the regression coefficients, Bai (1997) and Hansen (2000) obtain the same
asymptotic result as in the first group of the literature: the asymptotic distribution of γ is the same as
that of γ(τ0). Intuitively speaking, this is because δ0 shrinks to zero but it is asymptotically still large
enough for τ to be correctly estimated with high certainty. In the literature, this shrinking jump size
framework is sometimes interpreted as representing a moderately small size of the jump parameter.
For both groups mentioned above,√n (γ(τ)− γ0)
d→ N(0, Vγ), where Vγ is the standard asymptotic
covariance matrix one obtains when τ = τ0. This means that the econometrician can approximate
the distribution of γ by the conventional normal approximation as if τ is known with certainty. Such
asymptotic result implies that P(γ ∈ Θ(τ)
)→ 1 − α as n → ∞ where Θ(τ) is the usual asymptotic
(1− α)-level confidence region for γ under the assumption that τ is known.
As Hansen (2000) himself points out, with finite samples, this procedure is likely to under-represent
the true sampling uncertainty. See Section 6 of this paper for a simulation illustration of this point. The
3
classic approach such as the delta method cannot be used because γ(τ) is not differentiable with respect
to τ and τ is not normally distributed. In order to account for such finite sample uncertainty, Hansen
(2000) proposes a Bonferroni-type correction. For any ρ ∈ (0, 1), let Γ(ρ) be the confidence interval for τ
with asymptotic coverage ρ. The proposed confidence region is Θρ = ∪τ∈Γ(ρ)Θ(τ). Because Θρ ⊃ Θ(τ),
P(γ ∈ Θρ
)≥ P
(γ ∈ Θ(τ)
)→ 1 − α as n → ∞. Hence Θρ is more conservative. The drawback of this
procedure is that the researcher needs to select ρ, and the inference could be sensitive to the choice of ρ
in a finite sample as the author illustrates in his simulations.
Sample-spitting method provides an alternative solution to the failures of conventional estimators and
confidence sets. In a sample-splitting approach, one selects the threshold parameter based on one subset
of the data, and then conducts inference using the remaining data. In Section XX, we compare...
It is worth mentioning that there is another type of assumption on δ0 that recently attracted attention.
The third group considers the small break framework: δ0 = dn−1/2. In order words, the jump size shrinks
at the same rate as the sample uncertainty diminishes. The motivation in the frequentist literature for
such a framework is to better reflect the finite sample uncertainty of the estimation of τ on the asymptotic
analysis. In this framework, τ is not consistent. Elliott and Muller (2014) suggest a way to construct a
confidence interval for the regression coefficient. Andrews et al. (2019) argue that ignoring the sampling
uncertainty of estimating τ potentially leads to an invalid inference on the parameter of interest γ and
propose inference that reflect the data-dependent choice of the break location. Although the type of the
assumption on δ0 is different in our paper, conceptually, our paper contributes to this literature from a
Bayesian perspective. In addition, compared to the approaches in Elliott and Muller (2014) and Andrews
et al. (2019), our proposed Bayesian method is computationally easier to implement.
For a Bayesian, this non-standard aspect of the estimation problem can be dealt with quite naturally
by placing, for example, a uniform prior on a reasonable range of τ and an uninformative improper prior
(or a conjugate prior with a large prior variance) on γ and then by computing the marginal posterior
probabilities. Any finite sample uncertainty of estimating τ is automatically reflected in the marginal
posterior probability of γ. Note that unlike the conventional frequentist methods, Bayesian inference
has a valid interpretation even in finite samples as it does not rely on asymptotics. Indeed, Bayesian
estimations of change-point models have been very popular in statistics literature (for example, refer to
Khodadadi and Asgharian (2008)).
In this paper, we study asymptotic behavior of Bayesian estimation of the considered model under the
fixed jump size framework. Our theoretical contribution is to establish asymptotic equivalence between
the frequentist method and the Bayesian approach. Specifically, we prove a Bernstein-von mises type
theorem for the slope parameters γ which validates the frequentist interpretation of the Bayesian credible
regions. Our result is beneficial to both Bayesian and frequentist researchers. A Bayesian user can
invoke our theorem to convey his or her statistical result to the frequentist researchers. For a frequentist
researcher, looking at the credible interval can serve as a robustness check for the finite sample uncertainty
4
coming from the threshold estimation, and such sensitivity analysis is natural as our result guarantees
the credible interval to converge to the frequentist confidence interval.
Our theoretical results hold under a wide range of prior specifications. First, the prior on the regression
coefficients can be either improper uninformative or conjugate informative. Second, our assumption on
the prior for the threshold location is very mild. For example, a uniform prior satisfies the requirement.
Recently Baek (2019) investigated the same model (1). As the distribution of least-squares estimator for
τ might exhibit tri-modality for small jumps, she proposed a new point estimator τ based on a modified
objective function. Such modification is equivalent to specifying a certain type of prior for τ which indeed
satisfies the requirement for our asymptotic result.
In this paper, we do not consider the shrinking jump size frameworks which were described above
because it is likely that the Bayesian method would deliver different results compared to the frequentist
counterpart. Nevertheless, in contrast to Bai (1997) and Hansen (2000), Bayesian credible intervals
for τ are available without the assumption of the shrinking jump size. Furthermore, any finite sample
uncertainty when the true jump size is small will be automatically reflected in the posterior distribution
of γ.
Estimation of change-point models such as structural break models is considered non-standard in a
sense that there is a non-regular parameter (e.g. threshold parameter) whose point estimator converges
faster than n−1/2, the rate at which the regular parameters (e.g. regression coefficients) converge. Despite
its popularity in applications, the Bayesian theoretical literature on non-regular estimation is very scarce.
To our knowledge, frequentist evaluation of Bayesian approach for structural break models has not been
studied in the literature. Ghosal and Samanta (1995) consider the general non-regular estimation problem
from Bayesian perspective and establish conditions under which Bernstein-von mises theorem holds for
the regular part of the parameter. However, their assumptions are difficult to verify in regard to our
model in consideration.
The paper is organized as follows. Section 2 introduces the model. Section 3 lists assumptions
made about the data-generating-process. Section 4 outlines the proposed Bayesian estimation. Section
5 presents the asymptotic theory of our Bayesian method. Section 6 presents simulation evidence to
assess the adequacy of the asymptotic theory. Section 7 reports empirical applications to the multiple
equilibria growth model of Durlauf and Johnson (1995) and the stock return prediction model of Paye
and Timmermann (2006) (in progress). Section 8 concludes. The mathematical proofs of propositions are
listed to the Appendix. Proofs of some intermediate lemmas can be found under Supplementary Material.
5
2 The model
Using the notations similar to those in Bai (1997), the model we consider is
yi =
w′iα+ z′iδ1 + εi, for i = 1, . . . , bnτc
w′iα+ z′iδ2 + εi, for i = bnτc+ 1, . . . , n(1)
where wi and zi are dw×1 and dz×1 vectors of covariates and the random variable εi is a regression error.
bac is the largest integer that is strictly smaller than a. Note that α stays unchanged across the regimes
defined by the threshold parameter τ ∈ (0, 1). The vectors α, δ1, δ2, and τ are unknown parameters. We
assume δ1 6= δ2, so that a change has taken place. Using the reparametrization xi = (w′i, z′i), β = (α′, δ′1)′,
and δ = δ2 − δ1, the equations (1) can be rewritten as
yi =
x′iβ + εi, for i = 1, . . . , bnτc
x′iβ + z′iδ + εi, for i = bnτc+ 1, . . . , n(2)
Note that zi is a subvector of xi. More generally, let zi = R′xi, where R is a dx × dz known matrix with
full column rank and hence zi is defined as a linear transformation of xi. For R = (0dw×dz , Idz×dz)′, we
obtain the model (2). For R = Idx , the pure change model is obtained.
To rewrite the model in matrix form, let us introduce further notations. Define Y = (y1, . . . , yn)′,
ε = (ε1, . . . , εn)′, X = (x1, . . . , xn)′, X1τ = (x1, . . . , xbnτc, 0, . . . , 0)′, X2τ = (0, . . . , 0, xbnτc+1, . . . , xn)′.
Define Z,Z1τ , and Z2τ similarly. Then Z = XR, Z1τ = X1τR, and Z2τ = X2τR. Then the equations (2)
can be written as
Y = Xβ + Z2τδ + ε (3)
= χτγ + ε (4)
where χτ = (X,Z2τ ) and γ = (β, δ)′. Denote by Sn(τ) the sum of squared residuals of the regression (3).
The least-squares (LS) estimator of the threshold τ as in Bai (1997) is defined as
τ = argminτ∈(0,1)
Sn(τ)
and the LS estimators for the regression coefficients γ = (β, δ)′ are:
γ = γ(τ)
where γ(τ) denotes the usual OLS estimator assuming τ is known.
6
3 Data generating process
The data are assumed to include n observations on a response and a vector of covariates: Dn =
(Y n,Xn) = (y1, . . . , yn, x1, . . . , xn). Conditional on Xn, the response is generated according to the
model (1) with the true parameters θ0 = (β0, δ0, σ20, τ0). We make the following assumptions about the
true DGP:
A1. δ0 6= 0.
A2. εi is i.i.d. with E(εi|xi) = 0, E(ε2i |xi) = σ20.
A3. Assume that ΣX = E[xix′i] = plim 1
n
∑ni=1 xix
′i exists and is positive definite.
A4. For all τ1, τ2 ∈ (0, 1) with τ1 < τ2, 1n
∑bnτ2cbnτ1c+1 xiεi = Op(n
−1/2) and 1n
∑bnτ2cbnτ1c+1 xix
′i =
(τ2 − τ1)ΣX +Op(n−1/2)
If the assumptions above hold, the theoretical results in Bai (1997) apply. The author showed that the
convergence rate of τ is n−1 if δ0 is fixed with respect to the sample size:
τ = τ0 +Op(n−1)
and showed that the LS regression estimator for γ = (β, δ)′ is asymptotically normal with the asymptotic
covariance matrix being the same as if τ0 is known:
√n (γ − γ0) =
√n(β − β0
)√n(δ − δ0
) d→ N(dx+dz)(0, σ20V−1) (5)
where
V = plim1
n
( ∑ni=1 xix
′i
∑ni=bnτ0c+1 xiz
′i∑n
i=bnτ0c+1 zix′i
∑ni=bnτ0c+1 ziz
′i
)= plim
1
nχτ0χ
′τ0
4 Bayesian approach
4.1 Prior
To develop a Bayesian estimation framework, we model the distribution of the regression error term by
normal: εi|σ2 ∼ N(0, σ2). Note that the normality assumption is not made for the true DGP, so the
model can be mis-specified. The prior distribution for τ admits a density π(τ) whose support is Θ and
the ratio π(τ)/π(τ ′) is assumed to be bounded for any τ, τ ′ ∈ Θ. For example, the uniform distribution
on Θ satisfies the requirement. Other priors that reflects researcher’s prior knowledge about τ or some
penalty on the values near the boundaries as in Baek (2019) can be incorporated as well.
7
For the regression coefficients (γ, σ2)′ = (β, δ, σ2)′, the prior can be either the improper uninformative
prior or the Normal-Inverse-Gamma conjugate prior. That is, we specify either
π(γ, σ2
)∝ 1
σ2(6)
or
σ2 ∼ InvGamma(a, b)
γ|σ2 ∼ N(µ, σ2H−1) (7)
The uninformative prior implies that the reference point is the frequentist point estimator, which might
be of interest to a frequentist researcher who might want to compute the posterior intervals for robustness
check. On the other hand, the benefits of the conjugate prior include the facts that the researcher can
incorporate prior belief on the regression coefficients or can impose some regularization and that the
marginal likelihood is available conditional on τ . Intuitively, the two types of priors are equivalent when
a = b = 0 and H = 0.
4.2 Posterior under the improper uninformative prior
With the improper uninformative prior (6), we can show that the marginal posterior of the jump location
τ is
π(τ |Dn) ∝[det(χτχ
′τ
)]−0.5[Sn(τ)]−
(n−(dx+dz))2 × π(τ) (8)
The marginal posterior of γ = (β, δ)′ is a mixture with the weights being the marginal posterior
density πn(τ) of τ :
p (γ|Dn) =
∫ 1
0p (γ|τ,Dn)πn(τ)dτ
where πn(τ) is the marginal posterior density of τ . Note that in the conventional frequentist approach, τ is
fixed at the point estimate τ , and hence the information p (γ|τ,Dn)πn(τ) for τ 6= τ is essentially neglected.
In contrast, such information or uncertainty is explicitly reflected in Bayesian framework. Conditional on
τ , the posterior distributions of γ = (β, δ)′ is a (dx + dz)-dimensional t-distribution centering around the
OLS estimates given τ :
γ∣∣τ,Dn ∼ t(dx+dz)
(n− (dx + dz), γ(τ),
Sn(τ)
n− (dx + dz)
(χτχ
′τ
)−1)
(9)
where tk(v, µ,Σ) is the k-dimensional t-distribution with v degrees of freedom, a location vector µ ∈ Rk,and a k×k shape matrix Σ. It can be shown that the posterior for σ2 conditional on τ is Inverse-Gamma
8
with shape parameter (n− (dx + dz))/2 and scale parameter Sn(τ)/2:
σ2|τ,Dn ∼ InvGamma ((n− (dx + dz))/2, Sn(τ)/2) (10)
4.3 Posterior under the conjugate prior
With the conjugate prior (7), we can derive the posteriors for each of the parameters.
π(τ |Dn) ∝[det(Hτ
)]−0.5b−aτ × π(τ) (11)
γ∣∣τ,Dn ∼ t(dx+dz)
(2a, µτ , (bτ/a)H−1
τ
)(12)
σ2|τ,Dn ∼ InvGamma(a, bτ
)(13)
where
Hτ = H + χτχ′τ
µτ = H−1τ
[Hµ+ χτY
]bτ = b+
1
2
[µ′Hµ+ Y ′Y − µ′τ Hτ µτ
]a = a+
n
2(14)
4.4 Posterior sampling
Due to the availability of the closed form conditional posterior for β, δ, and σ2 given τ , the posterior
sampling is simple and fast. One can first draw τ(1), . . . , τ(S) from the marginal posterior of τ as in (8)
( or (11) )via for example Metropolis-Hastings algorithm. For each τ(s), one can sample posterior draws
γ(s) and σ2(s) from the posteriors conditional on τ = τ(s), namely (9) (or (12)) and (10) (or (13) ). A
laptop with a 2.2GHz processor and 8GB RAM takes about 4.1 seconds to draw 10,000 posterior draws
in the regression problem in Section 7 that has 10 slope coefficients in total.
5 Asymptotic theory
In this section, we investigate asymptotic behavior of the Bayesian method. Section 5.1 shows that the
marginal posterior of the threshold parameter τ contracts to the true value τ0 at rate of n−1, the same
rate at which the LS estimator τ converges. The proof is based on studying the behavior of the log
ratio of the marginal posterior densities of τ . Section 5.2 establishes a Bernstein-von mises type theorem
for the regression coefficients γ = (β, δ)′. The proof exploits the fact that the conditional posterior for
9
√n (γ − γ0) given τ is asymptotically normal, which is close to the asymptotic distribution of the OLS
estimater γ(τ0) when τ is close to τ0. A bound on the KL divergence between two normal densities
together with the n-consistency is used to make the argument precise.
5.1 n-consistency of marginal posterior of τ
An intermediate step for proving Bernstein-von mises theorem is marginal posterior consistency of τ at
rate n−1. Marginal posteriors have not been studied extensively or systematically in the literature. Here,
we directly analyze the form of the marginal posterior of τ . The marginal posterior density of τ is defined
as
πn(τ) =π(τ |Y (n))∫π(τ |Y (n))dτ
=Ln(τ)∫Ln(τ)dτ
where Ln(τ) is the right hand size of the equation (8) or (11), depending on the choice of the prior. The
following theorem states the main result of this subsection.
Theorem 1 (Marginal posterior consistency of τ at rate n−1). Suppose Assumptions A1-A4 hold. Then,
under both the uninformative improper prior and the conjugate prior, ∀η > 0, ε > 0, ∃M > 0 and N > 0
such that n ≥ N =⇒
Pθ0
(∫BcM/n
(τ0)πn(τ)dτ < η
)> 1− ε
where Bδ(τ0) = (τ0 − δ, τ0 + δ).
Note that
πn(τ) =Ln(τ)∫Ln(τ ′)dτ ′
=Ln(τ0)∫Ln(τ ′)dτ ′
Ln(τ)
Ln(τ0)= πn(τ0)
Ln(τ)
Ln(τ0)
πn(τ0) =Ln(τ0)∫Ln(τ ′)dτ ′
≤ Ln(τ0)∫BcM0/n
(τ0) Ln(τ ′)dτ ′=
[ ∫BcM0/n
(τ0)
Ln(τ ′)
Ln(τ0)dτ ′]−1
for any M0 > 0. Hence for each n and for any M0 > 0,∫BcM/n
(τ0)πn(τ)dτ = πn(τ0)
∫BcM/n
(τ0)
Ln(τ)
Ln(τ0)dτ ≤
[ ∫BcM0/n
(τ0)
Ln(τ ′)
Ln(τ0)dτ ′]−1 ∫
BcM/n
(τ0)
Ln(τ)
Ln(τ0)dτ (15)
Therefore, we want to find
1. an upper bound for∫BcM/n
(τ0)Ln(τ)Ln(τ0)dτ and
2. a lower bound for∫BcM0/n
(τ0)Ln(τ ′)Ln(τ0)dτ
′ for some M0 > 0
10
Note that
Ln(τ)
Ln(τ0)= exp
[n
{1
nlogLn(τ)− 1
nlogLn(τ0)
}]Case 1: Improper uninformative prior
For the case of the improper uninformative prior, from (8), we have
1
nlogLn(τ)− 1
nlogLn(τ0)
=1
2nlog
[det(χτ0χ
′τ0
)det (χτχ′τ )
]+n− (dx + dz)
2n
[logSn(τ0)− logSn(τ)
]+
1
nlog
π(τ)
π(τ0)
Note that for example 1n
∑ni=1 xix
′i,
1n
∑ni=bnτ0c+1 xiz
′i, and 1
n
∑ni=bnτc+1 xiz
′i converge in probability and
by continuity of determinant, its determinant converges to the determinant of the moment matrix. Hence,
the quantity inside of log in the first term is Op(1). Therefore, the first term is Op(n−1). Since the ratio
π(τ)/π(τ0) is bounded, the last term is O(n−1). Hence, we have
1
nlogLn(τ)− 1
nlogLn(τ0)
= logSn(τ0)− logSn(τ) +Op(n−1) = log
[1
nSn(τ0)
]− log
[1
nSn(τ)
]+Op(n
−1)
Case 2: Conjugate prior
For the case of the conjugate prior, from (11), we have
1
nlogLn(τ)− 1
nlogLn(τ0)
=1
2nlog
[det(H−1τ0
)det(H−1τ
)]+a
n
[log(bτ0)− log
(bτ) ]
+1
nlog
π(τ)
π(τ0)
= log
(1
nbτ0
)− log
(1
nbτ
)+Op(n
−1)
We have
1
nbτ =
b
n+
1
2
[1
nµ′Hµ+
1
nY ′Y − 1
nµ′τ Hτ µτ
]=
1
2n
[Y ′Y + (χ′τY )(χτχ
′τ )−1(χ′τY )
]+Op(n
−1)
=1
2nSn(τ) +Op(n
−1) (16)
11
Hence, as in the case of the improper prior, even with the conjugate prior, we have that
1
nlogLn(τ)− 1
nlogLn(τ0) = log
[1
nSn(τ0)
]− log
[1
nSn(τ)
]+Op(n
−1)
Define Qn(τ) = 1nSn(τ) and An(τ) = g (Qn(τ)) where g(x) = −log(x). Then we can write
1
nlogLn(τ)− 1
nlogLn(τ0) = An(τ)−An(τ0) +Op(n
−1) (17)
Definition 1. For all τ ∈ Θ,
Q(τ) = σ20 +
(τ0 − τ) (1−τ0)(1−τ) δ
′0R′ΣXRδ0, if τ ≤ τ0
(τ − τ0) τ0τ δ′0R′ΣXRδ0, if τ > τ0
≡ σ20 + ∆(τ)
Figure below shows an example of Qn(τ) and Q(τ).
Figure 1: Example of Qn(τ) n=100 (+), n=1,000 (circle), and n=10,000 (cross) and Q(τ) (dashed).
The following propositions are used to prove Theorem 1.
12
Proposition 1. Under Pθ0, for all τ ,
Qn(τ) = Q(τ) +Op(n−1/2)
Define A(τ) = g(Q(τ)).
Proposition 2. A(τ) attains its unique maximum at τ0
Proposition 3. ∀η > 0, ∀ε > 0, ∃M > 0 and N > 0 such that n ≥ N =⇒
Pθ0
(inf
τ∈BcM/n
(τ0)
| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|
< η
)> 1− ε
Proof of theorem 1. By Proposition 2, A(·) attains its unique max at τ0. Note that the convex function
A(τ) is not differentiable at τ0. Hence we have,
A(τ)−A(τ0) < |τ − τ0|B1
A(τ)−A(τ0) > |τ − τ0|B2
for some B1, B2 < 0.
By Proposition 3, given η1 > 0, ∃M > 0 : with Pθ0 → 1,
An(τ)−An(τ0) < η1|τ − τ0|+A(τ)−A(τ0) < |τ − τ0|{η1 +B1} (18)
for all τ ∈ BcM/n(τ0). Similarly, given η2 > 0, ∃M0 > 0 : with Pθ0 → 1,
An(τ)−An(τ0) > −η2|τ − τ0|+A(τ)−A(τ0) > |τ − τ0|{−η2 +B2} (19)
for all τ ∈ BcM0/n
(τ0). Recall, by Eq.(17), we have
Ln(τ)
Ln(τ0)= exp
[n
(An(τ0)−An(τ)
)+Op(1)
]Hence, from Eq. (18), given η1 > 0, small compared to −B1, there is B′1 < 0, which is independent of M :
we have with Pθ0 → 1,
Ln(τ)
Ln(τ0)≤ exp
[n|τ − τ0|B′1 +Op(1)
]= exp
[n|τ − τ0|B′1
]Op(1) (20)
for all τ ∈ BcM/n(τ0). Note that the statement above still holds with a larger value of M > 0 as the area
outside of the ball will be contained by that for the original M . Similarly, from Eq. (19), there is B′2 < 0
13
and M0 > 0 : with Pθ0 → 1,
Ln(τ)
Ln(τ0)≥ exp
[n|τ − τ0|B′2 +Op(1)
]= exp
[n|τ − τ0|B′2
]Op(1) (21)
for all τ ∈ BcM0/n
(τ0). Now, by Inequality (20) and fundamental theorem of calculus,
∫BcM/n
(τ0)
Ln(τ)
Ln(τ)dτ ≤
∫BcM/n
(τ0)exp
[n|τ − τ0|B′1
]dτOp(1) =
1
nB′1
(enB
′1 − eB′1M
)Op(1)
Similarly, by Inequality (21),∫BMc
0/n(τ0)
Ln(τ)
Ln(τ)dτ ≥
∫BMc
0/n(τ0)
exp
[n|τ − τ0|B′2
]dτOp(1) =
1
nB′2
(enB
′2 − eB′2M0
)Op(1)
This means, together with the bound (15),∫BcM/n
(τ0)πn(τ)dτ ≤
[ ∫BMc
0/n(τ0)
Ln(τ ′)
Ln(τ0)dτ ′]−1 ∫
BcM/n
(τ0)
Ln(τ)
Ln(τ0)dτ ≤ B′2
B′1
eB′1n − eB′1M
eB′2n − eB′2M0
Op(1)
which can be made arbitrarily small by choosing M > 0 and n sufficiently large.
5.2 Bernstein-von Mises Theorem for the regression coefficients
The marginal posterior of (β, δ)′ is a mixture weighted by πn(τ). Furthermore, due to Theorem 1, we can
focus our attention on the values of τ in a n−1 neighborhood of τ0:∫ 1
0p(β, δ|τ,Dn)πn(τ)dτ =
∫BM/n(τ0)
p(β, δ|τ,Dn)πn(τ)dτ + op(1)
Conditional on τ = τ0, the standard result of OLS applies:
√n (γ(τ0)− γ0) =
√n(β(τ0)− β0
)√n(δ(τ0)− δ0
) d→ N(dx+dz)(0, σ20V−1) (22)
where V is defined in (22). We are now ready to establish the following Bernstein-von mises type result:
Theorem 2 (Bernstein-von Mises theorem for the slope coefficients). Suppose that Assumptions A1-A4
hold.
dTV
π√n(β − β(τ0)
)√n(δ − δ(τ0)
)∣∣∣∣Dn
, N(dx+dz)
(0, σ2
0V−1)→ 0
in Pθ0 − probability.
14
Proof of theorem 2 . Define z =√n (γ − γ(τ0)) =
(√n(β − β(τ0)
),√n(δ − δ(τ0)
))′.
dTV
(π [z|Dn] , N(dx+dz) (0, V )
)=
∫|π(z|Dn)− φ(z; 0, V )|dz
≤∫ ∫
|π(z|τ,Dn)− φ(z; 0, V )|dzdπ(τ |Dn) =
∫dTV
(π(z|τ,Dn), N(dx+dz)(0, V )
)dπ(τ |Dn)
=
∫BM/n(τ0)
dTV
(π(z|τ,Dn), N(dx+dz)(0, V )
)dπ(τ |Dn) + op(1)
where the last equality is due to theorem 1.
Case 1: Improper uninformative prior
Let us first consider the case with the improper uninformative prior (6). From (9), asymptotically, the
posterior of γ = (β, δ)′ conditional on τ is normal:
γ∣∣τ,Dn a∼ N(dx+dz)
(γ(τ),
Sn(τ)
n− (dx + dz)(χτχτ )−1
)=⇒ z|τ,Dn a∼ N(dx+dz)
(√n (γ(τ)− γ(τ0)) ,
nSn(τ)
n− (dx + dz)(χτχτ )−1
)The total variation distance is bounded above by 2 times square root of the KL divergence. Note that for
any p-dimensional normal distributions N(µ1,Σ1), N(µ2,Σ2), we have
KL
(N(µ1,Σ1), N(µ2,Σ2)
)≤
∣∣det (Σ−12
)− det
(Σ−1
1
)∣∣min(det
(Σ−1
1
), det
(Σ−1
2
))︸ ︷︷ ︸
I
+ p||Σ−12 − Σ−1
1 ||∞||Σ1||∞︸ ︷︷ ︸II
+ ||µ1 − µ2||22||Σ−12 ||2︸ ︷︷ ︸
III
where ||Σ||∞ = maxij |Σij | is the largest element of Σ in the absolute value and ||Σ||2 = supµ||Σµ||2/||µ||2is a matrix norm induced by the standard norm on Rp, ||µ||2 =
∑pi=1 µ
2i .
In our case, µ1 =√n (γ(τ)− γ(τ0)), µ2 = 0, Σ2 = σ2
0V−1, and
Σ1 =nSn(τ)
n− (dx + dz)(χτχτ )−1
=nQn(τ)
n− (dx + dz)
(1
nχτχτ
)−1
︸ ︷︷ ︸≡V −1
n (τ)
15
Note that since |τ − τ0| < Mn ,
Σ1 − Σ2 =n
n− (dx + dz)(Qn(τ)−Qn(τ0))︸ ︷︷ ︸
=op(1)
V −1n (τ)
+n
n− (dx + dz)Qn(τ0)
(V −1n (τ)− V −1
n (τ0))
︸ ︷︷ ︸=op(1)
+n
n− (dx + dz)
(Qn(τ0)− σ2
0
)︸ ︷︷ ︸=Op(n−1/2)
V −1n (τ0)
+n
n− (dx + dz)σ2
0
(V −1n (τ0)− V −1
)︸ ︷︷ ︸
=op(1)
+
(n
n− (dx + dz)− 1
)︸ ︷︷ ︸
=o(1)
σ20V−1 = op(1)
which implies
Σ−12 − Σ−1
1 = op(1)
Hence II = op(1). By continuity of det and min functions, we also have that I = op(1) for τ ∈ BM/n(τ0).
Lastly, to show III = op(1), note that Y = Xβ0 + Z2τ0δ0 + ε = Xβ0 + Z2τδ0 + ε∗, where ε∗ =
(Z2τ0 − Z2τ )δ0 + ε. This implies
√n (γ(τ)− γ0) =
√n(β(τ)− β0
)√n(δ(τ)− δ0
) =
[1
n
(X ′X X ′Z2τ
Z ′2τX Z ′2τZ2τ
)]−11√n
(X ′ε∗
Z ′2τ ε∗
)
=
[1
n
(X ′X X ′Z2τ
Z ′2τX Z ′2τZ2τ
)]−11√n
(X ′ε+X ′(Z2τ0 − Z2τ )δ0
Z ′2τ ε+ Z ′2τ0(Z2τ0 − Z2τ )δ0
)
It can be shown that for |τ − τ0| < Mn ,
1√nX ′(Z2τ0 − Z2τ ) = op(1)
1√nZ ′2τ0(Z2τ0 − Z2τ ) = op(1)
1
nX ′Z2τ −
1
nX ′Z2τ0 = op(1)
1
nZ ′2τZ2τ −
1
nZ ′2τ0Z2τ0 = op(1)
16
which implies
√n (γ(τ)− γ0) =
[1
n
(X ′X X ′Z2τ0
Z ′2τ0X Z ′2τ0Z2τ0
)]−11√n
(X ′ε
Z ′2τ ε
)+ op(1)
By central limit theorem, it can be shown that the left hand side above has the same limit distribution
as that of√n (γ(τ0)− γ0) for |τ − τ0| < M
n . This means that III = op(1). Finally,
dTV
(π(z|τ, Y (n)), N(dx+dz)(0, V )
)≤ 2√op(1) = op(1)
which implies
dTV
(π[z|Y (n)
], N(dx+dz)(0, V )
)≤∫BM/n(τ0)
dTV
(π(z|τ, Y (n)), N(dx+dz)(0, V )
)dπ(τ |Y (n)) + op(1) = op(1)
Case 2: Conjugate prior
Next, let us consider the case with the conjugate prior (7). From (12), asymptotically, the posterior of
γ = (β, δ)′ conditional on τ is normal:
γ∣∣τ,Dn a∼ N(dx+dz)
(µτ , (bτ/a)H−1
τ
)=⇒ z|τ,Dn a∼ N(dx+dz)
(√n (µτ − γ(τ0)) , (nbτ/a)H−1
τ
)Note that for each τ ∈ BM/n(τ0)
√n (µτ − γ(τ0)) =
√n (µτ − γ(τ)) +
√n (γ(τ)− γ(τ0)) (23)
where we know that the last term is op(1) from the proof for the case of improper prior. By definition,
µτ =
[1
nH +
1
nχτχ
′τ
]−1 [ 1
nHµ+
1
nχτY
]= γ(τ) +Op(n
−1)
so the first term in (23) is also op(1). Also note that for each τ ∈ BM/n(τ0)
(nbτ/a)H−1τ − σ2
0V =
[(nbτ/a)H−1
τ −nSn(τ)
n− (dx + dz)(χτχτ )−1
]+
[nSn(τ)
n− (dx + dz)(χτχτ )−1 − σ2
0V
](24)
where we know that the second term is op(1) from the proof for the case of improper prior. For the first
17
term, we have
(nbτ/a)H−1τ =
bτa+ n/2
[1
nH +
1
nχτχ
′τ
]−1
=1n bτ
a/n+ 1/2
[1
nH +
1
nχτχ
′τ
]−1
From (16), we know that 1n bτ = 1
2nSn(τ) +Op(n−1), so we have
(nbτ/a)H−1τ = Sn(τ)
[χτχ
′τ
]−1+Op(n
−1)
Therefore, the first term in (24) is also op(1). The same argument from the proof for the improper prior
implies the desired result.
6 Simulation
6.1 Comparison with the conventional frequentist method
Consider the following simple model: yi = δ01(i > bnτ0c) + εi with εi iid ∼ N(0, 1). Assume τ0 = 0.5. We
consider different values of sample size n = 50, 100, 250, and 500 and the true jump size δ0 = 0.25, 0.5,
and 1.0.
We study the behavior of Bayesian approach in repeated experiments. For each pair of n and δ0, we
generate 1,000 data sets. We report the coverage of the 95% credible interval for the jump size δ (Table 1),
the mean interval length (Table 3), and the mean absolute value of the difference between the posterior
mode of δ and δ0 (Table 2).
We also present the performance of the conventional frequentist estimator. In specific, we computed
the least-square estimator τ , the jump size estimator δ(τ) conditional on τ = τ , and its 95% confidence
interval based on the conventional asymptotic theory. This would what the researchers do based on the
method in Hansen (2000). As mentioned in introduction section, he proposes a Bonferroni-type correction
of CIs, but the tuning parameter has to be selected, so we do not present Bonferroni-type corrected CIs.
1. Table 1 shows that for small n and/or small δ0, the conventional CI significantly under-covers. On
the other hand, Bayesian credible interval has a relatively reasonable coverage.
2. From Table 2, we can see that except for a few cases, the conventional method and Bayesian method
provide similar bias.
18
Least-squares Bayesian
δ0 = 0.25 0.50 1.00 0.25 0.50 1.00
n = 50 0.63 0.83 0.95 0.97 0.97 0.97n = 100 0.71 0.91 0.95 0.96 0.96 0.96n = 250 0.82 0.94 0.95 0.96 0.96 0.95n = 500 0.91 0.95 0.95 0.95 0.96 0.95
Table 1: Coverage of the conventional 95% confidence interval (left) and the 95% credible interval(right).
Least-squares Bayesian
δ0 = 0.25 0.50 1.00 0.25 0.50 1.00
n = 50 0.56 0.37 0.23 0.3 0.28 0.24n = 100 0.35 0.2 0.16 0.21 0.19 0.16n = 250 0.18 0.1 0.1 0.13 0.11 0.1n = 500 0.09 0.07 0.07 0.08 0.07 0.07
Table 2: |Bias| of δ(τ) and the posterior mode of δ
3. From Table 3, we can see that the primary reason for the under-coverage of the confidence intervals
seems to be that the interval tends to be too short. On the other hand, credible intervals provide a
better coverage since they are wider, which would be the result of correctly reflecting the sampling
uncertainty of threshold estimation.
4. Finally, looking at Tables 1, 2, and 3, we see that as n increases, the discrepancy between the
Bayesian and the frequentist results decreases, as expected from the proven Bernstein-von mises
theorem from the previous section.
19
Least-squares Bayesian
δ0 = 0.25 0.50 1.00 0.25 0.50 1.00
n = 50 1.31 1.25 1.12 1.64 1.57 1.35n = 100 0.92 0.84 0.79 1.14 1.02 0.85n = 250 0.56 0.51 0.49 0.68 0.56 0.5n = 500 0.38 0.35 0.35 0.44 0.36 0.35
Table 3: Length of the conventional 95% confidence interval (left) and the 95% credible interval(right).
6.2 Comparison to the sample-splitting method [In progress]
7 Application
In this section, we consider two empirical applications: economic growth model in Durlauf and Johnson
(1995) and stock return prediction in Paye and Timmermann (2006). In both papers, the estimation is
done using the conventional frequentist approach: they first computed τ and then estimated the regression
coefficients by fixing τ at τ .
7.1 Growth and multiple equilibria: Durlauf and Johnson (1995)
In this section, we apply our method on the data from Durlauf and Johnson (1995) which is also considered
in Hansen (2000). The authors suggest that cross-section growth behavior may be determined by initial
conditions. They explore this hypothesis using the Summers-Heston data set, reporting results obtained
from a regression tree approach, which is an extension of threshold regressions.
In one of the specifications in their paper, the authors divide the countries into two groups by the
literacy rate in the base year:
ln(Y/L)i,1985 − ln(Y/L)i,1960
=
δ(1)1 + δ
(2)1 ln(Y/L)i,1960 + δ
(3)1 ln(I/Y )i + δ
(4)1 ln(ni + g + δ) + δ
(5)1 ln(school)i + εi, if Liti,1960 ≤ τ
δ(1)2 + δ
(2)2 ln(Y/L)i,1960 + δ
(3)2 ln(I/Y )i + δ
(4)2 ln(ni + g + δ) + δ
(5)2 ln(school)i + εi, if Liti,1960 > τ
where for each country i, (Y/L)i,t is real GDP per member of the population aged 15-64 in year t, (I/Y )i
is investment to GDP ratio, ni is growth rate of working-age population, schooli is fraction of working-
age population enrolled in secondary school, and Liti,1960 is the literacy rate in the base year 1960. The
variables not indexed by t are annual averages over the period 1960-1985. Following the authors, we set
g + δ = 0.05, where g is the growth rate of technology and δ is the depreciation rate of both human
and physical capitals We standardize the covariates to make the comparison of the estimated coefficients
20
easier.
The prior on τ that we use is Uniform(0.05, 0.95). The posterior mean of τ is 53.7 % with 95%
credible interval being [31% 68%]. See Figure 2 for the trace plot and the posterior density. The Least
Squares method produces a similar point estimate, which is τ = 53%.
In Figure 3, we display the posterior densities of the slope coefficients as well as 95% credible intervals.
The intervals in red are the confidence intervals computed assuming the point estimate τ being fixed which
would be what the conventional frequentist method would produce. We see that in general, as it was
expected, the Bayesian credible intervals are wider than the frequentist counterparts, which would be a
result of correctly reflecting the finite sample uncertainty of the unknown τ . Importantly, this can have
a qualitative consequence on statistical importance of some parameter(s). For example, for the slope
coefficients δ(2)1 on ln(Y/L)i,1960 and δ
(3)1 on ln(I/Y )i, the confidence interval does not include 0 while
the Bayesian credible interval does. Hence, it is possible that ignoring the finite sample uncertainty of
the threshold estimation stage could lead to an invalid inference.
Figure 2: Trace plots (left) and posterior density (right) for τ
21
Figure 3: Posterior density of β’s. 95% credible interval (blue) and 95% confidence interval assuming τ(red).
Figure 4: Posterior density of σ
7.2 Stock return prediction: Paye and Timmermann (2006) [In progress]
Paye and Timmermann (2006) investigates the instability in models of ex-post predictable components
in stock returns. The model that they consider is:
Rett =
δ(1)1 + δ
(2)1 Divt−1 + δ
(3)1 Tbillt−1 + δ
(4)1 Spreadt−1 + δ
(5)1 Deft−1 + εt, if t ≤ bTτc
δ(1)2 + δ
(2)2 Divt−1 + δ
(3)2 Tbillt−1 + δ
(4)2 Spreadt−1 + δ
(5)2 Deft−1 + εt, if t > bTτc
where Rett denotes the excess stock return during month t, Divt−1 is the lagged dividend yield, Tbillt−1
is the lagged short term interest rate, Spreadt−1 is the lagged spread between the short term and the long
term interest rates, and Deft−1 is the lagged U.S. default premium.
22
8 Conclusion and future direction
In this paper, we established a Bernstein von-mises type theorem for the regression coefficients in linear
regression models with a structural break. A frequentist researcher can look at Bayesian credible intervals
for the regression coefficients as a robustness check to see whether the finite sample uncertainty coming
from the break location estimation affects the inference on the regression coefficients. Such sensitivity
analysis is natural as our theoretical result guarantees the credible interval to converge to the conventional
confidence intervals that the frequentist researcher would use otherwise.
Extending the result to multiple number of breaks should be straightforward. When the number of
breaks is unknown, the researcher can either conduct model comparisons based on marginal likelihoods,
which are available under the conjugate prior, or can compute the posterior probabilities of the number
of breaks using methods such as reversible jump MCMC or dynamic programing techniques.
I am currently working on several types of extensions of this paper. First, the homosckedasticity
assumption could be too strong in some applications, and hence extending the results to the case of
heterosckedasticity would be of interest. Second, researchers might want to relax the linearity assumption
for the regression function in some cases. I am currently working on Bayesian non-parametric regressions
with structural breaks.
A Proof of Propositions
A.1 Proof of Proposition 1
Proposition 1. Under Pθ0, for all τ ,
Qn(τ) = Q(τ) +Op(n−1/2)
Proof of Proposition 1. Let τ ∈ (0, 1) be given. Let M = I − X(X ′X)−1X ′. We have the following
identity: Sn(τ) = Sn − Vn(τ) (Amemiya (1985), Bai (1997)), where Sn is the sum of squared residuals
from regressing Y on X alone and Vn(τ) = δ′(τ)(Z ′2τMZ2τ )δ(τ). By Frisch-Waugh Theorem, the OLS
estimate of δ in Eq. (3) is equivalent to that in the model MY = MZ2τδ +Mε. Note the true model is
MY = MZ2τ0δ0 +Mε. Hence,
δ(τ) = (Z ′2τMZ2τ )−1Z ′2τMY
= (Z ′2τMZ2τ )−1Z ′2τ {MZ2τ0δ0 +Mε}
= (Z ′2τMZ2τ )−1Z ′2τMZ2τ0δ0 + (Z ′2τMZ2τ )−1Z ′2τMε
23
Hence,
Vn(τ) = δ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)δ0
+ 2δ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMε)
+ (Z ′2τMε)′(Z ′2τMZ2τ )−1(Z ′2τMε)
This means that
1
nVn(τ) =
1
nδ′0(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)δ0 +Op(n
−1/2)
Also we have
Sn(τ) = Y ′MY = δ′0Z′2τ0MZ2τ0δ0 + 2δ0Z2τ0M + ε′Mε
which implies
1
nSn(τ) =
1
nδ′0Z
′2τ0MZ2τ0δ0 + σ2
0 +Op(n−1/2)
By the above identity,
Qn(τ) =1
nSn(τ)− 1
nVn(τ)
= σ20 +
1
nδ′0
{(Z ′2τ0MZ2τ0)− (Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)
}δ0 +Op(n
−1/2)
Note that
(Z ′2τ0MZ2τ0) = Z ′2τ0Z2τ0 − Z ′2τ0X(XX ′)−1X ′Z2τ0
= R′X ′2τ0X2τ0R−R′(X ′2τ0X)(XX ′)−1(X ′X2τ0)R
= R′X ′2τ0X2τ0R−R′(X ′2τ0X2τ0)(XX ′)−1(X ′2τ0X2τ0)R
Hence,
1
n(Z ′2τ0MZ2τ0) = (1− τ0)R′ΣXR− (1− τ0)2R′ΣXR+Op(n
−1/2)
= τ0(1− τ0)R′ΣXR+Op(n−1/2)
Similarly,
1
n(Z ′2τMZ2τ ) = τ(1− τ)R′ΣXR+Op(n
−1/2)
24
WLOG, suppose τ < τ0. Then
(Z ′2τMZ2τ0) = Z ′2τZ2τ0 − Z ′2τX(XX ′)−1X ′Z2τ0
= R′X ′2τX2τ0R−R′(X ′2τX)(XX ′)−1(X ′X2τ0)R
= R′X ′2τ0X2τ0R−R′(X ′2τX2τ )(XX ′)−1(X ′2τ0X2τ0)R
which implies that
1
n(Z ′2τMZ2τ0) = (1− τ0)R′ΣXR+ (1− τ)(1− τ0)R′ΣXR+Op(n
−1/2)
= τ(1− τ0)R′ΣXR+Op(n−1/2)
Therefore,
1
n(Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0) =
[τ(1− τ0)
1
τ(1− τ)τ(1− τ0)︸ ︷︷ ︸
=τ(1−τ0)2
1−τ
]R′ΣXR+Op(n
−1/2)
Finally,
1
n
{(Z ′2τ0MZ2τ0)− (Z ′2τMZ2τ0)′(Z ′2τMZ2τ )−1(Z ′2τMZ2τ0)
}=
[τ0(1− τ0)− τ(1− τ0)2
1− τ
]R′ΣXR+Op(n
−1/2)
= (τ0 − τ)1− τ0
1− τR′ΣXR+Op(n
−1/2)
A.2 Proof of Proposition 2
Proposition 2. A(τ) attains its unique maximum at τ0
Proof of Proposition 2. By definition, Q(τ0) = σ20. Note that δ′0R
′ΣXRδ0 > 0. This is because (1) R
has full column rank, (2) δ0 6= 0, and (3) ΣX is assumed to be positive definite. Hence, Q(τ) > σ20
∀τ 6= τ0. Recall that A(τ) = g(Q(τ)) where g(x) = −log(x). Hence A(τ) = −log(σ20) if τ = τ0 and
A(τ) < −log(σ20) otherwise.
A.3 Proof of Proposition 3
Proposition 3. ∀η > 0, ∀ε > 0, ∃M > 0 and N > 0 such that n ≥ N =⇒
Pθ0
(inf
τ∈BcM/n
(τ0)
| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|
< η
)> 1− ε
25
Proof of Proposition 3. Recall that An(τ) = g(Qn(τ)) and A(τ) = g(Q(τ)) where g(x) = −log(x). By
Taylor approximation, there is c between x and a:
g(x)− g(a) = g′(a)(x− a) +1
2g′′(c)(x− a)2
Hence, for each τ ∈ BcM/n(τ0), there is cn between Qn(τ) and Q(τ):
g (Qn(τ))− g (Q(τ)) = g′ (Q(τ))(Qn(τ)−Q(τ)
)+
1
2g′′ (cn)
(Qn(τ)−Q(τ)
)2= g′ (Q(τ))Op(n
−1/2) +Op(n−1)
where we used Proposition 1. Similarly, there is c0n between Qn(τ0) and Q(τ0):
g (Qn(τ0))− g (Q(τ0)) = g′ (Q(τ0))Op(n−1/2) +Op(n
−1)
Note that g′′ (cn) = 1c2n
and g′′ (c0n) = 1c20n
are bounded with probability tending to one because for each
τ , Qn(τ)p→ Q(τ), and Q(τ) is bounded.
Now,
{An(τ)−An(τ0)} − {A(τ)−A(τ0)} = {An(τ)−A(τ)} − {An(τ0)−A(τ0)}
=
{g(Qn(τ))− g(Q(τ))
}−{g(Qn(τ0))− g(Q(τ0))
}=
(g′ (Q(τ))− g′ (Q(τ0))
)Op(n
−1/2) +Op(n−1)
=
[− 1
σ20 + ∆(τ)
−(− 1
σ20 + ∆(τ0)
)]Op(n
−1/2) +Op(n−1)
=
[1
σ20
− 1
σ20 + ∆(τ)
]Op(n
−1/2) +Op(n−1)
In general, there is B > 0 such that 1b −
1b+x ≤ Bx for b, x > 0. Hence, 1
σ20− 1
σ20+∆(τ)
≤ B∆(τ) ≤B′|τ − τ0| where the last inequality holds for some B′ > 0 due to the shape of Q(τ).
Finally,
| {An(τ)−An(τ0)} − {A(τ)−A(τ0)} ||τ − τ0|
≤ B′Op(n−1/2) +1
|τ − τ0|Op(n
−1) ≤︸︷︷︸|τ−τ0|>M/n
Op(n−1/2) +
1
MOp(1)
The desired result is established by taking M large enough.
26
B Derivation of Posterior distributions
B.1 Improper uninformative prior
In this section, we derive the posterior in Section 4.2 under the improper uninformative prior (6). The
posterior is
p(θ|Dn) ∝(
1
σ
)nexp
[− 1
2σ2
{n∑i=1
(yi − χ′τ,iγ
)2}] 1
σ2π(τ)
∝(
1
σ
)nexp
[− 1
2σ2
{Sn(τ) + (γ − γ(τ))′ χτχ
′τ (γ − γ(τ))
}] 1
σ2π(τ) (25)
where Sn(τ) is the sum of squared residuals given τ . Integrating the right hand side with respect to γ,
we obtain
(1
σ2
)n−(dx+dz)2
+1
exp
[− 1
2σ2Sn(τ)
] [det(χτχ
′τ
)]−0.5π(τ)
by the property of the multivariate normal density. Integrating the above with respect to σ2 over the
positive part of the real line and using the change of variable φ = 1/σ2, we get the marginal posterior for
τ
π(τ |Dn) ∝[det(χτχ
′τ
)]−0.5[Sn(τ)]−
(n−(dx+dz))2 × π(τ)
To obtain the conditional posterior for γ given τ , we integrate (25) with respect to σ2. With the change
of variable φ = 1/σ2, we can show that
π(γ|τ,Dn) ∝[1 +
1
n− (dx + dz)(γ − γ(τ))′
1
Sn(τ)/(n− (dx + dz))χτχ
′τ (γ − γ(τ))
]−n/2which means that
γ∣∣τ,Dn ∼ t(dx+dz)
(n− (dx + dz), γ(τ),
Sn(τ)
n− (dx + dz)
(χτχ
′τ
)−1)
Finally, to obtain the conditional posterior for σ2 given τ , we integrate (25) with respect to γ and can
show that
π(φ|τ,Dn) ∝ φn−(dx+dz)
2−1exp
[−Sn(τ)
2φ
]which means that φ|τ,Dn ∼ Gamma ((n− (dx + dz))/2, Sn(τ)/2) or equivalently,
σ2|τ,Dn ∼ InvGamma ((n− (dx + dz))/2, Sn(τ)/2)
27
B.2 Conjugate prior
In this section, we derive the posterior in Section 4.3 under the conjugate prior (7). It can be shown that
the posterior is
p(θ|Dn) ∝ p(Dn|θ)π(θ)
∝(
1
σ
)nexp
[− 1
2σ2
{n∑i=1
(yi − χ′τ,iγ
)2}] 1
σ2π(τ)
×(
1
σ2
)a+p/2+1
exp
[− 1
σ2
{b+
1
2
(γ − µ
)′H(γ − µ
)}]∝(
1
σ2
)a+p/2+1
exp
[− 1
σ2
{bτ +
1
2(γ − µτ )′ Hτ (γ − µτ )
}]where a, bτ , µτ , and Hτ are defined in Eq. (14). Note that the distribution for γ, σ2|τ,Dn is also a
normal-inverse-gamma distribution. By the same approach we used under the improper prior, we can
show
π(τ |Dn) ∝[det(Hτ
)]−0.5b−aτ × π(τ)
and
σ2|τ,Dn ∼ InvGamma(a, bτ
)Finally, to derive the posterior of γ conditional on τ , we use the well-known property that the integral of
a normal-inverse-gamma distribution with respect to σ2 is a t-distribution to conclude that
γ∣∣τ,Dn ∼ t(dx+dz)
(2a, µτ , (bτ/a)H−1
τ
)
28
Acknowledgements
I am grateful for Andriy Norets, my dissertation advisor, for guidance and encouragement through the
work on this project. I thank for valuable comments and suggestions from Eric Renault, Susanne Schen-
nach, Jesse Shapiro, Kenneth Chay, Toru Kitagawa, Adam McCloskey, Dimitris Korobilis, and Florian
Gunsilius as well as participants in the Brown University Econometrics Seminar. This work was supported
by the Economics department dissertation fellowship at Brown University. All remaining errors are mine.
References
I. Andrews, T. Kitagawa, and A. McCloskey. Inference after estimation of breaks. Technical report,
Centre for Microdata Methods and Practice, Institute for Fiscal Studies, 2019.
Y. Baek. Estimation of structural break point in linear regression models. Working paper, 2019.
J. Bai. Estimation of a change point in multiple regression models. Review of Economics and Statistics,
79(4):551–563, 1997.
D. Card, A. Mas, and J. Rothstein. Tipping and the dynamics of segregation. The Quarterly Journal of
Economics, 123(1):177–218, 2008.
K. S. Chan. Consistency and limiting distribution of the least squares estimator of a threshold autore-
gressive model. The Annals of Statistics, 21(1):520–533, 1993.
S. N. Durlauf and P. A. Johnson. Multiple regimes and cross-country growth behaviour. Journal of
Applied Econometrics, 10(4):365–384, 1995.
G. Elliott and U. K. Muller. Pre and post break parameter inference. Journal of Econometrics, 180(2):
141–157, 2014.
S. Ghosal and T. Samanta. Asymptotic behaviour of bayes estimates and posterior distributions in
multiparameter nonregular cases. Mathematical Methods of Statistics, 4(4):361–388, 1995.
B. E. Hansen. Sample splitting and threshold estimation. Econometrica, 68(3):575–603, 2000.
B. E. Hansen. Threshold autoregression in economics. Statistics and its Interface, 4(2):123–127, 2011.
29
A. Khodadadi and M. Asgharian. Change-point problems and regression: an annotated bibliography.
Collection of Biostatistics Research Archive., Preprint 44, 2008.
J. Liu, S. Wu, and J. V. Zidek. On segmented multivariate regression. Statistica Sinica, pages 497–525,
1997.
B. S. Paye and A. Timmermann. Instability of return prediction models. Journal of Empirical Finance,
13(3):274–315, 2006.
30