optimal forecast under structural breaks with application

Optimal Forecast under Structural Breaks with Application to

Equity Premium

Shahnaz Parsaeian∗

October 28, 2019

Abstract

This paper develops an optimal combined estimator to forecast out-of-sample under structuralbreaks. When it comes to forecasting, using only the post-break observations after the mostrecent break point may not be optimal, especially when there are not enough observations in thepost-break sample. In this paper we propose different estimation methods that exploit the pre-break information. In particular, we show how to combine the estimator using the full-sample(i.e., both the pre-break and post-break data) and the estimator using only the post-breaksample. The full-sample estimator is inconsistent when there is a break while it is efficientwhen there is no break. The post-break estimator is consistent when there is a break while itis less efficient when there is no break. Hence, depending on the severity of the breaks, thefull-sample estimator and the post-break estimator can be combined to balance the consistencyand efficiency. We derive several Stein-like combined estimators of the full-sample estimatorand the post-break estimator, to balance the bias-variance trade-off. The combination weightdepends on the break severity, which we measure by the Hausman statistic. We derive theoptimal combining weight by minimizing the mean square forecast error. We also introduce asemi-parametric estimator which facilitates the idea without having to compute the Hausmanstatistic in combining full-sample information and post-break sample information. We examinethe properties of the proposed methods, analytically in theory, numerically by simulation, andalso empirically in predicting the equity premium.

Keywords: Forecasting, Structural breaks, Stein-like combined estimator, Predicting equitypremium

∗Department of Economics, University of California, Riverside, CA 92521. E-mail: [email protected]

1 Introduction

In a regression framework, structural breaks can be considered as a regression equation with a

shift in some/all of the regression coefficients and/or error term. One of the biggest concerns of

economists is how to have an accurate forecast in the case of the structural break(s). Since the

seminal work by Bates and Granger (1969), forecast combination is an effective way to improve

forecasting performance. Especially under model instability, the performance of the forecast can

be boosted by forecast combinations method, see Diebold and Pauly (1987), Clements and Hendry

(1998, 1999, 2006), Stock and Watson (2004), Pesaran and Timmermann (2005, 2007), Timmermann

(2006), Pesaran and Pick (2011), Rossi (2013), and Pesaran et al. (2013) inter alia. Normally

in the forecast combinations, we need to closely pay attention to two points: first the selected

forecast models, and secondly the combination weights. The common method for forecasting under

structural break is to use a post-break estimator. But this estimator by itself cannot be accurate

in the case that the break point is close to the end of the sample, and thus there is not enough

observations. If the distance to break is short, then the post-break parameters are likely to be

poorly estimated relative to those obtained using pre-break data. Therefore, as pointed out by

Pesaran and Timmermann (2007), and Pesaran et al. (2013), it is not always optimal to base the

forecast only on post-break observations. Actually, using pre-break data biases the forecast but

also reduces the forecast error variance which is helpful to lower the MSFE.

A key question that comes to mind under the presence of structural break is how to include the

pre-break data to estimate the parameters of the forecasting model such that minimizes the loss

function like MSFE. In this paper we propose the combined estimator of the post-break estimator

and the full-sample estimator which uses all observations in the sample, t = 1, . . . , T. The

combination weight takes the form of James-Stein weight, see Stein (1956) and James and Stein

(1961). Massoumi (1978) introduced the Stein-like estimator for the simultaneous equation. Also,

Hansen (2016, 2017) has used this Stein-type weight in different context. See Saleh (2006) for a

comprehensive explanation for the Stien-type estimators.

The current paper does not focus on classical problem of identifying the break points. Instead,

the goal of this paper is to propose an estimator with a minimum MSFE under the assumption that

structural break has in fact occurred. For estimation of the break points, we follow Bai and Perron

1

(1998, 2003) which is a consistent global minimizer method of the error term. Bai and Perron

(2003) propose an efficient dynamic programming algorithm that even with a large sample size, has

a small computing cost and is super fast. We also assume that the break points are bounded away

from the beginning or end of the sample which is crucial for consistency of the estimated break

points, Andrew (1993), Bai and Perron (1998, 2003).

This paper has two main contributions. First, we introduce the Stein-like combined estimator

of the full-sample estimator and the post-break estimator which is a trade-off between bias and

forecast error variance. The full-sample estimator (which uses the pre-break and post-break data) is

bias under structural break, but at the same time it decreases the forecast error variance because it

uses more observations. Post-break estimator (which is the standard solution under the structural

break model) is an unbiased estimator under structural breaks but less efficient. As discussed in

papers by Pesaran, and Timmermann (2007), Pesaran and Pick (2011), Pesaran et al. (2013), we

are able to improve the performance of forecast measured by MSFE by the trade-off between bias

and forecast error variance. Our proposed Stein-like combined estimator is the trade-off between

bias and variance of the two estimators. The main difference between our paper and Hansen (2017)

paper which uses the Stein-type weigh in different context is that, here we minimized the weighted

risk under the general positive semi-definite form of weight and did not restrict our theoretical

proof for a specific choice of weight. We show that if we set this weight to the inverse of the

difference between variances of estimators, then we get similar result as Hansen (2017). Besides,

we investigate the performance of our combined estimator with an alternative weight κ ∈ [0 1] and

compare the properties of this estimator with the Stein-like combined estimator.

The second contribution of this paper is that we introduce the semi-parametric method for the

combined estimator which facilitate implementation of our theoretical point that using pre-break

data can improve the forecasting performance. This method is a kernel weight of the observations.

Motivation for the semi-parametric method comes from the paper by Li, Ouyang, Racine (2013),

in which they focus on the categorical variables. Examples of the categorical variables are race,

sex, age group and educational level. Inspired by the categorical data, we introduce the semi-

parametric method for the time series analysis, and specifically apply that for the structural break

models. The point of using this method is that we can find the combination weights numerically

by Cross-Validation (CV) without being involved to the estimation errors of parameters. We prove

2

theoretically that the weight which is derived by CV is optimal in the sense of optimality which

was introduced by K.-C Li (1987), the weight derives by CV is asymptotically equivalent to the

infeasible optimal weight.

We undertake an empirical analysis for the equity premium that compares the forecasting

performance of our proposed Stein-like combined estimator, and some of the alternative methods,

including post-break estimator, by using monthly, quarterly and annual data frequencies starting

from 1927 up to 2017. This analysis which allows for multiple breaks at unknown times confirms

the out-performance of using pre-break data in estimation of the forecasting model relative to

forecasting with only the post-break data. Besides, Welch and Goyal (2008) find that it is hard

to outperform a simple historical average with more complex models for predicting excess equity

returns. We also compare our results with historical average, and the results confirms the out

performance of our combined estimator over the simple historical average.

The outline of the paper is as follows. Section 2 sets up the model under the structural break and

introduces the Stein-like combined estimator and its risk. For simplicity, we discuss the problem

under a single break, but generalization of the method to the multiple breaks is straightforward.

Section 3 proposes combined estimator with weight κ ∈ [0 1] and compares the asymptotic risk of

this estimator with the Stein-like combined estimator. Section 4 introduces the semi-parametric

combined estimator. Section 5 discusses an alternative combined estimator proposed by Pesaran

et al. (2013). Section 6 extends the model with multiple structural break. Section 7 reports Monte

Carlo simulation and empirical example is in Section 8. Finally Section 9 concludes.

2 The structural break model

Consider the linear structural break model as yt = x′tβt + σtεt, in which the k × 1 vector of

coefficients, βt, and the error variance, σt, are subject to one break (m = 1) at time T1:

βt =

β(1) if 1 < t ≤ T1

β(2) if T1 < t < T(1)

and

σt =

σ(1) if 1 < t ≤ T1

σ(2) if T1 < t < T.(2)

3

So we can rewrite the model as:

yt =

x′tβ(1) + σ(1)εt if 1 < t ≤ T1

x′tβ(2) + σ(2)εt if T1 < t < T,(3)

where xt is k × 1, εt ∼ i.i.d.(0, 1), t ∈ 1, . . . , T, and T1 is the break point with 1 < T1 < T . In

this set up we have only one break (two regimes). Assume that we know the break point.

2.1 Stein-like combined estimator

Our proposed combined estimator of β is

βα = αβFull + (1− α)β(2), (4)

where βα is the combined estimator which is k×1, βFull is the estimator under the assumption that

there is no break in the model, so we estimate the coefficient using all observations in the sample,

t ∈ 1, . . . , T. β(2) is for the case of having a break, and estimate the coefficient only by using the

post-break observations. We call this estimator, post-break estimator. Following Hansen (2017),

we define the combination weight as:

α =

τHT

if HT ≥ τ1 if HT < τ,

(5)

where τ controls the degree of shrinkage, HT is the Hausman statistics test which is equal to

HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull), (6)

where VF and V(2) are the variances of the full-sample estimator and the post-break estimator,

respectively. Originally this type of shrinkage weight introduced by James-Stein (1961). The

degree of shrinkage depends on the ratio of τHT

. When HT < τ then α = 1 and βα = βFull. When

HT > τ then βα is a weighted average of the full-sample estimator and the post-break estimator.

Note that it can be shown that cov(β(2), βFull) = VF , where VF < V(2) and it means that

the covariance between the estimators is equal to the variance of the efficient estimator. The

interesting idea behind the Hausman test is that the efficient estimator, βFull, must have zero

asymptotic covariance with β(2) − βFull under the null. If this condition does not hold, it means

that we can find another linear combination which would have smaller asymptotic variance than

βFull which is assumed to be asymptotically efficient. Since this condition holds in our combined

4

estimator, we apply Hausman test. See Hausman (1978) for more discussion.

2.2 Asymptotic distribution

In this part, we develop the asymptotic distributions of the estimators under a local asymptotic

framework. Assume that the break size is local to zero, specifically, β(1) = β(2) + δ1√T

, in which

δ1 shows the break size between the coefficients. Actually, to obtain a meaningful asymptotic

distribution for the full-sample estimator, we consider such a local to zero assumption otherwise

the risk explodes.

2.2.1 Full-sample estimator: There is no break in the model

The full-sample estimator is constructed under the null hypothesis that there is no break in the

model, β(1) = β(2), so it uses all of the observations to estimate β. This assumption lines up with

the fact that for the small break sizes, usually ignoring the break and estimating the coefficient by

all of the observations results in better forecast (lower MSFE), see Boot and Pick (2017). Under

the alternative hypothesis the break size is β1 = β2 + δ1√T

. We denote this full-sample estimator as

βFull, and estimate the coefficient by the Generalized Least Square (GLS) as:

βFull =(X ′Ω−1X

)−1X ′Ω−1Y

=

(T∑t=1

xtx′t

1

σ2t

)−1 T∑t=1

xtyt1

σ2t

=

(T∑t=1

xtx′t

1

σ2t

)−1 T1∑t=1

xtx′t

β(1)

σ2(1)

+

T∑t=T1+1

xtx′t

β(2)

σ2(2)

+

T∑t=1

xtσtεtσ2t

=

(T∑t=1

xtx′t

1

σ2t

)−1 T1∑t=1

xtx′t

β(1)

σ2(1)

−T1∑t=1

xtx′t

β(2)

σ2(1)

+

T1∑t=1

xtx′t

β(2)

σ2(1)

+T∑

t=T1+1

xtx′t

β(2)

σ2(2)

+T∑t=1

xtσtεtσ2t

=

(T∑t=1

xtx′t

1

σ2t

)−1 [ T1∑t=1

xtx′t

β(1) − β(2)

σ2(1)

+T∑t=1

xtx′t

β(2)

σ2t

+T∑t=1

xtσtεtσ2t

],

(7)

where Ω = diag(σ2(1), . . . , σ

2(1), σ

2(2), . . . , σ

2(2)) is a T ×T matrix and X = (x1, x2, . . . , xT )′ = (X ′1 X

′2)′

is a T × k matrix of regressors. So X1 is T1 × k matrix of observations before the break point,

and X2 is (T − T1)× k matrix of observations after the break point. Assume that T − T1 ≥ k+ 1,

so we will have enough observations in the post-break sample. This is the assumption to have

5

well defined estimator otherwise the number of unknown parameters would be larger than number

of observations. Also, assume thatX′iΩ

−1i Xi

Ti−Ti−1, where i = 0, . . . ,m + 1 and T0 = 0, Tm+1 = T ,

converges in probability to some nonrandom positive definite matrix not necessarily the same for

all i. Thus,(X′Ω−1X

T

)p−→ Q, and

(X′i Ω−1

i Xi∆Ti

)p−→ Qi, where ∆Ti = Ti − Ti−1. Throughout this

Section, m = 1 since we have only one break in the model. By simplification:

√T(βFull − β(2)

)=

(T∑t=1

xtx′t

1

Tσ2t

)−1 T1∑t=1

xtx′t

b1√T(β(1) − β(2)

)Tb1 σ2

(1)

+T∑t=1

xtσtεt√Tσ2

t

=

(X ′Ω−1X

T

)−1(X ′1Ω−1

1 X1

Tb1

)b1δ1 +

(X ′Ω−1X

T

)−1( T∑t=1

xtσtεt√Tσ2

t

)d−→ N

(Q−1Q1b1δ1, Q

−1),

(8)

where VF =(X′Ω−1X

T

)−1 p−→ Q−1,(X′1Ω−1

1 X1

T1

)p−→ Q1, b1 = T1

T shows the proportion of pre-break

observations and Ω1 = diag(σ2(1), . . . , σ

2(1)) is a T1 × T1 matrix. Note that in practice, we need to

estimate the value of the unknown parameters, σ2(1) and σ2

(2). For solving this problem, we can use

the two-step GLS estimator method. The two-step estimator is computed by first obtaining the

estimates of σ2(1) and σ2

(2) by using ordinary least square residuals for each regime, and then plug

it back into equation (8) to find the estimation of the coefficient.

Remark 1: The full-sample estimator is under the null hypothesis that there is no break in the

coefficient β(1) = β(2), but we allow break in variance, σ(1) 6= σ(2). Because we have variance

heteroskedasticity in the full-sample estimator, we use GLS which is more efficient than ordinary

least square (OLS).

2.2.2 Post-break estimator

As we introduced post-break estimator earlier, it is for the case that we only focus on the observations

after the most recent break point. The post-break estimator is:

β(2) =(X ′2 Ω−1

2 X2

)−1X ′2 Ω−1

2 Y2, (9)

where Ω2 = diag(σ2(2), . . . , σ

2(2)) is a (T − T1)× (T − T1) matrix. Basically, post-break estimator is

the simple OLS estimator, since Ω−12 will be cancel out in this equation. But we are following the

GLS format to be consistent with the full-sample estimator. By simplification:

6

√T(β(2) − β(2)

)=

(X ′2 Ω−1

2 X2

T − Tb1

)−1(X ′2 Ω−1

2 σ(2)ε2√1− b1

√T − Tb1

)

d−→ N

(0,

1

1− b1Q−1

2

),

(10)

where V(2) = 11−b1

(X′2 Ω−1

2 X2

T−T1

)−1 p−→ 11−b1Q

−12 .

2.2.3 Stein estimator

For writing the distribution of the Stein-like combined estimator, at first we write the joint

asymptotic distribution of the full-sample estimator and the post-break estimator. Theorem 1

shows this joint distribution, distribution of the Hausman statistics and finally distribution of the

Stein-like combined estimator.

Theorem 1: The joint distribution of the full-sample estimator and the post-break estimator is:

√T

[βFull − β(2)

β(2) − β(2)

]d−→ V 1/2Z, (11)

where Z ∼ N(θ, I2k), θ = V −1/2

[Q−1Q1b1δ1

0

], V =

[VF VFVF V(2)

], VF = plim

T→∞

(X′Ω−1X

T

)−1= Q−1

and V(2) = 11−b1 plim

T→∞

(X′2Ω−1

2 X2

T−T1

)−1= 1

1−b1Q−12 .

Besides, the distribution of the Hausman statistic is:

HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull)

d−→ Z ′V 1/2 G (V(2) − VF )−1 G′ V 1/2Z

≡ Z ′MZ,

(12)

where G = (−I I)′ and M ≡ V 1/2 G (V(2) − VF )−1 G′ V 1/2 is an idempotent matrix with rank k.

Finally, the distribution of the Stein-like combined estimator is:

βα − β(2) =(β(2) − β(2)

)− α

(β(2) − βFull

)d−→ G′2 V

1/2Z −( τ

Z ′MZ

)1G′ V 1/2Z,

(13)

where G2 = (0 I)′ and (a)1 = min[1, a].

7

Based on Theorem 1, the joint asymptotic distribution of the full-sample and post-break

estimators is normal. The Hausman statistics has asymptotic noncentral chi-square distribution,

which later we deal with this non-centrality when we want to calculate the asymptotic risk. Finally,

the asymptotic distribution of the Stein-like combined estimator is a function of the normal random

vector, Z, and a function of the non-centrality parameter which appears because of having the

Hausman statistics in the combination weight.

2.3 Asymptotic risk for the Stein-like combined estimator

In this section we want to find the asymptotic risk for the Stein-like combined estimator for the

case that we have only one break (m = 1). Since our focus is on forecasting, it seems reasonable to

consider β(2) as the true parameter vector in the definition of the risk.

Lemma 1: When an estimator has asymptotic distribution,√T (β − β)

d−→ $, then we can derive

risk for this estimator as ρ(β,W) = E($′W$). See Lehmann and Casella (1998).

Based on Lemma 1, we can write the asymptotic risk for our estimator. Note that if we set

up the risk with W = X ′Ω−1X, we will have the definition of MSFE. Having that, we can derive

the shrinkage weight, α. In this case, α is the optimal weight in which it minimizes the MSFE.

Then, by having the combination weight, α, we can derive the h step ahead out of sample forecast.

Throughout the calculation of the risk, we did not plug W = X ′Ω−1X, and solve for general W.

The risk is:

ρ(βα,W) = E[T (βα − β(2))

′W (βα − β(2))]

= T E[(β(2) − β(2)

)− α

(β(2) − βFull

)]′W[(β(2) − β(2)

)− α

(β(2) − βFull

)]= ρ(β(2),W) + τ2 E

[Z ′ V 1/2 GWG′ V 1/2 Z

(Z ′MZ)2

]− 2τ E

[Z ′ V 1/2 GWG′2 V

1/2 Z

Z ′MZ

]

= ρ(β(2),W) + τ2 tr[WG′ V 1/2 E

((Z ′MZ)−2ZZ ′

)V 1/2 G

]− 2τ tr

[WG′2 V

1/2 E(

(Z ′MZ)−1ZZ ′)V 1/2 G

].

(14)

The challenging part for calculation of this risk function is that we have to take expectation from

the noncentral chi-square distribution, with noncentrality parameter equal to θ′Mθ2 . For calculating

these expectations, we need to use the following Lemmas.

8

Lemma 2: Let χ2p(µ) denote a noncentral chi-square random variable with noncentral parameter

µ and p degree of freedom. Besides, let p denote a positive integer such that p > 2r. Then we have:

E[(χ2p(µ)

)−r]= 2−re−µ

Γ(p2 − r)Γ(p2)

1F1

(p2− r; p

2;µ),

where 1F1(.; .; .) is the confluent hypergeometric function which is defined as 1F1(a; b;µ) =∑∞

n=0(a)n µn

(b)n n! ,

where (a)n = a(a+ 1) . . . (a+ n− 1), (a)0 = 1. See Ullah (1974) for the proof and details.

Lemma 3: The definition of the confluent hypergeometric function implies the following relations:

1. 1F1(a; b;µ) = 1F1(a+ 1; b;µ)− µb 1F1(a+ 1; b+ 1;µ),

2. 1F1(a; b;µ) = b−ab 1F1(a; b+ 1;µ) + a

b 1F1(a+ 1; b+ 1;µ), and

3. (b− a− 1) 1F1(a; b;µ) = (b− 1) 1F1(a; b− 1;µ)− a 1F1(a+ 1; b+ 1;µ)

Proof: See Lebedev (1972), pp. 262.

Lemma 4: Let the T×1 vector Z is distributed normally with mean vector θ and covariance matrix

IT , and M is any T × T idempotent matrix with rank r. Also assume φ(.) is a Borel measurable

function. Then:

E[φ(Z ′MZ)ZZ ′

]= E

[φ(χ2r+2(µ)

)]M + E

[φ(χ2r(µ)

)](IT −M)

+ E[φ(χ2r+4(µ)

)]Mθθ′M + E

[φ(χ2r(µ)

)](IT −M)θθ′(IT −M)

+ E[φ(χ2r+2(µ)

)](θθ′M +Mθθ′ − 2Mθθ′M

),

where µ ≡ θ′Mθ2 is the non-centrality parameter. Proof is based on Judge and Bock (1978).

Now by using Lemmas 2 - 4, we can simplify the risk in equation (14). Appendix A.1 has the

complete proof for this part. After simplifying the risk we have:

ρ(βα,W) = ρ(β(2),W)− 1

k − 2

2τ

(tr(W(V(2) − VF )

)− 2 θ′V 1/2GWG′V 1/2θ

θ′Mθ

)

− τ2

(θ′V 1/2GWG′V 1/2θ

θ′Mθ

)e−µ 1F1

(k2− 1;

k

2;µ)

− 1

k − 2

(τ2 + 4τ)

(θ′V 1/2GWG′V 1/2θ

θ′Mθ−

tr(W(V(2) − VF

))k

)e−µ 1F1

(k2− 1;

k

2+ 1;µ

).

(15)

9

Thus the risk of combined estimator is lower than the risk of the post-break estimator, if the

terms inside the curly brackets be positive. We use Lemma 5 to derive the required conditions.

Lemma 5: Let C be an m×m symmetric matrix, with λmin(C) and λmax(C) denoting its minimum

and maximum eigenvalues. Then,

SupX

X ′CX

X ′X= λmax(C), and Inf

X

X ′CX

X ′X= λmin(C).

See Rao (1973) for the proof.

Based on Lemma 5, we can derive the necessary conditions for which the risk of the combined

estimator is lower than the post-break estimator. Given equation (15), the following three conditions

are necessary for the results:

1.

2τ

(tr(W(V(2) − VF

))− 2 θ′V 1/2GWG′V 1/2θ

θ′Mθ

)− τ2

(θ′V 1/2GWG′V 1/2θ

θ′Mθ

)≥ 0, (16)

which is satisfied if 0 ≤ τ ≤ 2

(tr(W(V(2)−VF )

) (θ′Mθ

)θ′ V 1/2 GW G′ V 1/2θ

− 2

).

2.

tr(W(V(2) − VF )

)− 2 θ′V 1/2GWG′V 1/2θ

θ′Mθ> 0, (17)

which is satisfied if k > 2.

3.

θ′V 1/2GWG′V 1/2θ

θ′Mθ−

tr(W(V(2) − VF )

)k

> 0, (18)

which is satisfied if k >tr(W(V(2)−VF )

)λmin

(W(V(2)−VF )

) .

See Appendix A.2 for details and proof.

Therefore, the necessary condition for the Stein-like combined estimator to have lower risk than

the post-break estimator is that the number of regressors to be k > max

(2,

tr(W(V(2)−VF )

)λmin

(W(V(2)−VF )

)).

Having that, we can find the optimal τ∗ in which it minimizes the ρ(βα,W). For 0 ≤ τ ≤

10

2(

tr(W(V(2)−VF )

) (θ′Mθ

)θ′ V 1/2 GW G′ V 1/2θ

− 2)

, the optimal τ which depends on W is:

τ∗(W) =tr(W(V(2) − VF )

) (θ′Mθ

)θ′ V 1/2 GW G′ V 1/2θ

− 2. (19)

Remark 2: The difference between this paper and Hansen (2017) paper is that, here we calculate

the risk for the general form of W , and derive the optimal τ∗ which depends on W. For the special

case that W =(V(2) − VF

)−1, then τ∗ = k − 2 which is the well-known results of James and Stein

(1961), and τ∗ is positive when the number of regressors is larger than two, k > 2. This choice of

W is the special case of Hansen (2017) paper. Actually, the risk in our paper is for any positive

definite W, and we theoretically derive the risk for any given form of W.

By plugging back the optimal τ∗(W) into the risk function in equation (15), we can derive the

optimal risk. Theorem 2 summarizes the result for risk with general form for W > 0.

Theorem 2: If 0 ≤ τ ≤ 2(

tr(W(V(2)−VF )

) (θ′Mθ

)θ′ V 1/2 GW G′ V 1/2θ

− 2)and k > max

(2,

tr(W(V(2)−VF )

)λmin

(W(V(2)−VF )

)), then

ρ(βα,W) = ρ(β(2),W)−

[tr(W(V(2) − VF )

)(θ′Mθ

)− 2

(θ′V 1/2 GWG′ V 1/2θ

) ]2

(k − 2) (θ′Mθ) (θ′V 1/2 GWG′ V 1/2θ)

×

e−µ 1F1

(k2− 1;

k

2;µ)− 1

k − 2

[tr(W(V(2) − VF )

)]2(θ′Mθ

)2

(θ′V 1/2 GWG′ V 1/2θ

)2 − 4

×

(θ′V 1/2 GWG′ V 1/2θ

)θ′Mθ

e−µ 1F1

(k2− 1;

k

2+ 1;µ

),

(20)

where ρ(β(2),W) = tr(WV(2)), and µ = θ′Mθ2 is the non-centrality parameter.

Note that in Theorem 2, the terms inside the curly brackets are positive, so we could prove that

the Stein-like combined estimator has a smaller risk than the post-break estimator.

Corollary 2.1: For the special case that W = (V(2) − VF )−1, the risk of the Stein-like combined

estimator is equal to:

ρ(βα,W) = ρ(β(2),W)− (k − 2)

e−µ 1F1

(k2− 1;

k

2;µ)

, (21)

where the risk of the combined estimator is less than the post-break estimator if k > 2.

11

Remark 3: As b1 increases, V(2) increases. Thus τ∗ increases. Besides, as b1 increases, V(2)

increases, and because HT inversely depends on V(2), so HT decreases which results in having

bigger τ∗

HT. Thus the full-sample estimator gets a higher weight when b1 is large. Remember that

βα = αβFull + (1− α)β(2).

Remark 4: Based on Theorem 2, the gain obtained by using combined estimator can be derived

by taking difference between ρ(β(2),W) and ρ(βα,W).

Figure 1 shows the relationship between breaksize that appears in θ in the above equation

and the difference between risks which we draw it by Matlab program. Based on the figure, as k

increases, the difference between risks increases in favor of the Stein-like combined estimator. Also,

when the variance of the pre-break data is lower than variance of the post-break data, q < 1, then

it is more gain to use pre-break observations. Besides, for the cases that the break point is near the

end of the sample, b = 0.9, the gain obtained from the Stein-like combined estimator is higher than

the case that the break point is near the beginning of the sample, b = 0.1. The reason is that when

there is not enough observations in the post-break sample, post-break estimator performs poorly

due to the lack of observations.

3 Combined estimator with weight κ ∈ [0 1]

In this section we want to answer to the possible question that why do we need to consider the Stein-

type weight as a combination weight? So, we consider our combined estimator with an alternative

combination weight κ ∈ [0 1]. Here we propose to combine the post-break and full-sample estimators

with a weight κ ∈ [0 1] as:

βκ = κβFull + (1− κ)β(2). (22)

By using Theorem 1 results, we can derive the asymptotic risk for this combined estimator. Here,

calculation of risk is simple, because we do not have random part in the combination weight. Thus

we do not need to deal with taking expectation from the inverse of the chi-square distribution.

Actually we do not need to use Lemmas 2 - 4. The asymptotic risk for this combined estimator

12

can be written as:

ρ(βκ,W) = E[T (βκ − β(2))

′W (βκ − β(2))]

= T E[(β(2) − β(2)

)− κ(β(2) − βFull

)]′W[(β(2) − β(2)

)− κ(β(2) − βFull

)]= ρ(β(2),W) + κ2 E

[Z ′ V 1/2 GWG′ V 1/2 Z

]− 2κE

[Z ′ V 1/2 GWG′2 V

1/2 Z]

= ρ(β(2),W) + κ2 tr[V 1/2 GWG′ V 1/2 E(ZZ ′)

]− 2κ tr

[V 1/2 GWG′2 V

1/2 E(ZZ ′)]

= ρ(β(2),W) + κ2 tr[V 1/2 GWG′ V 1/2 (I2k + θθ′)

]− 2κ tr

[V 1/2 GWG′2 V

1/2 (I2k + θθ′)]

= ρ(β(2),W)− κ(

2 tr(W(V(2) − VF

))− κ

[tr(W(V(2) − VF

))+ θ′V 1/2GWG′V 1/2θ

] ).

(23)

If 0 ≤ κ ≤2 tr

(W(V(2)−VF

))tr

(W(V(2)−VF

))+θ′V 1/2GWG′V 1/2θ

, the optimal κ∗ is:

κ∗(W) =tr(W(V(2) − VF

))tr(W(V(2) − VF

))+ θ′V 1/2GWG′V 1/2θ

, (24)

which is derived by minimizing the risk function. Finally by plugging the optimal κ∗ into the risk

function, we have the following theorem.

Theorem 3: The asymptotic risk for the combined estimator with weight κ after plugging the

optimal κ∗(W) into the risk function is:

ρ(βκ,W) = ρ(β(2),W)−

[tr(W(V(2) − VF )

)]2

tr(W(V(2) − VF )

)+ θ′V 1/2GWG′V 1/2θ

. (25)

Theorem 3 shows that under the stated condition for κ, the risk of the combined estimator is

less than the post-break estimator.

Remark 5: For the special case that W = (V(2) − VF )−1, κ∗ = kk+θ′Mθ .

Remark 6: We can compare the combination weights α and κ for the special case that W = (V(2)−

VF )−1. Remember that α = τHT

, where HT = T (β(2) − βFull)′(V(2) − VF )−1(β(2) − βFull) = Z ′MZ.

Also, τ∗ = k − 2 when W = (V(2) − VF )−1. Thus, α = k−2HT

. Besides, κ = kk+HT

under this

13

specification for W. So:

α− κ =(k − 1)2 − 2HT − 1

HT (k +HT ),

which means that for large k and small break size, α is larger than κ. Since α is the weight for

the full-sample estimator, the intuition is that for the small break size and large k, the Stein-like

combined estimator gives a larger weight to the full-sample estimator and it gives the lower weight

to the post-break estimator. Thus, we expect that the Stein-type weight to have a lower risk than

the κ type weight. Because when the break size is small, the full-sample estimator performs better

than the post-break estimator, as sometimes ignoring the very small break sizes is better than

considering them. For the large break size (when HT is large), these weights are almost equal to

each other, α ≈ κ.

4 Semi-parametric shrinkage estimator for forecast averaging

Li, Ouyang, Racine (2013) introduce the semi-parametric method for estimation of the categorical

data. Inspired by their approach, we want to propose an estimator for our time series structural

break models. Using the same model as equation (3), define Y1 = Ω−1/21 Y1, Y2 = Ω

−1/22 Y2, X1 =

Ω−1/21 X1 and X2 = Ω

−1/22 X2. Based on this semi-parametric method, we can estimate the post-

break coefficients by GLS and using the following discrete kernel estimator:

βγ = argminβ(2)

T∑t=1

(yt − x′tβ(2)

)2L(t, γ), (26)

where

L(t, γ) = γ 1(t ≤ T1) + 1(t > T1) (27)

and t goes over all of the observations. βγ is the estimator for the post-break observations. The idea

is that for calculating the post-break estimator, it is worth to use information from the first regime

as well. This kernel-weight estimator gives the constant weight 1 to the post-break observations

and down-weight pre-break observations by weight γ ∈ H = [0 1]. In other words, the post-break

coefficient is estimated by weighting observations over the whole sample. For the case that γ = 0,

the estimator is only using the post-break observations, and when γ = 1, all of the observations

are weighted equally. For other cases that γ ∈ (0 1), we have the combination of pre-break and

14

post-break observations. The first order condition of equation (26) is

T∑t=1

L(t, γ) xt

(yt − x′tβ(2)

)= 0, (28)

leading to

βγ =

(T∑t=1

L(t, γ) xtx′t

)−1 T∑t=1

L(t, γ) xtyt

=(γX ′1X1 + X ′2X2

)−1 (γX ′1Y1 + X ′2Y2

)=

(γ

σ2(1)

X ′1X1 +1

σ2(2)

X ′2X2

)−1(γ

σ2(1)

X ′1Y1 +1

σ2(2)

X ′2Y2

)

=

(γ

q2X ′1X1 +X ′2X2

)−1(γ

q2X ′1X1 β(1) +X ′2X2 β(2)

)= ∆ β(1) + (I −∆) β(2),

(29)

where ∆ =(γq2X ′1X1+X ′2X2

)−1 (γq2X ′1X1

), q =

σ(1)σ(2)

, β(1) = (X ′1X1)−1X ′1Y1 and β(2) = (X ′2X2)−1X ′2Y2.

We can find γ by minimizing the following Cross Validation (CV) criterion function:

CV (γ) =1

T − T1

T∑τ=T1+1

(yτ − x′τ β(−τ)

γ

)2, (30)

where β(−τ)γ denotes the leave-one-out Cross Validation (CV) estimate of β in which τ goes over the

post-break observations, τ = T1 + 1, . . . , T and at each time deletes τ ′th row of the observation.

So:

β(−τ)γ =

T∑t6=τ

L(t, γ)xtx′t

−1T∑t6=τ

L(t, γ)xtyt, for t = T1 + 1, . . . , T. (31)

Note that CV for finding γ is based on post-break observations only. The reason is that from

t = 1, . . . T1 the coefficient is β(1), and we are trying to find the estimator for the post-break

observations (t > T1) in order to use it for forecasting. Once we found γ, we can calculate βγ and

MSFE accordingly. The good point about this semi-parametric estimator, βγ , is that γ does not

depend on the estimation of the break size or the ratio of break in the error term. Actually the

estimation error in this estimator is lower than the previous introduced combined estimator. Since

this method finds the weight by CV, it can fit to data better than other methods which involve the

estimation of some other parameters as well. We just need to prove that the γ weight that we derive

by CV is optimal. Define L(γ) =[(βγ − β(2)

)′X ′2 Ω−1

2 X2

(βγ − β(2)

)]=(µ(γ) − µ

)′(µ(γ) − µ

)

15

to be the loss function, where µ = Ω−1/22 X2 β(2) and µ(γ) = Ω

−1/22 X2 βγ , and the expected loss is

R(γ) = E[L(γ)

]. Theorem 4 is about optimality of the γ weight.

Theorem 4: As T →∞, then

L(γ)

infγ∈H

L(γ)

p−→ 1, (32)

R(γ)

infγ∈H

R(γ)

p−→ 1. (33)

Theorem 4 shows that the optimal weight obtained by CV is asymptotically equivalent to the

infeasible optimal weight. For proving this Theorem, at first we need to derive L(γ) and R(γ). See

Zhang et al. (2013) and Ullah et al. (2017).

4.1 Deriving MSFE and the optimality for the cross-validation weight

This part, shows the step by step procedure that is needed for the proof of optimality. Rewrite

equation (29) as:

βγ − β(2) =( γq2

X ′1X1 +X ′2X2

)−1( γq2

X ′1X1λ+γ

q2X ′1σ(1)ε1 +X ′2σ(2)ε2

), (34)

where λ = β(1) − β(2), ε1 = (ε1, . . . , εT1)′, ε2 = (εT1+1, . . . , εT )′ and we define γq2≡ γ

q2for ease of

notation. Based on the loss function:

R(γ) = E[(βγ − β(2)

)′W(βγ − β(2)

)]= E

[( γq2

λ′X ′1X1 +γ

q2σ(1)ε

′1X1 + σ(2)ε

′2X2

)( γq2

X ′1X1 +X ′2X2

)−1W( γ

q2X ′1X1 +X ′2X2

)−1( γq2

X ′1X1λ+γ

q2X ′1σ(1)ε1 +X ′2σ(2)ε2

)]=( γq2

)2λ′X ′1X1

( γq2

X ′1X1 +X ′2X2

)−1W( γq2

X ′1X1 +X ′2X2

)−1X ′1X1λ

+( γq2

)2E[σ2

(1) ε′1X1

( γq2

X ′1X1 +X ′2X2

)−1W( γq2

X ′1X1 +X ′2X2

)−1X ′1ε1

]+ E

[σ2

(2) ε′2X2

( γq2

X ′1X1 +X ′2X2

)−1W( γq2

X ′1X1 +X ′2X2

)−1X ′2ε2

]= λ′A1λ+ γ2σ2

(2) tr(A2) + σ2(2) tr(A3),

(35)

16

where A1 =(γq2

)2X ′1X1

(γq2X ′1X1 +X ′2X2

)−1W(γq2X ′1X1 +X ′2X2

)−1X ′1X1,

A2 = X ′1X1

(γq2X ′1X1 +X ′2X2

)−1W(γq2X ′1X1 +X ′2X2

)−1and

A3 = X ′2X2

(γq2X ′1X1 +X ′2X2

)−1W(γq2X ′1X1 +X ′2X2

)−1. Since λ is the break size which is not

known, we can plug the unbiased estimator for this term instead. The unbiased estimator for λ′A1λ

is equal to:

λ′A1λ− σ2(1) tr

(A1(X ′1X1)−1

)− σ2

(2) tr

(A1(X ′2X2)−1

)(36)

and by plugging it back into equation (35) we have:

R(γ) = λ′A1λ+ σ2(1) tr

(( γq2

)2A2 −A1(X ′1X1)−1

)+ σ2

(2) tr

(A3 −A1(X ′2X2)−1

)= λ′A1λ+ 0 + 2σ2

(2) tr

(W( γq2

X ′1X1 +X ′2X2

)−1)− σ2

(2) tr

((X ′2X2

)−1W

),

(37)

where(γq2

)2A2−A1(X ′1X1)−1 = 0. Also, A3−A1(X ′2X2)−1 = 2σ2

(2) tr

(W(γq2X ′1X1+X ′2X2

)−1)−

σ2(2) tr

((X ′2X2

)−1W

), and σ2

(2) =(Y2−X2β(2))

′(Y2−X2β(2))

T−T1−k is an unbiased estimator for the variance

of the post-break observations. Besides, we can rewrite λ′A1λ as:

λ′A1λ = β′(2) W β(2) − 2 β′(2) W βγ + β′γ W βγ

= β′(2) X′2 Ω−1

2 X2 β(2) − 2 Y ′2 Ω−1/22 Ω

−1/22 X2 βγ +

(µ(γ)− Y2 + Y2

)′(µ(γ)− Y2 + Y2

)= β′(2) X

′2 Ω−1

2 X2 β(2) − 2 Y ′2 µ(γ) +(µ(γ)− Y2

)′(µ(γ)− Y2

)+ Y ′2 Y2 + 2

(µ(γ)− Y2

)′Y2

= β′(2) X′2 Ω−1

2 X2 β(2) +(µ(γ)− Y2

)′(µ(γ)− Y2

)− Y ′2 Y2,

(38)

where we plug W = X ′2 Ω−12 X2 to derive the MSFE.

Thus, based on equation (37), the MSFE for this semi-parametric estimator up to the relevant

term to γ can be written as:

R(γ) =(µ(γ)− Y2

)′(µ(γ)− Y2

)+ 2 σ2

(2) tr

(X ′2 Ω

−1/22 X2

( γq2

X ′1X1 +X ′2X2

)−1). (39)

17

Define P (γ) ≡ X2

(γq2X ′1X1 +X ′2X2

)−1X ′2. Thus we can rewrite µ(γ) as:

µ(γ) = Ω−1/22 X2 βγ = Ω

−1/22 X2

( γq2

X ′1X1 +X ′2X2

)−1( γq2

X ′1Y1 +X ′2Y2

)= Ω

−1/22 X2

( γq2

X ′1X1 +X ′2X2

)−1( γq2

X ′1Y1

)+ Ω

−1/22 X2

( γq2

X ′1X1 +X ′2X2

)−1X ′2Y2

= Ω−1/22

(A+P (γ)Y2

),

(40)

where A ≡ X2

(γq2X ′1X1 + X ′2X2

)−1(γq2X ′1Y1

). Based on this definition, we can rewrite the risk

in equation (39) as:

R(γ) =(µ(γ)− Y2

)′(µ(γ)− Y2

)+ 2 σ2

(2) tr

(X ′2 Ω

−1/22 X2

( γq2

X ′1X1 +X ′2X2

)−1)

=(µ(γ)− Ω

−1/22 Y2

)′(µ(γ)− Ω

−1/22 Y2

)+ 2 tr

(P (γ)

)=(µ(γ)− µ− Ω

−1/22 σ(2)ε2

)′(µ(γ)− µ− Ω

−1/22 σ(2)ε2

)+ 2 tr

(P (γ)

)=(µ(γ)− µ− ε2

)′(µ(γ)− µ− ε2

)+ 2 tr

(P (γ)

)= L(γ) + ε′2ε2 − 2 ε′2

(µ(γ)− µ

)+ 2 tr

(P (γ)

)= L(γ) + ε′2ε2 − 2 ε′2 µ(γ) + 2ε′2µ+ 2 tr

(P (γ)

)= L(γ) + ε′2ε2 − 2 ε′2 Ω

−1/22

(A+P (γ)Y2

)+ 2 ε′2µ+ 2 tr

(P (γ)

)= L(γ) + ε′2ε2 − 2 ε′2 Ω

−1/22 A−2 ε′2 Ω

−1/22 P (γ) Ω1/2µ− 2 ε′2Ω

−1/22 P (γ) σ(2)ε2

+ 2 ε′2 µ+ 2 tr(P (γ)

)= L(γ) + ε′2ε2 − 2 ε′2 Ω

−1/22 A−2 µ′ P (γ) ε2 − 2 ε′2P (γ) ε2 + 2 ε′2 µ+ 2 tr

(P (γ)

)

(41)

We want to show that the estimate of the weight, γ, that we get from CV is optimal in the sense of

optimality in Theorem 4 which was introduced by K.-C Li (1987). The concept of this optimality

is that the average squared error of the CV is asymptotically as small as the average squared error

of the infeasible best possible estimator. Also notice that:

R(γ) = E[L(γ)

]= E

[(µ(γ)− µ

)′(µ(γ)− µ

)]= L(γ)− ε′2 P 2(γ) ε2 − 4 A′Ω−1/2

2 P (γ) ε2 − 2(P (γ)µ− µ

)′(Ω−1/22 A+P (γ) ε2

)+ γ tr

(P (γ)

)+(

1− γ)

tr(P 2(γ)

)− C,

(42)

18

Supγ∈H

|tr(P 2(γ))|R(γ)

= op(1). (54)

Based on the above conditions, the proof of Theorem 4 thus follows. Therefore, the weight derives

by CV is optimal. See Appendix A.4 for proof and details for each of the above conditions.

5 Alternative combined estimator proposed by Pesaran et al. (2013)

In a particularly insightful paper, Pesaran et al (2013), hereafter PPP, propose that we can decrease

the MSFE under the structural break by using the whole observations in the sample instead of only

post-break observations. Their proposed estimator is :

βPPP = (X ′WX)−1(X ′WY ). (55)

They derive the optimal weight, W , such that MSFE of the one-step-ahead forecast is minimized,

and found that the optimal weight takes one value for observations before the break point and one

value for the post-break observations, W = diag(w(1), . . . , w(1), w(2), . . . , w(2)

), where w(1) is a fixed

weight for the pre-break observations and w(2) is a fixed weight for the post-break observations and

these are defined as: w(1) = 1

T1

b1+(1−b1)(q2+Tb1φ2),

w(2) = 1T

q2+Tb1φ2

b1+(1−b1)(q2+Tb1φ2),

(56)

where b1 = T1T is the proportion of observation before the break, q =

σ(1)σ(2)

is the ratio of break in the

error term, φ =x′T+1λ

σ2(2)

(x′T+1Q−1xT+1)1/2

, Q = E(xtx′t) and λ = β(1)−β(2) is the break size. By knowing

this form of W , we can rewrite their estimator as:

βPPP =(w(1) X

′1X1 + w(2) X

′2X2

)−1(w(1) X

′1X1β1 + w(2) X

′2X2β(2)

)= Λβ(1) + (I − Λ)β(2),

(57)

which is the combined estimator of the pre-break and post-break estimators with combination

weight Λ =(w(1) X

′1X1 + w(2) X

′2X2

)−1(w(1) X

′1X1

). See Pesaran et al. (2013) for details.

Remark 7: By looking carefully at this estimator and comparing this with the semi-parametric

estimator introduced in Section 4, we can see that these two estimators are very close to each other.

Actually if we write the normalized weights for the semi-parametric estimator βγ , we can see that

20

this method gives one value as a weight to pre-break observations, and one value for post-break

observations. Thus:w(1),γ =

(γ

q2

)T1(γ

q2

)+T−T1

= 1T

1

b1+(1−b1)(q2

γ

) for 1 < t ≤ T1,

w(2),γ = 1

T1(γ

q2

)+T−T1

= 1T

(q2

γ

)b1+(1−b1)

(q2

γ

) for T1 < t < T.

(58)

By comparing these weights with those in equation (56), we can see that PPP’s estimator is the

especial case of this semi-parametric estimator under the condition that:

γ =q2

q2 + T1φ2. (59)

6 Extension to multiple breaks

So far, we have talked about the case of having a single break in the model, but in practice the time

series model may be subject to multiple breaks. The case of multiple breaks is the straightforward

extension of the aforementioned sections for the Stein-like combined estimator. The combined

estimator always can be defined as the combination of βFull and the estimator after the most

recent break point. Here we write the model with two breaks in details, and then we extend that

for the case of having more than two breaks. Suppose we have two breaks (three regime) in the

model. The break dates are T1, T2, and we have breaks in both the coefficients and the error

terms, such that

yt =

x′tβ(1) + σ(1)εt if 1 < t ≤ T1

x′tβ(2) + σ(2)εt if T1 < t ≤ T2

x′tβ(3) + σ(3)εt if T2 < t < T.

(60)

Thus the combined estimator in the case of having two breaks is:

βα = αβFull + (1− α)β(3), (61)

where α is defined as in equation (5), in which HT = T (β(3)− βFull)′(V(3)− VF )−1(β(3)− βFull) and

V(3) is the variance of the post-break estimator. It is clear that βα− β(3) =(β(3)− β(3)

)−α

(β(3)−

βFull). So, ρ(βα,W) = E

[T (βα − β(3))

′W (βα − β(3))]. In order to derive the risk, at first we need

to find the distribution of(β(3) − β(3)

)and

(β(3) − βFull

). Let b1 = T1

T , b2 = T2T , and b3 = T3

T where

21

b1 < b2 < b3.

Suppose under the local alternative, the break sizes are: (β(1) − β(2), β(2) − β(3)) = ( δ1√T, δ2√

T).

So, the full-sample estimator is:

√T(βFull − β(3)

)=( T∑t=1

xtx′t

1

Tσ2t

)−1[

T1∑t=1

xtx′t

√T(β(1) − β(3)

)Tσ2

(1)

+

T2∑t=T1+1

xtx′t

√T(β(2) − β(3)

)Tσ2

(2)

+T∑t=1

xtσtεt√Tσ2

t

]

=(X ′Ω−1X

T

)−1(X ′1Ω−11 X1

Tb1

)b1(δ1 + δ2) +

(X ′Ω−1X

T

)−1(X ′2Ω−12 X2

T (b2 − b1)

)(b2 − b1)δ2

+(X ′Ω−1X

T

)−1T∑t=1

xtσtεt√Tσ2

t

d−→ N(Q−1Q1b1(δ1 + δ2) +Q−1Q2(b2 − b1)δ2, VF

).

(62)

Remember that X = (X ′1 X′2 X

′3)′ is T ×k, and Ω = diag

(σ2

(1), . . . , σ2(1), σ

2(2), . . . , σ

2(2), σ

2(3), . . . , σ

2(3)

)is a T × T matrix.

When we have two breaks in the model, the post-break estimator is:

√T(β(3) − β(3)

)=(X ′3 Ω−1

3 X3

T − T2

)−1( X ′3 Ω−13 σ(3)ε3√

1− b2√T − T2

)d−→ N

(0, V(3)

),

(63)

where V(3) = 11−b2 plim

T→∞

(X′3Ω−1

3 X3

T−T2

)−1= 1

1−b2Q−13 . Like what we did in Theorem 1, we can write

the joint asymptotic distribution of the estimators as:

√T

[βFull − β(3)

β(3) − β(3)

]d−→ V 1/2Z, (64)

where Z ∼ N(θ, I2k), θ = V −1/2

[Q−1Q1b1(δ1 + δ2) +Q−1Q2(b2 − b1)δ2

0

]and V =

[VF VFVF V(3)

].

Actually the main difference between multiple breaks in compare to the single break is the bias

term, θ, but still it has the same dimension as the single case. We can extend it to the case of

22

having m breaks which breaks happen at times T1, T2, . . . , Tm:

yt =

x′tβ(1) + σ(1)εt if 1 < t ≤ T1,

x′tβ(2) + σ(2)εt if T1 < t ≤ T2,

...

x′tβ(m+1) + σ(m+1)εt if Tm < t < T.

(65)

In the case of having m breaks in model, the matrix θ which is 2k × 1 can be written as:

θ = V −1/2

[Q−1

(Q1b1(δ1 + · · ·+ δm) + · · ·+Qm(bm −Bm−1)δm

)0

]. (66)

Regarding the extension of the semi-parametric estimator introduced in Section 4, if we have

m breaks in the model, we need to estimate m unknown weight parameter by CV, subject to the

condition that weights are in the range of [0 1]. Once we found these weights, we are able to

estimate the coefficient, βγ .

7 Monte Carlo evidence on forecasting performance

This section provides some Monto Carlo results based on the Stein-like combined estimator, combined

estimator with weight κ ∈ [0 1], semi-parametric estimator with weight γ, PPP’s estimator, and

the post-break estimator. The goal is to compare the MSFEs for these estimators. To do this, Let

t = 1, ..., T with T = 100, 200, q1 =σ(1)σ(2)∈ 0.5, 1, 2 and k = 3, 5, 8. We try different values

for T1 which are proportional to the pre-break sample observations, b1 = T1T ∈ 0.2, 0.4, 0.6, 0.8.

Suppose xt ∼ N(0, 1) and εt ∼ i.i.d. N(0, 1). The data generating process for the single break case

is:

yt =

x′tβ(1) + q1εt if 1 < t ≤ T1

x′tβ(2) + εt if T1 < t ≤ T. (67)

Let β(2) be a vector of ones, and β(1) = β(2) + δ1√T

under local alternative. Note that the magnitude

of distance between parameters is determined by the localizing parameter δ1 and sample size T .

Assume that λ ≡ δ1√T∈ 0, 0.5, 1 represents different break sizes. The number of replications is

1000.

We report the results based on the relative MSFE with respect to the post-break estimator

23

which is the benchmark estimator, i.e.,

RMSFEi =MSFE(yi)

MSFE(y2), (68)

where MSFE(yi) is MSFE(yα), MSFE(yκ), MSFE(yγ) and MSFE(yPPP ). Tables 1 through

6 show the results of the Monte Carlo for the single break case.

As we may have more than one break in practice, we consider multiple breaks in our simulation

design. Accordingly, we extend our simulation experiments to allow for two breaks that happen at

T1 = T3 , and T2 = 2T

3 respectively. The data generating process for the two breaks model follows:

yt =

x′tβ(1) + q1εt if 1 < t ≤ T1

x′tβ(2) + q2εt if T1 < t ≤ T2

x′tβ(3) + εt if T2 < t ≤ T,(69)

where q2 =σ(2)σ(3)

. In order to consider different possibilities that may happen under multiple breaks,

we do the Monte Carlo under different experiments. Table 7 shows break points specifications with

the assigned experiment numbers. The first experiment is under no structural breaks. We allow

for both moderate break (0.5) and large break (1) in coefficients in either direction (experiment

numbers 2 to 5). Also, we consider the cases that the direction of break in coefficients change from

decreasing to increasing or vice versa (experiment numbers 6 to 9). Experiment 10 represents the

case that we have partial change in the coefficients, i.e β(1) = β(2) 6= β(3). Experiment 11 also shows

the partial break case which β(1) 6= β(2) = β(3). Finally, experiments 12 and 13 represent the higher

and lower post-break volatility respectively.

7.1 Simulation results

Tables 1 to 6 show the results of Monte Carlo for the single break case for different sample size and

different q. q = 0.5 means the variance of the pre-break data is less than the post-break data. In

such case, considering the pre-break data in forecasting will improve the forecast accuracy. q = 1

happens when there is no break in the variance. Finally q = 2 means that the pre-break data is

more volatile than the post-break data, so one cannot gain a lot by considering the pre-break data.

Based on the results in Tables, the performance of the Stein-like combined estimator is better than

(in some cases equivalent) the post-break estimator in the sense of having lower MSFE. As we

increase the number of regressors, the MSFE of the Stein-like combined estimator decreases more.

24

When the break size is large (λ = 1), all estimators perform almost the same as the post-break

estimator. As we expected, when we increase the number of regressors, semi-parametric estimator

performs well especially for the small and moderate break sizes (λ ≤ 0.5) and this is because of

having a lower estimation error. PPP estimator has smaller (almost equal) MSFE than post-break

estimator when k = 3 and break size is of moderate size (large size), but as we increase the k, for the

large break sizes it has under-performance relative to the post-break estimator. We have the similar

pattern for the results when q = 1 and q = 2, but MSFE of estimators are lower with q = 0.5.

The similar results hold as we increase the sample size from 100 to 200, but when the number of

observations increase, the MSFE of estimators are closer to each other relative to T = 100.

Tables 8 and 9 show the results for the two breaks case under the specified set up in Table 7.

Based on the results Stein-like combined estimator has a lower MSFE than post-break estimator.

Also the combined estimator with weight γ performs very well under this set up and even has a

lower MSFE than Stein estimator.

8 Empirical analysis

This section presents an empirical illustration of our method. We consider the application of the

forecasting procedure for equity premium. The data are from Welch and Goyal (2008) which we

examine part of it (1927-2018) where data for all variables are available based on relevant frequency.

We consider all data frequencies (monthly, quarterly and yearly) to assess the performance of our

model, and report the results based on quarterly data in this section. The results with monthly

data and annual data can be found in Appendix B.1 and B.2, respectively. We refer to Welch and

Goyal (2008) for detailed description of the data and sources.

The equity premium (or market premium) is the return on the stock market minus the return

on a short-term risk-free treasury bill. Based on the economic variables used to predict the equity

premium, we consider all 14 variables from Welch and Goyal (2008) which data are available from

1927:Q1-2018:Q4, giving a time-series dimension T = 368. The 14 variables are: D/P (dividend

price ratio), D/Y (dividend yield), E/P (earning price ratio), D/E (dividend payout ratio), SV AR

(stock variance), B/M(book-to-market ratio), NTIS(net equity expansion), TBL(treasury bill

rate), LTY (long term yield), LTR(long term return), TMS(term spread), DFY (default yield

spread), DFR(default return spread), INFL (inflation).

25

We recursively compute one-step-ahead forecasts using different forecasting methods for the

models described in this paper. Each time that we expand the window, initially we apply the

Schwarz’s Bayesian Information Criterio (BIC) to choose the effective variables out of k = 14

available ones which are critical in predicting the equity premium. In other words, we select the

forecasting model using this criterion from all 2k possible specifications which k = 14. As we did

not put any restriction on the number of predictors, this criterion may choose all 14 predictors or

choose nothing as the extreme cases. Since BIC has a larger penalty term, it is a good choice to

use it against in-sample over fitting, Pesaran and Timmermann (1995). Besides, we identify break

points by the sequential procedure introduced by Bai and Perron (1998, 2003), hereafter BP, where

we search for up to eight breaks and set the trimming parameter to 0.1 and the significance level

to 5%. Using an initial estimation period of n1 = 92 quarters (23 years) forecasts are recursively

generated at each point in the out-of-sample period using only the information available at the

time the forecast is made. As the selection of the forecast evaluation period is always somewhat

arbitrary, we also report the results with larger estimation window sizes of 132, 172. The results

are qualitatively similar when a larger number of estimation period is used. The baseline forecast

uses the observations after the last break identified by the sequential procedure of BP. We compare

our proposed Stein-like combined estimator forecast with the BP post-break forecast, κ weight

combined estimator forecast and forecast from the semi-parametric estimator which we derive the

weight (γ) by CV. Also we compare our results with the one proposed by PPP (2013) and historical

average (HA) forecast.

Table 10 displays the summary statistics for the equity premium and its predictors. The min,

max and high standard deviation shows the higher volatility which can be attributed to D/P ,

D/Y , E/P and D/E in compare to other predictors.

8.1 Forecast evaluation with quarterly data

In order to evaluate the performance of our proposed estimator, we compute its out of sample

MSFE and compare it with MSFE from other competing estimators. For this purpose, we divide

the sample of T observations into two parts. The first n1 observation is used as an in-sample

estimation period, and the remaining n2 = T − n1 observations is the out-of-sample period which

we recursively do one step ahead forecast. We consider different estimation window sizes to check

26

the sensitivity of the estimators to this choice. The MSFE for the predictive regression model (65)

over the forecast evaluation period is given by:

MSFEi =1

n2

n2∑t=1

(yn1+t − yi,n1+t), (70)

whereMSFEi stands for different forecasting method such as the BP post-break estimator (MSFEBP ),

our proposed Stein-like combined estimator (MSFEα), combined estimator with weight κ (MSFEκ),

semi-parametric estimator (MSFEγ) and PPP estimator (MSFEPPP ). We also report the forecast

from historical average (MSFEHA) where the historical average of the equity premium corresponds

to a constant expected equity premium.

Table 11 shows the recursively out-of-sample forecasting results under different in-sample estimation

period (n1), so the beginning of the various forecast evaluation periods runs from 1950:Q1 (n1 = 92)

through 1970:Q1 (n1 = 172). We evaluate the forecasts for horizons h = 1, 2, 3, 4 quarters. Based

on the results of the Table 11, the Stein-like combined estimator delivers vastly improved forecasts

(lower MSFE) compared to the BP post-break estimator for all horizons. Besides, the semi-

parametric estimator and combined estimator with weight κ have a lower MSFE than post-break

estimator but still combined estimator with the Stein type weight outperforms these estimators.

Our proposed combined estimator performs almost the same as PPP’s estimator when h = 1, but

for h > 1, the Stein-like combined estimator outperforms the PPP’s estimator. Moreover, the Stein-

like combined estimator has a lower MSFE than HA which is hard to beat. The reason that Stein

type weight performs well can be attributed to the form of its weight that beautifully captures the

break size and adjust the weight between the full-sample estimator and the post-break estimator

accordingly.

We also use the out-of-sample R2 statistics suggested by Campbell and Thompson (2008)

to compare the forecasts based on our proposed Stein-like combined estimator relative to the

introduced alternative models such as “κ”, “γ”, “PPP” and “HA”. We compute the out-of-sample

R2 by:

R2i = 1− MSFEα

MSFEi, (71)

where MSFEα is the MSFE for the Stein-like combined estimator and MSFEi shows the MSFE

for each of the introduced models including “κ”, “γ”, “PPP” and “HA”. Obviously positive R2

27

shows the out-performance of our proposed Stein-like combined estimator relative to the benchmark

model, and negative R2 value indicates the under-performance of our Stein estimator. Finally, we

evaluate whether any of the out/under-performance of our model is statistically significant or not

corresponding to testing H0 : R2 = 0 (MSFEi = MSFEα) against the alternative hypothesis that

H1 : R2 > 0 (MSFEi > MSFEα). We report the well known Diebold and Mariano (1995) and

West (1996) statistic for testing the null of equal MSFE (or equal predictive ability). The results

are displayed in Table 12. Based on the results, the Stein-like combined estimator has a larger

positive R2 compare to the κ estimator, semi-parametric estimator and the post-break estimator,

and the value of R2 is significant across different estimation window size at 5% significance level

for all h. Besides, the semi-parametric estimator is performing well in terms of having lower

MSFE which actually this estimator facilitate implementing of our theoretical point that using pre-

break observations improve forecast accuracy. Based on the results, the semi-parametric estimator

performs better for the longer forecast horizons.

8.2 Cumulative difference square forecast error

In order to check whether our proposed Stein-like combined estimator has consistent out-performance

relative to the other estimators or not, we graph the cumulative difference squared forecast errors

(CDSFE) for our proposed estimator versus the other models under different forecast horizons.

This informative graph is recommended by Welch and Goyal (2008), see also Rapach et al (2013),

to assess the consistency of any out-performance in the model. In order to interpret this graph,

we can check whether CDSFE has a positive values during any period of time or not. A strong

forecasting model should perform well across the majority of the out-of-sample period and not only

around one or two short periods. Thus, a model with positive (negative) values show the consistent

out-performance (under-performance) of our proposed estimator relative to the benchmark model.

Actually, rising values of CDSFE represents a higher MSFE for benchmark model relative to our

model. Similarly, the flat spots reflects the equal performance of the two estimator. CDSFE can

be calculated by:

CDSFEi,t =

t∑τ=1

(e2i, n1+τ − e2

α, n1+τ

), (72)

where t = 1, . . . , n2 and we set n1 ∈ 92, 132, 172 to be the size of initial in-sample estimation

period, so the out-of sample forecast evaluation period starts from early 1950s, 1960s and 1970s,

28

respectively. Also, e2i, n1+τ shows the MSFE for various benchmark estimators at time τ , and

e2α, n1+τ represents the MSFE for our proposed Stein-like combined estimator at time τ . Figure 2

displays the results for CDSFE for different benchmark models with quarterly data. The solid dark

blue line shows the CDSFE for the BP post-break estimator. As it is clear from the graph, we have

an increasing positive values for this estimator over 1950-2018 for all forecast horizons, h. This

shows the consistent out-performance of the Stein-like combined estimator over the BP post-break

estimator.

The next best forecast is the one with semi-parametric method (green dashed line) that the

Stein estimator has consistent out-performance relative to this semi-parametric estimator. This

estimator is not sensitive to the choice of initial in-sample estimation period and has the consistent

performance regardless of the choice of n1, but for the longer forecast horizons, h = 3, 4, the

CDSFE becomes smaller which emphasize the excellence performance of this estimator for longer

forecast horizons.

The out-performance of the Stein-like combined estimator relative to the combined estimator

with κ weight (pink dashed line) can be seen in the figures for all h. Based on the graphs, the

κ weight estimator is sensitive to the choice of the initial in-sample estimation period because as

we increase n1, the CDSFE becomes smaller but still the Stein-like estimator out-performs the κ

estimator. Also, the Stein-like combined estimator forecast outperforms the forecast based on the

PPP estimator (dotted red line) almost for all h except for h = 4 and n1 = 92, 132 that we can

see the insignificant under-performance.

Finally we can see the out-performance of the Stein-like combined estimator relative to the

historical average except for for h = 4. For h = 4, we can see a period of under-performance around

2000s. Overall, the Stein-like combine estimator is the best estimator amongst the introduced

estimators in this paper.

9 Conclusion

In this paper we introduce the Stein-like combined estimator of the full-sample estimator (using

all observations in the sample), and the post-break estimator which uses the observations after the

most recent break point. The standard solution for forecasting under structural break is to use

the post-break estimator, but it has been shown that using pre-break observations can decrease

29

the MSFE. As our combined estimator uses the pre-break observations, we are able to reduce the

variance of forecast error at the cost of adding some bias. We also introduce the semi-parametric

estimator which is a discrete kernel weight of all observations in the sample. This estimator gives

weight one to the post-break observations, and down-weight pre-break observations by weight

γ ∈ [0 1], which we can find it numerically by CV. Both the Stein-like combined estimator and

the semi-parametric estimator have lower MSFE than the post-break estimator. Especially for

the case of large number of regressors, the semi-parametric estimator performs well because it

does not involve with estimation error of many parameters, and we only need to estimate the

weight γ. We also compare the performance of our proposed Stein-like combined estimator with

two alternative estimators. One of them is the combined estimator with weight κ ∈ [0 1] and

the other one is proposed by Pesaran et al. (2013). Based on the simulation results, all of the

estimators perform well, but when the number of regressors is high and break size is large, MSFE

for PPP estimator is larger than post-break estimator, but we do not have such a case in our

proposed estimators. Besides, the results from the empirical example with equity premium shows

that the Stein-like combined estimator perform better than other introduced estimators in this

paper (including historical average) and is more robust to the frequency of data and selection of

initial in-sample estimation window.

30

References

Andrews, D.W.K. (1993). “Tests for parameter instability and structural change with unknown

change point.” Econometrica 61, 821-856.

Bai, J. and Perron, P. (1998). “Estimating and testing linear models with multiple structural

changes.” Econometrica 66, 47-78.

Bai, J. and Perron, P. (2003). “Computation and analysis of multiple structural change models.”

Journal of Applied Econometrics 18, 1-22.

Bates, J.M. and C.W.J. Granger (1969). “The Combination of Forecasts.” Operations Research

Quarterly 20, 451-468.

Boot, T. and Andreas Pick (2017). “A near optimal test for structural breaks when forecasting

under square error loss.” Tinbergen Institute Discussion Paper, No.17-039/III, Tinbergen

Institute, Amsterdam and Rotterdam

Campbell, J. Y, and S. B. Thompson (2008). “Predicting the Equity Premium Out of Sample:

Can Anything Beat the Historical Average?” Review of Financial Studies 21, 1509-31.

Clements, M.P. and D.F. Hendry (1998). “Forecasting Economic Time Series.” Cambridge

University Press.

Clements, M.P. and D.F. Hendry (1999). “Forecasting Non-stationary Economic Time Series.”

The MIT Press.

Clements, M.P. and D.F. Hendry (2006). “Forecasting with Breaks.” In Elliott, G., C.W.J.

Granger and A. Timmermann (Eds.), Handbook of Economic Forecasting vol. 1. North-

Holland, 605-658

Diebold, F.X. and P. Pauly (1987). “Structural Change and the Combination of Forecasts.”

Journal of Forecasting 6, 21-40.

Diebold, F. X. and R. S. Mariano (1995). “Comparing predictive accuracy.”Journal of Business

and Economic Statistics 13, 253-263.

Hansen, Bruce (2016). “Efficient shrinkage in parametric models.” Journal of Econometrics 190,

115-132

Hansen, Bruce (2017). “Stein-like 2SLS estimator.” Econometrics Reviews 36, 840-852

Hausman, J. A. (1978). “Specification tests in econometrics.” Econometrica 46, 1251-1271

31

James, W., Stein, Charles M. (1961). “Estimation with quadratic loss.” In Proceedings of the

Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 361–380.

Judge, G.G. and M.E. Bock (1978). “The statistical implications of pretest and Stein-rule

estimators in econometrics.” North-Holland, Amsterdam.

Lebedev, N. N. (1972). “Special Functions and Their Applications.” Dover Publications, Inc.;

New York.

Li, K.-C. (1987). “Asymptotic optimality for Cp , CL , cross-validation and generalized cross-

validation: discrete index set.” The Annals of Statistics 15, 958-975.

Li, Q., Ouyang, D., Racine, J. (2013). “Categorical semiparametric varying coefficient models.”

Journal of Applied Econometrics 28,551–579.

Lehmann, E. L., Casella, G. (1998). “Theory of Point Estimation.” 2nd ed. New York: Springer.

Maasoumi, E. (1978). “A modified Stein-like estimator for the reduced form coefficients of

simultaneous equations.” Econometrica 46,695–703.

Pesaran, M.H. and A. Timmermann (2005). “Small Sample Properties of Forecasts from Auto-

regressive Models under Structural Breaks.” Journal of Econometrics 129, 183-217.

Pesaran, M.H., and A. Timmermann (1995). “Predictability of Stock Returns: Robustness and

Economic Significance.” Journal of Finance 50, 1201-28.

Pesaran, M.H., Andreas Pick and Mikhail Pranovich (2013). “Optimal forecasts in the presence

of structural breaks.” Journal of Econometrics, 177(2), 134-152.

Pesaran, M.H., Timmermann, Allan (2007). “Selection of estimation window in the presence of

breaks.” Journal of Econometrics 137, 134-161.

Pesaran, M.H., and Andreas Pick (2011). “Forecast Combination Across Estimation Windows.”

Journal of Business and Economic Statistics, 29:2, 307-318, DOI: 10.1198/ jbes.2010.09018

Rao, C. R. (1973). “Linear Statistical Inference and Its Applications.” New York: John Wiley

and Sons, Inc.

Rapach, D. E., and Zhou, G. (2013). “Forecasting stock returns.” in Graham Elliott and Allan

Timmermann, eds: Handbook of Economic Forecasting, Volume 2 (Elsevier, Amsterdam).

Rossi, B. (2013). “Advances in forecasting under instability.” In G. Elliott, & A. Timmermann

(Eds.), Handbook of Economic Forecasting, Chapter 21, 1203 - 1324. Elsevier volume 2, Part

B.

32

Saleh, A.K.Md. Ehsanses (2006). “Theory of Preliminary Test and Stein-Type Estimation with

Applications.” Wiley, Hoboken.

Stein, Charles M. (1956). “Inadmissibility of the usual estimator for the mean of a multivariate

normal distribution.” In: Proc. Third Berkeley Symp. Math. Statist. Probab., vol. 1 pp.

197–206.

Stock, J. H., and Watson, M. W. (2004). “Combination Forecasts of Output Growth in a Seven-

Country Data Set.” Journal of Forecasting 23, 405–430.

Timmerman, Allan (2006). “Forecast Combinations.” In G. Elliott, C. W. Granger, and A.

Timmermann (Eds.), Handbook of economic forecasting (pp. 135-196). Elsevier.

Ullah, A. (1974). “On the sampling distribution of improved estimators for coefficients in linear

regression.” Journal of Econometrics 2, 143-50.

Ullah, A., Wan, A. T. K., Wang, H., Zhang, X., Zou, G. (2017). “A semiparametric generalized

ridge estimator and link with model averaging.” Econometric Reviews 36, 370-384.

Welch, I., and A. Goyal (2008). “A Comprehensive Look at the Empirical Performance of Equity

Premium Prediction.” Review of Financial Studies 21, 1455-508.

West, K.D., (1996). “Asymptotic inference about predictive ability.” Econometrica 64, 1067-1084.

Zhang, X., Wan, A. T. K., Zou, G. (2013). “Model averaging by Jackknife criterion in models

with dependent data.” Journal of Econometrics 174(2), 82–94.

33

Table 1: Simulation results for a single break, T = 100, q = 0.5

k = 3 k = 5 k = 8

λ RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP

b = 0.2

0 0.988 0.984 0.986 0.996 0.968 0.970 0.975 0.992 0.952 0.962 0.967 0.989

0.5 1.000 1.000 1.001 1.002 0.998 0.997 0.996 1.002 0.996 0.995 0.995 1.008

1 1.000 1.000 1.002 1.006 1.000 1.000 0.999 1.005 0.999 0.999 0.999 1.026

b = 0.4

0 0.985 0.980 0.976 0.988 0.959 0.959 0.957 0.982 0.926 0.935 0.926 0.971

0.5 0.998 0.999 1.000 1.005 0.996 0.994 0.996 1.003 0.993 0.992 0.995 1.000

1 0.999 1.000 1.001 1.003 1.000 1.000 1.000 1.012 0.999 0.999 1.000 1.019

b = 0.6

0 0.975 0.972 0.951 0.977 0.925 0.928 0.917 0.959 0.871 0.884 0.857 0.922

0.5 1.000 1.000 1.006 0.997 0.993 0.991 0.991 0.995 0.986 0.982 0.980 0.993

1 0.999 0.999 1.001 0.999 0.999 1.000 0.996 1.012 0.997 0.997 0.992 1.012

b = 0.8

0 0.926 0.915 0.867 0.923 0.833 0.837 0.797 0.880 0.705 0.732 0.641 0.785

0.5 0.998 0.999 1.008 0.992 0.977 0.974 0.976 0.974 0.953 0.956 0.921 0.929

1 1.000 1.000 1.003 1.007 1.000 1.000 1.002 1.010 0.997 0.994 0.984 1.014

Note: This table represents the results of the relative MSFE. The bench mark forecast is post-break estimator. The first column showsthe ratio of pre-break observations while the second column shows different break sizes, λ. RMSEα ≡ MSFE(yα)

MSFE(y2) , RMSEκ ≡ MSFE(yκ)MSFE(y2) ,

RMSEγ ≡ MSFE(yγ)MSFE(y2) , and RMSEPPP ≡ MSFE(yPPP )

MSFE(y2) .

34

Table 2: Simulation results for a single break, T = 100, q = 1

k = 3 k = 5 k = 8


b = 0.2

0 0.994 0.993 0.992 0.999 0.986 0.987 0.982 0.997 0.979 0.983 0.979 0.995

0.5 1.000 0.999 1.000 1.001 0.999 0.997 0.996 1.004 0.996 0.995 0.995 1.003

1 1.000 1.001 1.002 1.005 1.000 0.999 0.999 1.004 0.999 0.999 0.999 1.010

b = 0.4

0 0.992 0.988 0.987 0.993 0.977 0.977 0.974 0.990 0.957 0.962 0.952 0.983

0.5 0.999 0.998 0.999 1.004 0.996 0.994 0.995 1.001 0.994 0.992 0.994 0.997

1 0.999 1.000 1.001 1.002 1.000 1.000 1.000 1.014 0.998 0.999 1.000 1.011

b = 0.6

0 0.982 0.980 0.963 0.984 0.947 0.949 0.940 0.971 0.902 0.911 0.885 0.939

0.5 1.000 0.999 1.004 0.996 0.991 0.988 0.989 0.994 0.986 0.981 0.979 0.988

1 0.999 0.999 1.000 1.000 0.999 0.998 0.996 1.013 0.996 0.996 0.991 1.009

b = 0.8

0 0.940 0.931 0.882 0.930 0.860 0.859 0.822 0.893 0.737 0.759 0.672 0.803

0.5 0.999 0.985 1.003 0.989 0.973 0.978 0.972 0.973 0.940 0.938 0.918 0.931

1 1.000 1.001 1.003 1.005 0.999 0.999 1.001 1.011 0.994 0.999 0.982 1.013

Note: See the footnote of Table 1.

35


k = 3 k = 5 k = 8


b = 0.2

0 0.997 0.998 0.998 1.000 0.995 0.995 0.995 1.001 0.993 0.994 0.993 0.998

0.5 1.000 0.999 1.000 1.000 0.998 0.997 0.996 1.001 0.996 0.996 0.994 0.998

1 1.000 1.000 1.002 1.002 1.000 0.999 0.999 1.002 0.999 0.999 0.999 1.002

b = 0.4

0 0.996 0.995 0.997 0.997 0.991 0.991 0.990 0.996 0.983 0.985 0.986 0.993

0.5 0.999 0.997 0.998 1.000 0.996 0.993 0.994 0.997 0.995 0.993 0.994 0.995

1 0.999 0.999 1.001 1.000 1.000 1.000 1.000 1.010 0.999 0.999 1.000 1.003

b = 0.6

0 0.993 0.986 0.985 0.991 0.976 0.976 0.974 0.986 0.949 0.954 0.946 0.970

0.5 0.999 0.999 1.000 0.995 0.990 0.992 0.986 0.988 0.985 0.980 0.978 0.984

1 0.999 0.999 1.000 0.999 0.998 0.997 0.995 1.005 0.994 0.993 0.991 1.001

b = 0.8

0 0.964 0.957 0.911 0.951 0.910 0.907 0.890 0.925 0.807 0.819 0.772 0.843

0.5 0.994 0.990 0.989 0.982 0.968 0.970 0.969 0.968 0.924 0.926 0.914 0.934

1 1.000 0.999 1.003 1.002 0.998 0.999 1.000 1.009 0.986 0.989 0.979 0.997


36

Table 4: Simulation results for a single break, T = 200, q = 0.5

k = 3 k = 5 k = 8


b = 0.2

0 0.998 0.997 0.997 0.999 0.991 0.991 0.992 0.996 0.979 0.982 0.984 0.995

0.5 0.999 0.998 0.998 1.000 0.999 0.999 0.998 0.999 0.999 0.999 1.000 1.009

1 1.000 1.000 1.000 1.003 1.000 1.000 1.000 1.008 1.000 1.000 1.000 1.007

b = 0.4

0 0.992 0.986 0.984 0.994 0.972 0.976 0.974 0.988 0.952 0.959 0.953 0.981

0.5 0.998 0.997 0.996 0.996 0.998 0.998 0.997 0.997 0.999 0.999 0.999 1.010

1 1.000 0.999 1.000 1.002 1.000 0.999 1.000 1.005 1.000 1.000 1.000 1.004

b = 0.6

0 0.983 0.975 0.970 0.984 0.955 0.958 0.951 0.976 0.914 0.928 0.916 0.960

0.5 0.997 0.995 0.994 0.995 0.997 0.997 0.996 0.999 0.996 0.996 0.998 1.013

1 1.000 0.999 1.000 1.003 0.999 0.999 1.000 1.007 0.999 0.999 0.999 1.005

b = 0.8

0 0.968 0.958 0.931 0.963 0.903 0.907 0.882 0.931 0.819 0.841 0.784 0.884

0.5 0.996 0.993 0.994 0.996 0.994 0.994 0.995 0.997 0.987 0.984 0.984 1.005

1 1.000 1.000 1.000 1.008 0.998 0.998 1.000 1.006 0.997 0.997 0.995 1.009


37


k = 3 k = 5 k = 8


b = 0.2

0 0.999 1.000 1.002 1.000 0.997 0.998 0.999 0.999 0.994 0.994 0.995 0.998

0.5 0.999 0.998 0.997 0.999 0.999 0.999 0.998 0.997 1.000 1.000 1.000 1.002

1 1.000 1.000 1.000 1.002 1.000 1.000 1.000 1.004 1.000 1.000 1.000 1.009

b = 0.4

0 0.998 0.995 0.994 0.997 0.986 0.988 0.986 0.994 0.974 0.978 0.974 0.988

0.5 0.998 0.997 0.996 0.996 0.999 0.998 0.997 0.998 0.999 0.999 1.000 1.005

1 1.000 1.000 1.000 1.001 1.000 1.000 1.000 1.005 1.000 1.000 1.000 1.006

b = 0.6

0 0.992 0.986 0.982 0.991 0.972 0.973 0.967 0.984 0.944 0.953 0.942 0.975

0.5 0.997 0.999 0.994 0.996 0.998 0.997 0.996 0.999 0.997 0.997 0.999 1.014

1 1.000 0.999 1.000 1.002 0.999 0.999 1.000 1.007 0.999 0.999 0.999 1.005

b = 0.8

0 0.978 0.969 0.946 0.973 0.922 0.923 0.900 0.942 0.848 0.863 0.811 0.900

0.5 0.996 0.998 0.993 0.995 0.994 0.995 0.995 0.996 0.987 0.984 0.985 1.005

1 1.000 1.000 1.000 1.008 0.998 0.998 1.001 1.009 0.997 0.997 0.995 1.010


38


k = 3 k = 5 k = 8


b = 0.2

0 1.000 1.001 1.003 1.000 1.001 1.000 1.004 1.000 1.000 0.999 1.001 0.999

0.5 0.999 0.999 0.998 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 0.999

1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.002 1.000 1.000 1.000 0.999

b = 0.4

0 1.001 1.000 1.007 1.000 0.998 0.997 1.000 0.998 0.992 0.993 0.994 0.996

0.5 0.998 0.997 0.996 0.996 0.999 0.998 0.998 0.998 1.000 0.999 1.000 1.000

1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.004 1.000 1.000 1.001 1.001

b = 0.6

0 0.999 0.996 1.002 0.997 0.990 0.989 0.990 0.994 0.978 0.981 0.982 0.991

0.5 0.997 0.994 0.994 0.996 0.998 0.997 0.997 0.998 0.998 0.997 0.999 1.007

1 1.000 0.999 1.000 1.000 0.999 0.999 1.000 1.004 0.999 0.998 0.999 1.003

b = 0.8

0 0.990 0.988 0.979 0.986 0.957 0.955 0.944 0.964 0.901 0.909 0.880 0.935

0.5 0.996 0.999 0.993 0.993 0.995 0.995 0.996 0.997 0.988 0.989 0.986 1.005

1 1.000 1.000 1.000 1.007 0.998 0.998 1.001 1.009 0.997 0.996 0.996 1.011


39

Table 7: Break point specifications by experiments

Experiment No. β(1) β(2) β(3) σ(1) σ(2) σ(3)

#1 : No break 1 1 1 1 1 1

#2 : Moderate break in coefficients (decline) 1.5 1 0.5 1 1 1

#3 : Large break in coefficients (decline) 2 1 0 1 1 1

#4 : Moderate break in coefficients (increase) 0.5 1 1.5 1 1 1

#5 : Large break in coefficients (increase) 0 1 2 1 1 1

#6 : Moderate decreasing and increasing break 1.5 1 1.5 1 1 1

#7 : Large decreasing and increasing break 2 1 2 1 1 1

#8 : Moderate increasing and decreasing break 0.5 1 0.5 1 1 1

#9 : Large increasing and decreasing break 0 1 0 1 1 1

#10 : No break, then increasing break 1 1 2 1 1 1

#11 : Decreasing, then no break 2 1 1 1 1 1

#12 : Higher post-break volatility 1 1 1 1 1 2

#13 : Lower post-break volatility 1 1 1 1 1 0.5

40

Table 8: Simulation results for multiple breaks, T = 100

k = 3 k = 5 k = 8

Experiment RMSFEα RMSFEκ RMSFEγ RMSFEPPP RMSFEα RMSFEκ RMSFEγ RMSFEPPP RMSFEα RMSFEκ RMSFEγ RMSFEPPP

#1 0.951 0.942 0.957 0.998 0.887 0.894 0.906 0.996 0.820 0.840 0.830 0.982

#2 0.999 1.000 0.994 1.039 0.995 0.999 0.983 1.068 0.991 0.997 0.979 1.089

#3 0.999 1.000 1.000 1.010 0.998 0.998 0.995 1.041 0.997 0.998 0.993 1.079

#4 0.999 0.999 0.997 1.008 0.996 0.997 0.992 1.042 0.995 0.996 0.978 1.083

#5 1.000 1.000 1.004 1.055 1.000 1.000 1.004 1.064 1.000 1.000 1.003 1.088

#6 0.983 0.983 0.964 1.004 0.967 0.958 0.921 0.992 0.911 0.909 0.860 0.990

#7 0.998 0.999 0.968 1.011 0.995 0.998 0.942 1.009 0.990 0.992 0.876 1.010

#8 0.991 1.001 0.975 1.001 0.960 0.965 0.929 0.996 0.955 0.960 0.868 0.995

#9 0.994 0.999 0.962 1.009 1.000 1.001 0.935 1.023 1.000 1.001 0.878 1.089

#10 0.999 1.000 1.002 1.009 0.999 0.999 0.997 1.039 0.995 0.996 0.994 1.045

#11 0.984 1.000 0.966 1.011 1.000 1.000 0.934 1.024 1.000 0.999 0.876 1.095

#12 0.934 0.921 0.949 1.011 0.859 0.869 0.892 1.012 0.782 0.808 0.816 1.011

#13 0.965 0.958 0.968 1.000 0.909 0.914 0.921 0.997 0.849 0.863 0.854 0.975

Note: This table represents the results of the relative MSFE with respect to post-break estimator for multiple breaks when T = 100.The first column shows the experiment number which represents the specification of breaks based on Table 7. RMSEα ≡ MSFE(yα)

MSFE(y2) ,

RMSEκ ≡ MSFE(yκ)MSFE(y2) , RMSEγ ≡ MSFE(yγ)

MSFE(y2) , and RMSEPPP ≡ MSFE(yPPP )MSFE(y2) .

41

Table 9: Simulation results for multiple breaks, T = 200

k = 3 k = 5 k = 8

Experiment RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP RMSEα RMSEκ RMSEγ RMSEPPP

#1 0.982 0.977 0.981 0.999 0.953 0.952 0.959 0.999 0.933 0.934 0.938 0.995

#2 0.999 0.989 0.987 1.002 0.999 1.000 0.972 1.007 0.997 1.000 0.950 1.021

#3 0.999 0.999 0.983 1.007 0.998 1.000 0.974 1.016 0.996 1.000 0.939 1.011

#4 0.999 1.000 1.001 1.008 1.000 1.000 0.994 1.040 1.000 1.000 0.989 1.057

#5 0.999 1.000 1.001 1.038 0.999 1.000 1.000 1.040 0.999 1.001 0.999 1.066

#6 0.999 1.000 0.991 1.002 0.990 0.999 0.969 1.009 0.970 0.979 0.949 1.015

#7 0.999 1.001 0.986 1.001 1.000 1.001 0.973 1.015 1.000 1.000 0.950 1.039

#8 0.995 0.997 0.991 1.007 0.983 0.980 0.976 1.012 0.980 0.980 0.967 1.011

#9 0.998 0.988 0.988 1.007 0.993 0.991 0.976 1.023 0.991 0.989 0.956 1.055

#10 0.994 0.982 1.003 1.015 0.988 0.990 1.003 1.013 0.984 0.988 1.007 1.040

#11 1.000 1.000 0.991 1.007 0.993 0.999 0.965 1.010 0.989 0.990 0.939 1.029

#12 0.979 0.974 0.978 1.000 0.943 0.942 0.950 1.001 0.917 0.920 0.924 0.996

#13 0.983 0.979 0.985 0.998 0.959 0.960 0.969 1.000 0.941 0.944 0.950 0.998

Note: This table represents the results of the relative MSFE with respect to post-break estimator for multiple breaks when T = 200.For further details, see the footnote of Table 8

42

Table 10: Summary statistics for quarterly data

Variables Mean St. dev. Min. Max. Median

Equity Premium 0.0208 0.1117 -0.3937 0.8929 0.0308

D/P -3.3785 0.4660 -4.4932 -1.9039 -3.3567

D/Y -3.3643 0.4600 -4.4966 -2.0331 -3.3291

E/P -2.7414 0.4223 -4.8074 -1.7750 -2.7926

D/E -0.6372 0.3341 -1.2442 1.3795 -0.6304

SV AR 0.0086 0.0147 0.0004 0.1144 0.0039

BM 0.5695 0.2678 0.1252 2.0285 0.5462

NTIS 0.0167 0.0257 -0.0529 0.1635 0.0168

TBL 0.0339 0.0309 0.0001 0.1549 0.0293

LTY 0.0509 0.0279 0.0179 0.1482 0.0420

LTR 0.0144 0.0466 -0.1451 0.2437 0.0094

TMS 0.0170 0.0130 -0.0350 0.0453 0.0175

DFY 0.0113 0.0070 0.0034 0.0559 0.0090

DFR 0.0010 0.0219 -0.1184 0.1629 0.0018

INFL 0.0074 0.0130 -0.0411 0.0909 0.0072

Note: D/P is dividend price ratio, D/Y is dividend yield, E/P is earning price ratio,D/E is dividend-payout ratio, SV AR is stock variance, BM is book-to-market ratio,NTIS is net equity expansion, TBL is treasury bill rate, LTY is long term yield,LTR is long term return, TMS is term spread, DFY is default yield spread, DFRis default return spread, and INFL is inflation.

43

Table 11: Out-of-sample forecasting performance with quarterly data

h n2 MSFEα MSFEκ MSFEγ MSFEBP MSFEPPP MSFEHA

1 1950:Q1-2018:Q4 0.0067 0.0078 0.0076 0.0088 0.0068 0.0072

1960:Q1-2018:Q4 0.0073 0.0086 0.0081 0.0097 0.0074 0.0078

1970:Q1-2018:Q4 0.0078 0.0086 0.0089 0.0099 0.0080 0.0081

2 1950:Q1-2018:Q4 0.0068 0.0091 0.0083 0.0105 0.0087 0.0073

1960:Q1-2018:Q4 0.0073 0.0100 0.0089 0.0116 0.0096 0.0078

1970:Q1-2018:Q4 0.0077 0.0087 0.0097 0.0105 0.0086 0.0081

3 1950:Q1-2018:Q4 0.0073 0.0113 0.0086 0.0128 0.0239 0.0073

1960:Q1-2018:Q4 0.0079 0.0121 0.0089 0.0138 0.0272 0.0078

1970:Q1-2018:Q4 0.0077 0.0089 0.0093 0.0109 0.0082 0.0077

4 1950:Q1-2018:Q4 0.0081 0.0139 0.0079 0.0155 0.0078 0.0073

1960:Q1-2018:Q4 0.0087 0.0155 0.0085 0.0173 0.0084 0.0078

1970:Q1-2018:Q4 0.0080 0.0097 0.0089 0.0119 0.0087 0.0077

Note: This table reports the MSFE for different estimators. h in the first column shows the forecasthorizon. The second column (n2) shows the start date of the out-of sample period which all ends at2018 : Q4. Using the initial in-sample estimation period n1, forecasts are recursively generated ateach point in the out-of-sample period using only the information available at the time the forecastis made. In the heading of table, MSFEBP is for the case that we only use post-break observations.MSFEα represents the results based on our proposed Stein-like combined estimator with weightα = 1− τ

HT. MSFEκ shows the results for the combined estimator with weight κ ∈ [0 1]. MSFEγ

column represents the results for the semi-parametric estimator which we estimate the weight (γ)by cross validation. MSFEPPP represents the results based on Pesaran et al. (2013) estimator.Finally MSFEHA represents the results for the historical average.

44

Table 12: Statistical significance of improved predictive accuracy with quarterly data

h n2 R2κ R2

γ R2BP R2

PPP R2HA

1 1950:Q1-2018:Q4 0.1410 0.1184 0.2299 0.0147 0.0694(0.0067) (0.0103) (0.0002) (0.3723) (0.1173)

1960:Q1-2018:Q4 0.1512 0.0988 0.2474 0.0135 0.0641(0.0053) (0.0266) (0.0001) (0.4017) (0.1734)

1970:Q1-2018:Q4 0.0930 0.1236 0.2121 0.0250 0.0370(0.0527) (0.0167) (0.0016) (0.3647) (0.3194)

2 1950:Q1-2018:Q4 0.2527 0.1807 0.3524 0.2184 0.0685(0.0002) (0.0040) (0.0000) (0.0943) (0.1935)

1960:Q1-2018:Q4 0.2700 0.1798 0.3707 0.2396 0.0641(0.0002) (0.0075) (0.0000) (0.0969) (0.2347)

1970:Q1-2018:Q4 0.1149 0.2062 0.2667 0.1047 0.0494(0.0103) (0.0028) (0.0001) (0.0734) (0.2795)

3 1950:Q1-2018:Q4 0.3540 0.1512 0.4297 0.6946 0.0000(0.0008) (0.0163) (0.0000) (0.1575) (0.5111)

1960:Q1-2018:Q4 0.3471 0.1124 0.4275 0.7096 -0.0128(0.0016) (0.0494) (0.0000) (0.1581) (0.5392)

1970:Q1-2018:Q4 0.1348 0.1720 0.2936 0.0610 0.0000(0.0232) (0.0029) (0.0001) (0.1908) (0.4626)

4 1950:Q1-2018:Q4 0.4173 -0.0253 0.4774 -0.0385 -0.1096(0.0019) (0.5842) (0.0002) (0.6239) (0.8253)

1960:Q1-2018:Q4 0.4387 -0.0235 0.4971 -0.0357 -0.1154(0.0019) (0.5735) (0.0002) (0.6047) (0.8161)

1970:Q1-2018:Q4 0.1753 0.1011 0.3277 0.0805 -0.0390(0.0136) (0.0649) (0.0003) (0.0995) (0.6279)

Note: This table reports the results for R2. R2i = 1 − MSFEα

MSFEi, where MSFEi

stands for the MSFE for different estimator such as “BP”, “γ”, “PPP” and “HA”.R2i shows the proportional reduction in MSFE for the predictive regression forecast

relative to the benchmark forecast which is the BP post-break estimator. To showwhether the forecast improvement is significant or not, we report the p-value basedon Diebold and Mariano (1995) and West (1996) statistic in bracket. This is fortesting the null hypothesis that MSFEi = MSFEα against the alternative thatMSFEi > MSFEα (corresponding to H0 : R2

i = 0 against HA : R2i > 0). The first

column shows the forecast horizon and n2 in the second column shows the start dateof the out-of sample period which all ends at 2018 : Q4.

45

(a) b1 = 0.1, q = 0.5 (b) b1 = 0.1, q = 1 (c) b1 = 0.1, q = 2

(d) b1 = 0.5, q = 0.5 (e) b1 = 0.5, q = 1 (f) b1 = 0.5, q = 2

(g) b1 = 0.9, q = 0.5 (h) b1 = 0.9, q = 1 (i) b1 = 0.9, q = 2

Figure 1: Risk gain between the Stein-like combined estimator and post break estimator.

46

(a) h = 1, n1 = 92 (b) h = 1, n1 = 132 (c) h = 1, n1 = 172

(d) h = 2, n1 = 92 (e) h = 2, n1 = 132 (f) h = 2, n1 = 172

(g) h = 3, n1 = 92 (h) h = 3, n1 = 132 (i) h = 3, n1 = 172

(j) h = 4, n1 = 92 (k) h = 4, n1 = 132 (l) h = 4, n1 = 172

Figure 2: Cumulative Difference Square Forecast Error for quarterly data with all forecast horizon,and n1 = 92, 132, 172.

47

(a) h = 1, n1 = 276 (b) h = 1, n1 = 396 (c) h = 1, n1 = 516

(d) h = 2, n1 = 276 (e) h = 2, n1 = 396 (f) h = 2, n1 = 516

(g) h = 3, n1 = 276 (h) h = 3, n1 = 396 (i) h = 3, n1 = 516

(j) h = 4, n1 = 276 (k) h = 4, n1 = 396 (l) h = 4, n1 = 516

Figure 3: Cumulative Difference Square Forecast Error for monthly data with all forecast horizon,and n1 = 276, 396, 516.

48

A Appendix: Mathematical details

A.1 Asymptotic risk of the Stein-like combined estimator

For proof of equation (23), we use Lemmas 2 - 4. For simplicity, define A ≡ G′V 1/2, Ψ = G′2V1/2,

and let the non-centrality parameter of chi-square distribution based on Lemma 4 to be defined as

µ ≡ θ′Mθ2 . For clarity, we focus on the second and third term in equation (23) one by one. The

second term can be simplified as:

tr[WAE

((Z ′MZ)−2ZZ ′

)A′]

= E[χ2k+2(µ)

]−2tr(WAMA′

)+ E

[χ2k(µ)

]−2tr(WA(I2K −M)A′

)+ E

[χ2k+4(µ)

]−2tr(WAMθθ′MA′

)+ E

[χ2k(µ)

]−2tr(WA(I2K −M)θθ′(I2K −M)A′

)+ E

[χ2k+2(µ)

]−2tr(WA(θθ′M +Mθθ′ − 2Mθθ′M)A′

)= E

[χ2k+2(µ)

]−2tr(WAA′

)+ 0 + E

[χ2k+4(µ)

]−2tr(WAθθ′A′

)+ 0 + 0

=

[1

4e−µ

Γ(k2 − 1)

Γ(k2 + 1)1F1

(k2− 1;

k

2+ 1;µ

)]tr(WAA′

)

+

[1

4e−µ

Γ(k2 )

Γ(k2 + 2)1F1

(k2

;k

2+ 2;µ

)](θ′A′WAθ

)

=

[θ′A′WAθ

(k − 2)θ′Mθ

][e−µ 1F1

(k2− 1;

k

2

)]

−

[θ′A′WAθ

(k − 2)θ′Mθ−

tr(WAA′

)k(k − 2)

][e−µ 1F1

(k2− 1;

k

2+ 1)],

(a.1)

where the last equality derives by using Lemma 3 several times, AM = A, MA′ = A′, A(I2K −

M)A′ = 0, A

(I2K−M

)θθ′(I2K−M

)A′ = 0, and A

(θθ′M+Mθθ′−2Mθθ′M

)A′ = 0. Remember

that M is an idempotent matrix. The third term in equation (23) can be simplified as:

49

tr[WΨE

((Z ′MZ)−1ZZ ′

)A′]

= E[χ2k+2(µ)

]−1tr(WΨMA′

)+ E

[χ2k(µ)

]−1tr(WΨ(I2K −M)A′

)+ E

[χ2k+4(µ)

]−1tr(WΨMθθ′MA′

)+ E

[χ2k(µ)

]−1tr(WΨ(I2K −M)θθ′(I2K −M)A′

)+ E

[χ2k+2(µ)

]−1tr(WΨ(θθ′M +Mθθ′ − 2Mθθ′M)A′

)= E

[χ2k+2(µ)

]−1tr(WAA′

)+ 0 + E

[χ2k+4(µ)


)+ 0− E

[χ2k+2(µ)


)=

[1

2e−µ

Γ(k2 )

Γ(k2 + 1)1F1

(k2

;k

2+ 1;µ

)]tr(WAA′ −WAθθ′A′

)

+

[1

2e−µ

Γ(k2 + 1)

Γ(k2 + 2)1F1

(k2

+ 1;k

2+ 2;µ

)]tr(WAθθ′A′

)

= 2 e−µ

[2(θ′A′WAθ

)(θ′Mθ

)(k − 2

) − tr(WAA′)

k − 2

]

×

[1F1(

k

2− 1;

k

2;µ)− 1F1

(k2− 1;

k

2+ 1;µ

)],

(a.2)

where the last equality derives by using Lemma 3 several times, ΨM = A, Ψθ = 0, Ψ(I2K−M

)A′ =

0, Ψ(I2K −M

)θθ′(I2K −M

)A′ = 0, and AA′ = V(2) − VF . Finally, plugging equations (a.1) and

(a.2) into equation (23) produces equation (15).

A.2 Proof of equation (17)

We want to show that tr(W(V(2) − VF

))− 2 θ′V 1/2GWG′V 1/2θ

θ′Mθ > 0.

It is clear that: tr(W(V(2) − VF

))= tr

(W(V(2) − VF

)Ik

)≤ tr(Ik) λmax

(W(V(2) − VF

)).Thus,

tr(W(V(2) − VF

))> Sup 2 θ′V 1/2GWG′V 1/2θ

θ′Mθ ⇒

k λmax

(W(V(2) − VF

))> tr

(W(V(2) − VF

))> 2 λmin

(W(V(2) − VF

))⇒

k >2 λmin

(W(V(2)−VF

))λmax

(W(V(2)−VF

)) ⇒ k > 2.

50

Proof of equation (18):

We want to show that θ′V 1/2GWG′V 1/2θθ′Mθ − tr

(W(V(2)−VF )

)k > 0.

We have: Inf k θ′V 1/2GWG′V 1/2θθ′Mθ > tr

(W(V(2) − VF )

), therefore k >

tr(W(V(2)−VF )

)λmin

(W(V(2)−VF )

) .

A.3 Risk of the semi-parametric estimator

R(γ) = E[L(γ)

]= E

[(µ(γ)− µ

)′(µ(γ)− µ

)]= E

[(Ω−1/22

(A+P (γ)Y2

)− µ

)′(Ω−1/22

(A+P (γ)Y2

)− µ

)]

= E

[(Ω−1/22 A+Ω

−1/22 P (γ) Ω

1/22 µ+ Ω

−1/22 p(γ) σ(2)ε2 − µ

)′(

Ω−1/22 A+Ω

−1/22 P (γ) Ω

1/22 µ+ Ω

−1/22 p(γ) σ(2)ε2 − µ

)]

= E

[(Ω−1/22 A+P (γ)ε2 +

(P (γ)µ− µ

))′(Ω−1/22 A+P (γ)ε2 +

(P (γ)µ− µ

))]

= E

[A′Ω−1

2 A+ε′2P2(γ)ε2 +

(P (γ) µ− µ

)′(P (γ) µ− µ

)+ 2A′Ω−1/2

2

(P (γ) µ− µ

)]= E

[A′Ω−1

2 A]

+ tr(P 2(γ)

)+(P (γ) µ− µ

)′(P (γ) µ− µ

)= Υ + γ tr

(P (γ)− P 2(γ)

)+ tr

(P 2(γ)

)+(P (γ)µ− µ

)′(P (γ)µ− µ

)= Υ + γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)+

(p(γ)

(µ+ Ω

−1/22 σ(2)ε2 − Ω

−1/22 σ(2)ε2

)− µ

)′(p(γ)

(µ+ Ω

−1/22 σ(2)ε2 − Ω

−1/22 σ(2)ε2

)− µ

)= Υ + γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)(Ω−1/22 P (γ) Y2 + Ω

−1/22 A−Ω

−1/22 A−P (γ)ε2 − µ

)′(

Ω−1/22 P (γ) Y2 + Ω

−1/22 A−Ω

−1/22 A−P (γ)ε2 − µ

)= Υ + γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)+(µ(γ)− µ− Ω

−1/22 A−P (γ) ε2

)′(µ(γ)− µ− Ω

−1/22 A−P (γ) ε2

)= L(γ) + A′Ω−1

2 A+ε′2P2(γ)ε2 − 2

(µ(γ)− µ

)′(Ω−1/22 A+P (γ)ε2

)+ Υ + γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)= L(γ) + A′Ω−1

2 A+ε′2P2(γ)ε2 − 2A′Ω−1

2 A−2(P (γ)µ− µ

)′(Ω−1/2 A+P (γ)ε2

)− 2 ε′2P

2(γ)ε2

51

− 4A′Ω−1/22 P (γ)ε2 + Υ + γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)= L(γ)− C − ε′2P 2(γ)ε2 − 2

(P (γ)µ− µ

)′(Ω−1/22 A+P (γ)ε2

)+ γ tr

(P (γ)

)+ (1− γ)

(P 2(γ)

)− 4A′Ω−1/2

2 P (γ)ε2, (a.3)

where E[A′Ω−1

2 A]

= Υ+γ tr(P (γ)−P 2(γ)

), A′Ω−1

2 A = Υ+C, Υ =(γ2

q4

)β′(1)X

′1X1

(γq2X ′1X1 +

X ′2X2

)−1X ′2Ω−1

2 X2

(γq2X ′1X1+X ′2X2

)−1X ′1X1β(1), and C = γ2 σ2

(2)ε′1X1

(γq2X ′1X1+X ′2X2

)−1X ′2Ω−1

2 X2(γq2X ′1X1 +X ′2X2

)−1X ′1ε1.

A.4 Details for the optimality of the cross validated weight

In order to prove the conditions in equations (44)-(54), define λmax(B) to be the largest eigenvalue

of matrix B. Thus:

Supγ∈H

λmax

(p(γ)

)= Sup

γ∈Hλmax

(X2

( γq2X ′1X1 +X ′2X2

)−1X ′2

)≤ λmax

(X2

(X ′2X2

)−1X ′2

)= 1. (a.4)

Also, assumeX′2Ω−1

2 X2

T (1−b) → Q. Besides, it can be shown that R(γ) = Op(T ), µ′µ = O(T ), and

X ′2Ω−1/22 ε2 = Op(

√T ).


ε′2Ω−1/22 A =

γ

q2ε′2Ω

−1/22 X2

( γq2X ′1X1 +X ′2X2

)−1X ′1Y1

≤ γ

q2ε′2Ω

−1/22 X2

( γq2X ′1X1

)−1X ′1Y1

= ε′2Ω−1/22 X2

(X ′1X1

)−1X ′1Y1

= ε′2Ω−1/22 X2 β(1)

= Op(√T ). Op(1) = Op(

√T ).

(a.5)

Proof of equation (45):(µ′P (γ)ε2

)2≤ µ′µ ε′2P 2(γ)ε2 ≤ µ′µ λmax

(P (γ)

)ε′2P (γ)ε2 = Op(T ), (a.6)

where the first inequality derived based on the Caushy-Schwarz inequality, (X ′Y )2 ≤ (X ′X)(Y ′Y ),

and the second inequality is because for any matrix X and A, we have: X ′AX ≤ X ′X λmax(A).

Therefore, X ′A2X ≤ (X ′AX) λmax(A). Thus µ′P (γ)ε2 = Op(√T ).


Supγ∈H

ε′2P (γ)ε2 ≤ ε′2(X2(X ′2X2)−1X2

)ε2 = Op(1). (a.7)

52


Supγ∈H

tr(P (γ)

)≤ tr

(X ′2(X ′2X2)−1X2

)= k. (a.8)

Proof of equation(48):

C = γ2 σ2(2)ε′1X1

( γq2X ′1X1 +X ′2X2

)−1X ′2Ω−1

2 X2

( γq2X ′1X1 +X ′2X2

)−1X ′1ε1

≤ γ2 σ2(2)ε′1X1

( γq2X ′1X1

)−1X ′2Ω−1

2 X2

( γq2X ′1X1

)−1X ′1ε1

= q2ε′1σ(1)X1(X ′1X1)−1X ′2Ω−12 X2(X ′1X1)−1X ′1σ1ε1

= q2(β(1) − β(1)

)′X ′2Ω−1

2 X2

(β(1) − β(1)

)= Op(1).

(a.9)


Supγ∈H

ε′2P2(γ)ε2 ≤ λmax

(P (γ)

)ε′2P (γ)ε2 ≤ ε′2X2(X ′2X2)−1X ′2ε2 = Op(1). (a.10)


A′Ω−1/22 P (γ)ε2 =

γ

q2Y ′1X1(

γ

q2X ′1X1 +X ′2X2)−1 X ′2Ω−1

2 X2(γ

q2X ′1X1 +X ′2X2)−1X ′2ε2

≤ γ

q2Y ′1X1(X ′2X2)−1 X ′2Ω−1

2 X2

( γq2X ′1X1

)−1X ′2ε2

=1

σ(2)Y ′1X1(X ′1X1)−1X ′2ε2

= Y ′1X1(X ′1X1)−1X ′2Ω−12 ε2

= β1 X′2Ω−1

2 ε2

= Op(1) . Op(√T ) = Op(

√T ).

(a.11)


Using the Caushy-Schwarz inequality we have:[(P (γ)µ− µ

)′Ω−1/22 A

]2

≤(P (γ)µ− µ

)′(P (γ)µ− µ

)A′Ω−1

2 A

≤ R(γ)A′Ω−12 A

= R(γ)γ2

q4Y ′1X1

( γq2X ′1X1 +X ′2X2

)−1X ′2Ω−1

2 X2

( γq2X ′1X1 +X ′2X2

)−1X ′1Y1

≤ R(γ)γ2

q4Y ′1X1

(X ′2X2

)−1X ′2Ω−1

2 X2

(X ′2X2

)−1X ′1Y1

= R(γ)γ2

σ2(1)

Y ′1X1

(X ′2X2

)−1X ′1Y1

= R(γ)γ2

σ(1)2 Y ′1X1

(X ′2X2

)−1X ′1Y1

53

= Op(T ) . Op(1

T 2) . Op(T ). Op(

1

T). Op(T ) = Op(1), (a.12)

where Y ′1X1(X ′2X2)−1X ′1Y1 = Op(T ). Theorem 1 in Li et al. (2013) proves that the CV parameter

converge to zero at a fast rate of 1T . Because we keep γ2 in the above simplification, we need to plug

the estimate of the weight which we derive from CV, γ = Op(1T ). Therefore,

(P (γ)µ−µ

)′Ω−1/22 A =

Op(1).

Proof of equation (52):[(P (γ)µ− µ

)′P (γ)ε2

]2

≤(P (γ)µ− µ

)′(P (γ)µ− µ

)ε′2 P

2(γ) ε2

≤ R(γ) ε′2 P2(γ) ε2

≤ R(γ) λmax

(P (γ)

)ε′2P (γ)ε2

≤ R(γ) ε′2P (γ)ε2 = Op(T ) . Op(1) = Op(T ).

(a.13)

So,(P (γ)µ− µ

)′P (γ)ε2 = Op(

√T ).


Supγ∈H

tr(P (γ)

)≤ tr

(X ′2(X ′2X2)−1X2

)= k = Op(1). (a.14)


tr(P 2(γ)

)= tr

[X2

( γq2X ′1X1 +X ′2X2

)−1X ′2X2

( γq2X ′1X1 +X ′2X2

)−1X ′2

]

≤ tr

[X2(X ′2X2)−1X ′2X2 (X ′2X2)−1X ′2

]= k = Op(1).

(a.15)

B Appendix: Additional empirical results

B.1 Empirical results based on monthly data

Monthly data starts from January 1927 to December 2018 which gives time series dimension of

T = 1104. Table 13 reports the results for MSFE which are recursively generated using different

initial in-sample estimation window sizes 276, 396, 516 such that the beginning of the various

forecast evaluation periods runs from Jan-1950 to Jan-1970. The results for statistical significance

level of predictive accuracy is reported in Table 14. Based on the results, MSFE of the Stein-like

combined estimator is lower than all other estimators in the table with significant values for R2

at 5%. As mentioned by Campbell and Thompson (2008), for monthly data an R2 value as small

as 0.5% reflects strong out-performance. The out-performance of the Stein-like estimator over the

PPP estimator is not significant only for h = 1 and h = 4. Besides, we can see the out-performance

54

of the Stein-like estimator over the historical average for all forecast horizons and for any choices

of the initial in-sample estimation period. Figure 3 displays the results for CDSFE for the different

benchmark models with monthly data.

Table 13: Out-of-sample forecasting performance with monthly data


1 1950:01-2018:12 0.0019 0.0019 0.0021 0.0020 0.0329 0.0031

1960:01-2018:12 0.0020 0.0020 0.0022 0.0021 0.0382 0.0034

1970:01-2018:12 0.0022 0.0022 0.0023 0.0022 0.0458 0.0034

2 1950:01-2018:12 0.0019 0.0020 0.0022 0.0021 0.0027 0.0031

1960:01-2018:12 0.0020 0.0021 0.0023 0.0021 0.0030 0.0034

1970:01-2018:12 0.0022 0.0022 0.0024 0.0023 0.0033 0.0034

3 1950:01-2018:12 0.0019 0.0020 0.0022 0.0021 0.0021 0.0031

1960:01-2018:12 0.0020 0.0021 0.0023 0.0022 0.0023 0.0034

1970:01-2018:12 0.0022 0.0022 0.0025 0.0023 0.0025 0.0034

4 1950:01-2018:12 0.0019 0.0021 0.0022 0.0021 0.0029 0.0031

1960:01-2018:12 0.0021 0.0022 0.0023 0.0023 0.0032 0.0034

1970:01-2018:12 0.0022 0.0023 0.0024 0.0023 0.0036 0.0034

Note: See the footnote for Table 11.

55

Table 14: Statistical significance of improved predictive accuracy with monthly data

h n2 R2κ R2

γ R2BP R2

PPP R2HA

1 1950:01-2018:12 0.0000 0.0952 0.0500 0.9422 0.3871(0.3434) (0.0021) (0.0087) (0.1531) (0.0000)

1960:01-2018:12 0.0000 0.0909 0.0476 0.9476 0.4118(0.4209) (0.0065) (0.0429) (0.1531) (0.0000)

1970:01-2018:12 0.0000 0.0435 0.0000 0.9520 0.3529(0.7506) (0.0544) (0.2134) (0.1531) (0.0000)

2 1950:01-2018:12 0.0500 0.1364 0.0952 0.2963 0.3871(0.0482) (0.0000) (0.0007) (0.0463) (0.0000)

1960:01-2018:12 0.0476 0.1304 0.0476 0.3333 0.4118(0.1043) (0.0000) (0.0068) (0.0470) (0.0000)

1970:01-2018:12 0.0000 0.0833 0.0435 0.3333 0.3529(0.7360) (0.0003) (0.1432) (0.0472) (0.0000)

3 1950:01-2018:12 0.0500 0.1364 0.0952 0.0952 0.3871(0.0311) (0.0000) (0.0006) (0.0150) (0.0000)

1960:01-2018:12 0.0476 0.1304 0.0909 0.1304 0.4118(0.0464) (0.0000) (0.0019) (0.0170) (0.0000)

1970:01-2018:12 0.0000 0.1200 0.0435 0.1200 0.3529(0.5624) (0.0000) (0.0543) (0.0181) (0.0000)

4 1950:01-2018:12 0.0952 0.1364 0.0952 0.3448 0.3871(0.0017) (0.0000) (0.0001) (0.1356) (0.0000)

1960:01-2018:12 0.0455 0.0870 0.0870 0.3438 0.3824(0.0034) (0.0000) (0.0004) (0.1373) (0.0000)

1970:01-2018:12 0.0435 0.0833 0.0435 0.3889 0.3529(0.2647) (0.0002) (0.0176) (0.1399) (0.0000)

Note: See the footnote for Table 12

B.2 Empirical results based on annual data

Annual data starts from 1927 to the end of 2018 which gives time series dimension of T = 92.

We report the results under different in-sample estimation period which correspond to early 1970’s

(n1 = 43) through early 1990’s (n1 = 63). Like what we did for the quarterly and monthly

data, Table 15 reports the MSFE under different forecast horizons, h. We also report the forecast

equivalency test of MSFE by R2 and to show whether the improvement is significant or not, we

report the p-value based on Diebold and Mariano (1995) and West (1996) statistic in bracket in

Table 16. The results are similar to the one that we get with quarterly and monthly data. Stein-like

combined estimator has a lower MSFE compare to other estimators for all h except the historical

average; but the under-performance of the Stein-like combined estimator over the historical average

is not significant.

56

Table 15: Out-of-sample forecasting performance with annual data


1 1970-2018 0.0300 0.0381 0.0323 0.0323 0.0337 0.0281

1980-2018 0.0265 0.0278 0.0288 0.0288 0.0291 0.0235

1990-2018 0.0277 0.0279 0.0279 0.0279 0.028 0.0278

2 1970-2018 0.0360 0.0516 0.0520 0.0520 0.0356 0.0268

1980-2018 0.0313 0.0506 0.0508 0.0507 0.0259 0.0240

1990-2018 0.0282 0.0290 0.0289 0.0288 0.0285 0.0279

3 1970-2018 0.0288 0.0368 0.0364 0.0363 0.0315 0.0231

1980-2018 0.0270 0.0365 0.0360 0.0359 0.0260 0.0234

1990-2018 0.0268 0.0270 0.0278 0.0278 0.0271 0.0279

4 1970-2018 0.0320 0.0412 0.0407 0.0407 0.3013 0.0226

1980-2018 0.0277 0.0351 0.0356 0.0356 0.3378 0.0240

1990-2018 0.0275 0.0276 0.0284 0.0284 0.0276 0.0290


57

Table 16: Statistical significance of improved predictive accuracy with annual data

h n2 R2κ R2

γ R2BP R2

PPP R2HA

1 1970-2018 0.2126 0.0712 0.0712 0.1098 -0.0676(0.1051) (0.2105) (0.2093) (0.1289) (0.6752)

1980-2018 0.0468 0.0799 0.0799 0.0893 -0.1277(0.3129) (0.2162) (0.2189) (0.1810) (0.8534)

1990-2018 0.0072 0.0072 0.0072 0.0107 0.0036(0.4096) (0.3589) (0.3744) (0.2276) (0.4698)

2 1970-2018 0.3023 0.3077 0.3077 -0.0112 -0.3433(0.1354) (0.1293) (0.1304) (0.5332) (0.9643)

1980-2018 0.3814 0.3839 0.3826 -0.2085 -0.3042(0.1242) (0.1222) (0.1232) (0.8772) (0.9437)

1990-2018 0.0276 0.0242 0.0208 0.0105 -0.0108(0.1445) (0.2247) (0.2411) (0.2345) (0.6067)

3 1970-2018 0.2174 0.2088 0.2066 0.0857 -0.2468(0.1074) (0.1184) (0.1215) (0.1845) (0.9468)

1980-2018 0.2603 0.2500 0.2479 -0.0385 -0.1538(0.1050) (0.1148) (0.1178) (0.8068) (0.9334)

1990-2018 0.0074 0.0360 0.0360 0.0111 0.0394(0.2958) (0.0132) (0.0157) (0.1414) (0.1843)

4 1970-2018 0.2233 0.2138 0.2138 0.8938 -0.4159(0.0152) (0.0317) (0.0311) (0.1475) (0.9778)

1980-2018 0.2108 0.2219 0.2219 0.9180 -0.1542(0.0664) (0.0738) (0.0738) (0.1558) (0.8865)

1990-2018 0.0036 0.0317 0.0317 0.0036 0.0517(0.1616) (0.2073) (0.2079) (0.3950) (0.1253)


58

optimal forecast under structural breaks with application

Documents