lecture notes - universiteit leiden

TIME SERIES

A.W. van der Vaart

Universiteit Leiden

version 242013

PREFACE

These are lecture notes for the courses “Tijdreeksen”, “Time Series” and “FinancialTimeSeries”. The material is more than can be treated in a one-semester course. Seenext section for the exam requirements.

Parts marked by an asterisk “*” do not belong to the exam requirements.Exercises marked by a single asterisk “*” are either hard or to be considered of

secondary importance. Exercises marked by a double asterisk “**” are questions to whichI do not know the solution.

Amsterdam, 1995–2010 (revisions, extensions),

A.W. van der Vaart

LITERATURE

The following list is a small selection of books on time series analysis. Azencott/Dacunha-Castelle and Brockwell/Davis are close to the core material treated in these notes. Thefirst book by Brockwell/Davis is a standard book for graduate courses for statisticians.Their second book is prettier, because it lacks the overload of formulas and computationsof the first, but is of a lower level.

Chatfield is less mathematical, but perhaps of interest from a data-analysis pointof view. Hannan and Deistler is tough reading, and on systems, which overlaps withtime series analysis, but is not focused on statistics. Hamilton is a standard work usedby econometricians; be aware, it has the existence results for ARMA processes wrong.Brillinger’s book is old, but contains some material that is not covered in the later works.Rosenblatt’s book is new, and also original in its choice of subjects. Harvey is a proponentof using system theory and the Kalman filter for a statistical time series analysis. Hisbook is not very mathematical, and a good background to state space modelling.

Most books lack a treatment of developments of the last 10–15 years, such asGARCH models, stochastic volatility models, or cointegration. Mills and Gourierouxfill this gap to some extent. The first contains a lot of material, including examples fit-ting models to economic time series, but little mathematics. The second appears to bewritten for a more mathematical audience, but is not completely satisfying. For instance,its discussion of existence and stationarity of GARCH processes is incomplete, and thepresentation is mathematically imprecise at many places. An alternative to these booksare several review papers on volatility models, such as Bollerslev et al., Ghysels et al.,and Shepard. Besides introductory discussion, also inclusing empirical evidence, thesehave extensive lists of references for further reading.

The book by Taniguchi and Kakizawa is unique in its emphasis on asymptotic theory,including some results on local asymptotic normality. It is valuable as a resource.

[1] Azencott, R. and Dacunha-Castelle, D., (1984). Series d’Observations Irregulieres.Masson, Paris.

[2] Brillinger, D.R., (1981). Time Series Analysis: Data Analysis and Theory. Holt,Rinehart & Winston.

[3] Bollerslev, T., Chou, Y.C. and Kroner, K., (1992). ARCH modelling in finance:a selective review of the theory and empirical evidence. J. Econometrics 52,201–224.

[4] Bollerslev, R., Engle, R. and Nelson, D., (ARCH models). Handbook of Econo-metrics IV (eds: RF Engle and D McFadden). North Holland, Amsterdam.

[5] Brockwell, P.J. and Davis, R.A., (1991). Time Series: Theory and Methods.Springer.

[6] Chatfield, C., (1984). The Analysis of Time Series: An Introduction. Chapman andHall.

[7] Gourieroux, C., (1997). ARCH Models and Financial Applications. Springer.

[8] Dedecker, J., Doukhan, P., Lang, G., Leon R., J.R., Louhichi, S. and Prieur, C.,(2007). Weak Dependence. Springer.

[9] Gnedenko, B.V., (1962). Theory of Probability. Chelsea Publishing Company.

[10] Hall, P. and Heyde, C.C., (1980). Martingale Limit Theory and Its Applications.Academic Press, New York.

[11] Hamilton, J.D., (1994). Time Series Analysis. Princeton.

[12] Hannan, E.J. and Deistler, M., (1988). The Statistical Theory of Linear Systems.John Wiley, New York.

[13] Harvey, A.C., (1989). Forecasting, Structural Time Series Models and the KalmanFilter. Cambridge University Press.

[14] Mills, T.C., (1999). The Econometric Modelling of Financial Times Series. Cam-bridge University Press.

[15] Rio, A., (2000). Theorie Asymptotique des Processus Aleatoires FaiblementDependants. Springer Verlag, New York.

[16] Rosenblatt, M., (2000). Gaussian and Non-Gaussian Linear Time Series and Ran-dom Fields. Springer, New York.

[17] Rudin, W., (1974). Real and Complex Analysis. McGraw-Hill.

[18] Taniguchi, M. and Kakizawa, Y., (2000). Asymptotic Theory of Statistical Infer-ence for Time Series. Springer.

[19] van der Vaart, A.W., (1998). Asymptotic Statistics. Cambrdige University Press.

CONTENTS

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1. Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Transformations and Filters . . . . . . . . . . . . . . . . . . . . 121.3. Complex Random Variables . . . . . . . . . . . . . . . . . . . . 181.4. Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . 21

2. Hilbert Spaces and Prediction . . . . . . . . . . . . . . . . . . . . . 222.1. Hilbert Spaces and Projections . . . . . . . . . . . . . . . . . . . 222.2. Square-integrable Random Variables . . . . . . . . . . . . . . . . . 272.3. Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 292.4. Innovations Algorithm . . . . . . . . . . . . . . . . . . . . . . . 322.5. Nonlinear Prediction . . . . . . . . . . . . . . . . . . . . . . . 332.6. Partial Auto-Correlation . . . . . . . . . . . . . . . . . . . . . . 34

3. Stochastic Convergence . . . . . . . . . . . . . . . . . . . . . . . . 363.1. Basic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.2. Convergence of Moments . . . . . . . . . . . . . . . . . . . . . . 393.3. Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4. Stochastic o and O symbols . . . . . . . . . . . . . . . . . . . . 413.5. Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.6. Cramer-Wold Device . . . . . . . . . . . . . . . . . . . . . . . 433.7. Delta-method . . . . . . . . . . . . . . . . . . . . . . . . . . 443.8. Lindeberg Central Limit Theorem . . . . . . . . . . . . . . . . . . 453.9. Minimum Contrast Estimators . . . . . . . . . . . . . . . . . . . 46

4. Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 494.1. Finite Dependence . . . . . . . . . . . . . . . . . . . . . . . . 504.2. Linear Processes . . . . . . . . . . . . . . . . . . . . . . . . . 514.3. Strong Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4. Uniform Mixing . . . . . . . . . . . . . . . . . . . . . . . . . 574.5. Martingale Differences . . . . . . . . . . . . . . . . . . . . . . . 594.6. Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.7. Proof of Theorem 4.7 . . . . . . . . . . . . . . . . . . . . . . . 63

5. Nonparametric Estimation of Mean and Covariance . . . . . . . . . . . . 675.1. Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2. Auto Covariances . . . . . . . . . . . . . . . . . . . . . . . . . 725.3. Auto Correlations . . . . . . . . . . . . . . . . . . . . . . . . . 765.4. Partial Auto Correlations . . . . . . . . . . . . . . . . . . . . . 79

6. Spectral Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.1. Spectral Measures . . . . . . . . . . . . . . . . . . . . . . . . . 836.2. Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.3. Nonsummable filters . . . . . . . . . . . . . . . . . . . . . . . . 936.4. Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 946.5. Multivariate Spectra . . . . . . . . . . . . . . . . . . . . . . 1016.6. Prediction in the Frequency Domain . . . . . . . . . . . . . . . . 104

7. Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 105

7.1. Stationary Sequences . . . . . . . . . . . . . . . . . . . . . . 1057.2. Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . 1067.3. Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.4. Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . . . 113

8. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 1158.1. Backshift Calculus . . . . . . . . . . . . . . . . . . . . . . . 1158.2. ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 1178.3. Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.5. Auto Correlation and Spectrum . . . . . . . . . . . . . . . . . . 1258.6. Existence of Causal and Invertible Solutions . . . . . . . . . . . . 1288.7. Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308.8. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 1318.9. VARMA Processes . . . . . . . . . . . . . . . . . . . . . . . 133

8.10. ARMA-X Processes . . . . . . . . . . . . . . . . . . . . . . . 1349. GARCH Processes . . . . . . . . . . . . . . . . . . . . . . . . . 135

9.1. Linear GARCH . . . . . . . . . . . . . . . . . . . . . . . . . 1379.2. Linear GARCH with Leverage and Power GARCH . . . . . . . . . 1509.3. Exponential GARCH . . . . . . . . . . . . . . . . . . . . . . 1519.4. GARCH in Mean . . . . . . . . . . . . . . . . . . . . . . . . 152

10. State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . 15310.1. Hidden Markov Models and State Spaces . . . . . . . . . . . . . 15410.2. Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . 15910.3. Nonlinear Filtering . . . . . . . . . . . . . . . . . . . . . . . 16310.4. Stochastic Volatility Models . . . . . . . . . . . . . . . . . . . 166

11. Moment and Least Squares Estimators . . . . . . . . . . . . . . . . 17211.1. Yule-Walker Estimators . . . . . . . . . . . . . . . . . . . . . 17211.2. Moment Estimators . . . . . . . . . . . . . . . . . . . . . . . 17911.3. Least Squares Estimators . . . . . . . . . . . . . . . . . . . . 190

12. Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 19512.1. Finite Fourier Transform . . . . . . . . . . . . . . . . . . . . . 19612.2. Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . 19712.3. Estimating a Spectral Density . . . . . . . . . . . . . . . . . . 20212.4. Estimating a Spectral Distribution . . . . . . . . . . . . . . . . 207

13. Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 21213.1. General Likelihood . . . . . . . . . . . . . . . . . . . . . . . 21313.2. Asymptotics under a Correct Model . . . . . . . . . . . . . . . . 21513.3. Asymptotics under Misspecification . . . . . . . . . . . . . . . . 21913.4. Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . 22113.5. Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 230

1Introduction

Oddly enough, a statistical time series is a mathematical sequence, not a series. In thisbook we understand a time series to be a doubly infinite sequence

. . . , X−2, X−1, X0, X1, X2, . . .

of random variables or random vectors. We refer to the index t of Xt as time and thinkof Xt as the state or output of a stochastic system at time t. The interpretation of theindex as “time” is unimportant for the mathematical theory, which is concerned withthe joint distribution of the variables only, but the implied ordering of the variables isusually essential. Unless stated otherwise, the variable Xt is assumed to be real valued,but we shall also consider sequences of random vectors Xt, or variables with values inthe complex numbers. We often write “the time series Xt” rather than use the correct(Xt: t ∈ Z), and instead of “time series” we also speak of “process”, “stochastic process”,or “signal”.

We are interested in the joint distribution of the variables Xt. The easiest way toensure that these variables, and other variables that we may introduce, possess joint laws,is to assume that they are defined as measurable maps on a single underlying probabilityspace. This is what we meant, but did not say in the preceding definition. Also in generalwe make the underlying probability space explicit only if otherwise confusion might arise,and then denote it by (Ω,U ,P), with ω denoting a typical element of Ω and Xt(ω) therealization of the variable Xt.

1.1 Stationarity

Time series theory is a mixture of probabilistic and statistical concepts. The probabilisticpart is to study and characterize probability distributions of sets of variables Xt that willtypically be dependent. The statistical problem is to determine the probability distribu-

1.1: Stationarity 3

tion of the time series given observations X1, . . . , Xn at times 1, 2, . . . , n. The resultingstochastic model can be used in two ways:

- understanding the stochastic system;- predicting the “future”, i.e. Xn+1, Xn+2, . . . ,.In order to have any chance of success at these tasks it is necessary to assume some a-

priori structure of the time series. Indeed, if the Xt could be completely arbitrary randomvariables, then (X1, . . . , Xn) would constitute a single observation from a completelyunknown distribution on R

n. Conclusions about this distribution would be impossible,let alone about the distribution of the future values Xn+1, Xn+2, . . ..

A basic type of structure is stationarity. This comes in two forms.

1.1 Definition. The time series Xt is strictly stationary if the distribution (on Rh+1)

of the vector (Xt, Xt+1, . . . , Xt+h) is independent of t, for every h ∈ N.

1.2 Definition. The time series Xt is stationary (or more precisely second order sta-tionary) if EXt and EXt+hXt exist and are finite and do not depend on t, for everyh ∈ N.

It is clear that a strictly stationary time series with finite second moments is alsostationary. For a stationary time series the auto-covariance and auto-correlation at lagh ∈ Z are defined by

γX(h) = cov(Xt+h, Xt),

ρX(h) = ρ(Xt+h, Xt) =γX(h)

γX(0).

The auto-covariance and auto-correlation are functions γX :Z → R and ρX :Z → [−1, 1].Both functions are symmetric about 0. Together with the mean µ = EXt, they determinethe first and second moments of the stationary time series. Much of time series (toomuch?) concerns this “second order structure”.

The autocovariance and autocorrelations functions are measures of (linear) de-pendence between the variables at different time instants, except at lag 0, whereγX(0) = varXt gives the variance of (any) Xt, and ρX(0) = 1.

1.3 Example (White noise). A doubly infinite sequence of independent, identicallydistributed random variables Xt is a strictly stationary time series. Its auto-covariancefunction is, with σ2 = varXt,

γX(h) =

σ2, if h = 0,0, if h 6= 0.

Any stationary time series Xt with mean zero and covariance function of this type iscalled a white noise series. Thus any mean-zero i.i.d. sequence with finite variances is awhite noise series. The converse is not true: there exist white noise series that are notstrictly stationary.

The name “noise” should be intuitively clear. We shall see why it is called “white”when discussing spectral theory of time series in Chapter 6.

4 1: Introduction

0 50 100 150 200 250

-2-1

01

2

Figure 1.1. Realization of a Gaussian white noise series of length 250.

1.4 EXERCISE. Construct a white noise sequence that is not strictly stationary.

White noise series as in the preceding example are important building blocks toconstruct other series, but from the point of view of time series analysis they are notso interesting. More interesting are series of dependent random variables, so that, to acertain extent, the future can be predicted from the past.

1.5 Example (Deterministic trigonometric series). Let A and B be given, uncorre-lated random variables with mean zero and variance σ2, and let λ be a given number.Then

Xt = A cos(tλ) +B sin(tλ)

defines a stationary time series. Indeed, EXt = 0 and

γX(h) = cov(Xt+h, Xt)

= cos(

(t+ h)λ)

cos(tλ) varA+ sin(

(t+ h)λ)

sin(tλ) varB

= σ2 cos(hλ).

Even though A and B are random variables, this type of time series is called deterministicin time series theory. is because once A and B have been determined (at time −∞ say),

1.1: Stationarity 5

the process behaves as a deterministic trigonometric function. This type of time series isan important building block to model cyclic events in a system, but it is not the typicalexample of a statistical time series that we study in this course. Predicting the future istoo easy in this case.

1.6 Example (Moving average). Given a white noise series Zt with variance σ2 anda number θ set

Xt = Zt + θZt−1.

This is called a moving average of order 1. The series is stationary with EXt = 0 and

γX(h) = cov(Zt+h + θZt+h−1, Zt + θZt−1) =

(1 + θ2)σ2, if h = 0,θσ2, if h = ±1,0, otherwise.

Thus Xs and Xt are uncorrelated whenever s and t are two or more time instants apart.We speak of short range dependence and say that the time series has short memory.Figure 1.2 shows a realization of a moving average series; at first glance it does not lookthat different from the realization of an i.i.d. sequence.

If the Zt are an i.i.d. sequence, then the moving average is strictly stationary.A natural generalization are higher order moving averages of the form Xt = Zt +

θ1Zt−1 + · · ·+ θqZt−q.

1.7 EXERCISE. Prove that the series Xt in Example 1.6 is strictly stationary if Zt is astrictly stationary sequence.

1.8 Example (Autoregression). Given a white noise series Zt with variance σ2 and anumber θ consider the equations

Xt = θXt−1 + Zt, t ∈ Z.

To give this equation a clear meaning, we take the white noise series Zt as defined onsome probability space (Ω,U ,P), and look for a solution Xt to the equation: a time seriesdefined on the same probability space that solves the equation “pointwise in ω” (at leastfor almost every ω).

The autoregressive equation does not define Xt, but in general has many solutions.Indeed, we can define the sequence Zt and the variable X0 in some arbitrary way on thegiven probability space and next define the remaining variables Xt for t ∈ Z \ 0 bythe equation. However, suppose that we are only interested in stationary solutions. Thenthere is either no solution or a unique solution, depending on the value of θ, as we shallnow prove.

Suppose first that |θ| < 1. By iteration we find that

Xt = θ(θXt−2 + Zt−1) + Zt = · · ·= θkXt−k + θk−1Zt−k+1 + · · ·+ θZt−1 + Zt.

6 1: Introduction

0 50 100 150 200 250

-3-2

-10

12

3

Figure 1.2. Realization of length 250 of the moving average series Xt = Zt − 0.5Zt−1 for Gaussian whitenoise Zt.

For a stationary sequence Xt we have that E(θkXt−k)2 = θ2kEX20 → 0 as k → ∞. This

suggests that a solution of the equation is given by the infinite series

Xt = Zt + θZt−1 + θ2Zt−2 + · · · =∞∑

j=0

θjZt−j .

We show below in Lemma 1.29 that the series on the right side converges almost surely,so that the preceding display indeed defines some random variable Xt. This is a “movingaverage of infinite order”. We can check directly, by substitution in the equation, that Xt

satisfies the auto-regressive relation. (For every ω for which the series converges; hencein general only almost surely. We consider this to be good enough.)

If we are allowed to change expectations and infinite sums, then we see that

EXt =

∞∑

j=0

θjEZt−j = 0,

γX(h) =∞∑

i=0

∞∑

j=0

θiθjEZt+h−iZt−j =∞∑

j=0

θh+jθjσ21h+j≥0 =θ|h|

1− θ2σ2.

1.1: Stationarity 7

We justify these computations in Lemma 1.29. It follows that Xt is indeed a stationarytime series. In this example γX(h) 6= 0 for every h, so that every pair of variables Xs andXt are dependent. However, because γX(h) → 0 at exponential speed as h → ∞, thisseries is still considered to be short-range dependent. Note that γX(h) oscillates if θ < 0and decreases monotonely if θ > 0. The realization in Figure 1.4 shows more dependencethan in a moving average series.

For θ = 1 the situation is very different: no stationary solution exists. To see thisnote that the equation obtained before by iteration now takes the form, for k = t,

Xt = X0 + Z1 + · · ·+ Zt.

Thus Xt is a random walk. It satisfies var(Xt −X0) = tσ2 → ∞ as t→ ∞. However, bythe triangle inequality we have that

sd(Xt −X0) ≤ sdXt + sdX0 = 2 sdX0,

for a stationary sequence Xt. Hence no stationary solution exists. The situation for θ = 1is explosive: the randomness increases significantly as t→ ∞ due to the introduction ofa new Zt for every t. Figure 1.3 illustrates the nonstationarity of a random walk.

The cases θ = −1 and |θ| > 1 are left as exercises.The auto-regressive time series of order one generalizes naturally to auto-regressive

series of the form Xt = φ1Xt−1 + · · ·φpXt−p + Zt. The existence of stationary solutionsXt to this equation depends on the locations of the roots of the polynomial z 7→ 1 −φ1z − φ2z

2 − · · · − φpzp, as is discussed in Chapter 8.

1.9 EXERCISE. Consider the cases θ = −1 and |θ| > 1. Show that in the first case thereis no stationary solution and in the second case there is a unique stationary solution.[For |θ| > 1 mimic the argument for |θ| < 1, but with time reversed: iterate Xt−1 =(1/θ)Xt − Zt/θ.]

1.10 Example (GARCH). A time series Xt is called a GARCH(1, 1) process if, forgiven nonnegative constants α, θ and φ, and a given i.i.d. sequence Zt with mean zeroand unit variance, it satisfies the system of equations

(1.1)Xt = σtZt,

σ2t = α+ φσ2

t−1 + θX2t−1.

The variable σt is interpreted as the scale or volatility of the time series Xt at time t. Bythe second equation this is modelled as dependent on the “past”. To make this structureexplicit it is often included (or thought to be automatic) in the definition of a GARCHprocess that, for every t,(i) σt is a measurable function of Xt−1, Xt−2, . . .;(ii) Zt is independent of Xt−1, Xt−2, . . ..In view of the first GARCH equation, properties (i)-(ii) imply that

EXt = EσtEZt = 0,

EXsXt = E(Xsσt)EZt = 0, (s < t).

8 1: Introduction

0 50 100 150 200 250

-10

-50

5

Figure 1.3. Realization of a random walk Xt = Zt + · · ·+ Z0 of length 250 for Zt Gaussian white noise.

Therefore, a GARCH process with the extra properties (i)-(ii) is a white noise process.Furthermore,

E(Xt|Xt−1, Xt−2, . . .) = σtEZt = 0,

E(X2t |Xt−1, Xt−2, . . .) = σ2

tEZ2t = σ2

t .

The first equation shows that Xt is a “martingale difference series”: the past does nothelp to predict future values of the time series. The second equation exhibits σ2

t as theconditional variance of Xt given the past. Because σ2

t is a nontrivial time series by thesecond GARCH equation (if φ+ θ 6= 0), the time series Xt is not an i.i.d. sequence.

Because the conditional mean of Xt given the past is zero, a GARCH process willfluctuate around the value 0. A large deviation |Xt−1| from 0 at time t−1 will cause a largeconditional variance σ2

t = α+ θX2t−1 + φσ2

t−1 at time t, and then the deviation of Xt =σtZt from 0 will tend to be large as well. Similarly, small deviations from 0 will tend to befollowed by other small deviations. Thus a GARCH process will alternate between periodsof big fluctuations and periods of small fluctuations. This is also expressed by sayingthat a GARCH process exhibits volatility clustering. Volatility clustering is commonlyobserved in time series of stock returns. The GARCH(1, 1) process has become a popularmodel for such time series. Figure 1.5 shows a realization. The signs of the Xt are equalto the signs of the Zt and hence will be independent over time.

1.1: Stationarity 9

0 50 100 150 200 250

-4-2

02

4

Figure 1.4. Realization of length 250 of the stationary solution to the equation Xt = 0.5Xt−1+0.2Xt−2+Zt

for Zt Gaussian white noise.

The abbreviation GARCH is for “generalized auto-regressive conditional het-eroscedasticity”: the conditional variances are not constant (not homoscedastic), anddepend on the past through an auto-regressive scheme. Typically, the conditional vari-ances σ2

t are not directly observed, but must be inferred from the observed sequenceXt.

Being a white noise process, a GARCH process can itself be used as input in an-other scheme, such as an auto-regressive or a moving average series, leading to e.g.AR-GARCH processes. There are also many generalizations of the GARCH structure.In a GARCH(p, q) process, the conditional variances σ2

t are allowed to depend onσ2t−1, . . . , σ

2t−p and X2

t−1, . . . , X2t−q. A GARCH (0, q) process is also called an ARCH

process. The rationale of using the squares X2t is that these are nonnegative and simple;

there are many variations using other functions.As the auto-regressive relationship, the defining GARCH equations are recursive,

and it is not obvious that solutions to the pair of defining equations, and, if so, whetherthe solution is unique and satisfies (i)-(ii). If we are only interested in the process fort ≥ 0, then we might complement the equations with an initial value for σ2

0 , and solvethe equations by iteration, taking a suitable sequence Zt as given. Alternatively, we maystudy the existence of stationary solutions Xt to the equations.

10 1: Introduction

We now show that there exists a stationary solution Xt to the GARCH equationsif θ+ φ < 1, which then is unique and automatically possesses the properties (i)-(ii). Byiterating the GARCH relation we find that, for every n ≥ 0,

σ2t = α+ (φ+ θZ2

t−1)σ2t−1 = α+ α

n∑

j=1

(φ+ θZ2t−1)× · · · × (φ+ θZ2

t−j)

+ (φ+ θZ2t−1)× · · · × (φ+ θZ2

t−n−1)σ2t−n−1.

The sequence(

(φ+θZ2t−1) · · · (φ+θZ2

t−n−1))∞n=1

, which consists of nonnegative variables

with means (φ + θ)n+1, converges in probability to zero if θ + φ < 1. If the time seriesσ2t is bounded in probability as t→ ∞, then the term on the far right converges to zero

in probability as n→ ∞. Thus for a stationary solution (Xt, σt) we must have

(1.2) σ2t = α+ α

∞∑

j=1

(φ+ θZ2t−1)× · · · × (φ+ θZ2

t−j).

This series has positive terms, and hence is always well defined. It is easily checked thatthe series converges in mean if and only if θ + φ < 1 (cf. Lemma 1.27). Given an i.i.d.series Zt, we can then define a process Xt by first defining the conditional variances σ2

t

by (1.2), and next setting Xt = σtZt. It can be verified by substitution that this processXt solves the GARCH relationship (1.1). Hence a stationary solution to the GARCHequations exists if φ+ θ < 1.

Because Zt is independent of Zt−1, Zt−2, . . ., by (1.2) it is also independent ofσ2t−1, σ

2t−2, . . ., and hence also of Xt−1 = σt−1Zt−1, Xt−2 = σt−2Zt−2, . . .. This verifies

property (ii). In addition it follows that the time series Xt = σtZt is strictly stationary,being a fixed measurable transformation of (Zt, Zt−1, . . .) for every t.

By iterating the auto-regressive relation σ2t = φσ2

t−1 +Wt, with Wt = α + θX2t−1,

in the same way as in Example 1.8, we also find that for the stationary solution σ2t =

∑∞j=0 φ

jWt−j . Hence σt is σ(Xt−1, Xt−2, . . .)-measurable: property (i) is valid.An inspection of the preceding argument shows that a strictly stationary solution

exists under a weaker condition than φ+ θ < 1. This is because the infinite series (1.2)may converge almost surely, without converging in mean (see Exercise 1.14). We shallstudy this further in Chapter 9.

1.11 EXERCISE. Let θ + φ ∈ [0, 1) and 1− κθ2 − φ2 − 2θφ > 0, where κ = EZ4t . Show

that the second and fourth (marginal) moments of a stationary GARCH process aregiven by α/(1− θ − φ) and κα2

(

1− (θ + φ)2)

/(1− κθ2 − φ2 − 2θφ). From this computethe kurtosis of the GARCH process with standard normal Zt. [You can use (1.2), but itis easier to use the GARCH relations.]

1.12 EXERCISE. Show that EX4t = ∞ if 1− κθ2 − φ2 − 2θφ = 0.

1.13 EXERCISE. Suppose that the process Xt is square-integrable and satisfies theGARCH relation for an i.i.d. sequence Zt such that Zt is independent of Xt−1, Xt−2, . . .and such that σ2

t = E(X2t |Xt−1, Xt−2, . . .), for every t, and some α, φ, θ > 0. Show that

φ+ θ < 1. [Derive that EX2t = α+ α

∑nj=1(φ+ θ)j + (φ+ θ)n+1EX2

t−n−1.]

1.1: Stationarity 11

1.14 EXERCISE. Let Zt be an i.i.d. sequence with E log(Z2t ) < 0. Show that

∑∞j=0 Z

2t Z

2t−1 · · ·Z2

t−j <∞ almost surely. [Use the Law of Large Numbers.]

0 100 200 300 400 500

-50

5

Figure 1.5. Realization of the GARCH(1, 1) process with α = 0.1, φ = 0 and θ = 0.8 of length 500 for Zt

Gaussian white noise.

1.15 Example (Stochastic volatility). A general approach to obtain a time serieswith volatility clustering is to define Xt = σtZt for an i.i.d. sequence Zt and a process σtthat depends “positively on its past”. A GARCH model fits this scheme, but a simplerconstruction is to let σt depend only on its own past and independent noise. Because σtis to have an interpretation as a scale parameter, we restrain it to be positive. One wayto combine these requirements is to set

ht = θht−1 +Wt,

σ2t = eht ,

Xt = σtZt.

Here Wt is a white noise sequence, ht is a (stationary) solution to the auto-regressiveequation, and the process Zt is i.i.d. and independent of the process Wt. If θ > 0 andσt−1 = eht−1/2 is large, then σt = eht/2 will tend to be large as well. Hence the processXt will exhibit volatility clustering.

12 1: Introduction

If 0 < θ, 1, then there exists a strictly stationary solution ht, by Example 1.8, andthe process Xt is also strictly stationary. Because there is no recursion between σt andXt, existence of stationary versions is thus easily resolved, unlike for a GARC series.

The process ht will typically not be observed and for that reason is sometimes calledlatent. A “stochastic volatility process” of this type is an example of a (nonlinear) statespace model, discussed in Chapter 10. Rather than defining σt by an auto-regression inthe exponent, we may choose a different scheme. For instance, an EGARCH(p, 0) modelpostulates the relationship

log σt = α+

p∑

j=1

φj log σt−j .

This is not a stochastic volatility model, because it does not include a random distur-bance. The symmetric EGARCH (p, q) model repairs this by adding terms depending onthe past of the observed series Xt = σtZt, giving

log σt = α+

q∑

j=1

θj |Zt−j |+p∑

j=1

φj log σt−j .

This looks a bit GARCH like, and hence these processes their variants are much relatedto stochastic volatility models. In view of the recursive nature of the definitions of σtand Xt, they are perhaps more complicated.

1.2 Transformations and Filters

Many time series in real life are not stationary. Rather than modelling a nonstationarysequence, such a sequence is often transformed in a time series that is (assumed to be)stationary. The statistical analysis next focuses on the transformed series.

Two important deviations from stationarity are trend and seasonality. A trend is along-term, steady increase or decrease in the general level of the time series. A seasonalcomponent is a cyclic change in the level, the cycle length being for instance a year or aweek. Even though Example 1.5 shows that a perfectly cyclic series can be modelled asa stationary series, it is often considered wise to remove such perfect cycles from a givenseries before applying statistical techniques.

There are many ways in which a given time series can be transformed into a seriesthat is easier to analyse. Transforming individual variables Xt into variables f(Xt) by afixed function f (such as the logarithm) is a common technique to render the variablesmore stable. It does not transform a nonstationary series into a strictly stationary one,but it does change the autocovariance function, which may be a more useful summary forthe transformed series. Another simple technique is detrending by substracting a “bestfitting polynomial in t” of some fixed degree. This is commonly found by the method of

1.2: Transformations and Filters 13

0 50 100 150 200 250

-10

-50

5

Figure 1.6. Realization of length 250 of the stochastic volatility model Xt = eht/2Zt for a standardGaussian i.i.d. process Zt and a stationary auto-regressive process ht = 0.8ht−1 +Wt for a standard Gaussiani.i.d. process Wt.

least squares: given a nonstationary time series Xt we determine constants β0, . . . , βp byminimizing

(β0, . . . , βp) 7→n∑

t=1

(

Xt − β0 − β1t− · · · − βptp)2.

Next the time series Xt − β0 − β1t− · · · − βptp is assumed to be stationary.

A standard transformation for financial time series is to (log) returns, given by

logXt

Xt−1, or

Xt

Xt−1− 1.

If Xt/Xt−1 is close to unity for all t, then these transformations are similar, as log x ≈x− 1 for x ≈ 1. Because log(ect/ec(t−1)) = c, a log return can be intuitively interpretedas the exponent of exponential growth. Many financial time series exhibit an exponentialtrend, not always in the right direction for the owners of the corresponding assets.

A general method to transform a nonstationary sequence in a stationary one, advo-cated with much success in a famous book by Box and Jenkins, is filtering.

14 1: Introduction

1984 1986 1988 1990 1992

1.0

1.5

2.0

2.5

1984 1986 1988 1990 1992

-0.2

-0.1

0.0

0.1

Figure 1.7. Prices of Hewlett Packard on New York Stock Exchange and corresponding log returns.

1.16 Definition. The (linear) filter with filter coefficients (ψj : j ∈ Z) is the operationthat transforms a given time series Xt into the time series Yt =

∑

j∈Z ψjXt−j .

A linear filter is a moving average of infinite order. In Lemma 1.29 we give conditionsfor the infinite series to be well defined. All filters used in practice are finite filters: onlyfinitely many coefficients ψj are nonzero. Important examples are the difference filter

∇Xt = Xt −Xt−1,

and its repetitions ∇kXt = ∇∇k−1Xt defined recursely for k = 2, 3, . . .. Differencingwith a bigger lag gives the the seasonal difference filter ∇kXt = Xt − Xt−k, for k the“length of the season” or “period”. It is intuitively clear that a time series may havestationary (seasonal) differences (or increments), but may itself be nonstationary.

1.17 Example (Polynomial trend). A linear trend model could take the form Xt =at + Zt for a strictly stationary time series Zt. If a 6= 0, then the time series Xt is notstationary in the mean. However, the differenced series ∇Xt = a+Zt−Zt−1 is stationary.

Thus differencing can be used to remove a linear trend. Similarly, a polynomial trendcan be removed by repeated differencing: a polynomial trend of degree k is removed byapplying ∇k.


1.18 EXERCISE. Check this for a series of the form Xt = at+ bt2 + Zt.

1.19 EXERCISE. Does a (repeated) seasonal filter also remove polynomial trend?

0 20 40 60 80 100

020

040

060

0

0 20 40 60 80 100

02

46

810

0 20 40 60 80 100

-4-2

02

4

Figure 1.8. Realization of the time series t + 0.05t2 + Xt for the stationary auto-regressive process Xt

satisfying Xt−0.8Xt−1 = Zt for Gaussian white noise Zt, and the same series after once and twice differencing.

1.20 Example (Random walk). A random walk is defined as the sequence of partialsums Xt = Z1 + Z2 + · · ·+ Zt of an i.i.d. sequence Zt (with X0 = 0). A random walk isnot stationary, but the differenced series ∇Xt = Zt certainly is.

1.21 Example (Monthly cycle). If Xt is the value of a system in month t, then ∇12Xt

is the change in the system during the past year. For seasonable variables without trendthis series might be modelled as stationary. For series that contain both yearly seasonalityand polynomial trend, the series ∇k∇12Xt might be stationary.

1.22 Example (Weekly averages). If Xt is the value of a system at day t, then Yt =(1/7)

∑6j=0Xt−j is the average value over the last week. This series might show trend,

but should not show seasonality due to day-of-the-week. We could study seasonalityby considering the time series Xt − Yt, which results from filtering the series Xt withcoefficients (ψ0, . . . , ψ6) = (6/7,−1/7, . . . ,−1/7).

16 1: Introduction

1.23 Example (Exponential smoothing). An ad-hoc method for predicting the futureis to equate the future to the present or, more generally, to the average of the lastk observed values of a time series. When averaging past values it is natural to givemore weight to the most recent values. Exponentially decreasing weights appear to havesome popularity. This corresponds to predicting a future value of a time series Xt bythe weighted average

∑∞j=0 θ

j(1 − θ)−1Xt−j for some θ ∈ (0, 1). The coefficients ψj =

θj/(1− θ) of this filter decrease exponentially and satisfy∑

ψj = 1.

1.24 EXERCISE (Convolution). Show that the result of two filters with coefficientsαj and βj applied in turn (if well defined) is the filter with coefficients γj given byγk =

∑

j αjβk−j . This is called the convolution of the two filters. Infer that filtering iscommutative.

1.25 Definition. A filter with coefficients ψj is causal if ψj = 0 for every j < 0.

For a causal filter the variable Yt =∑

j ψjXt−j depends only on the valuesXt, Xt−1, . . . of the original time series in the present and past, not the future. This isimportant for prediction. Given Xt up to some time t, we can calculate Yt up to time t.If Yt is stationary, we can use results for stationary time series to predict the future valueYt+1. Next we predict the future value Xt+1 by Xt+1 = ψ−10 (Yt+1 −

∑

j>0 ψjXt+1−j).

1.2.1 Convergence of random series

In order to derive conditions that guarantee that an infinite filter is well defined, westart with a lemma concerning series of random variables. Recall that a series

∑

t xt ofnonnegative numbers is always well defined (although possibly ∞), where the order ofsummation is irrelevant. Furthermore, for general numbers xt the absolute convergence∑

t |xt| <∞ implies that∑

t xt exists as a finite number, where the order of summationis again irrelevant. We shall be concerned with series indexed by t belonging to somecountable set T , such as N, Z, or Z2. It follows from the preceding that

∑

t∈T xt is welldefined as a limit as n → ∞ of partial sums

∑

t∈Tnxt, for any increasing sequence of

finite subsets Tn ⊂ T with union T , if either every xt is nonnegative or∑

t |xt| < ∞.For instance, in the case that the index set T is equal to Z, we can choose the setsTn = t ∈ Z: |t| ≤ n.

1.26 Definition. A series∑

tXt converges in pth mean if there is a random variableY such that Yn: =

∑

t∈TnXt → Y in pth mean: E|Yn − Y |p → 0. The limit Y is then

denoted∑

tXt. For p = 1 and p = 2 we say convergence in mean and convergence insecond mean, or, alternatively, “convergence in quadratic mean” or “convergence in L1

or L2”.

1.27 Lemma. Let (Xt: t ∈ T ) be an arbitrary countable set of random variables.(i) If Xt ≥ 0 for every t, then E

∑

tXt =∑

t EXt (possibly +∞);(ii) If

∑

t E|Xt| <∞, then the series∑

tXt converges absolutely almost surely and andin mean, and E

∑

tXt =∑

t EXt.


Proof. Suppose T = ∪jTj for an increasing sequence T1 ⊂ T2 ⊂ · · · of finite subsets ofT . Assertion (i) follows from the monotone convergence theorem applied to the variablesYj =

∑

t∈TjXt. For the proof of (ii) we first note that Z =

∑

t |Xt| is well defined, withfinite mean EZ =

∑

t E|Xt|, by (i). Thus∑

t |Xt| converges almost surely to a finite limit,and hence Y : =

∑

tXt converges almost surely as well. The variables Yj =∑

t∈TjXt are

dominated by Z and converge to Y as j → ∞. Hence EYj → EY by the dominatedconvergence theorem.

1.28 EXERCISE. Suppose that E|Xn −X|p → 0 and E|X|p <∞ for some p ≥ 1. Showthat EXk

n → EXk for every 0 < k ≤ p.

1.29 Lemma (Filtering). Let (Zt: t ∈ Z) be an arbitrary time series and let∑

j |ψj | <∞.(i) If supt E|Zt| <∞, then

∑

j ψjZt−j converges absolutely, almost surely and in mean.

(ii) If supt E|Zt|2 <∞, then∑

j ψjZt−j converges in second mean as well.(iii) If the series Zt is stationary, then so is the series Xt =

∑

j ψjZt−j and γX(h) =∑

l

∑

j ψjψj+l−hγZ(l).

Proof. (i). Because∑

t E|ψjZt−j | ≤ supt E|Zt|∑

j |ψj | < ∞, it follows by (ii) of thepreceding lemma that the series

∑

j ψjZt−j is absolutely convergent, almost surely. Theconvergence in mean follows as in the remark following the lemma.

(ii). By (i) the series converges almost surely, and∑

j ψjZt−j −∑

|j|≤k ψjZt−j =∑

|j|>k ψjZt−j . By the triangle inequality we have

∣

∣

∣

∑

|j|>k

ψjZt−j∣

∣

∣

2

≤(

∑

|j|>k

|ψjZt−j |)2

=∑

|j|>k

∑

|i|>k

|ψj ||ψi||Zt−j ||Zt−i|.

By the Cauchy-Schwarz inequality E|Zt−j ||Zt−i| ≤(

E|Zt−j |2|EZt−i|2)1/2

, which isbounded by supt E|Zt|2. Therefore, in view of (i) of the preceding lemma the expec-tation of the right side (and hence the left side) of the preceding display is boundedabove by

∑

|j|>k

∑

|i|>k

|ψj ||ψi| supt

E|Zt|2 =(

∑

|j|>k

|ψj |)2

supt

E|Zt|2.

This converges to zero as k → ∞.(iii). By (i) the series

∑

j ψjZt−j converges in mean. Therefore, E∑

j ψjZt−j =∑

j ψjEZt, which is independent of t. Using arguments as under (ii), we see that we canalso justify the interchange of the order of expectations (hidden in the covariance) anddouble sums in

γX(h) = cov(

∑

j

ψjZt+h−j ,∑

i

ψiZt−i)

=∑

j

∑

i

ψjψi cov(Zt+h−j , Zt−i) =∑

j

∑

i

ψjψiγZ(h− j + i).

This can be written in the form given by the lemma by the change of variables (j, i) 7→(j, j + l − h).

18 1: Introduction

1.30 EXERCISE. Suppose that the series Zt in (iii) is strictly stationary. Show that theseries Xt is strictly stationary whenever it is well defined. [You need measure theory tomake the argument mathematically rigorous(?).]

* 1.31 EXERCISE. For a white noise series Zt, part (ii) of the preceding lemma can be im-proved: Suppose that Zt is a white noise sequence and

∑

j ψ2j <∞. Show that

∑

j ψjZt−jconverges in second mean. [For this exercise you may want to use material from Chap-ter 2.]

1.3 Complex Random Variables

Even though no real-life time series is complex valued, the use of complex numbers isnotationally convenient to develop the mathematical theory. In this section we discusscomplex-valued random variables.

A complex random variable Z is a map from some probability space into the fieldof complex numbers whose real and imaginary parts are random variables. For complexrandom variables Z = X + iY , Z1 and Z2, we define

EZ = EX + iEY,

varZ = E|Z − EZ|2,cov(Z1, Z2) = E(Z1 − EZ1)(Z2 − EZ2).

Here the overbar means complex conjugation. Some simple properties are, for α, β ∈ C,

EαZ = αEZ, EZ = EZ,

varZ = E|Z|2 − |EZ|2 = varX + varY = cov(Z,Z),

var(αZ) = |α|2 varZ,cov(αZ1, βZ2) = αβ cov(Z1, Z2),

cov(Z1, Z2) = cov(Z2, Z1) = EZ1Z2 − EZ1EZ2.

A complex random vector is of course a vector Z = (Z1, . . . , Zn)T of complex random

variables Zi defined on the same probability space. Its covariance matrix Cov(Z) is the(n× n) matrix of covariances cov(Zi, Zj).

1.32 EXERCISE. Prove the preceding identities.

1.33 EXERCISE. Prove that a covariance matrix Σ = Cov(Z) of a complex randomvector Z is Hermitian (i.e. Σ = ΣT ) and nonnegative-definite (i.e. αΣαT ≥ 0 for everyα ∈ C

n). Here ΣT is the reflection of a matrix, defined to have (i, j)-element Σj,i.

A complex time series is a (doubly infinite) sequence of complex-valued randomvariables Xt. The definitions of mean, covariance function, and (strict) stationarity given

1.3: Complex Random Variables 19

for real time series apply equally well to complex time series. In particular, the auto-covariance function of a stationary complex time series is the function γX :Z → C definedby γX(h) = cov(Xt+h, Xt). Lemma 1.29 also extends to complex time series Zt, wherein (iii) we must read γX(h) =

∑

l

∑

j ψjψj+l−hγZ(l).

1.34 EXERCISE. Show that the auto-covariance function of a complex stationary timeseries Xt is conjugate symmetric: γX(−h) = γX(h), for every h ∈ Z. (The positions ofXt+h and Xt in the definition γX(h) = cov(Xt+h, Xt) matter!)

Useful as these definitions are, it should be noted that the second-order structure of acomplex-valued time series is only partly described by its covariance function. A complexrandom variable is geometrically equivalent to the two-dimensional real vector of its realand imaginary parts. More generally, a complex random vector Z = (Z1, . . . , Zn)

T ofdimension n can be identified with the 2n-dimensional real vector that combines all realand imaginary parts. The second order structure of the latter vector is determined bya 2n-dimensional real covariance matrix, and this is not completely determined by the(n × n) complex covariance matrix of Z. This is clear from comparing the dimensionsof these objects. For instance, for n = 1 the complex covariance matrix of Z is simplythe single positive number varZ, whereas the covariance matrix of its pair of real andimaginary parts is a symmetric (2 × 2)-matrix. This discrepancy increases with n. Thecovariance matrix of a complex vector of dimension n contains n reals on the diagonaland 1

2n(n− 1) complex numbers off the diagonal, giving n+n(n− 1) = n2 reals in total,whereas a real covariance matrix of dimension 2n varies over an open set of 1

22n(2n+1)reals.

* 1.3.1 Complex Gaussian vectors

It is particularly important to keep this in mind when working with complex Gaussianvectors. A complex random vector Z = (Z1, . . . , Zn) is normally distributed or Gaussianif the 2n-vector of all its real and imaginary parts is multivariate-normally distributed.The latter distribution is determined by the mean vector and the covariance matrixof the latter 2n-vector. The mean vector is one-to-one related to the (complex) meanvector of Z, but the complex covariances cov(Zi, Zj) do not completely determine thecovariance matrix of the combined real and imaginary parts. This is sometimes resolvedby requiring that the covariance matrix of the 2n-vector has the special form

(1.3) 12

(

ReΣ − ImΣImΣ ReΣ

)

.

Here ReΣ and ImΣ are the real and imaginary parts of the complex covariance matrixΣ = ReΣ + i ImΣ of Z. A Gaussian distribution with covariance matrix of this typeis sometimes called circular complex normal. This distribution is completely describedby the mean vector µ and covariance matrix Σ, and can be referred to by a code suchas Nn(µ,Σ). However, the additional requirement (1.3) imposes a relationship betweenthe vectors of real and imaginary parts of Z, which seems not natural for a generaldefinition of a Gaussian complex vector. For instance, under the requirement (1.3) a

20 1: Introduction

complex Gaussian variable Z (for n = 1) would be equivalent to a pair of independentreal Gaussian variables each with variance equal to 1

2 varZ.On the other hand, the covariance matrix of any complex vector that can be con-

structed by linearly transforming total complex white noise is of the form (1.3). Hereby “total complex white noise” we mean a complex vector of the form U + iV , with(UT , V T )T a 2n-real vector with covariance matrix 1

2 times the identity. The covariancematrix of the real and imaginary parts of the complex vector A(U + iV ) is then of theform (1.3), for any complex matrix A (see Problem 1.36). Furthermore, the “canoni-cal construction” of a complex vector Z with prescribed covariance matrix Σ, using thespectral decomposition, also yields real and imaginary parts with covariance matrix (1.3)(see Problem 1.37), and the coordinates of a complex N(µ,Σ)-vector are independent ifand only if Σ is diagonal.

Warning. If X is a real Gaussian n-vector and A a complex matrix, then AX is notnecessarily a complex Gaussian vector, if the definition of the latter concept includes therequirement that the covariance matrix of the vector of real and imaginary parts takesthe form (1.3).

1.35 Example (Complex trigonometric series). The time series Xt = Aeitλ, for acomplex-valued random variable A and a fixed real number λ, defines a complex versionof the trigonometric time series considered in Example 1.5. If EA = 0 and varA = σ2

is finite, then this time series is stationary with mean 0 and autocovariances γX(h) =σ2eihλ.

The real part ReXt of this time series takes the form of the series in Example 1.5,with the variables A and B in that example taken equal to ReA and ImA. It follows thatthe real part is stationary if ReA and ImA have equal variance and are uncorrelated,i.e. if the covariance of A has the form (1.3). If this is not the case, then the real part willtypically not be stationary. For instance, in the case that the variances of ReA and ImAare equal, the extra term EReA ImA cos

(

(t+h)λ) sin(tλ) will appear in the formula forthe covariance of ReXt+h and ReXt, which depends on h unless EReA ImA = 0.

Thus a stationary complex time series may have nonstationary real and imaginaryparts!

1.36 EXERCISE. Let U and V be real n-vectors with EUUT = EV V T = I and EUV T =0. Show that for any complex (n×n)-matrix A the complex vector A(U + iV ) possessescovariance matrix of the form (1.3).

1.37 EXERCISE. A nonnegative, Hermitian complex matrix Σ can be decomposed as

Σ = ODOTfor a unitary matric O and a real diagonal matrix D. Given uncorrelated real

vectors U and V each with mean zero and covariance matrix 12I set Z = O

√D(U + iV ).

(i) Show that CovZ = Σ.(ii) Show that (ReZ, ImZ) possesses covariance matrix (1.3).(If O

√D = A + iB, then Σ = O

√D(O

√D)T = (A + iB)(AT − iBT ), whence ReΣ =

AAT +BBT and ImΣ = BAT −ABT . Also Z = AU −BV + i(AV +BU).)

1.38 EXERCISE. Suppose that the 2n-vector (XT , Y T )T possesses covariance matrix

1.4: Multivariate Time Series 21

as (1.3). Show that the n-vector Z = X + iY possesses covariance matrix Σ. Also showthat E(Z − EZ)(Z − EZ)T = 0.

1.39 EXERCISE. Let Z and Z be independent complex random vectors with mean zeroand covariance matrix Σ.(i) Show that E(ReZ)(ReZ)T + E(ImZ)(ImZ)T = ReΣ and E(ImZ)(ReZ)T −

E(ReZ)(ImZ)T = ImΣ.(ii) Show that the 2n-vector (XT , Y T )T defined by X = (ReZ + Im Z)/

√2 and Y =

(−Re Z + ImZ)/√2 possesses covariance matrix (1.3).

(iii) Show that the 2n-vector (ReZT , ImZT )T does not necessarily have this property.

1.40 EXERCISE. Say that a complex n-vector Z = X+iY is Nn(µ,Σ) distributed if the2n real vector (XT , Y T )T is multivariate normally distributed with mean (Reµ, Imµ)and covariance matrix (1.3). Show that:(i) If Z is N(µ,Σ)-distributed, then AZ + b is N(Aµ+ b, AΣAT )-distributed.(ii) If Z1 and Z2 are independent and N(µi,Σi)-distributed, then Z1 + Z2 is N(µ1 +

µ2,Σ1 +Σ2)-distributed.(iii) If X is a real vector and N(µ,Σ)-dstributed and A is a complex matrix, then AX

is typically not complex normally distributed.

1.4 Multivariate Time Series

In many applications we are interested in the time evolution of several variables jointly.This can be modelled through vector-valued time series. The definition of a station-ary time series applies without changes to vector-valued series Xt = (Xt,1, . . . , Xt,d)

T .Here the mean EXt is understood to be the vector (EXt,1, . . . ,EXt,d)

T of means of thecoordinates and the auto-covariance function is defined to be the matrix

γX(h) =(

cov(Xt+h,i, Xt,j))

i,j=1,...,d= E(Xt+h − EXt+h)(Xt − EXt)

T .

The auto-correlation at lag h is defined as

ρX(h) =(

ρ(Xt+h,i, Xt,j))

i,j=1,...,d= diag

(√

γX(0))

γX(h) diag(√

γX(0))

.

The study of properties of multivariate time series can often be reduced to the study ofunivariate time series by taking linear combinations αTXt of the coordinates. The firstand second moments satisfy, for every α ∈ C

n,

EαTXt = αTEXt, γαTX(h) = αT γX(h)α.

1.41 EXERCISE. Show that the auto-covariance function of a complex, multivariate,stationary time series X satisfies γX(h)T = γX(−h), for every h ∈ Z. (The order of thetwo random variables in the definition γX(h) = cov(Xt+h, Xt) matters!)

2Hilbert Spacesand Prediction

In this chapter we first recall definitions and basic facts concerning Hilbert spaces. Inparticular, we consider the Hilbert space of square-integrable random variables, with thecovariance as the inner product. Next we apply the Hilbert space theory to solve theprediction problem: finding the “best” predictor of Xn+1, or other future variables, basedon observations X1, . . . , Xn.

2.1 Hilbert Spaces and Projections

Given a measure space (Ω,U , µ), we define L2(Ω,U , µ) (or L2(µ) for short) as the set of allmeasurable functions f : Ω → C such that

∫

|f |2 dµ < ∞. (Alternatively, all measurablefunctions with values in R with this property.) Here a complex-valued function is saidto be measurable if both its real and imaginary parts Re f and Im f are measurablefunctions. Its integral is by definition

∫

f dµ =

∫

Re f dµ+ i

∫

Im f dµ,

provided the two integrals on the right are defined and finite. We set

(2.1)

〈f1, f2〉 =∫

f1f2 dµ,

‖f‖ =

√

∫

|f |2 dµ,

d(f1, f2) = ‖f1 − f2‖ =

√

∫

|f1 − f2|2 dµ.

2.1: Hilbert Spaces and Projections 23

These define a semi-inner product, a semi-norm, and a semi-metric, respectively. Thefirst is a semi-inner product in view of the properties:

〈f1 + f2, f3〉 = 〈f1, f3〉+ 〈f2, f3〉,〈αf1, βf2〉 = αβ〈f1, f2〉,

〈f2, f1〉 = 〈f1, f2〉,〈f, f〉 ≥ 0, with equality iff f = 0, a.e..

These equalities are immediate consequences of the definitions, and the linearity of theintegral.

The second entity in (2.1) is a semi-norm because it has the properties:

‖f1 + f2‖ ≤ ‖f1‖+ ‖f2‖,‖αf‖ = |α|‖f‖,‖f‖ = 0 iff f = 0, a.e..

Here the first line, the triangle inequality is not immediate, but it can be proved with thehelp of the Cauchy-Schwarz inequality, given below. The other properties are obvious.

The third object in (2.1) is a semi-distance, in view of the relations:

d(f1, f3) ≤ d(f1, f2) + d(f2, f3),

d(f1, f2) = d(f2, f1),

d(f1, f2) = 0 iff f1 = f2, a.e..

The first property is again a triangle inequality, and follows from the correspondinginequality for the norm.

Immediate consequences of the definitions and the properties of the inner productare

(2.2) ‖f + g‖2 = 〈f + g, f + g〉 = ‖f‖2 + 2Re〈f, g〉+ ‖g‖2,‖f + g‖2 = ‖f‖2 + ‖g‖2, if 〈f, g〉 = 0.

The last equality is Pythagoras’ rule. In the complex case this is true, more generally,for functions f, g with Re〈f, g〉 = 0, as is immediate from the first equality.

2.1 Lemma (Cauchy-Schwarz). Any pair f, g in L2(Ω,U , µ) satisfies∣

∣〈f, g〉∣

∣ ≤ ‖f‖‖g‖.

Proof. This follows upon working out the inequality ‖f − λg‖2 ≥ 0 for λ = 〈f, g〉/‖g‖2using the decomposition (2.2).

Now the triangle inequality for the norm follows from the decomposition (2.2) andthe Cauchy-Schwarz inequality, which, when combined, yield

‖f + g‖2 ≤ ‖f‖2 + 2‖f‖‖g‖+ ‖g‖2 =(

‖f‖+ ‖g‖)2.

Another consequence of the Cauchy-Schwarz inequality is the continuity of the innerproduct:

(2.3) fn → f, gn → g implies that 〈fn, gn〉 → 〈f, g〉.

24 2: Hilbert Spaces and Prediction

2.2 EXERCISE. Prove this.

2.3 EXERCISE. Prove that∣

∣‖f‖ − ‖g‖∣

∣ ≤ ‖f − g‖.

2.4 EXERCISE. Derive the parallellogram rule: ‖f + g‖2 + ‖f − g‖2 = 2‖f‖2 + 2‖g‖2.

2.5 EXERCISE. Prove that ‖f + ig‖2 = ‖f‖2 + ‖g‖2 for every pair f, g of real functionsin L2(Ω,U , µ).

2.6 EXERCISE. Let Ω = 1, 2, . . . , k, U = 2Ω the power set of Ω and µ the countingmeasure on Ω. Show that L2(Ω,U , µ) is exactly C

k (or Rk in the real case).

We attached the qualifier “semi” to the inner product, norm and distance definedpreviously, because in every of the three cases, the third of the three properties involvesa null set. For instance ‖f‖ = 0 does not imply that f = 0, but only that f = 0almost everywhere. If we think of two functions that are equal almost everywere as thesame “function”, then we obtain a true inner product, norm and distance. We defineL2(Ω,U , µ) as the set of all equivalence classes in L2(Ω,U , µ) under the equivalencerelation “f ≡ g if and only if f = g almost everywhere”. It is a common abuse ofterminology, which we adopt as well, to refer to the equivalence classes as “functions”.

2.7 Proposition. The metric space L2(Ω,U , µ) is complete under the metric d.

We shall need this proposition only occasionally, and do not provide a proof. (See e.g.Rudin, Theorem 3.11.) The proposition asserts that for every sequence fn of functions inL2(Ω,U , µ) such that

∫

|fn − fm|2 dµ→ as m,n→ ∞ (a Cauchy sequence), there existsa function f ∈ L2(Ω,U , µ) such that

∫

|fn − f |2 dµ→ 0 as n→ ∞.

2.8 Definition. A Hilbert space is a set equipped with an inner product that is metricallycomplete under the corresponding metric.

The space L2(Ω,U , µ) is an example of an Hilbert space, and the only example weneed. (In fact, this is not a great loss of generality, because it can be proved that anyHilbert space is (isometrically) isomorphic to a space L2(Ω,U , µ) for some (Ω,U , µ).)

2.9 Definition. Two elements f, g of L2(Ω,U , µ) are orthogonal if 〈f, g〉 = 0. This isdenoted f ⊥ g. Two subsets F ,G of L2(Ω,U , µ) are orthogonal if f ⊥ g for every f ∈ Fand g ∈ G. This is denoted F ⊥ G.

2.10 EXERCISE. If f ⊥ G for some subset G ⊂ L2(Ω,U ,P), show that f ⊥ linG, wherelinG is the closure of the linear span lin G of G. (The linear span of a set is the set of allfinite linear combinations of elements of the set.)

2.11 Theorem (Projection theorem). Let L ⊂ L2(Ω,U , µ) be a closed linear subspace.For every f ∈ L2(Ω,U , µ) there exists a unique element Πf ∈ L that minimizes l 7→‖f − l‖2 over l ∈ L. This element is uniquely determined by the requirements Πf ∈ Land f −Πf ⊥ L.

2.1: Hilbert Spaces and Projections 25

Proof. Let d = inf l∈L ‖f − l‖ be the “minimal” distance of f to L. This is finite,because 0 ∈ L and ‖f‖ <∞. Let ln be a sequence in L such that ‖f − ln‖2 → d. By theparallellogram law

∥

∥(lm − f) + (f − ln)∥

∥

2= 2‖lm − f‖2 + 2‖f − ln‖2 −

∥

∥(lm − f)− (f − ln)∥

∥

2

= 2‖lm − f‖2 + 2‖f − ln‖2 − 4∥

∥

12 (lm + ln)− f

∥

∥

2.

The two first terms on the far right both converge to 2d2 as m,n → ∞. Because (lm +ln)/2 ∈ L, the last term on the far right is bounded above by −4d2. We conclude thatthe left side, which is ‖lm − ln‖2, is bounded above by 2d2 + 2d2 + o(1) − 4d2 = o(1)and hence, being nonnegative, converges to zero. Thus ln is a Cauchy sequence, and hasa limit l, by the completeness of L2(Ω,U , µ). The limit is in L, because L is closed. Bythe continuity of the norm ‖f − l‖ = lim ‖f − ln‖ = d. Thus the limit l qualifies as theminimizing element Πf .

If both Π1f and Π2f are candidates for Πf , then we can take the sequencel1, l2, l3, . . . in the preceding argument equal to the sequence Π1f,Π2f,Π1f, . . .. It thenfollows that this sequence is a Cauchy-sequence and hence converges to a limit. Thelatter is possible only if Π1f = Π2f .

Finally, we consider the orthogonality relation. For every real number a and l ∈ L,we have

∥

∥f − (Πf + al)∥

∥

2= ‖f −Πf‖2 − 2aRe〈f −Πf, l〉+ a2‖l‖2.

By definition of Πf this is minimal as a function of a at the value a = 0, whence thegiven parabola (in a) must have its bottom at zero, which is the case if and only ifRe〈f − Πf, l〉 = 0. In the complex case we see by a similar argument with ia instead ofa, that Im〈f −Πf, l〉 = 0 as well. Thus f −Πf ⊥ L.

Conversely, if 〈f −Πf, l〉 = 0 for every l ∈ L and Πf ∈ L, then Πf − l ∈ L for everyl ∈ L and by Pythagoras’ rule

‖f − l‖2 =∥

∥(f −Πf) + (Πf − l)∥

∥

2= ‖f −Πf‖2 + ‖Πf − l‖2 ≥ ‖f −Πf‖2.

This proves that Πf minimizes l 7→ ‖f − l‖2 over l ∈ L.

The function Πf given in the preceding theorem is called the (orthogonal) projectionof f onto L. A geometric representation of a projection in R

3 is given in Figure 2.1.From the orthogonality characterization of Πf , we can see that the map f 7→ Πf is

linear and decreases norm:Π(f + g) = Πf +Πg,

Π(αf) = αΠf,

‖Πf‖ ≤ ‖f‖.Together linearity and decreasing norm show that Π is Lipshitz continuous: ‖Πf−Πg‖ ≤‖f − g‖.

A further important property relates to repeated projections. If ΠLf denotes theprojection of f onto L and L1 and L2 are two closed linear subspaces, then

ΠL1ΠL2

f = ΠL1f, iff L1 ⊂ L2.


f f −Πf

Πf

L

1

Figure 2.1. Projection of f onto the linear space L. The remainder f − Πf is orthogonal to L.

Thus we can find a projection in steps, by projecting a projection (ΠL2f) onto a bigger

space (L2) a second time onto the smaller space (L1). This, again, is best proved usingthe orthogonality relation.

2.12 EXERCISE. Prove the relations in the two preceding displays.

The projection ΠL1+L2onto the sum L1+L2 = l1+ l2: li ∈ Li of two closed linear

spaces is not necessarily the sum ΠL1+ΠL2

of the projections. (It is also not true thatthe sum of two closed linear subspaces is necessarily closed, so that ΠL1+L2

may noteven be well defined.) However, this is true if the spaces L1 and L2 are orthogonal:

ΠL1+L2f = ΠL1

f +ΠL2f, if L1 ⊥ L2.

2.13 EXERCISE.(i) Show by counterexample that the condition L1 ⊥ L2 cannot be omitted.(ii) Show that L1 + L2 is closed if L1 ⊥ L2 and both L1 and L2 are closed subspaces.(iii) Show that L1 ⊥ L2 implies that ΠL1+L2

= ΠL1+ΠL2

.

[Hint for (ii): It must be shown that if zn = xn + yn with xn ∈ L1, yn ∈ L2 for every nand zn → z, then z = x + y for some x ∈ L1 and y ∈ L2. How can you find xn and ynfrom zn? Also remember that a projection is continuous.]

2.14 EXERCISE. Find the projection ΠLf of an element f onto a one-dimensional spaceL = λl0:λ ∈ C.

* 2.15 EXERCISE. Suppose that the set L has the form L = L1 + iL2 for two closed,linear spaces L1, L2 of real functions. Show that the minimizer of l 7→ ‖f − l‖ over l ∈ Lfor a real function f is the same as the minimizer of l 7→ ‖f − l‖ over L1. Does this implythat f −Πf ⊥ L2? Why does this not follow from the projection theorem?

2.2: Square-integrable Random Variables 27

2.2 Square-integrable Random Variables

For (Ω,U ,P) a probability space the Hilbert space L2(Ω,U ,P) is exactly the set ofall complex (or real) random variables X with finite second moment E|X|2. The innerproduct is the product expectation 〈X,Y 〉 = EXY , and the inner product betweencentered variables is the covariance:

〈X − EX,Y − EY 〉 = cov(X,Y ).

The Cauchy-Schwarz inequality takes the form

|EXY |2 ≤ E|X|2E|Y |2.

When combined the preceding displays imply that∣

∣cov(X,Y )∣

∣

2 ≤ varX varY . Conver-gence Xn → X relative to the norm means that E|Xn − X|2 → 0, and is referred toas convergence in second mean. This implies the convergence in mean E|Xn −X| → 0,because E|X| ≤

√

E|X|2 by the Cauchy-Schwarz inequality. The continuity of the innerproduct (2.3) gives that:

E|Xn −X|2 → 0,E|Yn − Y |2 → 0 implies cov(Xn, Yn) → cov(X,Y ).

2.16 EXERCISE. How can you apply this rule to prove equalities of the typecov(

∑

αjXt−j ,∑

βjYt−j) =∑

i

∑

j αiβj cov(Xt−i, Yt−j), such as in Lemma 1.29?

2.17 EXERCISE. Show that sd(X+Y ) ≤ sd(X)+sd(Y ) for any pair of random variablesX and Y .

2.2.1 Conditional Expectation

Let U0 ⊂ U be a sub σ-field of the σ-field U . The collection L of all U0-measurablevariables Y ∈ L2(Ω,U ,P) is a closed, linear subspace of L2(Ω,U ,P). (It can be identifiedwith L2(Ω,U0,P))). By the projection theorem every square-integrable random variableX possesses a projection onto L. This particular projection is important enough to giveit a name and study it in more detail.

2.18 Definition. The projection of X ∈ L2(Ω,U ,P) onto the the set of all U0-measurable square-integrable random variables is called the conditional expectation ofX given U0. It is denoted by E(X| U0).

Many times the σ-field U0 will be generated by a measurable map Y : Ω → D withvalues in some measurable space (D,D): then U0 gives the information contained inobserving Y . Formally the σ-field generated by Y is defined as σ(Y ) = Y −1(D), thecollection of all events of the form Y ∈ D, for D ∈ D. The notation E(X|Y ) is anabbreviation of E

(

X|σ(Y ))

, and called the conditional expectation of X given Y .The name “conditional expectation” suggests that there exists another, more intu-

itive interpretation of this projection. An alternative definition of a conditional expecta-tion is as follows.


2.19 Definition. The conditional expectation given U0 of a random variable X whichis either nonnegative or integrable is defined as a U0-measurable variable X ′ such thatEX1A = EX ′1A for every A ∈ U0.

It is clear from the definition that any other U0-measurable map X ′′ such thatX ′′ = X ′ almost surely is also a conditional expectation. Apart from this indeterminacyon null sets, a conditional expectation as in the second definition can be shown to beunique; its existence can be proved using the Radon-Nikodym theorem. We do not giveproofs of these facts here.

Because a variable X ∈ L2(Ω,U ,P) is automatically integrable, Definition 2.19 de-fines a conditional expectation for a larger class of variables than Definition 2.18. IfE|X|2 < ∞, so that both definitions apply, then the two definitions agree. To see thisit suffices to show that a projection E(X| U0) as in the first definition is the conditionalexpectation X ′ of the second definition. Now E(X| U0) is U0-measurable by definitionand satisfies the equality E

(

X − E(X| U0))

1A = 0 for every A ∈ U0, by the orthog-onality relationship of a projection. Thus X ′ = E(X| U0) satisfies the requirements ofDefinition 2.19.

The required measurability in the smaller σ-field U0 says that the conditional ex-pectation X ′ = E(X| U0) is a “coarsening” of the original variable X: it is based on lessinformation. Definition 2.19 shows that the two variables have the same average valuesEX1A/P(A) and EX1A/P(A) over every measurable set A ∈ U0. Some examples helpto gain more insight in conditional expectations.

2.20 Example (Ordinary expectation). The expectation EX of a random variable Xis a number, and can considered a degenerate random variable. It is also the conditionalexpectation relative to the trivial σ-field : E

(

X| ∅,Ω)

= EX. More generally, we havethat E(X| U0) = EX if X and U0 are independent. This is intuitively clear: “an indepen-dent σ-field U0 gives no information about X” and hence the expectation given U0 is theunconditional expectation.

To derive this from the definition, note that E(EX)1A = EXE1A = EX1A for everymeasurable set A such that X and A are independent.

2.21 Example (Full information). At the other extreme we have that E(X| U0) = Xif X itself is U0-measurable. This is immediate from the definition. “Given U0 we knowX exactly.”

2.22 EXERCISE. Let U0 = σ(A1, . . . , Ak) for a measurable partition Ω = ∪iAi intofinitely many sets of positive probability. Show that the random variable E(X| U0) takesthe value EX1Ai

/P(Ai) on the event Ai.

2.23 Example (Conditional density). Let (X,Y ): Ω → R × Rk be measurable and

possess a density f(x, y) relative to a σ-finite product measure µ × ν on R × Rk (for

instance, the Lebesgue measure on Rk+1). Then it is customary to define a conditional

density of X given Y = y by

f(x| y) = f(x, y)∫

f(x, y) dµ(x).

2.3: Linear Prediction 29

This is well defined for every y for which the denominator is positive. As the denominatoris precisely the marginal density fY of Y evaluated at y, this is for all y in a set of measureone under the distribution of Y .

We now have that the conditional expection is given by the “usual formula”

E(X|Y ) =

∫

xf(x|Y ) dµ(x).

Here we may define the right hand arbitrarily if the denominator of f(x|Y ) is zero.That this formula is the conditional expectation according to Definition 2.19 follows

by some applications of Fubini’s theorem. To begin with, note that it is a part of thestatement of this theorem that the right side of the preceding display is a measurablefunction of Y . Next we write EE(X|Y )1Y ∈B , for an arbitrary measurable set B, in theform

∫

B

∫

xf(x| y) dµ(x) fY (y) dν(y). Because f(x| y)fY (y) = f(x, y) for almost every(x, y), the latter expression is equal to EX1Y ∈B , by Fubini’s theorem.

2.24 Lemma (Properties).(i) EE(X| U0) = EX.(ii) If Z is U0-measurable, then E(ZX| U0) = ZE(X| U0) a.s.. (Here require that X ∈

Lp(Ω,U ,P) and Z ∈ Lq(Ω,U ,P) for 1 ≤ p ≤ ∞ and p−1 + q−1 = 1.)(iii) (linearity) E(αX + βY | U0) = αE(X| U0) + βE(Y | U0) a.s..(iv) (positivity) If X ≥ 0 a.s., then E(X| U0) ≥ 0 a.s..(v) (towering property) If U0 ⊂ U1 ⊂ U , then E

(

E(X| U1)| U0) = E(X| U0) a.s..

The conditional expectation E(X|Y ) given a random vector Y is by definition aσ(Y )-measurable function. The following lemma shows that, for most Y , this means thatit is a measurable function g(Y ) of Y . The value g(y) is often denoted by E(X|Y = y).

Warning. Unless P(Y = y) > 0 it is not right to give a meaning to E(X|Y = y) fora fixed, single y, even though the interpretation as an expectation given “that we knowthat Y = y” often makes this tempting (and often leads to a correct result). We mayonly think of a conditional expectation as a function y 7→ E(X|Y = y) and this is onlydetermined up to null sets.

2.25 Lemma. Let Yα:α ∈ A be random variables on Ω and let X be a σ(Yα:α ∈ A)-measurable random variable.(i) If A = 1, 2, . . . , k, then there exists a measurable map g:Rk → R such that

X = g(Y1, . . . , Yk).(ii) If |A| = ∞, then there exists a countable subset αn∞n=1 ⊂ A and a measurable

map g:R∞ → R such that X = g(Yα1, Yα2

, . . .).

Proof. For the proof of (i), see e.g. Dudley Theorem 4.28.


2.3 Linear Prediction

Suppose that we observe the values X1, . . . , Xn from a stationary, mean zero time seriesXt. The linear prediction problem is to find the linear combination of these variables thatbest predicts future variables.

2.26 Definition. Given a mean zero time series Xt, the best linear predictor of Xn+1 isthe linear combination φ1Xn+φ2Xn−1+· · ·+φnX1 that minimizes E|Xn+1−Y |2 over alllinear combinations Y of X1, . . . , Xn. The minimal value E|Xn+1 −φ1Xn − · · ·−φnX1|2is called the square prediction error.

In the terminology of the preceding section, the best linear predictor of Xn+1 is theprojection of Xn+1 onto the linear subspace lin (X1, . . . , Xn) spanned by X1, . . . , Xn. Acommon notation is ΠnXn+1, for Πn the projection onto lin (X1, . . . , Xn). Best linearpredictors of other random variables are defined similarly.

Warning. The coefficients φ1, . . . , φn in the formula ΠnXn+1 = φ1Xn + · · ·+φnX1

depend on n, even though we often suppress this dependence in the notation. The reversedordering of the labels on the coefficients φi is for convenience.

By Theorem 2.11 the best linear predictor can be found from the prediction equations

〈Xn+1 − φ1Xn − · · · − φnX1, Xt〉 = 0, t = 1, . . . , n,

where 〈·, ·〉 is the inner product in L2(Ω,U ,P). For a stationary (real-valued!) time seriesXt this system can be written in the form

(2.4)

γX(0) γX(1) · · · γX(n− 1)γX(1) γX(0) · · · γX(n− 2)

......

. . ....

γX(n− 1) γX(n− 2) · · · γX(0)

φ1...φn

=

γX(1)...

γX(n)

.

If the (n× n)-matrix on the left is nonsingular, then φ1, . . . , φn can be solved uniquely.Otherwise there are multiple solutions for the vector (φ1, . . . , φn), but any solution willgive the best linear predictor ΠnXn+1 = φ1Xn+· · ·+φnX1, as this is uniquely determinedby the projection theorem. The equations express φ1, . . . , φn in the auto-covariance func-tion γX . In practice, we do not know this function, but estimate it from the data, anduse the corresponding estimates for φ1, . . . , φn to calculate the predictor. (We considerthe estimation problem in later chapters.)

The square prediction error can be expressed in the coefficients by Pythagoras’ rule,which gives, for a stationary time series Xt,

(2.5)E|Xn+1 −ΠnXn+1|2 = E|Xn+1|2 − E|ΠnXn+1|2

= γX(0)− (φ1, . . . , φn)Γn(φ1, . . . , φn)T ,

for Γn the covariance matrix of the vector (X1, . . . , Xn), i.e. the matrix on the left leftside of (2.4).

Similar arguments apply to predicting Xn+h for h > 1. If we wish to predict the fu-ture values at many time lags h = 1, 2, . . ., then solving a n-dimensional linear system for

2.3: Linear Prediction 31

every h separately can be computer-intensive, as n may be large. Several more efficient,recursive algorithms use the predictions at earlier times to calculate the next prediction,and are computationally more efficient. We discuss one of these in Secton 2.4.

2.27 Example (Autoregression). Prediction is extremely simple for the stationaryauto-regressive time series satisfying Xt = φXt−1+Zt for a white noise sequence Zt and|φ| < 1: the best linear predictor of Xn+1 given X1, . . . , Xn is simply φXn (for n ≥ 1).Thus we predict Xn+1 = φXn + Zn+1 by simply setting the unknown Zn+1 equal to itsmean, zero. The interpretation is that the Zt are external noise factors that are completelyunpredictable based on the past. The square prediction error E|Xn+1 − φXn|2 = EZ2

n+1

is equal to the variance of this “innovation”.The claim is not obvious, as is proved by the fact that it is wrong in the case that

|φ| > 1. To prove the claim we recall from Example 1.8 that the unique stationary solutionto the auto-regressive equation in the case that |φ| < 1 is given by Xt =

∑∞j=0 φ

jZt−j .Thus Xt depends only on Zs from the past and the present. Because Zt is a white noisesequence, it follows that Zt+1 is uncorrelated with the variables Xt, Xt−1, . . .. Therefore〈Xn+1 − φXn, Xt〉 = 〈Zn+1, Xt〉 = 0 for t = 1, 2, . . . , n. This verifies the orthogonalityrelationship; it is obvious that φXn is contained in the linear span of X1, . . . , Xn.

2.28 EXERCISE. There is a hidden use of the continuity of the inner product in thepreceding example. Can you see where?

* 2.29 EXERCISE. Find the best linear predictor of Xn+1 given Xn, Xn−1, . . . , and givenXn, . . . , X1 in the stationary autoregressive process satisfying Xt = φXt−1 + Zt for awhite noise sequence Zt and |φ| > 1.

2.30 Example (Deterministic trigonometric series). For the processXt = A cos(λt)+B sin(λt), considered in Example 1.5, the best linear predictor of Xn+1 given X1, . . . , Xn

is given by 2(cosλ)Xn − Xn−1, for n ≥ 2. The prediction error is equal to 0! Thisunderscores that this type of time series is deterministic in character: if we know it attwo time instants, then we know the time series at all other time instants. The intuitiveexplanation is that the values A and B can be recovered from the values of Xt at twotime instants.

These assertions follow by explicit calculations, solving the prediction equations. Itsuffices to do this for n = 2: if X3 can be predicted without error by 2(cosλ)X2 −X1,then, by stationarity, Xn+1 can be predicted without error by 2(cosλ)Xn −Xn−1.

2.31 EXERCISE.(i) Prove the assertions in the preceding example.(ii) Are the coefficients 2 cosλ,−1, 0, . . . , 0 in this example unique?

If a given time series Xt is not centered at 0, then it is natural to allow a constantterm in the predictor. Write 1 for the random variable that is equal to 1 almost surely.


2.32 Definition. The best linear predictor of Xn+1 based on X1, . . . , Xn is the projec-tion of Xn+1 onto the linear space spanned by 1, X1, . . . , Xn.

If the time series Xt does have mean zero, then the introduction of the constantterm 1 does not help. Indeed, the relation EXt = 0 is equivalent to Xt ⊥ 1, whichimplies both that 1 ⊥ lin (X1, . . . , Xn) and that the projection of Xn+1 onto lin 1 iszero. By the orthogonality the projection of Xn+1 onto lin (1, X1, . . . , Xn) is the sumof its projections onto lin 1 and lin (X1, . . . , Xn).As the first projection is 0, this is theprojection onto lin (X1, . . . , Xn),

If the mean of the time series is nonzero, then adding a constant to the predictordoes cut the prediction error. By a similar argument as in the preceding paragraph wesee that for a time series with mean µ = EXt possibly nonzero,

(2.6) Πlin (1,X1,...,Xn)Xn+1 = µ+Πlin (X1−µ,...,Xn−µ)(Xn+1 − µ).

Thus the recipe for prediction with uncentered time series is: substract the mean fromevery Xt, calculate the projection for the centered time series Xt−µ, and finally add themean. Because the auto-covariance function γX gives the inner produts of the centeredprocess, the coefficients φ1, . . . , φn of Xn−µ, . . . ,X1−µ are still given by the predictionequations (2.4).

2.33 EXERCISE. Prove formula (2.6), noting that EXt = µ is equivalent to Xt−µ ⊥ 1.

* 2.4 Innovations Algorithm

The prediction error Xn − Πn−1Xn at time n is called the innovation at time n. (SetΠ0X1 = EX1, so that at time 1 we predict using constants.) The linear span of theinnovations X1 − Π0X1, . . . , Xn − Πn−1Xn is the same as the linear span X1, . . . , Xn,and hence the linear predictor of Xn+1 can be expressed in the innovations, as

ΠnXn+1 = ψn,1(Xn −Πn−1Xn) + ψn,2(Xn−1 −Πn−2Xn−1) + · · ·+ ψn,n(X1 −Π0X1).

The innovations algorithm shows how the triangular array of coefficients ψn,1, . . . , ψn,n

can be efficiently computed. Once these coefficients are known, predictions further intothe future can also be computed easily. Indeed by the towering property of projectionsthe projection ΠkXn+1 for k ≤ n can be obtained by applying Πk to the left sideof the preceding display. Next the linearity of projection and the orthogonality of theinnovations to the “past” yields that, for k ≤ n,

ΠkXn+1 = ψn,n−k+1(Xk −Πk−1Xk) + · · ·+ ψn,n(X1 −Π0X1).

We just drop the innovations that are future to time k.

2.5: Nonlinear Prediction 33

The innovations algorithm computes the coeffients ψn,t for n = 1, 2, . . ., and forevery n backwards for t = n, n− 1, . . . , 1. It also uses and computes the square norms ofthe innovations

vn−1: = E(Xn −Πn−1Xn)2.

The algorithm is initialized at n = 1 by setting Π0X1 = EX1 and v0 = varX1 = γX(0).For each n the last of the coefficients ψn,t is computed as (cf. Exercise 2.14)

ψn,n =cov(Xn+1, X1)

varX1=γX(n)

v0.

In particular, this yields the single coefficient ψ1,1 at level n = 1. Once all coefficients atlevel n and v0, . . . , vn−1 are known, we compute

vn = varXn+1 − varΠnXn+1 = γX(0)−n∑

t=1

ψ2n,tvn−t.

Next if the coefficients (ψj,t) are known for j = 1, . . . , n− 1, then for t = n− 1, . . . , 1,

ψn,t =cov(Xn+1, Xn+1−t −Πn−tXn+1−t)

var(Xn+1−t −Πn−tXn+1−t)

=γX(t)− ψn−t,1ψn,t+1vn−t−1 − ψn−t,2ψn,t+2vn−t−2 − · · · − ψn−t,n−tψn,nv0

vn−t.

In the last step we have replaced Πn−tXn+1−t by

n−t∑

j=1

ψn−t,j(Xn+1−t−j −Πn−t−jXn+1−t−j),

and next used that

cov(Xn+1, Xn+1−t−j −Πn−t−jXn+1−t−j)

= cov(ΠnXn+1, Xn+1−t−j −Πn−t−jXn+1−t−j) = ψn,t+jvn−t−j .

2.5 Nonlinear Prediction

The method of linear prediction is commonly used in time series analysis. Its mainadvantage is simplicity: the linear predictor depends on the mean and auto-covariancefunction only, and in a simple fashion. On the other hand, utilization of general functionsf(X1, . . . , Xn) of the observations as predictors may decrease the prediction error.


2.34 Definition. The best predictor of Xn+1 based on X1, . . . , Xn is the function

fn(X1, . . . , Xn) that minimizes E∣

∣Xn+1 − f(X1, . . . , Xn)∣

∣

2over all measurable functions

f :Rn → R.

In view of the discussion in Section 2.2.1 the best predictor is the conditional ex-pectation E(Xn+1|X1, . . . , Xn) of Xn+1 given the variables X1, . . . , Xn. Best predictorsof other variables are similarly defined as conditional expectations.

The difference between linear and nonlinear predictors can be substantial. In “clas-sical” time series theory linear models with Gaussian errors were predominant and forthose models the two predictors coincide. However, for nonlinear models, or non-Gaussiandistributions, nonlinear prediction should be the method of choice, if feasible.

2.35 Example (GARCH). In the GARCH model of Example 1.10 the variable Xn+1 isgiven as σn+1Zn+1, where σn+1 is a function of Xn, Xn−1, . . . and Zn+1 is independentof these variables. It follows that the best predictor of Xn+1 given the infinite pastXn, Xn−1, . . . is given by σn+1E(Zn+1|Xn, Xn−1, . . .) = 0. We can find the best predictorgiven Xn, . . . , X1 by projecting this predictor further onto the space of all measurablefunctions of Xn, . . . , X1. By the linearity of the projection we again find 0.

We conclude that a GARCH model does not allow a “true prediction” of the future,if “true” refers to predicting the values of the time series itself.

On the other hand, we can predict other quantities of interest. For instance, theuncertainty of the value of Xn+1 is determined by the size of the “volatility” σn+1. Ifσn+1 is close to zero, then we may expect Xn+1 to be close to zero, and conversely. Giventhe infinite past Xn, Xn−1, . . . the variable σn+1 is known completely, but in the morerealistic situation that we know only Xn, . . . , X1 some chance component will be left.

For large n the difference between these two situations is small. The dependenceof σn+1 on Xn, Xn−1, . . . is given in Example 1.10 as σ2

n+1 =∑∞

j=0 φj(α + θX2

n−j) and

is nonlinear. For large n this is close to∑n−1

j=0 φj(α + θX2

n−j), which is a function of

X1, . . . , Xn. By definition the best predictor σ2n+1 based on X1, . . . , Xn is the closest

function and hence it satisfies

E∣

∣σ2n+1 − σ2

n+1

∣

∣

2 ≤ E∣

∣

∣

n−1∑

j=0

φj(α+ θX2n−j)− σ2

n+1

∣

∣

∣

2

= E∣

∣

∣

∞∑

j=n

φj(α+ θX2n−j)

∣

∣

∣

2

.

For small φ and large n this will be small if the sequence Xn is sufficiently integrable.Thus accurate nonlinear prediction of σ2

n+1 is feasible.

2.6 Partial Auto-Correlation

For a mean-zero stationary time series Xt the partial auto-correlation at lag h ≥ 1 isdefined as the correlation between Xh − Πh−1Xh and X0 − Πh−1X0, where Πh is the

2.6: Partial Auto-Correlation 35

projection onto lin (X1, . . . , Xh) (with the convention that this space is the point 0 ifh = 0). This is the “correlation between Xh and X0 with the correlation due to theintermediate variables X1, . . . , Xh−1 removed”. We shall denote it by

αX(h) = ρ(

Xh −Πh−1Xh, X0 −Πh−1X0

)

.

For an uncentered stationary time series we set the partial auto-correlation by definitionequal to the partial auto-correlation of the centered series Xt − EXt. We also defineαX(0) = 1. A convenient method to compute αX is given by the prediction equationscombined with the following lemma, which shows that αX(h) is the coefficient of X1 inthe best linear predictor of Xh+1 based on X1, . . . , Xh.

2.36 Lemma. Suppose thatXt is a mean-zero stationary time series. If φ1Xh+φ2Xh−1+· · ·+ φhX1 is the best linear predictor of Xh+1 based on X1, . . . , Xh, then αX(h) = φh.

Proof. Let ψ1Xh + · · ·+ψh−1X2 =:Π2,hX1 be the best linear predictor of X1 based onX2, . . . , Xh. The best linear predictor of Xh+1 based on X1, . . . , Xh can be decomposedas

ΠhXh+1 = φ1Xh + · · ·+ φhX1

=[

(φ1 + φhψ1)Xh + · · ·+ (φh−1 + φhψh−1)X2

]

+[

φh(X1 −Π2,hX1)]

.

The two random variables in square brackets are orthogonal, because X1 − Π2,hX1 ⊥lin (X2, . . . , Xh) by the projection theorem. Therefore, the second variable in squarebrackets is the projection of ΠhXh+1 onto the one-dimensional subspace lin (X1 −Π2,hX1). It is also the projection of Xh+1 onto this one-dimensional subspace, becauselin (X1 −Π2,hX1) ⊂ lin (X1, . . . , Xh) and we can compute projections by first projectingonto a bigger subspace.

The projection of Xh+1 onto the one-dimensional subspace lin (X1 − Π2,hX1) iseasily computed directly as α(X1 −Π2,hX1), for

α =〈Xh+1, X1 −Π2,hX1〉

‖X1 −Π2,hX1‖2=

〈Xh+1 −Π2,hXh+1, X1 −Π2,hX1〉‖X1 −Π2,hX1‖2

.

As it depends on the auto-covariance function only, the linear prediction problem issymmetric in time, and hence ‖X1 − Π2,hX1‖ = ‖Xh+1 − Π2,hXh+1‖. Therefore, theright side is exactly αX(h). In view of the preceding paragraph, we have α = φh and thelemma is proved.

2.37 Example (Autoregression). According to Example 2.27, for the stationary auto-regressive process Xt = φXt−1+Zt with |φ| < 1, the best linear predictor of Xn+1 basedon X1, . . . , Xn is φXn, for n ≥ 1. Thus αX(1) = φ and the partial auto-correlationsαX(h) of lags h > 1 are zero. This is often viewed as the dual of the property that forthe moving average sequence of order 1, considered in Example 1.6, the auto-correlationsof lags h > 1 vanish.

In Chapter 8 we shall see that for higher order stationary auto-regressive processesXt = φ1Xt−1 + · · · + φpXt−p + Zt the partial auto-correlations of lags h > p are zerounder the (standard) assumption that the time series is “causal”.

3Stochastic Convergence

This chapter provides a review of modes of convergence of sequences of stochastic vectors.In particular, convergence in distribution and in probability. Many proofs are omitted,but can be found in most standard probability books, and certainly in the book Asymp-totic Statistics (A.W. van der Vaart, 1998).

3.1 Basic theory

A random vector in Rk is a vector X = (X1, . . . , Xk) of real random variables. More

formally it is a Borel measurable map from some probability space in Rk. The distribution

function of X is the map x 7→ P(X ≤ x).A sequence of random vectors Xn is said to converge in distribution to X if

P(Xn ≤ x) → P(X ≤ x),

for every x at which the distribution function x 7→ P(X ≤ x) is continuous. Alternativenames are weak convergence and convergence in law. As the last name suggests, theconvergence only depends on the induced laws of the vectors and not on the probabilityspaces on which they are defined. We denote weak convergence by Xn X; if X hasdistribution L or a distribution with a standard code such as N(0, 1), then also byXn L or Xn N(0, 1).

Let d(x, y) be any distance function on Rk that generates the usual topology. For

instance

d(x, y) = ‖x− y‖ =(

k∑

i=1

(xi − yi)2)1/2

.

A sequence of random variables Xn is said to converge in probability to X if for all ε > 0

P(

d(Xn, X) > ε)

→ 0.

3.1: Basic theory 37

This is denoted by XnP→ X. In this notation convergence in probability is the same as

d(Xn, X) P→ 0.As we shall see convergence in probability is stronger than convergence in distribu-

tion. Even stronger modes of convergence are almost sure convergence and convergencein pth mean. The sequence Xn is said to converge almost surely to X if d(Xn, X) → 0with probability one:

P(

lim d(Xn, X) = 0)

= 1.

This is denoted by Xnas→ X. The sequence Xn is said to converge in pth mean to X if

Ed(Xn, X)p → 0.

This is denoted XnLp→ X. We already encountered the special cases p = 1 or p = 2,

which are referred to as “convergence in mean” and “convergence in quadratic mean”.Convergence in probability, almost surely, or in mean only make sense if each Xn

and X are defined on the same probability space. For convergence in distribution this isnot necessary.

The portmanteau lemma gives a number of equivalent descriptions of weak con-vergence. Most of the characterizations are only useful in proofs. The last one also hasintuitive value.

3.1 Lemma (Portmanteau). For any random vectors Xn and X the following state-ments are equivalent.(i) P(Xn ≤ x) → P(X ≤ x) for all continuity points of x→ P(X ≤ x);(ii) Ef(Xn) → Ef(X) for all bounded, continuous functions f ;(iii) Ef(Xn) → Ef(X) for all bounded, Lipschitz† functions f ;(iv) lim inf P(Xn ∈ G) ≥ P(X ∈ G) for every open set G;(v) lim supP(Xn ∈ F ) ≤ P(X ∈ F ) for every closed set F ;(vi) P(Xn ∈ B) → P(X ∈ B) for all Borel sets B with P(X ∈ δB) = 0 where δB = B−B

is the boundary of B.

The continuous mapping theorem is a simple result, but is extremely useful. If thesequence of random vector Xn converges to X and g is continuous, then g(Xn) convergesto g(X). This is true without further conditions for three of our four modes of stochasticconvergence.

3.2 Theorem (Continuous mapping). Let g:Rk → Rm be measurable and continuous

at every point of a set C such that P(X ∈ C) = 1.(i) If Xn X, then g(Xn) g(X);(ii) If Xn

P→ X, then g(Xn)P→ g(X);

(iii) If Xnas→ X, then g(Xn)

as→ g(X).

Any random vector X is tight: for every ε > 0 there exists a constant M such thatP(

‖X‖ > M)

< ε. A set of random vectors Xα:α ∈ A is called uniformly tight if M

† A function is called Lipschitz if there exists a number L such that |f(x) − f(y)| ≤ Ld(x, y) for every xand y. The least such number L is denoted ‖f‖Lip.

38 3: Stochastic Convergence

can be chosen the same for every Xα: for every ε > 0 there exists a constant M suchthat

supα

P(

‖Xα‖ > M)

< ε.

Thus there exists a compact set to which all Xα give probability almost one. Anothername for uniformly tight is bounded in probability. It is not hard to see that every weaklyconverging sequence Xn is uniformly tight. More surprisingly, the converse of this state-ment is almost true: according to Prohorov’s theorem every uniformly tight sequencecontains a weakly converging subsequence.

3.3 Theorem (Prohorov’s theorem). Let Xn be random vectors in Rk.

(i) If Xn X for some X, then Xn:n ∈ N is uniformly tight;(ii) If Xn is uniformly tight, then there is a subsequence with Xnj

X as j → ∞ forsome X.

3.4 Example. A sequence Xn of random variables with E|Xn| = O(1) is uniformlytight. This follows since by Markov’s inequality: P

(

|Xn| > M)

≤ E|Xn|/M . This can bemade arbitrarily small uniformly in n by choosing sufficiently large M.

The first absolute moment could of course be replaced by any other absolute mo-ment. Since the second moment is the sum of the variance and the square of the mean analternative sufficient condition for uniform tightness is: EXn = O(1) and varXn = O(1).

Consider some of the relationships between the three modes of convergence. Conver-gence in distribution is weaker than convergence in probability, which is in turn weakerthan almost sure convergence and convergence in pth mean.

3.5 Theorem. Let Xn, X and Yn be random vectors. Then(i) Xn

as→ X implies XnP→ X;

(ii) XnLp→ X implies Xn

P→ X;(iii) Xn

P→ X implies Xn X;(iv) Xn

P→ c for a constant c if and only if Xn c;(v) if Xn X and d(Xn, Yn)

P→ 0, then Yn X;(vi) if Xn X and Yn

P→ c for a constant c, then (Xn, Yn) (X, c);(vii) if Xn

P→ X and YnP→ Y , then (Xn, Yn)

P→ (X,Y ).

According to the last assertion of the lemma convergence in probability of a se-quence of vectors Xn = (Xn,1, . . . , Xn,k) is equivalent to convergence of every one ofthe sequences of components Xn,i separately. The analogous statement for convergencein distribution is false: convergence in distribution of the sequence Xn is stronger thanconvergence of every one of the sequences of components Xn,i. The point is that the dis-tribution of the components Xn,i separately does not determine their joint distribution:they might be independent or dependent in many ways. One speaks of joint convergencein distribution versus marginal convergence.

3.2: Convergence of Moments 39

The one before last assertion of the lemma has some useful consequences. If Xn Xand Yn c, then (Xn, Yn) (X, c). Consequently, by the continuous mapping theoremg(Xn, Yn) g(X, c) for every map g that is continuous at the set R

k × c where thevector (X, c) takes its values. Thus for every g such that

limx→x0,y→c

g(x, y) = g(x0, c), every x0.

Some particular applications of this principle are known as Slutsky’s lemma.

3.6 Lemma (Slutsky). Let Xn, X and Yn be random vectors or variables. If Xn Xand Yn c for a constant c, then(i) Xn + Yn X + c;(ii) YnXn cX;(iii) Xn/Yn X/c provided c 6= 0.

In (i) the “constant” c must be a vector of the same dimension as X, and in (ii) c isprobably initially understood to be a scalar. However, (ii) is also true if every Yn and care matrices (which can be identified with vectors, for instance by aligning rows, to givea meaning to the convergence Yn c), simply because matrix multiplication (y, x) → yxis a continuous operation. Another true result in this case is that XnYn Xc, if thisstatement is well defined. Even (iii) is valid for matrices Yn and c and vectorsXn providedc 6= 0 is understood as c being invertible and division is interpreted as (pre)multiplicationby the inverse, because taking an inverse is also continuous.

3.7 Example. Let Tn and Sn be statistical estimators satisfying

√n(Tn − θ) N(0, σ2), S2

nP→ σ2,

for certain parameters θ and σ2 depending on the underlying distribution, for everydistribution in the model. Then θ = Tn ± Sn/

√n ξα is a confidence interval for θ of

asymptotic level 1− 2α.This is a consequence of the fact that the sequence

√n(Tn−θ)/Sn is asymptotically

standard normal distributed.

* 3.2 Convergence of Moments

By the portmanteau lemma, weak convergence Xn X implies that Ef(Xn) → Ef(X)for every continuous, bounded function f . The condition that f be bounded is not su-perfluous: it is not difficult to find examples of a sequence Xn X and an unbounded,continuous function f for which the convergence fails. In particular, in general conver-gence in distribution does not imply convergence EXp

n → EXp of moments. However, inmany situations such convergence occurs, but it requires more effort to prove it.


A sequence of random variables Yn is called asymptotically uniformly integrable if

limM→∞

lim supn→∞

E|Yn|1|Yn| > M = 0.

A simple sufficient condition for this is that for some p > 1 the sequence E|Yn|p isbounded in n.

Uniform integrability is the missing link between convergence in distribution andconvergence of moments.

3.8 Theorem. Let f :Rk → R be measurable and continuous at every point in a set C.Let Xn X where X takes its values in C. Then Ef(Xn) → Ef(X) if and only if thesequence of random variables f(Xn) is asymptotically uniformly integrable.

3.9 Example. Suppose Xn is a sequence of random variables such that Xn X andlim supE|Xn|p <∞ for some p. Then all moments of order strictly less than p convergealso: EXk

n → EXk for every k < p.By the preceding theorem, it suffices to prove that the sequenceXk

n is asymptoticallyuniformly integrable. By Markov’s inequality

E|Xn|k1

|Xn|k ≥M

≤M1−p/k E|Xn|p.

The limsup, as n→ ∞ followed by M → ∞, of the right side is zero if k < p.

3.3 Arrays

Consider an infinite array xn,l of numbers, indexed by (n, l) ∈ N × N, such that everycolumn has a limit, and the limits xl themselves converge to a limit along the columns.

x1,1 x1,2 x1,3 x1,4 . . .x2,1 x2,2 x2,3 x2,4 . . .x3,1 x3,2 x3,3 x3,4 . . ....

......

... . . .↓ ↓ ↓ ↓ . . .x1 x2 x3 x4 . . . → x

Then we can find a “path” xn,ln , indexed by n ∈ N through the array along whichxn,ln → x as n→ ∞. (The point is to move to the right slowly in the array while goingdown, i.e. ln → ∞.) A similar property is valid for sequences of random vectors, wherethe convergence is taken as convergence in distribution.

3.4: Stochastic o and O symbols 41

3.10 Lemma. For n, l ∈ N let Xn,l be random vectors such that Xn,l Xl as n → ∞for every fixed l for random vectors such that Xl X as l → ∞. Then there exists asequence ln → ∞ such Xn,ln X as n→ ∞.

Proof. Let D = d1, d2, . . . be a countable set that is dense in Rk and that only contains

points at which the distribution functions of the limits X,X1, X2, . . . are continuous.Then an arbitrary sequence of random variables Yn converges in distribution to one ofthe variables Y ∈ X,X1, X2, . . . if and only if P(Yn ≤ di) → P(Y ≤ di) for everydi ∈ D. We can prove this using the monotonicity and right-continuity of distributionfunctions. In turn P(Yn ≤ di) → P(Y ≤ di) as n→ ∞ for every di ∈ D if and only if

∞∑

i=1

∣

∣P(Yn ≤ di)− P(Y ≤ di)∣

∣2−i → 0.

Now define

pn,l =∞∑

i=1

∣

∣P(Xn,l ≤ di)− P(Xl ≤ di)∣

∣2−i,

pl =

∞∑

i=1

∣

∣P(Xl ≤ di)− P(X ≤ di)∣

∣2−i.

The assumptions entail that pn,l → 0 as n → ∞ for every fixed l, and that pl → 0 asl → ∞. This implies that there exists a sequence ln → ∞ such that pn,ln → 0. By thetriangle inequality

∞∑

i=1

∣

∣P(Xn,ln ≤ di)− P(X ≤ di)∣

∣2−i ≤ pn,ln + pln → 0.

This implies that Xn,ln X as n→ ∞.

3.4 Stochastic o and O symbols

It is convenient to have short expressions for terms that converge in probability to zeroor are uniformly tight. The notation oP (1) (‘small “oh-P-one”’) is short for a sequenceof random vectors that converges to zero in probability. The expression OP (1) (‘big “oh-P-one”’) denotes a sequence that is bounded in probability. More generally, for a givensequence of random variables Rn

Xn = oP (Rn) means Xn = YnRn and YnP→ 0;

Xn = OP (Rn) means Xn = YnRn and Yn = OP (1).

This expresses that the sequence Xn converges in probability to zero or is boundedin probability at ‘rate’ Rn. For deterministic sequences Xn and Rn the stochastic oh-symbols reduce to the usual o and O from calculus.


There are many rules of calculus with o and O symbols, which will be appliedwithout comment. For instance,

oP (1) + oP (1) = oP (1)

oP (1) +OP (1) = OP (1)

OP (1)oP (1) = oP (1)(

1 + oP (1))−1

= OP (1)

oP (Rn) = RnoP (1)

OP (Rn) = RnOP (1)

oP(

OP (1))

= oP (1).

To see the validity of these “rules” it suffices to restate them in terms of explicitlynamed vectors, where each oP (1) and OP (1) should be replaced by a different sequenceof vectors that converges to zero or is bounded in probability. In this manner the firstrule says: if Xn

P→ 0 and YnP→ 0, then Zn = Xn + Yn

P→ 0; this is an example of thecontinuous mapping theorem. The third rule is short for: if Xn is bounded in probabilityand Yn

P→ 0, then XnYnP→ 0. If Xn would also converge in distribution, then this

would be statement (ii) of Slutsky’s lemma (with c = 0). But by Prohorov’s theorem Xn

converges in distribution “along subsequences” if it is bounded in probability, so that thethird rule can still be deduced from Slutsky’s lemma by “arguing along subsequences”.

Note that both rules are in fact implications and should be read from left to right,even though they are stated with the help of the equality “=” sign. Similarly, while it istrue that oP (1) + oP (1) = 2oP (1), writing down this rule does not reflect understandingof the oP -symbol.

Two more complicated rules are given by the following lemma.

3.11 Lemma. Let R be a function defined on a neighbourhood of 0 ∈ Rk such that

R(0) = 0. Let Xn be a sequence of random vectors that converges in probability to zero.(i) if R(h) = o(‖h‖) as h→ 0 , then R(Xn) = oP (‖Xn‖);(ii) if R(h) = O(‖h‖) as h→ 0, then R(Xn) = OP (‖Xn‖).

Proof. Define g(h) as g(h) = R(h)/‖h‖ for h 6= 0 and g(0) = 0. Then R(Xn) =g(Xn)‖Xn‖.

(i). Since the function g is continuous at zero by assumption, g(Xn)P→ g(0) = 0 by

the continuous mapping theorem.(ii). By assumption there existM and δ > 0 such that

∣

∣g(h)∣

∣ ≤M whenever ‖h‖ ≤ δ.

Thus P(∣

∣g(Xn)∣

∣ > M)

≤ P(

‖Xn‖ > δ)

→ 0, and the sequence g(Xn) is tight.

It should be noted that the rule expressed by the lemma is not a simple plug-in rule.For instance it is not true that R(h) = o(‖h‖) implies that R(Xn) = oP (‖Xn‖) for everysequence of random vectors Xn.

3.6: Cramer-Wold Device 43

3.5 Transforms

It is sometimes possible to show convergence in distribution of a sequence of random vec-tors directly from the definition. In other cases ‘transforms’ of probability measures mayhelp. The basic idea is that it suffices to show characterization (ii) of the portmanteaulemma for a small subset of functions f only.

The most important transform is the characteristic function

t 7→ EeitTX , t ∈ R

k.

Each of the functions x 7→ eitT x is continuous and bounded. Thus by the portmanteau

lemma EeitTXn → Eeit

TX for every t if Xn X. By Levy’s continuity theorem theconverse is also true: pointwise convergence of characteristic functions is equivalent toweak convergence.

3.12 Theorem (Levy’s continuity theorem). Let Xn and X be random vectors in

Rk. Then Xn X if and only if Eeit

TXn → EeitTX for every t ∈ R

k. Moreover, if

EeitTXn converges pointwise to a function φ(t) that is continuous at zero, then φ is the

characteristic function of a random vector X and Xn X.

The following lemma, which gives a variation on Levy’s theorem, is less well known,but will be useful in Chapter 4.

3.13 Lemma. Let Xn be a sequence of random variables such that E|Xn|2 = O(1) andsuch that E(iXn + vt)eitXn → 0 as n → ∞, for every t ∈ R and some v > 0. ThenXn N(0, v).

Proof. By Markov’s inequality and the bound on the second moments, the sequence Xn

is uniformly tight. In view of Prohorov’s theorem it suffices to show that N(0, v) is theonly weak limit point.

If Xn X along some sequence of n, then by the boundedness of the secondmoments and the continuity of the function x 7→ (ix+vt)eitx, we have E(iXn+vt)e

itXn →E(iX+ vt)eitX for every t ∈ R. (Cf. Theorem 3.8.) Combining this with the assumption,we see that E(iX + vt)eitX = 0. By Fatou’s lemma EX2 ≤ lim inf EX2

n < ∞ and hencewe can differentiate the characteristic function φ(t) = EeitX under the expectation tofind that φ′(t) = EiXeitX . We conclude that φ′(t) = −vtφ(t). This differential equationpossesses φ(t) = e−vt

2/2 as the only solution within the class of characteristic functions.Thus X is normally distributed with mean zero and variance v.

3.6 Cramer-Wold Device

The characteristic function t 7→ EeitTX of a vector X is determined by the set of all

characteristic functions u 7→ Eeiu(tTX) of all linear combinations tTX of the components


of X. Therefore the continuity theorem implies that weak convergence of vectors isequivalent to weak convergence of linear combinations:

Xn X if and only if tTXn tTX for all t ∈ Rk.

This is known as the Cramer-Wold device. It allows to reduce all higher dimensionalweak convergence problems to the one-dimensional case.

3.14 Example (Multivariate central limit theorem). Let Y, Y1, Y2, . . . be i.i.d. randomvectors in R

k with mean vector µ = EY and covariance matrix Σ = E(Y − µ)(Y − µ)T .Then

1√n

n∑

i=1

(Yi − µ) =√n(Y n − µ) Nk(0,Σ).

(The sum is taken coordinatewise.) By the Cramer-Wold device the problem can bereduced to finding the limit distribution of the sequences of real-variables

tT( 1√

n

n∑

i=1

(Yi − µ))

=1√n

n∑

i=1

(tTYi − tTµ).

Since the random variables tTY1 − tTµ, tTY2 − tTµ, . . . are i.i.d. with zero mean andvariance tTΣt this sequence is asymptotically N1(0, t

TΣt) distributed by the univariatecentral limit theorem. This is exactly the distribution of tTX if X possesses a Nk(0,Σ)distribution.

3.7 Delta-method

Let Tn be a sequence of random vectors with values in Rk and let φ:Rk → R

m be agiven function defined at least on the range of Tn and a neighbourhood of a vector θ.We shall assume that, for given constants rn → ∞, the sequence rn(Tn− θ) converges indistribution, and wish to derive a similar result concerning the sequence rn

(

φ(Tn)−φ(θ))

.Recall that φ is differentiable at θ if there exists a linear map (matrix) φ′θ:R

k → Rm

such that

φ(θ + h)− φ(θ) = φ′θ(h) + o(‖h‖), h→ 0.

All the expressions in this equation are vectors of length m and ‖h‖ is the Euclideannorm. The linear map h 7→ φ′θ(h) is sometimes called a total derivative, as opposed topartial derivatives. A sufficient condition for φ to be (totally) differentiable is that allpartial derivatives ∂φj(x)/∂xi exist for x in a neighbourhood of θ and are continuousat θ. (Just existence of the partial derivatives is not enough.) In any case the total

3.8: Lindeberg Central Limit Theorem 45

derivative is found from the partial derivatives. If φ is differentiable, then it is partiallydifferentiable and the derivative map h 7→ φ′θ(h) is matrix multiplication by the matrix

φ′θ =

∂φ1

∂x1(θ) · · · ∂φ1

∂xk(θ)

......

∂φm

∂x1(θ) · · · ∂φm

∂xk(θ)

.

If the dependence of the derivative φ′θ on θ is continuous, then φ is called continuouslydifferentiable.

3.15 Theorem. Let φ:Rk → Rm be a measurable map defined on a subset of Rk and

differentiable at θ. Let Tn be random vectors taking their values in the domain of φ. Ifrn(Tn − θ) T for numbers rn → ∞, then rn

(

φ(Tn) − φ(θ))

φ′θ(T ). Moreover, the

difference between rn(

φ(Tn)−φ(θ))

and φ′θ(

rn(Tn− θ))

converges to zero in probability.

Proof. Because rn → ∞, we have by Slutsky’s lemma Tn − θ = (1/rn)rn(Tn − θ) 0T = 0 and hence Tn − θ converges to zero in probability. Define a function g by

g(0) = 0, g(h) =φ(θ + h)− φ(θ)− φ′θ(h)

‖h‖ , if h 6= 0.

Then g is continuous at 0 by the differentiability of φ. Therefore, by the continuousmapping theorem, g(Tn−θ) P→ 0 and hence, by Slutsky’s lemma and again the continuousmapping theorem rn‖Tn − θ‖g(Tn − θ) P→ ‖T‖0 = 0. Consequently,

rn(

φ(Tn)− φ(θ)− φ′θ(Tn − θ))

= rn‖Tn − θ‖g(Tn − θ) P→ 0.

This yields the last statement of the theorem. Since matrix multiplication is continuous,φ′θ(

rn(Tn − θ))

φ′θ(T ) by the continuous-mapping theorem. Finally, apply Slutsky’s

lemma to conclude that the sequence rn(

φ(Tn)− φ(θ))

has the same weak limit.

A common situation is that√n(Tn−θ) converges to a multivariate normal distribu-

tion Nk(µ,Σ). Then the conclusion of the theorem is that the sequence√n(

φ(Tn)−φ(θ))

converges in law to the Nm

(

φ′θµ, φ′θΣ(φ

′θ)

T)

distribution.

3.8 Lindeberg Central Limit Theorem

In this section we state, for later reference, a central limit theorem for independent, butnot necessarily identically distributed random vectors.


3.16 Theorem (Lindeberg). For each n ∈ N let Yn,1, . . . , Yn,n be independent randomvectors with finite covariance matrices such that

1

n

n∑

i=1

Cov Yn,i → Σ,

1

n

n∑

i=1

E‖Yn,i‖21

‖Yn,i‖ > ε√n

→ 0, for every ε > 0.

Then the sequence n−1/2∑n

i=1(Yn,i − EYn,i) converges in distribution to the normaldistribution with mean zero and covariance matrix Σ.

3.9 Minimum Contrast Estimators

Many estimators θn of a parameter θ are defined as the point of minimum (or maximum)of a given stochastic process θ 7→ Mn(θ). In this section we state basic theorems thatgive the asymptotic behaviour of such minimum contrast estimators or M -estimatorsθn in the case that the contrast function Mn fluctuates around a deterministic, smoothfunction.

Let Mn be a sequence of stochastic processes indexed by a subset Θ of Rd, defined ongiven probability spaces, and let θn be random vectors defined on the same probabilityspaces with values in Θ such that Mn(θn) ≤ Mn(θ) for every θ ∈ Θ. Typically it will betrue that Mn(θ)

P→ M(θ) for each θ and a given deterministic function M . Then we may

expect that θnP→ θ0 for θ0 a point of minimum of the map θ → M(θ). The following

theorem gives a sufficient condition for this. It applies to the more general situation thatthe “limit” function M is actually a random process.

For a sequence of random variables Xn we write XnP≫ 0 if Xn > 0 for every n and

1/Xn = OP (1).

3.17 Theorem. Let Mn andMn be stochastic processes indexed by a semi-metric spaceΘ such that, for some θ0 ∈ Θ,

supθ∈Θ

∣

∣Mn(θ)−Mn(θ)∣

∣

P→ 0,

infθ∈Θ:d(θ,θ0)>δ

Mn(θ)−Mn(θ0)P≫ 0.

If θn are random elements with values in Θ with Mn(θn) ≥ Mn(θ0) − oP (1), then

d(θn, θ0)P→ 0.

Proof. By the uniform convergence to zero of Mn−Mn and the minimizing property ofθn, we haveMn(θn) = Mn(θn)+oP (1) ≤ Mn(θ0)+oP (1) =Mn(θ0)+oP (1). Write Zn(δ)

3.9: Minimum Contrast Estimators 47

for the left side of the second equation in the display of the theorem. Then d(θn, θ0) > δ

implies that Mn(θn)−Mn(θ0) ≥ Zn(δ). Combined with the preceding this implies thatZn(δ) ≤ oP (1). By assumption the probability of this event tends to zero.

If the limit criterion function θ → M(θ) is smooth and takes its minimum atthe point θ0, then its first derivative must vanish at θ0, and the second derivativeV must be positive definite. Thus it possesses a parabolic approximation M(θ) =M(θ0) +

12 (θ − θ0)

TV (θ − θ0) around θ0. The random criterion function Mn can bethought of as the limiting criterion function plus the random perturbation Mn −M andpossesses approximation

Mn(θ)−Mn(θ0) ≈ 12 (θ − θ0)

TV (θ − θ0) +[

(Mn −M)(θ)− (Mn −M)(θ0)]

.

We shall assume that the term in square brackets possesses a linear approximation of theform (θ − θ0)

TZn/√n. If we ignore all the remainder terms and minimize the quadratic

formθ − θ0 7→ 1

2 (θ − θ0)TV (θ − θ0) + (θ − θ0)

TZn/√n

over θ− θ0, then we find that the minimum is taken for θ− θ0 = −V −1Zn/√n. Thus we

expect that theM -estimator θn satisfies√n(θn−θ0) = −V −1Zn+oP (1). This derivation

is made rigorous in the following theorem.

3.18 Theorem. Let Mn be stochastic processes indexed by an open subset Θ of Eu-clidean space and let M : Θ → R be a deterministic function. Assume that θ → M(θ)is twice continuously differentiable at a point of minimum θ0 with nonsingular second-derivative matrix V .‡ Suppose that

rn(Mn −M)(θn)− rn(Mn −M)(θ0)

= (θn − θ0)′Zn + o∗P

(

‖θn − θ0‖+ rn‖θn − θ0‖2 + r−1n

)

,

for every random sequence θn = θ0 + o∗P (1) and a uniformly tight sequence of random

vectors Zn. If the sequence θn converges in outer probability to θ0 and satisfies Mn(θn) ≤infθ Mn(θ) + oP (r

−2n ) for every n, then

rn(θn − θ0) = −V −1Zn + o∗P (1).

If it is known that the sequence rn(θn−θ0) is uniformly tight, then the displayed conditionneeds to be verified for sequences θn = θ0 +O∗P (r

−1n ) only.

Proof. The stochastic differentiability condition of the theorem together with the two-times differentiability of the map θ →M(θ) yields for every sequence hn = o∗P (1)

(3.1)Mn(θ0 + hn)−Mn(θ0) =

12 h′nV hn + r−1n h′nZn

+ o∗P(

‖hn‖2 + r−1n ‖hn‖+ r−2n

)

.

‡ It suffices that a two-term Taylor expansion is valid at θ0.


For hn chosen equal to hn = θn − θ0, the left side (and hence the right side) is at

most oP (r−2n ) by the definition of θn. In the right side the term h′nV hn can be bounded

below by c‖hn‖2 for a positive constant c, since the matrix V is strictly positive definite.Conclude that

c‖hn‖2 + r−1n ‖hn‖OP (1) + oP(

‖hn‖2 + r−2n

)

≤ oP (r−2n ).

Complete the square to see that this implies that

(

c+ oP (1))

(

‖hn‖ −OP (r−1n ))2

≤ OP (r−2n ).

This can be true only if ‖hn‖ = O∗P (r−1n ).

For any sequence hn of the order O∗P (r−1n ), the three parts of the remainder term

in (3.1) are of the order oP (r−2n ). Apply this with the choices hn and −r−1n V −1Zn to

conclude that

Mn(θ0 + hn)−Mn(θ0) =12 h′nV hn + r−1n h′nZn + o∗P (r

−2n ),

Mn(θ0 − r−1n V −1Zn)−Mn(θ0) = − 12r−2n Z ′nV

−1Zn + o∗P (r−2n ).

The left-hand side of the first equation is smaller than the second, up to an o∗P (r−2n )-term.

Subtract the second equation from the first to find that

12 (hn + r−1n V −1Zn)

′V (hn + r−1n V −1Zn) ≤ oP (r−2n ).

Since V is strictly positive definite, this yields the first assertion of the theorem.If it is known that the sequence θn is rn-consistent, then the middle part of the

preceding proof is unnecessary and we can proceed to inserting hn and −r−1n V −1Zn

in (3.1) immediately. The latter equation is then needed for sequences hn = O∗P (r−1n )

only.

4Central Limit Theorem

The classical central limit theorem asserts that the mean of independent, identicallydistributed random variables with finite variance is asymptotically normally distributed.In this chapter we extend this to dependent variables.

Given a stationary time series Xt let Xn = n−1∑n

t=1Xt be the average of thevariables X1, . . . , Xn. If µ and γX are the mean and auto-covariance function of the timeseries, then, by the usual rules for expectation and variance,

EXn = µ,

var(√nXn) =

1

n

n∑

s=1

n∑

t=1

cov(Xs, Xt) =

n∑

h=−n

(n− |h|n

)

γX(h).(4.1)

In the expression for the variance every of the terms(

n−|h|)

/n is bounded by 1 and con-

verges to 1 as n→ ∞. If∑∣

∣γX(h)∣

∣ <∞, then we can apply the dominated convergence

theorem and obtain that var(√nXn) →

∑

h γX(h). In any case

(4.2) var√nXn ≤

∑

h

∣

∣γX(h)∣

∣.

Hence absolute convergence of the series of auto-covariances implies that the sequence√n(Xn − µ) is uniformly tight. The purpose of the chapter is to give conditions for

this sequence to be asymptotically normally distributed with mean zero and variance∑

h γX(h).Such conditions are of two types: martingale and mixing.The martingale central limit theorem is concerned with time series such that the

sums∑n

t=1Xt from a martingale. It thus makes a structural assumption on the expectedvalue of the increments Xt of these sums given the past variables.

Mixing conditions, in a general sense, require that elements Xt and Xt+h at largetime lags h be approximately independent. Positive and negative values of deviationsXt−µ at large time lags will then occur independently and partially cancel each other, whichis the intutive reason for normality of sums. Absolute convergence of the series

∑

h γX(h),

50 4: Central Limit Theorem

which is often called short-range dependence ‘, can be viewed as a mixing condition, asit implies that covariances at large lags are small. However, it is not strong enough toimply a central limit theorem. Finitely dependent time series and linear processes arespecial examples of mixing time series that do satisfy the central limit theorem. We alsodiscuss a variety of general “mixing” conditions.

* 4.1 EXERCISE. Suppose that the series v: =∑

h γX(h) converges (not necessarily ab-solutely). Show that var

√nXn → v. [Write var

√nXn as vn for vh =

∑

|j|<h γX(j) and

apply Cesaro’s lemma: if vn → v, then vn → v.]

4.1 Finite Dependence

A time series Xt is called m-dependent if the random vectors (. . . , Xt−1, Xt) and(Xt+m+1, Xt+m+2, . . .) are independent for every t ∈ Z. In other words, “past” and“future” are independent if m “present” variables are left out.

4.2 EXERCISE. Show that the moving average Xt = Zt + θZt−1 considered in Exam-ple 1.6, with Zt an i.i.d. sequence, is 1-dependent.

4.3 EXERCISE. Show that “0-dependent” is equivalent to “independent”.

4.4 Theorem. Let Xt be a strictly stationary, m-dependent time series with mean zeroand finite variance. Then the sequence

√nXn converges in distribution to a normal

distribution with mean 0 and variance∑

h γX(h).

Proof. Choose a (large) integer l and divide X1, . . . , Xn into r = ⌊n/l⌋ groups of size land a remainder group of size n−rl < l. Let A1,l, . . . , Ar,l and B1,l, . . . , Br,l be the sumsof the first l −m and last m of the variables Xi in the r groups. (Cf. Figure 4.1.) Thenboth A1,l, . . . , Ar,l and B1,l, . . . , Br,l are sequences of independent identically distributedrandom variables (but the two sequences may be dependent) and

(4.3)

n∑

i=1

Xi =

r∑

j=1

Aj,l +

r∑

j=1

Bj,l +

n∑

i=rl+1

Xi.

We shall show that for suitable l = ln → ∞, the second and third terms tend to zero,whereas the first tends to a normal distribution.

For fixed l and n → ∞ (hence r → ∞) the classical central limit theorem appliedto the variables Aj,l yields

1√n

r∑

j=1

Aj,l =

√

r

n

1√r

r∑

j=1

Aj,l 1√lN(0, varA1,l).

4.2: Linear Processes 51

Next the variables n−1/2∑n

i=rl+1Xi have mean zero, and by the triangle inequality, forfixed l as n→ ∞,

sd( 1√

n

n∑

i=rl+1

Xi

)

≤ l√nsd(X1) → 0.

By Chebyshev’s inequality it follows that the third term in (4.3) tends to zero in proba-bility. We conclude by Slutsky’s lemma that, as n→ ∞,

Sn,l: =1√n

r∑

j=1

Aj,l +1√n

n∑

i=rl+1

Xi N(

0,1

lvarA1,l

)

.

This is true for every fixed l. If l → ∞, then

1

lvarA1,l =

1

lvar

l−m∑

i=1

Xi =l−m∑

h=m−l

l −m− |h|l

γX(h) →m∑

h=−mγX(h) =: v.

Therefore, by Lemma 3.10 there exists a sequence ln → ∞ such that Sn,ln N(0, v).Let rn = ⌊n/ln⌋ be the corresponding sequence of values of rn, so that rn/n→ 0. By

the strict stationarity of the series Xt each Bj,l is equal in distribution to X1+ · · ·+Xm.Hence varBj,l is independent of j and l and, by the independence of B1,l, . . . , Br,l,

E( 1√

n

rn∑

j=1

Bj,ln

)2

=rnn

varB1,ln → 0.

Thus the sequence of random variables in the left side converges to zero in probability,by Chebyshev’s inequality.

In view of (4.3) another application of Slutsky’s lemma gives the result.

1 l + 1 2l + 1 (r − 1)l + 1 rl + 1 n. . . . . . . . . .

← l−m →← m →← l−m →← m → ← l−m →← m →A1,l B1,l A2,l B2,l Ar,l Br,l

Figure 4.1. Blocking of observations in the proof of Theorem 4.4.

4.2 Linear Processes

A linear process is a time series that can be represented as an infinite moving average

(4.4) Xt = µ+∞∑

j=−∞ψjZt−j ,


for a sequence . . . , Z−1, Z0, Z1, Z2, . . . of independent and identically distributed variableswith EZt = 0, a constant µ, and constants ψj with

∑

j |ψj | <∞. This may seem special,but we shall see later that this includes, for instance, the rich class of all ARMA-processes.

By (iii) of Lemma 1.29 the covariance function of a linear process is given by γX(h) =σ2∑

j ψjψj+h, where σ2 = varZt, and hence the asymptotic variance of

√nXn is given

by

(4.5) v: =∑

h

γX(h) = σ2∑

h

∑

j

ψjψj+h = σ2(

∑

j

ψj

)2

.

4.5 Theorem. Suppose that (4.4) holds for an i.i.d. sequence Zt with mean zero andfinite variance and numbers ψj with

∑

j |ψj | < ∞. Then the sequence√n(Xn − µ)

converges in distribution to a normal distribution with mean zero and variance v.

Proof. We can assume without loss of generality that µ = 0. For a fixed (large) integerm define the time series

Xmt =

∑

|j|≤mψjZt−j =

∑

j

ψmj Zt−j ,

where ψmj = ψj if |j| ≤ m and 0 otherwise. Then Xm

t is 2m-dependent and strictly

stationary. By Theorem 4.4, the sequence√nXm

n converges as n→ ∞ in distribution toa normal distribution with mean zero and variance (see (4.5))

vm: =∑

h

γXm(h) = σ2(

∑

j

ψmj

)2

= σ2(

∑

|j|≤mψj

)2

.

As m → ∞ this converges to v. Because N(0, vm) N(0, v), Lemma 3.10 guaranteesthe existence of a sequence mn → ∞ such that

√nXmn

n N(0, v).In view of Slutsky’s lemma the proof will be complete once we have shown that√

n(Xn−Xmnn ) P→ 0. This concerns the average Y mn

n of the differences Y mt = Xt−Xm

t =∑

|j|>m ψjZt−j . These satisfy

E(√nXn −√

nXmnn

)2

= var√nY mn

n ≤∑

h

∣

∣γY mn (h)∣

∣ ≤ σ2(

∑

|j|>mn

|ψj |)2

.

The inequalities follow by (4.1) and Lemma 1.29(iii). The right side converges to zero asmn → ∞.

4.3: Strong Mixing 53

4.3 Strong Mixing

The α-mixing coefficients (or strong mixing coefficients) of a time series Xt are definedby α(0) = 1

2 and for h ∈ N

α(h) = 2 supt

supA∈σ(...,Xt−1,Xt)

B∈σ(Xt+h,Xt+h+1,...)

∣

∣P(A ∩B)− P(A)P(B)∣

∣.

The events A and B in this display depend on elements Xt of the “past” and “future”,respectively, that are h time lags apart. Thus the α-mixing coefficients measure the extentby which events A and B that are separated by h time instants fail to satisfy the equalityP(A ∩B) = P(A)P(B), which is valid for independent events. If the series Xt is strictlystationary, then the supremum over t is unnecessary, and the mixing coefficient α(h) canbe defined using the σ-fields σ(. . . , X−1, X0) and σ(Xh, Xh+1, . . .) only.

It is immediate from their definition that the coefficients α(1), α(2), . . . are decreas-ing and nonnegative. Furthermore, if the time series is m-dependent, then α(h) = 0 forh > m.

4.6 EXERCISE. Show that α(1) ≤ 12 ≡ α(0). [Apply the inequality of Cauchy-Schwarz

to P(A ∩B)− P(A)P(B) = cov(1A, 1B).]

Warning. The mixing numbers α(h) are denoted by the same symbol as the partialauto-correlations αX(h).

If α(h) → 0 as h→ ∞, then the time series Xt is called α-mixing or strong mixing.Then events connected to time sets that are far apart are “approximately independent”.For a central limit theorem to hold, the convergence to 0 must take place at a sufficientspeed, dependent on the “sizes” of the variables Xt.

A precise formulation can best be given in terms of the inverse function of the mixingcoefficients. We can extend α to a function α: [0,∞) → [0, 1] by defining it to be constanton the intervals [h, h+1) for integers h. This yields a monotone, right-continuous functionthat decreases in steps from α(0) = 1

2 to 0 at infinity in the case that the time series ismixing. The generalized inverse α−1: [0, 1] → [0,∞) is defined by

α−1(u) = inf

x ≥ 0:α(x) ≤ u

=∞∑

h=0

1u<α(h).

Alternatively, we can define α−1(u) = h iff α(h) ≤ u < α(h− 1). Similarly, the quantilefunction F−1X of a random variable X is the (usual) generalized inverse of the distributionfunction FX of X, and is given by

F−1X (1− u) = inf

x ∈ R: 1− FX(x) ≤ u

.

Both functions u 7→ α−1(u) and u 7→ F−1X (1− u) are decreasing.

We denote by σ(Xt: t ∈ I) the σ-field generated by the set of random variables Xt: t ∈ I.


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

o

o

oo o

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

o

o

o

o

o

Figure 4.2. Typical functions α (top panel) and α−1 (bottom panel).

4.7 Theorem. If Xt is a strictly stationary time series with mean zero such that∫ 1

0α−1(u)F−1|X0|(1 − u)2 du < ∞, then the series v =

∑

h γX(h) converges absolutely

and√nXn N(0, v).

The condition of the theorem is a little complicated, because it combines the re-quirements that the mixing coefficients converge to zero fast enough and that the tails ofthe variables Xt are not too heavy. Loosening one of the two more may be compensatedby making the other requirement more stringent. To make this concrete we can derivefiniteness of the integral under a combination of a mixing and a moment condition, asexpressed in the following corollary.

4.8 Corollary (Ibragimov). If Xt is a strictly stationary time series with mean zerosuch that E|X0|r < ∞ and

∑

h α(h)1−2/r < ∞ for some r > 2, then the series v =

∑

h γX(h) converges absolutely and√nXn N(0, v).

Proof. If cr: = E|X0|r < ∞ for some r > 2, then 1 − F|X0|(x) ≤ cr/xr by Markov’s

4.3: Strong Mixing 55

inequality and hence F−1|X0|(1− u) ≤ c/u1/r. Then we obtain the bound

∫ 1

0

α−1(u)F−1|X0|(1− u)2 du ≤∞∑

h=0

∫ 1

0

1u<α(h)c2

u2/rdu =

c2r

r − 2

∞∑

h=0

α(h)1−2/r.

The right side is finite by assumption, and hence the integral condition of Theorem 4.7is satisfied.

The corollary assumes finiteness of the rth moment E|X0|r and convergence of theseries of the (1 − 2/r)th powers of the mixing coefficients. For larger values of r themoment condition is more restrictive, but the mixing condition is weaker. The intuitionis that heavy tailed distributions occasionally produce big values |Xt|. Combined withstrong dependence, this may lead to clusters of big variables, which will influence theaverage Xn relatively strongly, and prevent its approximate normality.

4.9 EXERCISE (Case r = ∞). Show that∫ 1

0α−1(u)F−1|X0|(1− u)2 du is bounded above

by ‖X0‖2∞∑∞

h=0 α(h). [Note that F−1|X0|(1 − U) is distributed as |X0| if U is uniformly

distributed.]

4.10 EXERCISE. Show that∫ 1

0α−1(u)F−1|X0|(1−u)2 du ≤ (m+1)EX2

0 if the time series

Xt is m-dependent. Recover Theorem 4.4 from Theorem 4.7.

4.11 EXERCISE. Show that∫ 1

0α−1(u)F−1|X0|(1− u)2 du <∞ implies that E|X0|2 <∞.

The key to the proof of Theorem 4.7 is a lemma that bounds covariances in termsof mixing coefficients. Let ‖X‖p denote the Lp-norm of a random variable X, i.e.

‖X‖p =(

E|X|p)1/p

, 1 ≤ p <∞, ‖X‖∞ = infM : P(|X| ≤M) = 1.

Recall Holder’s inequality: for any pair of numbers p, q > 0 (possibly infinite) withp−1 + q−1 = 1 and random variables X and Y

E|XY | ≤ ‖X‖p‖Y ‖q.

For p = q = 2 this is precisely the inequality of Cauchy-Schwarz. The other case ofinterest to us is the combination p = 1, q = ∞, for which the inequality is immediate. Byrepeated application the inequality can be extended to more than two random variables.For instance, for any numbers p, q, r > 0 with p−1+ q−1+ r−1 = 1 and random variablesX, Y , and Z

E|XY Z| ≤ ‖X‖p‖Y ‖q‖Z‖r.


4.12 Lemma (Covariance bound). Let Xt be a time series with α-mixing coefficientsα(h) and let Y and Z be random variables that are measurable relative to σ(. . . , X−1, X0)and σ(Xh, Xh+1, . . .), respectively, for a given h ≥ 0. Then, for any p, q, r > 0 such thatp−1 + q−1 + r−1 = 1,

∣

∣cov(Y,Z)∣

∣ ≤ 2

∫ α(h)

0

F−1|Y |(1− u)F−1|Z| (1− u) du ≤ 2α(h)1/p‖Y ‖q‖Z‖r.

Proof. By the definition of the mixing coefficients, we have, for every y, z > 0,∣

∣cov(1Y +>y, 1Z+>z)∣

∣ ≤ 12α(h).

The same inequality is valid with Y + and/or Z+ replaced by Y − and/or Z−. By bilin-earity of the covariance and the triangle inequality,

∣

∣cov(1Y +>y − 1Y −>y, 1Z+>z − 1Z−>z)∣

∣ ≤ 2α(h).

Because∣

∣cov(U, V )∣

∣ ≤ 2E|U | ‖V ‖∞ for any pair of random variables U, V (by the simplestHolder inequality), we obtain that the covariance on the left side of the preceding displayis also bounded by 2

(

P(Y + > y) + P(Y − > y))

= 2(1 − F|Y |)(y). Yet another boundfor the covariance is obtained by interchanging the roles of Y and Z. Combining thethree inequalities, we see that, for any y, z > 0, the left side of the preceding display isbounded above by

2[

α(h) ∧ (1− F|Y |)(y) ∧ (1− F|Z|)(z)]

= 2

∫ α(h)

0

11−F|Y |(y)>u 11−F|Z|(z)>u du.

Next we write Y = Y + − Y − =∫∞0

(1Y +>y − 1Y −>y) dy and similarly for Z, to obtain,by Fubini’s theorem,

∣

∣cov(Y,Z)∣

∣ =∣

∣

∣

∫ ∞

0

∫ ∞

0

cov(1Y +>y − 1Y −>y, 1Z+>z − 1Z−>z) dy dz∣

∣

∣

≤ 2

∫ ∞

0

∫ ∞

0

∫ α(h)

0

1F|Y |(y)<1−u 1F|Z|(z)<1−u du dy dz.

Any pair of a distribution function F and a quantile function F−1 satisfies F (x) < u ifand only x < F−1(u), for every x and u. Hence

∫∞0

1F (x)<1−u dx = F−1(1− u), and weconclude the proof of the first inequality of the lemma by another application of Fubini’stheorem.

The second inequality follows upon noting that F−1|Y |(1− U) is distributed as |Y | ifU is uniformly distributed on [0, 1], and next applying Holder’s inequality.

Taking Y and Z in the lemma to X0 and Xh, respectively, for a stationary timeseries Xt, and (p, q, r) equal to (r/(r − 2), r, r) for some r ≥ 4, we see that

∣

∣γX(h)∣

∣ ≤ 2

∫ α(h)

0

F−1|X0|(1− u)2 du ≤ 2α(h)1−2/r‖X0‖2r.

4.4: Uniform Mixing 57

Consequently, we find that

∑

h≥0

∣

∣γX(h)∣

∣ ≤ 2∑

h≥0

∫ α(h)

0

F−1|X0|(1− u)2 du = 2

∫ 1

0

α−1(u)F−1|X0|(1− u)2 du.

This proves the first assertion of Theorem 4.7. Furthermore, in view of (4.2) and thesymmetry of the auto-covariance function,

(4.6) var√nXn ≤ 4

∫ 1

0

α−1(u)F−1|X0|(1− u)2 du.

We defer the remainder of the proof of Theorem 4.7 to the end of the chapter.

* 4.4 Uniform Mixing

There are several other types of mixing coefficients. The φ-mixing coefficients or uniformmixing coefficients of a strictly stationary time series Xt are defined by

φ(h) = supA∈σ(...,X−1,X0),P(A) 6=0

B∈σ(Xh,Xh+1,...)

∣

∣P(B|A)− P(B)∣

∣,

φ(h) = supA∈σ(...,X−1,X0)

B∈σ(Xh,Xh+1,...),P(B) 6=0

∣

∣P(A|B)− P(A)∣

∣.

It is immediate from the definitions that α(h) ≤ 2(

φ(h) ∧ φ(h))

. Thus a φ-mixing timeseries is always α-mixing. It appears that conditions in terms of φ-mixing are often muchmore restrictive, even though there is no complete overlap.

4.13 Lemma (Covariance bound). Let Xt be a strictly stationary time series withφ-mixing coefficients φ(h) and φ(h) and let Y and Z be random variables that aremeasurable relative to σ(. . . , X−1, X0) and σ(Xh, Xh+1, . . .), respectively, for a givenh ≥ 0. Then, for any p, q > 0 with p−1 + q−1 = 1,

∣

∣cov(Y,Z)∣

∣ ≤ 2φ(h)1/pφ(h)1/q‖Y ‖p‖Z‖q.

Proof. Let Q be the measure PY,Z −PY ⊗PZ on R2, and let |Q| be its absolute value.

Then

∣

∣cov(Y,Z)∣

∣ =∣

∣

∣

∫ ∫

yz dQ(y, z)∣

∣

∣≤(

∫ ∫

|y|p dQ(y, z))1/p(

∫ ∫

|z|q dQ(y, z))1/q

,

by Holder’s inequality. It suffices to show that the first and second marginals of |Q| arebounded above by the measures 2φ(h)PY and 2φ(h)PZ , respectively. By symmetry itsuffices to consider the first marginal.


By definition we have that

|Q|(C) = supD

(

∣

∣Q(C ∩D)∣

∣+∣

∣Q(C ∩Dc)∣

∣

)

for the supremum taken over all Borel sets D in R2. Equivalently, we can compute the

supremum over any algebra that generates the Borel sets. In particular, we can use thealgebra consisting of all finite unions of rectangles A×B. Conclude from this that

|Q|(C) = sup∑

i

∑

j

∣

∣Q(C ∩ (Ai ×Bj)∣

∣,

for the supremum taken over all pairs of partitions R = ∪iAi and R = ∪jBj . It followsthat

|Q|(A× R) = sup∑

i

∑

j

∣

∣Q(

(A ∩Ai)×Bj

)∣

∣

= sup∑

i

∑

j

∣

∣PZ|Y (Bj |A ∩Ai)− PZ(Bj)∣

∣PY (A ∩Ai).

If, for fixed i, B+i consists of the union of all Bj such that PZ|Y (Bj |A∩Ai)−PZ(Bj) > 0

and B−i is the union of the remaining Bj , then the double sum can be rewritten

∑

i

(

∣

∣PZ|Y (B+i |A ∩Ai)− PZ(B+

i )∣

∣+∣

∣PZ|Y (B−i |A ∩Ai)− PZ(B−i )∣

∣

)

PY (A ∩Ai).

The sum between round brackets is bounded above by 2φ(h), by the definition of φ. Thusthe display is bounded above by 2φ(h)PY (A).

4.14 Theorem. If Xt is a strictly stationary time series with mean zero such thatE|Xt|p∨q < ∞ and

∑

h φ(h)1/pφ(h)1/q < ∞ for some p, q > 0 with p−1 + q−1 = 1,

then the series v =∑

h γX(h) converges absolutely and√nXn N(0, v).

Proof. For a given M > 0 let XMt = Xt1|Xt| ≤ M and let YM

t = Xt − XMt .

Because XMt is a measurable transformation of Xt, it is immediate from the definition

of the mixing coefficients that YMt is mixing with smaller mixing coefficients than Xt.

Therefore, by (4.2) and Lemma 4.13,

var√n(Xn −XM

n ) ≤ 2∑

h

φ(h)1/pφ(h)1/q‖YM0 ‖p‖YM

0 ‖q.

As M → ∞, the right side converges to zero, and hence the left side converges to zero,uniformly in n. This means that we can reduce the problem to the case of uniformlybounded time series Xt, as in the proof of Theorem 4.7.

Because the α-mixing coefficients are bounded above by the φ-mixing coefficients,we have that

∑

h α(h) < ∞. Therefore, the second part of the proof of Theorem 4.7applies without changes.

4.5: Martingale Differences 59

4.5 Martingale Differences

The martingale central limit theorem applies to the special time series for which thepartial sums

∑nt=1Xt are a martingale (as a process in n), or equivalently the increments

Xt are “martingale differences”. In Score processes (derivatives of log likelihoods) willbe an important example of application in Chapter 13.

The martingale central limit theorem can be seen as another type of generalizationof the ordinary central limit theorem. The partial sums of an i.i.d. sequence grow byincrements Xt that are independent from the “past”. The classical central limit theoremshows that this induces asymptotic normality, provided that the increments are centeredand not too big (finite variance suffices). The mixing central limit theorems relax theindependence to near independence of variables at large time lags, a condition thatinvolves the whole distribution. In contrast, the martingale central limit theorem imposesconditions on the conditional first and second moments of the increments given the past,without directly involving other aspects of the distribution. In this sense it is closer tothe ordinary central limit theorem. The first moments given the past are assumed zero;the second moments given the past must not be too big.

The “past” can be given by an arbitrary filtration. A filtration Ft is a nondecreasingcollection of σ-fields · · · ⊂ F−1 ⊂ F0 ⊂ F1 ⊂ · · ·. The σ-field Ft can be thought of as the“events that are known” at time t. Often it will be the σ-field generated by the variablesXt, Xt−1, Xt−2, . . .; the corresponding filtration is called the natural filtration of the timeseries Xt, or the filtration generated by this series. A martingale difference series relativeto a given filtration is a time series Xt such that, for every t:(i) Xt is Ft-measurable;(ii) E(Xt| Ft−1) = 0.The second requirement implicitly includes the assumption that E|Xt| <∞, so that theconditional expectation is well defined; the identity is understood to be in the almost-suresense.

4.15 EXERCISE. Show that a martingale difference series with finite variances is a whitenoise series.

4.16 Theorem. If Xt is a martingale difference series relative to the filtration Ft

such that n−1∑n

t=1 E(X2t | Ft−1) P→ v for a positive constant v, and such that

n−1∑n

t=1 E(

X2t 1|Xt|>ε

√n| Ft−1

)

P→ 0 for every ε > 0, then√nXn N(0, v).

Proof. For simplicity of notation, let Et denote conditional expectation given Ft−1.Because the events At = n−1∑t

j=1 Ej−1X2j ≤ 2v are Ft−1-measurable, the variables


Xn,t = n−1/2Xt1Atare martingale differences relative to the filtration Ft. They satisfy

(4.7)

n∑

t=1

Et−1X2n,t ≤ 2v,

n∑

t=1

Et−1X2n,t

P→ v,

n∑

t=1

Et−1X2n,t1|Xn,t|>ε

P→ 0, every ε > 0.

To see this note first that Et−1X2n,t = 1At

n−1Et−1X2t , and that the events At are de-

creasing: A1 ⊃ A2 ⊃ · · · ⊃ An. The first relation in the display follows from the definitionof the events At, which make that the cumulative sums stop increasing before they crossthe level 2v. The second relation follows, because the left side of this relation is equalto n−1

∑nt=1 Et−1X2

t on An, which tends to v by assumption, and the probability of theevent An tends to 1 by assumption. The third relation is immediate from the conditionalLindeberg assumption on the Xt and the fact that |Xn,t| ≤ Xt/

√n, for every t.

On the event An we also have that Xn,t = Xt/√n for every t = 1, . . . , n and hence

the theorem is proved once it has been established that∑n

t=1Xn,t N(0, v).We have that |eix − 1 − ix| ≤ x2/2 for every x ∈ R. Furthermore, the function

R:R → C defined by eix − 1− ix+ x2/2 = x2R(x) satisfies |R(x)| ≤ 1 and R(x) → 0 asx→ 0. If δ, ε > 0 are chosen such that |R(x)| < ε if |x| ≤ δ, then |eix − 1− ix+ x2/2| ≤x21|x|>δ + εx2. For fixed u ∈ R define

Rn,t = Et−1(eiuXn,t − 1− iuXn,t).

It follows that |Rn,t| ≤ 12u

2Et−1X2n,t and

n∑

t=1

|Rn,t| ≤ 12u

2n∑

t=1

Et−1X2n,t ≤ u2v,

max1≤t≤n

|Rn,t| ≤ 12u

2(

n∑

t=1

Et−1X2n,t1|Xn,t|>δ + δ2

)

,

n∑

t=1

∣

∣Rn,t +12u

2Et−1X2n,t

∣

∣ ≤ u2n∑

t=1

(

Et−1X2n,t1|Xn,t|>δ + εEt−1X

2n,t

)

.

The second and third inequalities together with (4.7) imply that the sequencemax1≤t≤n |Rn,t| tends to zero in probability and that the sequence

∑nt=1Rn,t tends

in probability to − 12u

2v.The function S:R → R defined by log(1 − x) = −x + xS(x) satisfies S(x) → 0 as

x→ 0. It follows that max1≤t≤n |S(Rn,t)| P→ 0, and

n∏

t=1

(1−Rn,t) = e∑n

t=1log(1−Rn,t) = e−

∑n

t=1Rn,t+

∑n

t=1Rn,tS(Rn,t) P→ eu

2v/2.

4.5: Martingale Differences 61

We also have that∏k

t=1 |1−Rn,t| ≤ exp∑n

t=1 |Rn,t| ≤ eu2v, for every k ≤ n. Therefore,

by the dominated convergence theorem, the convergence in the preceding display is alsoin absolute mean.

For every t,

En−1eiuXn,n(1−Rn,n) = (1−Rn,n)En−1(e

iuXn,n − 1− iuXn,n + 1)

= (1−Rn,n)(Rn,n + 1) = 1−R2n,n.

Therefore, by conditioning on Fn−1,

En∏

t=1

eiuXn,t(1−Rn,t) = En−1∏

t=1

eiuXn,t(1−Rn,t)− En−1∏

t=1

eiuXn,t(1−Rn,t)R2n,n.

By repeating this argument, we find that

∣

∣

∣E

n∏

t=1

eiuXn,t(1−Rn,t)− 1∣

∣

∣ =∣

∣

∣−n∑

k=1

E

k−1∏

t=1

eiuXn,t(1−Rn,t)R2n,k

∣

∣

∣ ≤ eu2vE

n∑

t=1

R2n,t.

This tends to zero, because∑n

t=1 |Rn,t| is bounded above by a constant andmax1≤t≤n |Rn,t| tends to zero in probability.

We combine the results of the last two paragraphs to conclude that

∣

∣

∣E

n∏

t=1

eiuXn,teu2v/2 − 1

∣

∣

∣ =∣

∣

∣E

n∏

t=1

eiuXn,teu2v/2 −

n∏

t=1

eiuXn,t(1−Rn,t)∣

∣

∣+ o(1) → 0.

The theorem follows from the continuity theorem for characteristic functions.

Apart from the structural condition that the sums∑n

t=1Xt form a martingale, themartingale central limit theorem requires that the sequence of variables Yt = E(X2

t | Ft−1)satisfies a law of large numbers and that the variables Yt,ε,n = E(X2

t 1|Xt|>ε√n| Ft−1)

satisfy a (conditional) Lindeberg-type condition. These conditions are immediate for“ergodic” sequences, which by definition are (strictly stationary) sequences for whichany integrable “running transformation” of the type Yt = g(Xt, Xt−1, . . .) satisfies theLaw of Large Numbers. The concept of ergodicity is discussed at length in Section 7.2.

4.17 Corollary. If Xt is a strictly stationary, ergodic martingale difference series rela-tive to its natural filtration with mean zero and v = EX2

t <∞, then√nXn N(0, v).

Proof. By strict stationarity there exists a fixed measurable function g:R∞ → R∞ such

that E(X2t |Xt−1, Xt−2, . . .) = g(Xt−1, Xt−2, . . .) almost surely, for every t. The ergodicity

of the series Xt is inherited by the series Yt = g(Xt−1, Xt−2, . . .) and hence Y n → EY1 =EX2

1 almost surely. By a similar argument the averages n−1∑n

t=1 E(

X2t 1|Xt|>M | Ft−1

)

converge almost surely to their expectation, for every fixed M . This expectation canbe made arbitrarily small by choosing M large. For any fixed ε > 0 the sequencen−1

∑nt=1 E

(

X2t 1|Xt| > ε

√n| Ft−1

)

is bounded by this sequence eventually, for anygiven M , and hence converges almost surely to zero.


* 4.6 Projections

Let Xt be a centered time series and F0 = σ(X0, X−1, . . .). For a suitably mixing timeseries the covariance E

(

XnE(Xj | F0))

between Xn and the best prediction of Xj at time0 should be small as n → ∞. The following theorem gives a precise and remarkablysimple sufficient condition for the central limit theorem in terms of these quantities.

4.18 Theorem. let Xt be a strictly stationary, mean zero, ergodic time series with∑

h

∣

∣γX(h)∣

∣ <∞ and, as n→ ∞,

∞∑

j=0

∣

∣E(

XnE(Xj | F0))∣

∣→ 0.

Then√nXn N(0, v), for v =

∑

h γX(h).

Proof. For a fixed integer m define a time series

Yt,m =t+m∑

j=t

(

E(Xj | Ft)− E(Xj | Ft−1))

.

Then Yt,m is a strictly stationary martingale difference series. By the ergodicity of theseries Xt, for fixed m as n→ ∞,

1

n

n∑

t=1

E(Y 2t,m| Ft−1) → EY 2

0,m =: vm,

almost surely and in mean. The number vm is finite, because the series Xt is square-integrable by assumption. By the martingale central limit theorem, Theorem 4.16, weconclude that

√nY n,m N(0, vm) as n→ ∞, for every fixed m.

Because Xt = E(Xt| Ft) we can write

n∑

t=1

(Yt,m −Xt) =

n∑

t=1

t+m∑

j=t+1

E(Xj | Ft)−n∑

t=1

t+m∑

j=t

E(Xj | Ft−1)

=

n+m∑

j=n+1

E(Xj | Fn)−m∑

j=1

E(Xj | F0)−n∑

t=1

E(Xt+m| Ft−1).

Write the right side as Zn,m−Z0,m−Rn,m. Then the time series Zt,m is stationary with

EZ20,m =

m∑

i=1

m∑

j=1

E(

E(Xi| F0)E(Xj | F0))

≤ m2EX20 .

4.7: Proof of Theorem 4.7 63

The right side divided by n converges to zero as n→ ∞, for every fixed m. Furthermore,

ER2n,m =

n∑

s=1

n∑

t=1

E(

E(Xs+m| Fs−1)E(Xt+m| Ft−1))

≤ 2∑∑

1≤s≤t≤nE(

E(Xs+m| Fs−1)Xt+m

)

≤ 2n∞∑

h=1

∣

∣EE(Xm+1| F0)Xh+m

∣

∣ = 2n∞∑

h=m+1

∣

∣EXm+1E(Xh| F0)∣

∣.

The right side divided by n converges to zero as m→ ∞. Combining the three precedingdisplays we see that the sequence

√n(Y n,m−Xn) = (Zn,m−Z0,m−Rn,m)/

√n converges

to zero in second mean as n→ ∞ followed by m→ ∞.Because Yt,m is a martingale difference series, the variables Yt,m are uncorrelated

and hencevar

√nY n,m = EY 2

0,m = vm.

Because, as usual, var√nXn → v as n→ ∞, combination with the preceding paragraph

shows that vm → v as m → ∞. Consequently, by Lemma 3.10 there exists mn → ∞such that

√nYn,mn

N(0, v) and√n(Yn,mn

− Xn) 0. This implies the theorem inview of Slutsky’s lemma.

4.19 EXERCISE. Derive the martingale central limit theorem, Theorem 4.16, from The-orem 4.18. [The infinite sum is really equal to |EXnX0|!]

* 4.7 Proof of Theorem 4.7

In this section we present two proofs of Theorem4.7, the first based on characteristicfunctions and Lemma 3.13, and the second based on Theorem 4.18.

Proof of Theorem 4.7 (first version). For a given M > 0 let XMt = Xt1|Xt| ≤ M

and let YMt = Xt − XM

t . Because XMt is a measurable transformation of Xt, it is

immediate from the definition of the mixing coefficients that the series YMt is mixing

with smaller mixing coefficients than the series Xt. Therefore, in view of (4.6)

var√n(Xn −XM

n ) = var√nYM

n ≤ 4

∫ 1

0

α−1(u)F−1|Y M0 |

(1− u)2 du.

Because YM0 = 0 whenever |X0| ≤ M , it follows that YM

0 0 as M → ∞ and henceF−1|Y M

0 |(u) → 0 for every u ∈ (0, 1). Furthermore, because |YM

0 | ≤ |X0|, its quantile

function is bounded above by the quantile function of |X0|. By the dominated convergencetheorem the integral in the preceding display converges to zero as M → ∞, and hence


the variance in the left side converges to zero as M → ∞, uniformly in n. If we can showthat

√n(XM

n − EXM0 ) N(0, vM ) as n → ∞ for vM = limvar

√nXM

n and every fixedM , then it follows that

√n(Xn − EX0) N(0, v) for v = lim vM = limvar

√nXn, by

Lemma 3.10, and the proof is complete.Thus it suffices to prove the theorem for uniformly bounded variables Xt. Let M be

the uniform bound.Fix some sequence mn → ∞ such that

√nα(mn) → 0 and mn/

√n → 0. Such

a sequence exists. To see this, first note that√nα(⌊√n/k⌋) → 0 as n → ∞, for ev-

ery fixed k. (See Problem 4.20). Thus by Lemma 3.10 there exists kn → ∞ such that√nα(⌊√n/kn⌋) → 0 as kn → ∞. Now set mn = ⌊√n/kn⌋. For simplicity write m for

mn. Also let it be silently understood that all summation indices are restricted to theintegers 1, 2, . . . , n, unless indicated otherwise.

Let Sn = n−1/2∑n

t=1Xt and, for every given t, set Sn(t) = n−1/2∑

|j−t|<mXj .

Because |eiλ − 1− iλ| ≤ 12λ

2 for every λ ∈ R, we have

∣

∣

∣E[ 1√

n

n∑

t=1

XteiλSn

(

e−iλSn(t) − 1 + iλSn(t))]∣

∣

∣≤ λ2nM

2√n

n∑

t=1

ES2n(t)

=λ2M

2√n

n∑

t=1

∑

|i−t|<m

∑

|j−t|<m

γX(i− j)

≤ λ2M

2√nm∑

h

∣

∣γX(h)∣

∣→ 0.

Furthermore, with An(t) and Bn(t) defined as n−1/2 times the sum of the Xj with1 ≤ j ≤ t−m and t+m ≤ j ≤ n, respectively, we have Sn − Sn(t) = An(t) +Bn(t) and

∣

∣

∣E( 1√

n

n∑

t=1

XteiλSne−iλSn(t)

)∣

∣

∣

≤ 1√n

n∑

t=1

∣

∣

∣cov(

XteiλAn(t), eiλBn(t)

)

+ cov(

Xt, eiλAn(t)

)

EeiλBn(t)∣

∣

∣

≤ 4√nMα(m) → 0,

by the second inequality of Lemma 4.12, with p = 1 and q = r = ∞. Combining thepreceding pair of displays we see that

ESneiλSn = E

1√n

n∑

t=1

XteiλSniλSn(t) + o(1) = iλE

(

eiλSn1

n

∑∑

|s−t|<m

XsXt

)

+ o(1).

If we can show that n−1∑∑

|s−t|<mXsXt converges in mean to v, then the right side

of the last display is asymptotically equivalent to iλEeiλSnv, and the theorem is provedin view of Lemma 3.13.

4.7: Proof of Theorem 4.7 65

In fact, we show that n−1∑∑

|s−t|<mXsXt → v in second mean. First,

E1

n

∑∑

|s−t|<m

XsXt =∑

|h|<m

(n− |h|n

)

γX(h) → v,

By the dominated convergence theorem, in view of (4.2). Second,

var( 1

n

∑∑

|s−t|<m

XsXt

)

≤ 1

n2

∑∑

|s−t|<m

∑∑

|i−j|<m

∣

∣cov(XsXt, XiXj)∣

∣.

The first double sum on the right can be split in the sums over the pairs (s, t) with s < tand s ≥ t, respectively, and similarly for the second double sum relative to (i, j). Bysymmetry the right side is bounded by

4

n2

∑∑

|s−t|<ms≤t

∑∑

|i−j|<mi≤j

∣

∣cov(XsXt, XiXj)∣

∣

≤ 4

n2

n∑

s=1

m∑

t=0

n∑

i=1

m∑

j=0

∣

∣cov(XsXs+t, XiXi+j)∣

∣

≤ 8

n2

n∑

s=1

m∑

t=0

n∑

i=1

m∑

j=0

∣

∣cov(XsXs+t, Xs+iXs+i+j)∣

∣,

by the same argument, this time splitting the sums over s ≤ i and s > i and usingsymmetry between s and i. If i ≥ t, then the covariance in the sum is bounded above by2α(i− t)M4, by Lemma 4.12, because there are i− t time instants between XsXs+t andXs+iXs+i+j . If i < t, then we rewrite the absolute covariance as

∣

∣

∣cov(Xs, Xs+tXs+iXs+i+j)− cov(Xs, Xs+t)EXs+iXs+i+j

∣

∣

∣ ≤ 4α(i)M4.

Thus the four-fold sum is bounded above by

32

n2

n∑

s=1

m∑

t=0

n∑

i=1

m∑

j=0

(

α(i− t)M41i≥t + α(i)M41i<t

)

≤ 64M4m2

n

∑

i≥0α(i).

Because F−1|X0| is bounded away from zero in a neighbourhood of 0, finiteness of the

integral∫ 1

0α−1(u)F−1|X0|(1 − u)2 du implies that the series on the right converges. This

conclude the proof.

* 4.20 EXERCISE. Suppose that α(h) is a decreasing sequence of nonnegative numbers(h = 1, 2, . . .) with

∑

h α(h) < ∞. Show that hα(h) → 0 as h → ∞. [First derive, usingthe monotonicity, that

∑

h 2hα(2h) <∞ and conclude from this that 2hα(2h) → 0. Next

use the monotonicity again “to fill the gaps”.]

Proof of Theorem 4.7 (second version). Let ∆F denote the jump sizes of a cumulativedistribution function F . For Yn = E(Xn| F0) and V a uniform variable independent of


the other variables, set

1− Un = F|Yn|(

|Yn|−)

+ V∆F|Yn|(

|Yn|)

,

This definition is an extended form of the probability integral transformation, allowingfor jumps in the distribution function. The variable Un is uniformly distributed andF−1|Yn|(1−Un) = |Yn| almost surely. Because Yn is F0-measurable the covariance inequality,Lemma 4.12, gives

∣

∣E(

E(Xn| F0)Xj

)∣

∣ ≤ 2

∫ αj

0

F−1|Yn|(1− u)F−1|Xj |(1− u) du

= 2EYn sign(Yn)F−1|Xj |(1− Un)1Un<αj

= 2EXn sign(Yn)F−1|Xj |(1− Un)1Un<αj

≤ 2E|Xn|F−1|Xj |(1− Un)1Un<αj

≤ 4

∫ 1

0

F−1|Xn|(1− u)G−1(1− u) du

by a second application of Lemma 4.12, with α = 1 and G the distribution function of therandom variable F−1|Xj |(1− Un)1Un<αj

. The corresponding quantile function G−1(1− u)

vanishes off [0, αj ] and is bounded above by the quantile function of |Xj |. Therefore, theexpression is further bounded by

∫ αj

0F−1|X0|(1 − u)2 du. We finish by summing up over

j.

5Nonparametric Estimationof Mean and Covariance

Suppose we observe the values X1, . . . , Xn from the stationary time series Xt with meanµX = EXt, covariance function h 7→ γX(h), and correlation function h 7→ ρX(h). Ifnothing is known about the distribution of the time series, besides that it is stationary,then “obvious” estimators for these parameters are

µn = Xn =1

n

n∑

t=1

Xt,

γn(h) =1

n

n−h∑

t=1

(Xt+h −Xn)(Xt −Xn), (0 ≤ h < n),

ρn(h) =γn(h)

γn(0).

These estimators are called nonparametric, because they are not motivated by a statisti-cal model that restricts the distribution of the time series. Their advantage is that theywork for (almost) every stationary time series. Better estimators, in particular for γXand ρX , might exist given a specific (and valid) statistical model for the time series. Weshall see examples when discussing ARMA-processes in Chapter 8.

* 5.1 EXERCISE. The factor 1/n in the definition of γn(h) is sometimes replaced by1/(n − h), because there are n − h terms in the sum. Show that with the present defi-nition of γn the corresponding estimator

(

γn(s − t))

s,t=1,...,hfor the covariance matrix

of (X1, . . . , Xh) is nonnegative-definite. Show by example that this is not true if we use1/(n− h). [Write the matrix as QQT for a suitable (n× (2n)) matrix Q.]

The time series Xt is called Gaussian if the joint distribution of any finite numberof the variables Xt is multivariate-normal. In that case Xn is normally distributed. Thedistributions of γn(h) and ρn(h) are complicated, even under normality. Distributionalstatements considering these estimators are therefore usually asymptotic in nature, as

68 5: Nonparametric Estimation of Mean and Covariance

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.1. Realization of the sample auto-correlation function (n = 250) of the stationary time seriessatisfying Xt+1 = 0.5Xt + Zt for standard normal white noise Zt.

n → ∞. In this chapter we discuss conditions under which the three estimators areasymptotically normally distributed. This knowledge can be used to set approximateconfidence intervals.

5.1 Mean

The asymptotic normality of the sample mean is the subject of Chapter 4. The statisticalsignificance of the central limit theorem is that the sample mean is an asymptoticallyconsistent estimator of the true mean µX , with precision of the order 1/

√n. The central

limit theorem can be used in a preciser way to derive an asymptotic confidence intervalfor µX . This requires an estimator of the (asymptotic) variance of the sample mean,which we discuss in this section.

An approximate confidence interval for µX based on the sample mean typically takesthe form

(

Xn − σn√n1.96, Xn +

σn√n1.96

)

.

5.1: Mean 69

If√n(Xn − µX)/σn N(0, 1) as n → ∞, then the confidence level of this interval

converges to 95%. The problem is to find suitable estimators σn.If the sequence

√n(Xn−µX) is asymptotically normal, as it is under the conditions

of the preceding chapter, the procedure works if the σn are consistent estimators of the(asymptotic) variance of Xn. However, constructing such estimators is not a trivial mat-ter. Unlike in the case of independent, identically distributed variables, the variance of thesample mean depends on characteristics of the joint distribution of (X1, . . . , Xn), ratherthan only on the marginal distributions (see (4.1)). The limiting variance

∑

h γX(h)depends even on the joint distribution of the infinite sequence (X1, X2, . . .). With a suf-ficient number of observations it is possible to estimate the auto-covariances γX(h) atsmaller lags h, but, without further information, this is not true for larger lags h ≈ n(let alone h ≥ n), unless we make special assumptions. Setting a confidence interval istherefore much harder than in the case of independent, identically distributed variables.

If a reliable model is available, expressed in a vector of parameters, then the problemcan be solved by a model-based estimator: the variance of the sample mean is expressedin the parameters, which are next replaced by estimators. If there are not too manyparameters in the model this should be feasible. (Methods to estimate parameters arediscussed in later chapters.)

5.2 EXERCISE.(i) Calculate the asymptotic variance of the sample mean for the moving average Xt =

Zt + θZt−1.(ii) Same question for the stationary solution of Xt = φXt−1 + Zt, where |φ| < 1.

However, the use of a model-based estimator is at odds with the theme of thischapter: nonparametric estimation. We now discuss several methods of nonparametricestimation that work as soon as the time series is sufficiently mixing.

In common use is the method of batched means. The total set of observations is splitinto r blocks

[X1, . . . , Xl], [Xl+1, . . . , X2l], . . . , [X(r−1)l+1, . . . , Xrl]

of l observations each. (Assume that n = rl for simplicity; drop a last batch of fewer thanl observations, for simplicity.) If Y1, . . . , Yr are the sample means of the r blocks, thenY r = Xn and hence varY r = varXn. The hope is that we can ignore the dependencebetween Y1, . . . , Yr and can simply estimate the variance var(

√rY r) = (r/n) var(

√nXn)

by the sample variance S2r,Y of Y1, . . . , Yr. If l is “large enough” and the orginal series

Xt is sufficiently mixing, then this actually works, to some extent.Presumably, the method of batched means uses disjoint blocks of Xt in order to

achieve the approximate independence of the block means Y1, . . . , Yr. In general theseare still dependent. This does not cause much (additional) bias in the estimator of thevariance, but it may have an effect on the precision. It turns out that it is better to useall blocks of l consecutive Xt, even though these may be more dependent. Thus in oursecond method we consider all blocks

[X1, . . . , Xl], [X2, . . . , Xl+1], . . . , [Xn−l+1, . . . , Xn]


of l consecutive observations. We let Zn,1, Zn,2, . . . , Zn,n−l+1 be the sample means of the

n− l+1 blocks, so that l varZn,i = var(√lX l) ≈ var(

√nXn), if l is large. This suggests

to estimate the variance of√nXn by lS2

n−l+1,Z . The following theorem shows that thismethod works under some conditions, provided that l is chosen dependent on n withln → ∞ at a not too fast rate. The theorem considers both the sample variance of the(recentered) block means,

S2n,Z =

1

n− ln + 1

n−ln+1∑

i=1

(Zn,i −Xn)2,

and the empirical distribution function of the centered and scaled block means,

Fn(x) =1

n− ln + 1

n−ln+1∑

i=1

1√

ln(Zn,i −Xn) ≤ x

.

The function Fn is the cumulative distribution function of the discrete distribution thatputs mass 1/(n − ln + 1) at each centered and scaled block mean

√ln(Zn,i −Xn). The

second moment of this distribution is lnS2n,Z . We have centered the block means at the

average Xn of the full sample, which is not unnatural, but slightly different from theaverage Zl of the block means, making S2

n,Z deviate slightly from the sample variance.Exercise 5.4 indicates the difference is negligible.

The following theorem shows that the sequence Fn tends weakly to the same limitas the sequence

√n(Xn − µX), and its second moments lnS

2n,Z converge as well.

5.3 Theorem. Suppose that the time series Xt is strictly stationary and α-mixing withmixing coefficients satisfying

∑

h α(h) < ∞. Let ln → ∞ such that ln/n → 0. Further-more, suppose that

√n(Xn − µX) N(0, v), for some number v. Then, for every x,

the sequence Fn(x) converges in probability to Φ(x/√v). Furthermore, if v =

∑

h γX(h),

then the second moments lnS2n,Z of Fn converge in probability to v.

Proof. The distribution function Gn obtained by replacing the meanXn in the definitionof Fn by µX relates to Fn through

Fn(x) = Gn

(

x+√

ln(Xn − µX))

.

The sequence√ln(Xn − µX) converges in probabity to zero, by the assumptions that

the sequence√n(Xn − µX) converges weakly and that ln/n → 0. In view of the mono-

tonicity of the functions Fn and Gn, the claimed convergence of Fn(x) then follows ifGn(x)

P→ Φ(x/√v) for every x. Furthermore, by the triangle inequality the difference

between the roots of the second moments of Fn and Gn is bounded by√ln|Xn − µX |

and hence tends to zero in probability. Therefore it suffices to prove the claims for Gn

instead of Fn.Fix some x and define Yt = 1

√ln(Zn,t − µX) ≤ x

. Then the time series Yt is

strictly stationary and Gn(x) = Y n−ln+1. By assumption

EY n−ln+1 = P(√

ln(X ln − µX) ≤ x)

→ Φ(x/√v).

5.1: Mean 71

Because the variable Yt depends only on the variables Xs with t ≤ s < t+ ln, the seriesYt is α-mixing with mixing coefficients bounded above by α(h− ln) for h > ln. Therefore,by (4.2) followed by Lemma 4.12 (with q = r = ∞),

varY n−ln+1 ≤ 1

n− ln + 1

∑

h

∣

∣γY (h)∣

∣ ≤ 4

n− ln + 1

(

∑

h≥lnα(h− ln) + ln

12

)

.

This converges to zero as n→ ∞. Thus Gn(x) = Y n−ln+1 → Φ(x/√v) in probability by

Chebyshev’s inequality, and the first assertion of the theorem is proved.By Theorem 3.8 the second moments of a sequence of distributions Gn with Gn

N(0, v) tends to v if∫

|x|≥M x2 dGn(x) → 0 as n → ∞ followed by M → ∞. We can

apply this to the present random distributions Gn, for instance by arguing along almostsurely converging subsequences, to conclude that the second moments of the present Gn

converge to v if and only if∫

|x|≥M x2 dGn(x)P→ 0, as n→ ∞ followed by M → ∞. Now

E

∫

|x|≥Mx2 dGn(x) = E

1

n− ln + 1

n−ln+1∑

i=1

∣

∣

√

ln(Zn,i − µX)∣

∣

21√

ln|Zn,i − µX | ≥M

= E∣

∣

√

ln(X ln − µX)∣

∣

21√

ln|X ln − µX | ≥M

.

By assumption√ln(X ln − µX) N(0, v), while E

∣

∣

√ln(X ln − µX)

∣

∣

2 → v by (4.1) (andExercise 4.1). Thus we can apply Theorem 3.8 in the other direction to conclude thatthe right side of the display converges to zero as n→ ∞ followed by M → ∞.

* 5.4 EXERCISE. Suppose that we center the block means Zn,i at Zn−ln+1 instead of

Xn, and consider the empirical distribution of the variables√ln(Zn,i − Zn−ln+1) and

its variance lnS2n,Z instead of Fn and S2

n,Z . Show that the preceding theorem is valid for

these quantities provided ln/n2/3 → 0. [The mean Zn−ln+1 of the block means is almost

the same as Xn. Each Xt with t ∈ (ln, n− ln +1) occurs in exactly ln block means Zn,i,and theXt with “extreme” indexes occur in between 1 and ln−1 block means. This showsthat (n − ln + 1)Zn−λn+1 = nXn − An, where An is a weighted sum of the “extreme”Xt, the weights ranging from (ln − 1)/ln to 1/ln. It follows that E|An| ≤ 2lnE|X1|.]

The usefulness of the estimator Fn goes beyond its variance. Because it tends tothe same limit distribution as the sequence

√n(Xn −µX), the distribution Fn can serve

as an estimator of the distribution of the latter variable. In particular, we could use thequantiles of Fn as estimators of the quantiles of

√n(Xn − µX) and use these to replace

the normal quantiles and σn in the construction of a confidence interval. This gives theinterval

[

Xn − F−1n (0.975)√n

,Xn − F−1n (0.025)√n

]

.

The preceding theorem shows that this interval has asymptotic confidence level 95% forcovering µX . Indeed, the theorem shows that P

(√n(Xn − µX) ≤ F−1n (α)

)

→ α, for anyα ∈ (0, 1), which implies that µX is contained in the confidence interval with probabilitytending to 95 %.


The proper choice of the block length l of the method of batched means is crucial:the preceding theorem shows that the estimators will be consistent provided ln → ∞such that ln/n → 0. Additional calculations show that, under general conditions, thestandard errors of the variance estimators are minimal if ln is proportional to n1/3.♯

Unfortunately, in practice, it is rarely known if these “general conditions” are satisfied,as an optimal block length must depend on the amount of mixing of the time series.Long blocks necessarily have much overlap and hence are positively dependent, whichhas a negative impact on the variance of the estimators. On the other hand, short blockscannot propertly reflect the dependence structure of the time series, with longer-rangedependence needing longer blocks. An optimal block size must balance these conflictingdemands.

5.5 EXERCISE. Extend the preceding theorem to the method of batched means.

* 5.1.1 Blockwise Bootstrap

A related method is the blockwise bootstrap. Assume that n = lr for simplicity. Giventhe same blocks [X1, . . . , Xl], [X2, . . . , Xl+1], . . . , [Xn−l+1, . . . , Xn], we choose r = n/lblocks at random with replacement and put the r blocks in a row, in random order,but preserving the order of the Xt within the r blocks. We denote the row of n = rlvariables obtained in this way by X∗1 , X

∗2 , . . . , X

∗n and let X∗n be their average. The

bootstrap estimator of the distribution of√n(Xn − µX) is by definition the conditional

distribution of√n(X∗n − Xn) given X1, . . . , Xn. The corresponding estimator of the

variance of√n(Xn − µX) is the variance of this conditional distribution.

An equivalent description of the bootstrap procedure is to choose a random samplewith replacement from the block averages Zn,1, . . . , Zn,n−ln+1. If this sample is denotedby Z∗1 , . . . , Z

∗r , then the average X∗n is also the average Z∗r . It follows that the bootstrap

estimator of the variance of Xn is the conditional variance of the mean of a randomsample of size r from the block averages given the values Zn,1, . . . , Zn,n−ln+1 of theseaverages. This is simply (n/r)S2

n−ln+1,Z , as before.Other aspects of the bootstrap estimators of the distribution, for instance quantiles,

are hard to calculate explicitly. In practice we perform computer simulation to obtainan approximation of the bootstrap estimator. By repeating the sampling procedure alarge number of times (with the same values of X1, . . . , Xn), and taking the empiricaldistribution over the realizations, we can, in principle obtain arbitrary precision.

5.2 Auto Covariances

Replacing a given time series Xt by the centered time series Xt − µX does not changethe auto-covariance function. Therefore, for the study of the asymptotic properties of

♯ See Kunsch (1989), Annals of Statistics 17, p1217–1241.

5.2: Auto Covariances 73

the sample auto covariance function γn(h), it is not a loss of generality to assume thatµX = 0. The sample auto-covariance function can be written as

γn(h) =1

n

n−h∑

t=1

Xt+hXt −Xn

( 1

n

n−h∑

t=1

Xt

)

−( 1

n

n−h∑

t=1

Xt+h

)

Xn +n− h

n(Xn)

2.

Under the conditions of Chapter 4 and the assumption µX = 0, the sample mean Xn

is of the order OP (1/√n) and hence the last term on the right is of the order OP (1/n).

For fixed h the second and third term are almost equivalent to (Xn)2 and are also of the

order OP (1/n). Thus, under the assumption that µX = 0,

γn(h) =1

n

n−h∑

t=1

Xt+hXt +OP

( 1

n

)

.

It follows from this and Slutsky’s lemma that the asymptotic behaviour of the sequence√n(

γn(h) − γX(h))

depends only on n−1∑n−h

t=1 Xt+hXt. Here a change of n by n − h(or n − h by n) is asymptotically negligible, so that, for simplicity of notation, we canequivalently study the averages

γ∗n(h) =1

n

n∑

t=1

Xt+hXt.

These are unbiased estimators of EXt+hXt = γX(h), under the condition that µX = 0.Their asymptotic distribution can be derived by applying a central limit theorem to theaverages Y n of the variables Yt = Xt+hXt.

If the time series Xt is mixing with mixing coefficients α(k), then the time series Ytis mixing with mixing coefficients bounded above by α(k − h) for k > h ≥ 0. Becausethe conditions for a central limit theorem depend only on the speed at which the mixingcoefficients converge to zero, and a shift of the argument by a fixed amount does notchange the speed, this means that in most cases the mixing coefficients of the time seriesXt and Yt are equivalent. By the Cauchy-Schwarz inequality the series Yt has finitemoments of order k if the series Xt has finite moments of order 2k. This means that themixing central limit theorems for the sample mean apply without further difficulties toproving the asymptotic normality of the sample auto-covariance function. The asymptoticvariance takes the form

∑

g γY (g) and in general depends on fourth order moments ofthe type EXt+g+hXt+gXt+hXt as well as on the auto-covariance function of the seriesXt. In its generality, its precise form is not of much interest.

5.6 Theorem. IfXt is a strictly stationary, mixing time series with α-mixing coefficients

such that∫ 1

0α−1(u)F−1|XhX0|(1 − u)2 du < ∞, then the sequence

√n(

γn(h) − γX(h))

converges in distribution to a normal distribution.

Proof. It can be seen that the mixing coefficients of Yt = Xt+hXt satisfy αY (g) ≤α(g − h), for g ≥ h. This implies that (αY )−1(u) =

∑

g 1u<αY (g) ≤ h/2 + α−1(u).

Therefore the theorem follows from Theorem 4.7 applied to Y n.


Another approach to central limit theorems is special to linear processes, of the form(4.4), for a sequence . . . , Z−1, Z0, Z1, Z2, . . . of independent and identically distributedvariables with EZt = 0, and constants ψj satisfying

∑

j |ψj | < ∞. The sample auto-covariance function of a linear process is also asymptotically normal, but the proof ofthis requires additional work. This work is worth while mainly because the limit variancetakes a simple form in this case.

Under (4.4) with µ = 0, the auto-covariance function of the series Yt = Xt+hXt canbe calculated as

γY (g) = cov(

Xt+g+hXt+g, Xt+hXt

)

=∑

i

∑

j

∑

k

∑

l

ψt−iψt+h−jψt+g−kψt+g+h−l cov(ZiZj , ZkZl).

Here cov(ZiZj , ZkZl) is zero whenever one of the indices i, j, k, l occurs only once. Forinstance EZ1Z2Z10Z2 = EZ1EZ

22EZ10 = 0. It also vanishes if i = j 6= k = l, by

independence of ZiZj and ZkZl. The covariance is nonzero only if all four indices arethe same, or if the indices occur in the pairs i = k 6= j = l or i = l 6= j = k. Thus thepreceding display can be rewritten as

cov(Z21 , Z

21 )∑

i

ψt−iψt+h−iψt+g−iψt+g+h−i

+ cov(Z1Z2, Z1Z2)∑∑

i6=j

ψt−iψt+h−jψt+g−iψt+g+h−j

+ cov(Z1Z2, Z2Z1)∑∑

i6=j

ψt−iψt+h−jψt+g−jψt+g+h−i

=(

EZ41 − 3(EZ2

1 )2)

∑

i

ψiψi+hψi+gψi+g+h + γX(g)2 + γX(g + h)γX(g − h).

In the last step we use Lemma 1.29(iii) twice, after first adding the diagonal terms i = jto the double sums. Since cov(Z1Z2, Z1Z2) = (EZ2

1 )2, these diagonal terms account

for −2 of the −3 times the sum in the first term. The variance of√nγ∗n(h) =

√nY n

converges to the sum over g of this expression. With κ4(Z) = EZ41/(EZ

21 )

2−3, the fourthcumulant (equal to the kurtosis minus 3 divided by the square variance) of Zt, this sumcan be written as

Vh,h = κ4(Z)γX(h)2 +∑

g

γX(g)2 +∑

g

γX(g + h)γX(g − h).

5.7 Theorem. If (4.4) holds for an i.i.d. sequence Zt with mean zero and EZ4t <∞ and

numbers ψj with∑

j |ψj | <∞, then√n(

γn(h)− γX(h))

N(0, Vh,h).

Proof. As explained in the discussion preceding the statement of the theorem, it sufficesto show that the sequence

√n(

γ∗n(h)− γX(h))

has the given asymptotic distribution inthe case that µ = 0. Define Yt = Xt+hXt and, for fixed m ∈ N,

Y mt =

∑

|i|≤mψiZt+h−i

∑

|j|≤mψjZt−j =:Xm

t+hXmt .

5.2: Auto Covariances 75

The time series Y mt is (2m+ h)-dependent and strictly stationary. By Theorem 4.4 the

sequence√n(Y m

n − EY mn ) is asymptotically normal with mean zero and variance

V mh,h: =

∑

g

γYm(g) = κ4(Z)γXm(h)2 +

∑

g

γXm(g)2 +∑

g

γXm(g + h)γXm(g − h),

where the second equality follows from the calculations preceding the theorem. For everyg, as m→ ∞,

γXm(g) = EZ21

∑

j:|j|≤m,|j+g|≤mψjψj+g → EZ2

1

∑

j

ψjψj+g = γX(g).

Furthermore, the numbers on the left are bounded above by EZ21

∑

j |ψjψj+g|, and∑

g

(

∑

j

|ψjψj+g|)2

=∑

g

∑

i

∑

j

|ψiψjψi+gψj+g| ≤ supj

|ψj |(

∑

j

|ψj |)3

<∞.

Therefore, by the dominated convergence theorem∑

g γXm(g)2 →∑

g γX(g)2 asm→ ∞.By a similar argument, we obtain the corresponding property for the third term in theexpression defining V m

h,h, whence Vmh,h → Vh,h as m→ ∞.

We conclude by Lemma 3.10 that there exists a sequence mn → ∞ such that√n(Y mn

n −EY mnn ) N(0, Vh,h). The proof of the theorem is complete once we also have

shown that the difference between the sequences√n(Yn − EYn) and

√n(Y mn

n − EY mnn )

converges to zero in probability.Both sequences are centered at mean zero. In view of Chebyshev’s inequality it

suffices to show that n var(Yn − Y mnn ) → 0. We can write

Yt − Y mt = Xt+hXt −Xm

t+hXmt =

∑

i

∑

j

ψmt−i,t+h−jZiZj ,

where ψmi,j = 0 if |i| ≤ m and |j| ≤ m and ψiψj otherwise. The variables Yn − Y m

n arethe averages of these double sums and hence

√n times their variance can be found as

n∑

g=−n

(n− |g|n

)

γY−Y m(g)

=

n∑

g=−n

(n− |g|n

)

∑

i

∑

j

∑

k

∑

l

ψmt−i,t+h−jψ

mt+g−k,t+g+h−l cov(ZiZj , ZkZl).

Most terms in this five-fold sum are zero and by similar arguments as before the wholeexpression can be bounded in absolute value by

cov(Z21 , Z

21 )∑

g

∑

i

|ψmi,i+hψ

mg,g+h|+(EZ2

1 )2∑

g

∑

i

∑

j

|ψmi,jψ

mi+g,j+g|

+ (EZ21 )

2∑

g

∑

i

∑

j

|ψmi,j+hψ

mj+g,i+g+h|.


We have that ψmi,j → 0 as m → ∞ for every fixed (i, j), |ψm

i,j | ≤ |ψiψj |, and supi |ψi| ≤∑

i |ψi| <∞. By the dominated convergence theorem the double and triple sums convergeto zero as well.

By similar arguments we can also prove the joint asymptotic normality of the sampleauto-covariances for a number of lags h simultaneously. By the Cramer-Wold device asequence of k-dimensional random vectors Xn converges in distribution to a randomvector X if and only if aTXn aTX for every a ∈ R

k. A linear combination of sampleauto-covariances can be written as an average, as before. These averages can be shown tobe asymptotically normal by the same methods, with only the notation becoming morecomplex.

5.8 Theorem. Under the conditions of either Theorem 5.6 or 5.7, for every h ∈ N andsome (h+ 1)× (h+ 1)-matrix V ,

√n

γn(0)...

γn(h)

−

γX(0)...

γX(h)

Nh+1(0, V ).

For a linear process Xt (as in (4.4)) the matrix V has (g, h)-element

Vg,h = κ4(Z)γX(g)γX(h) +∑

k

γX(k + g)γX(k + h) +∑

k

γX(k − g)γX(k + h).

5.3 Auto Correlations

The asymptotic distribution of the auto-correlations ρn(h) can be obtained from theasymptotic distribution of the auto-covariance function by the Delta-method (Theo-rem 3.15). We can write

ρn(h) =γn(h)

γn(0)= φ

(

γn(0), γn(h))

,

for φ the function given by φ(u, v) = v/u. This function has gradient (−v/u2, 1/u). Bythe Delta-method,

√n(

ρn(h)− ρX(h))

= − γX(h)

γX(0)2√n(

γn(0)− γX(0))

+1

γX(0)

√n(

γn(h)− γX(h))

+ oP (1).

The limit distribution of the right side is the distribution of the random variable−γX(h)/γX(0)2Y0 + 1/γX(0)Yh, for Y0 and Yh the first and last coordinate of a ran-dom vector Y with the Nh+1(0, V )-distribution given in Theorem 5.8. The joint limit

5.3: Auto Correlations 77

distribution of a vector of auto-correlations is the joint distribution of the correspondinglinear combinations of the Yh. By linearity this is a Gaussian distribution; its mean iszero and its covariance matrix can be expressed in the matrix V by linear algebra.

5.9 Theorem. Under the conditions of either Theorem 5.6 or 5.7, for every h ∈ N andsome h× h-matrix W ,

√n

ρn(1)...

ρn(h)

−

ρX(1)...

ρX(h)

Nh(0,W ),

For a linear process Xt the matrix W has (g, h)-element

Wg,h =∑

k

[

ρX(k + g)ρX(k + h) + ρX(k − g)ρX(k + h) + 2ρX(g)ρX(h)ρX(k)2

− 2ρX(g)ρX(k)ρX(k + h)− 2ρX(h)ρX(k)ρX(k + g)]

.

The expression for the asymptotic covariance matrix W of the auto-correlation co-efficients in the case of a linear process is known as Bartlett’s formula. An interestingfact is that W depends on the auto-correlation function ρX only, although the asymp-totic covariance matrix V of the sample auto-covariance coefficients depends also on thesecond and fourth moments of Z1. We discuss two interesting examples of this formula.

5.10 Example (Iid sequence). For ψ0 = 1 and ψj = 0 for j 6= 0, the linear processXt given by (4.4) is equal to the i.i.d. sequence µ+ Zt. Then ρX(h) = 0 for every h 6= 0and the matrix W given by Bartlett’s formula can be seen to reduce to the identitymatrix. This means that for large n the sample auto-correlations ρn(1), . . . , ρn(h) areapproximately independent normal variables with mean zero and variance 1/n.

This can be used to test whether a given sequence of random variables is indepen-dent. If the variables are independent and identically distributed, then approximately95 % of the computed auto-correlations should be in the interval [−1.96/

√n, 1.96/

√n].

This is often verified graphically, from a plot of the auto-correlation function, on whichthe given interval is indicated by two horizontal lines. Note that, just as we should expectthat 95 % of the sample auto-correlations are inside the two bands in the plot, we shouldalso expect that 5 % of them are not! A more formal test would be to compare the sum ofthe squared sample auto-correlations to the appropriate chisquare table. The Ljung-Boxstatistic is defined by

k∑

h=1

n(n+ 2)

n− hρn(h)

2.

By the preceding theorem, for fixed k, this sequence of statistics tends to the χ2-distribution with k degrees of freedom, as n → ∞. (The coefficients n(n + 2)/(n − h)are motivated by a calculation of moments for finite n and are thought to improve thechisquare approximation, but are asymptotically equivalent to (

√n)2.)


The more auto-correlations we use in a procedure of this type, the more informationwe extract from the data and hence the better the result. However, the tests are basedon the asymptotic distribution of the sample auto-correlations and this was derivedunder the assumption that the lag h is fixed and n → ∞. We should expect that theconvergence to normality is slower for sample auto-correlations ρn(h) of larger lags h,since there are fewer terms in the sums defining them. Thus in practice we should notuse sample auto-correlations of lags that are large relative to n.

Clearly the Ljung-Box test does not really test independence, but only zero auto-correlation. If theXt are independent, then so are any measurable transformations ψ(Xt).In practice one might therefore apply the test also to squares X2

t , logarithms log |Xt|,etc.

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.2. Realization of the sample auto-correlation function of a Gaussian white noise series of length250. The horizontal bands are drawn at heights ±1/

√250.

* 5.11 Example (Moving average). For a moving averageXt = Zt+θ1Zt−1+· · ·+θqZt−qof order q, the auto-correlations ρX(h) of lags h > q vanish. By the preceding theoremthe sequence

√nρn(h) converges for h > q in distribution to a normal distribution with

varianceWh,h =

∑

k

ρX(k)2 = 1 + 2ρX(1)2 + · · ·+ 2ρX(q)2, h > q.

This can be used to test whether a moving average of a given order q is an appropriatemodel for a given observed time series. A plot of the auto-correlation function shows

5.4: Partial Auto Correlations 79

nonzero auto-correlations for lags 1, . . . , q, and zero values for lags h > q. In practice weplot the sample auto-correlation function. Just as in the preceding example, we shouldexpect that some sample auto-correlations of lags h > q are significantly different fromzero, due to estimation errors. The asymptotic variancesWh,h are bigger than 1 and hencewe should take the confidence bands a bit wider than the intervals [−1.96/

√n, 1.96/

√n]

as in the preceding example. A proper interpretation is more complicated, because thesample auto-correlations are not asymptotically independent.

0 5 10 15 20

-0.2

-0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 5.3. Realization (n = 250) of the sample auto-correlation function of the moving average processXt = 0.5Zt + 0.2Zt−1 + 0.5Zt−2 for a Gaussian white noise series Zt. The horizontal bands are drawn atheights ±1/

√250.

5.12 EXERCISE. Verify the formula for Wh,h in the preceding example.

5.13 EXERCISE. Find W1,1 as a function of θ for the process Xt = Zt + θZt−1.

5.14 EXERCISE. Verify Bartlett’s formula.


5.4 Partial Auto Correlations

By Lemma 2.36 and the prediction equations the partial auto-correlation αX(h) is thesolution φh of the system of equations

γX(0) γX(1) · · · γX(h− 1)...

......

γX(h− 1) γX(h− 2) · · · γX(0)

φ1...φh

=

γX(1)...

γX(h)

.

A nonparametric estimator αn(h) of αX(h) is obtained by replacing the auto-covariancefunction in this linear system by the sample auto-covariance function γn. This yieldsestimators φ1, . . . , φh of the prediction coefficients satisfying

γn(0) γn(1) · · · γn(h− 1)...

......

γn(h− 1) γn(h− 2) · · · γn(0)

φ1...φh

=

γn(1)...

γn(h)

.

We define a nonparametric estimator for αX(h) by αn(h) = φh. The function h 7→ αn(h)is called the sample partial auto-covariance function.

If we write these two systems of equations as Γφ = γ and Γφ = γ, respectively, thenwe obtain that

φ− φ = Γ−1γ − Γ−1γ = Γ−1(γ − γ)− Γ−1(Γ− Γ)Γ−1γ.

The sequences√n(γ − γ) and

√n(Γ − Γ) are jointly asymptotically normal by The-

orem 5.8. With the help of Slutsky’s lemma and the continuous mapping theorem wereadily obtain the asymptotic normality of the sequence

√n(φ − φ) and hence of the

sequence√n(

αn(h) − αX(h))

. The asymptotic covariance matrix appears to be com-plicated, in general; we shall not derive it, except in the special case of the followingexample.

5.15 Example (Autoregression). For the stationary solution to Xt = φXt−1 + Zt

and |φ| < 1, the partial auto-correlations of lags h ≥ 2 vanish, by Example 2.37. InSection 11.1.2 it is shown that the sequence

√nαn(h) is asymptotically standard normally

distributed, for every h ≥ 2.This result extends to the “causal” solution of the pth order auto-regressive scheme

Xt = φ1Xt−1 + · · ·+φpXt−p +Zt and the auto-correlations of lags h > p. (The meaningof “causal” is explained in Chapter 8.) This property can be used to find an appropriateorder p when fitting an auto-regressive model to a given time series. The order is chosensuch that “most” of the sample auto-correlations of lags bigger than p are within the band[−1.96/

√n, 1.96/

√n]. A proper interpretation of “most” requires that the dependence

of the αn(h) is taken into consideration.

5.4: Partial Auto Correlations 81

0 5 10 15 20

0.0

0.2

0.4

Lag

Par

tial A

CF

Figure 5.4. Realization (n = 250) of the partial auto-correlation function of the stationary solution toXt = 0.5Xt−1 + 0.2Xt−1 + Zt for a Gaussian white noise series.

6Spectral Theory

Let Xt be a stationary, possibly complex, time series with auto-covariance function γX .If the series

∑

h |γX(h)| is convergent, then the series

(6.1) fX(λ) =1

2π

∑

h∈ZγX(h)e−ihλ,

is absolutely convergent, uniformly in λ ∈ R. This function is called the spectral densityof the time series Xt. Because it is periodic with period 2π it suffices to consider it onan interval of length 2π, which we shall take to be (−π, π]. In the present context thevalues λ in this interval are often referred to as frequencies, for reasons that will becomeclear. By the uniform convergence, we can exchange the order of integral and sum whencomputing

∫ π

−π eihλ fX(λ) dλ and we find that, for every h ∈ Z,

(6.2) γX(h) =

∫ π

−πeihλ fX(λ) dλ.

Thus the spectral density fX determines the auto-covariance function, just as the auto-covariance function determines the spectral density.

6.1 EXERCISE. Verify the orthonormality of the trigonometric basis:∫ π

−π eihλ dλ = 0

for integers h 6= 0 and∫ π

−π eihλ dλ = 2π for h = 0. Use this to prove the inversion formula

(6.2), under the condition that∑

h |γX(h)| <∞.

In mathematical analysis the series fX is called a Fourier series and the numbersγX(h) are called the Fourier coefficients of fX . (The factor 1/(2π) is sometimes omitted orreplaced by another number, and the Fourier series is often defined as fX(−λ) rather thanfX(λ), but this is inessential.) A central result in Fourier analysis is that the series in (6.1)converges in L2

(

(−π, π],B, λ)

, and (6.2) is true, if and only if∑

h |γX(h)|2 < ∞. (SeeLemma 6.3 below; the latter condition is weaker than the absolute summability of the au-tocovariances.) Furthermore, the converse of this statement is also true: if complex num-bers γX(h) are defined by (6.2) from a given function fX ∈ L2

(

(−π, π],B, λ)

, then the

6.1: Spectral Measures 83

latter function can be recovered from (6.1), where the convergence is in L2

(

(−π, π],B, λ)

.This involves the completeness of the trigonometric basis, consisting of the functionsλ 7→ (2π)−1/2e−ihλ, for h ∈ Z.

6.1 Spectral Measures

The requirement that the series∑

h

∣

∣γX(h)∣

∣ or∑

h

∣

∣γX(h)∣

∣2 are convergent means roughly

that γX(h) → 0 as h → ±∞ at a “sufficiently fast” rate. In statistical terms it meansthat variables Xt that are widely separated in time must be approximately uncorrelated.This is not true for every time series, and consequently not every time series possesses aspectral density. However, every stationary time series does have a “spectral measure”,by the following theorem.

6.2 Theorem (Herglotz). For every stationary time series Xt there exists a uniquefinite measure FX on (−π, π] such that

γX(h) =

∫

(−π,π]eihλ dFX(λ), h ∈ Z.

Proof. Define Fn as the measure on [−π, π] with Lebesgue density equal to

fn(λ) =1

2π

n∑

h=−nγX(h)

(

1− |h|n

)

e−ihλ.

It is not immediately clear that this is a real-valued, nonnegative function, but thisfollows from the fact that

0 ≤ 1

2πnvar(

n∑

t=1

Xte−itλ

)

=1

2πn

n∑

s=1

n∑

t=1

cov(Xs, Xt)ei(t−s)λ = fn(λ).

It is clear from the definition of fn that the numbers γX(h)(

1 − |h|/n)

are the Fouriercoefficients of fn for |h| ≤ n (and the remaining Fourier coefficients of fn are zero). Thus,by the inversion formula,

γX(h)(

1− |h|n

)

=

∫ π

−πeihλ fn(λ) dλ =

∫ π

−πeihλ dFn(λ), |h| ≤ n.

Setting h = 0 in this equation, we see that Fn[−π, π] = γX(0) for every n. Thus, apartfrom multiplication by the constant γX(0), the Fn are probability distributions. Becausethe interval [−π, π] is compact, the sequence Fn is uniformly tight. By Prohorov’s theoremthere exists a subsequence Fn′ that converges weakly to a distribution F on [−π, π] of thesame total mass. Because λ 7→ eihλ is a continuous function, it follows by the portmanteaulemma that ∫

[−π,π]eihλ dF (λ) = lim

n′→∞

∫

[−π,π]eihλ dFn′(λ) = γX(h),

84 6: Spectral Theory

by the preceding display. If F puts a positive mass at −π, we can move this to the pointπ without affecting this identity, since e−ihπ = eihπ for every h ∈ Z. The resulting Fsatisfies the requirements for FX .

That this F is unique can be proved using the fact that the linear span of thefunctions λ 7→ eihλ is uniformly dense in the set of continuous, periodic functions.† Thisallows to extend equality

∫

k dF =∫

k dG, for two measures F and G and every functionk of the form λ 7→ eiλh to equality for any continuous, periodic function k. Using themonotone and/or dominated convergence theorems, we can further extend this to anybounded measurable function k. This implies that F = G.

The measure FX is called the spectral measure of the time series Xt. If the spectralmeasure FX admits a density fX relative to the Lebesgue measure, then the latter iscalled the spectral density. A sufficient condition for this is that the square summabilityof the square autocovariances. Then the spectral density is the Fourier series (6.1), withFourier coefficients γX(h).

6.3 Lemma. If∑

h

∣

∣γX(h)∣

∣

2< ∞, then the series (6.1) converges in L2

(

(−π, π],B, λ)

,and defines a density of FX relative to Lebesgue measure.

Proof. The orthonormality of the trigonometric basis shows that, for any m ≤ n,∫ π

−π

∣

∣

∣

∑

m≤|h|≤nγX(h)e−ihλ

∣

∣

∣

2

dλ =∑

m≤|g|≤n

∑

m≤|h|≤nγX(h)γX(g)

∫ π

−πei(h−g)λ dλ

= 2π∑

m≤|h|≤n

∣

∣γX(h)∣

∣

2.

Thus the convergence of the series∑∣

∣γX(h)∣

∣

2is equivalent to the convergence of the

series (6.1) in L2

(

(−π, π],B, λ)

.

Next the continuity of the inner product in L2

(

(−π, π],B, λ)

allows exchanging the

order of integration and summation in∫ π

−π eigλ∑

h γX(h)e−ihλ dλ. Then (6.2) followsfrom the orthonormality of the trigonometric basis (as in Exercise 6.1).

6.4 EXERCISE. Show that the spectral density of a real-valued time series with∑

h

∣

∣γX(h)∣

∣ <∞ is symmetric about zero.

* 6.5 EXERCISE. Show that the spectral measure of a real-valued time series is symmetricabout zero, apart from a possible point mass at π. [Hint: Use the uniqueness of a spectralmeasure.]

6.6 Example (White noise). The covariance function of a white noise sequence Xt is0 for h 6= 0. Thus the Fourier series defining the spectral density has only one term andreduces to

fX(λ) =1

2πγX(0).

† See Rudin, Theorem 4.25. In fact, by Fejer’s theorem, the Cesaro sums of the Fourier series of a contin-uous, periodic function converge uniformly to the function.


The spectral measure is the uniform measure with total mass γX(0). Hence “a whitenoise series contains all possible frequencies in an equal amount”.

6.7 Example (Deterministic trigonometric series). Let Xt = A cos(λt)+B sin(λt) formean-zero, uncorrelated variables A and B of variance σ2, and λ ∈ (0, π). By Example 1.5the covariance function is given by

γX(h) = σ2 cos(hλ) = σ2 12 (e

iλh + e−iλh).

It follows that the spectral measure FX is the discrete 2-point measure with FXλ =FX−λ = σ2/2.

Because the time series is real, the point mass at −λ does not really count: becausethe spectral measure of a real time series is symmetric, the point −λ must be therebecause λ is there. The form of the spectral measure and the fact that the time series inthis example is a trigonometric series of frequency λ, are good motivation for referringto the values λ as “frequencies”.

6.8 EXERCISE.(i) Show that the spectral measure of the sum Xt + Yt of two uncorrelated time series

is the sum of the spectral measures of Xt and Yt.(ii) Construct a time series with spectral measure equal to a symmetric discrete measure

on the points ±λ1,±λ2, . . . ,±λk with 0 < λ1 < · · · < λk < π.(iii) Construct a time series with spectral measure the 1-point measure with FX0 = σ2.(iv) Same question, but now with FXπ = σ2.

* 6.9 EXERCISE. Show that every finite measure on (−π, π] is the spectral measure ofsome stationary time series.

The spectrum of a time series is an important theoretical concept, but it is alsoan important practical tool to gain insight in periodicities in the data. Inference usingthe spectrum is called spectral analysis or analysis in the frequency domain as opposedto “ordinary” analysis, which is in the time domain. It depends on the subject fieldwhether a spectral analysis is useful. If there are periodicities present in a time series,then a spectral analysis will reveal these. However, in general we should not have toogreat expectations of the insight offered by the spectrum. For instance, the interpretationof the spectrum of economic time series may be complicated, or even unclear, due to thefact that all possible frequencies are present to some nontrivial extent.

The idea of a spectral analysis is to view the consecutive values

. . . , X−1, X0, X1, X2, . . .

of a time series as a random function X:Z ⊂ R → R, and to write this as a weightedsum (or integral) of trigonometric functions t 7→ cosλt or t 7→ sinλt of different frequen-cies λ. In simple cases finitely many frequencies suffice, whereas in other situations allfrequencies λ ∈ (−π, π] are needed to give a full description, and the “weighted sum”becomes an integral. Examples TrigonometricSeries6.7 and 6.6 provide two extremes:


a deterministic trigonometric series incorporates a single frequency, and a white noiseseries has all frequencies in equal amounts. In general the spectral measure gives theweights of the different frequencies in the sum.

We shall derive the spectral decomposition, the theoretical basis for this interpreta-tion, in Section 6.4. Another method to gain insight in the interpretation of a spectrumis to consider its transformation under filtering. The term “filtering” stems from the fieldof signal processing, where a filter takes the form of an physical device that filters outcertain frequencies from a given electric current. For us, a (linear) filter will remain aninfinite moving average as defined in Chapter 1.

6.10 Definition (Transfer function). The transfer function (or frequency responsefuncion for a filter with filter coefficients ψj is the function ψ: (−π, π] → C given by

ψ(λ) =

∞∑

j=−∞ψje−ijλ.

The gain is the modulus of the transfer function.

6.11 Theorem. Let Xt be a stationary time series with spectral measure FX and let∑

j |ψj | < ∞. Then Yt =∑

j ψjXt−j has spectral measure FY given by, with ψ thetransfer function,

dFY (λ) =∣

∣ψ(λ)∣

∣

2dFX(λ).

Proof. According to Lemma 1.29(iii) (if necessary extended to complex-valued filters),the series Yt is stationary with auto-covariance function

γY (h) =∑

k

∑

l

ψkψlγX(h− k + l) =∑

k

∑

l

ψkψl

∫

ei(h−k+l)λ dFX(λ).

By the dominated convergence theorem we are allowed to change the order of (double)

summation and integration. Next we can rewrite the right side as∫ ∣

∣ψ(λ)∣

∣

2eihλ dFX(λ).

This proves the theorem, in view of Theorem 6.2 and the uniqueness of the spectralmeasure.

6.12 Example (Moving average). A white noise process Zt has a constant spectraldensity σ2/(2π). By the preceding theorem the moving average Xt = Zt + θZt−1 hasspectral density

fX(λ) = |1 + θe−iλ|2 σ2

2π= (1 + 2θ cosλ+ θ2)

σ2

2π.

If θ > 0, then the small frequencies dominate, whereas the bigger frequencies are moreimportant if θ < 0. This suggests that the sample paths of this time series will be morewiggly if θ < 0. However, in both cases all frequencies are present in the signal.


Figure 6.1. Spectral density of the moving average Xt = Zt + .5Zt−1, for σ2 = 1.

6.13 Example. The complex-valued process Xt = Aeiλt for a mean zero variable A andλ ∈ (−π, π] has covariance function

γX(h) = cov(Aeiλ(t+h), Aeiλt) = eihλE|A|2.

The corresponding spectral measure is the 1-point measure FX with FXλ = E|A|2.Therefore, the filtered series Yt =

∑

j ψjXt−j has spectral measure the 1-point measure

with FY λ =∣

∣ψ(λ)∣

∣2E|A|2. By direct calculation we find that

Yt =∑

j

ψjAeiλ(t−j) = Aeiλtψ(λ) = ψ(λ)Xt.

This suggests an interpretation for the term “transfer function”. Filtering a “pure signal”Aeitλ of a single frequency apparently yields another signal of the same single frequency,but the amplitude of the signal changes by multiplication with the factor ψ(λ). If ψ(λ) =0, then the frequency is “not transmitted”, whereas values of

∣

∣ψ(λ)∣

∣ bigger or smallerthan 1 mean that the frequency λ is amplified or weakened.

6.14 EXERCISE. Find the spectral measure of Xt = Aeiλt for λ not necessarily belong-ing to (−π, π].


To give a further interpretation to the spectral measure consider a band pass filter.This is a filter with transfer function of the form

ψ(λ) =

0, if |λ− λ0| > δ,1, if |λ− λ0| ≤ δ,

for a fixed frequency λ0 and fixed band width 2δ. According to Example 6.13 this filter“kills” all the signals Aeiλt of frequencies λ outside the interval [λ0 − δ, λ0 + δ] andtransmits all signals Aeitλ for λ inside this range unchanged. The spectral density of thefiltered signal Yt =

∑

j ψjXt−j relates to the spectral density of the original signal Xt

(if there exists one) as

fY (λ) =∣

∣ψ(λ)∣

∣

2fX(λ) =

0, if |λ− λ0| > δ,fX(λ), if |λ− λ0| ≤ δ.

Now think of Xt as a signal composed of many frequencies. The band pass filter transmitsonly the subsignals of frequencies in the interval [λ0 − δ, λ0 + δ]. This explains that thespectral density of the filtered sequence Yt vanishes outside this interval. For small δ > 0,

varYt = γY (0) =

∫ π

−πfY (λ) dλ =

∫ λ0+δ

λ0−δfX(λ) dλ ≈ 2δfX(λ0).

We interprete this as saying that fX(λ0) is proportional to the variance of the subsignalsin Xt of frequency λ0. The total variance varXt = γX(0) =

∫ π

−π fX(λ) dλ in the signalXt is the total area under the spectral density. This can be viewed as the sum of thevariances of the subsignals of frequencies λ, the area under fX between λ0− δ and λ0+ δbeing the variance of the subsignals of frequencies in this interval.

A band pass filter is a theoretical filter: in practice it is not possible to filter outan exact range of frequencies. Electronic devices will not filter frequencies close to theborders of the band correctly. There are several popular designs that balance a sharptransition of the transfer function to zero to constancy on the “pass band” (cf. Exam-ple 6.20). On a computer only finite filters can be implemented (the ones with onlyfinitely many nonzero filter coefficients ψj). These have transfer functions that are afinite linear combination of sines and cosines, and hence are smooth and cannot jump,in contrast to the indicator function corresponding to a band pass filter.

The filter coefficients ψj relate to the transfer function ψ(λ) in the same way asthe auto-covariances γX(h) relate to the spectral density fX(h), apart from a factor 2π.Thus, to find the filter coefficients of a given transfer function ψ, it suffices to apply theFourier inversion formula

ψj =1

2π

∫ π

−πeijλψ(λ) dλ.

6.15 EXERCISE. Show that the filter coefficients of a band pass filter with (λ0− δ, λ0+δ) ⊂ (0, π) are given by ψj = eiλ0j sin(δj)/(πj).

6.16 Example (Low frequency and trend). An apparent trend in observed dataX1, . . . , Xn could be modelled as a real trend in a nonstationary time series, but could


alternatively be viewed as the beginning of a long cycle. In practice, where we get tosee only a finite stretch of a time series, low frequency cycles and slowly moving trendscannot be discriminated. It was seen in Chapter 1 that differencing Yt = Xt −Xt−1 ofa time series Xt removes a linear trend, and repeated differencing removes higher orderpolynomial trends. In view of the preceding observation the differencing filter shouldremove, to a certain extent, low frequencies.

The differencing filter has transfer function

ψ(λ) = 1− e−iλ = 2ie−iλ/2 sinλ

2.

The absolute value∣

∣ψ(λ)∣

∣ of this transfer function increases from 0 at 0 to its maximumvalue at π (see Figure 6.2). Thus, indeed, it filters away low frequencies, albeit only withpartial success.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

Figure 6.2. Absolute value of the transfer function of the difference filter.

6.17 Example (Averaging). The averaging filter Yt = (2M + 1)−1∑M

j=−M Xt−j hastransfer function

ψ(λ) =1

2M + 1

M∑

j=−Me−ijλ =

sin(

(M + 12 )λ)

(2M + 1) sin( 12λ).

(The expression on the right is defined by continuity, as 1, at λ = 0.) This functionis proportional to the Dirichlet kernel, which is the function obtained by replacing thefactor 2M + 1 by 2π. From a picture of this kernel we conclude that averaging removes


0.0 0.5 1.0 1.5 2.0 2.5 3.0

01

23

sin(

u *

10.5

)/si

n(u/

2)/(

2 *

pi)

Figure 6.3. Dirichlet kernel of order M = 10.

high frequencies to a certain extent (and in an uneven manner depending on M), butretains low frequencies.

6.18 EXERCISE. Express the variance of Yt in the preceding example in ψ and thespectral density of the time series Xt (assuming that there is one). What happens ifM → ∞? Which conclusion can you draw? Does this remain true if the series Xt doesnot have a spectral density? [Verify that |ψ(λ)| ≤ 1 for every λ andM , and that ψ(λ) → 0as M → ∞ for every 0.]

6.19 EXERCISE. Find the transfer function of the filter Yt = Xt − Xt−12. Interpretethe result.

6.20 Example (Butterworth filter). The Butterworth filter of order n with cut-off ωhas transfer function satisfying

ψ(λ) =(

n∏

i=1

(iλ− ωe(2k+n−1)πi/(2n)))−1

,∣

∣ψ(λ)∣

∣

2=(

1 + (λ/ω)2n)−1

.

6.2: Aliasing 91

It is a low-pass filter: the function∣

∣ψ(λ)∣

∣ is approximately constant equal to 1 on [0, ω)and approximately 0 on (ω, π]. The transition from 1 to 0 is sharper if n is larger. The

derivatives of∣

∣ψ(λ)∣

∣

2up to order 2n − 1 vanish, making the function “maximally flat”

on the intervals where it takes unit or zero values. It pays for this desirable propertyby a transition from 1 ot 0 that is slower than some other popular filters such as theChebyshev.

The Butterworth filter is popular in signal analysis. It was proposed by Butterworthin 1930 as an electrical circuit, illustrated in Figure 6.4 for the order n = 3.

Figure 6.4. Electrical circuit implementing a Butterworth filter of order 3, if the capacitator C is set of4/3 farac, the resistor R to 1 Ohm and the inductors L1 and L2 to 3/2 and 1 Henry, respectively.

Instead of in terms of frequencies we can also think in periods. A series of the formt 7→ eiλt repeats itself after 2π/λ instants of time. Therefore, the period is defined as

period =2π

frequency.

Most monthly time series (one observation per month) have a period effect of 12 months.If so, this will be visible as a peak in the spectrum at the frequency 2π/12 = π/6.‡ Oftenthe 12-month cycle is not completely regular. This may produce additional (but smaller)peaks at the harmonic frequencies 2π/6, 3π/6, . . ., or π/12, π/18, . . ..

It is surprising at first that the highest possible frequency is π, the so-called Nyquistfrequency. This is caused by the fact that the series is measured only at discrete timepoints. Very rapid fluctuations fall completely between the measurement times and hencecannot be observed. The Nyquist frequency π corresponds to a period of 2π/π = 2 timeinstants and this is clearly the smallest period that is observable. The next section shedsmore light on this.

‡ That this is a complicated number is an inconvenient consequence of our convention to define the spectrumon the interval (−π, π]. This can be repaired. For instance, the Splus package produces spectral plots with thefrequencies rescaled to the interval (− 1

2 ,12 ]. Then a 12-month period gives a peak at 1/12.


* 6.2 Aliasing

In practice a (discrete-time) time series often arises by measuring an underlyingcontinuous-time process at discrete time instants, an operation called sampling. For reg-ularly spaced time instants the sampling frequency is defined as the number of measure-ments per time unit. The more we sample the better, of course, but practical considera-tions, such as computer storage of the data, will impose restrictions. In this section westudy the effect of sampling on the spectrum of a time series.

Consider a time series (Xt: t ∈ Z) obtained by sampling a continuous time stochasticprocess (Yt: t ∈ R) at times −δ, 0, δ, 2δ, . . . for a sampling interval δ > 0: Xt = Ytδ. Thecontinuous time process Yt is said to be second order stationary if its expectation isconstant and the autovariances cov(Yt+h, Yt) do not depend on t, so that its (continuoustime) autocovariance function can be defined by, for h ∈ R,

γY (h) = cov(Yt+h, Yt).

By Bochner’s theorem, provided that it is continuous, such a covariance function canalways be represented by a finite Borel measure FY on R in the form

γY (h) =

∫ ∞

−∞eihλ dFY (λ), h ∈ R.

This is similar to Herglotz’ theorem, the important difference being that the represen-tation uses all frequencies in R, not just an interval of length 2π, and is valid for thecontinuum of lags h ∈ R.

6.21 Lemma. The skeleton (Xt: t ∈ Z) = (Ytδ: t ∈ Z) of a continuous time stationaryprocess (Yt: t ∈ Z) with spectral measure FY possesses spectral measure FX given by

FX(B) =∑

k∈ZFY

(B + k 2π

δ

)

.

Proof. The covariance function of the time series Xt can be written γX(h) = γY (hδ) =∫

eihδλ dFY (λ). The function λ 7→ eihλδ is periodic with period 2π/δ, for every h ∈ Z. Itis therefore identical on the intervals . . . , (−3π/δ,−π/δ], (−π/δ, π/δ], (π/δ, 3π/δ], . . . andthe integral over R of the function relative to FY can be written as the integral of thefunction over the interval (−π/δ, π/δ] relative to the infinite sum of the translates of therestrictions of the measure FY to the intervals. Finally we rescale δλ→ λ.

The preceding lemma shows that the spectral weights of the continuous time processYt at a frequency not in the interval (−π/δ, π/δ] contribute to the spectral weights of thesampled process at the frequency modulo 2π/δ. In particular, if δ = 1 the frequenciesbigger than the Nyquist frequency, which may be “present” in the signal Yt, contributeto the spectrum of the sampled process at frequencies modulo 2π. This effect, which

For a proof of Bochner’s theorem, along the lines of the proof of Herglotz theorem, see pages 275–277 ofGnedenko (1962).

6.3: Nonsummable filters 93

is referred to as aliasing, is often considered undesirable, because it is impossible toseparate the contributions of the frequencies that are really present (below the Nyquistfrequency) from the “aliased” frequencies. Aliasing typically leads to an overall higherspectral density, which is equivalent to a bigger noise content of the signal.

For this reason, if it is possible, a signal should be low-pass filtered before samplingit. With continuous time electronic signals this filtering step is often achieved at the mea-suring time by passing the signal through a physical, electronic device, before samplingor digitizing it. If for some reason an already discretized signal must be “down-sampled”,then it can also be useful to low-pass filter the signal first, which can be achieved by justapplying a linear filter with the appropriate transfer function.

* 6.3 Nonsummable filters

For coefficients ψj with∑

j |ψj | < ∞, the filtered series∑

j ψjXt−j is well defined forany stationary time series Xt. Unfortunately, not all filters have summable coefficients.An embarassing example is the band pass filter considered previously, for which the filtercoefficients tend to zero at the speed |ψj | ∼ 1/j.

In fact, if a sequence of filter coefficients is summable, then the series∑

j ψjeijλ

converges uniformly, and hence the corresponding transfer function must be continuous.The transfer function λ 7→ ψ(λ) = 1[λ0−δ,λ0+δ](λ) of the band pass filter has two jumps.

In such a case, the series∑

j ψje−ijλ may still be well defined, for instance as a limit

in L2

(

(−π, π],B, λ)

, and define a valid transfer function. To handle such examples it isworthwhile to generalize Theorem 6.11 (and Lemma 1.29).

6.22 Theorem. Let Xt be a stationary zero-mean time series with spectral measure FX ,defined on the probability space (Ω,U ,P). Then the series ψ(λ) =

∑

j ψje−iλj converges

in L2(FX) if and only if Yt =∑

j ψjXt−j converges in L2(Ω,U ,P) for some t (and thenfor every t ∈ Z) and in that case

dFY (λ) =∣

∣ψ(λ)∣

∣

2dFX(λ).

Proof. For 0 ≤ m ≤ n let ψm,nj be equal to ψj for m ≤ |j| ≤ n and be 0 otherwise, and

define Y m,nt as the series Xt filtered by the coefficients ψm,n

j . Then certainly∑

j |ψm,nj | <

∞ for every fixed pair (m,n) and hence we can apply Lemma 1.29 and Theorem 6.11 tothe series Y m,n

t . This yields

E∣

∣

∣

∑

m≤|j|≤nψjXt−j

∣

∣

∣

2

= E|Y m,nt |2 = γY m,n(0)

=

∫

(−π,π]dFY m,n =

∫

(−π,π]

∣

∣

∣

∑

m≤|j|≤nψje−iλj

∣

∣

∣

2

dFX(λ).

The left side converges to zero for m,n→ ∞ if and only if the partial sums of the seriesYt =

∑

j ψjXt−j form a Cauchy sequence in L2(Ω,U ,P). The right side converges to zero


if and only if the partial sums of the sequence∑

j ψje−iλj form a Cauchy sequence in

L2(FX). The first assertion of the theorem now follows, because both spaces are complete.To prove the second assertion, we first note that, by Theorem 6.11,

cov(

∑

|j|≤nψjXt+h−j ,

∑

|j|≤nψjXt−j

)

= γY 0,n(h) =

∫

(−π,π]

∣

∣

∣

∑

|j|≤nψje−iλj

∣

∣

∣

2

eihλ dFX(λ).

We now take limits of the left and right sides as n → ∞ to find that γY (h) =∫

(−π,π]∣

∣ψ(λ)∣

∣2 eihλ dFX(λ), for every h.

6.23 Example. If the filter coefficients satisfy∑

j |ψj |2 < ∞ (which is weaker than

absolute convergence), then the series∑

j ψje−ijλ converges in L2

(

(−π, π],B, λ)

. Conse-quently, the series also converges in L2(FX) for every spectral measure FX that possessesa bounded density.

Thus, in many cases a sequence of square-summable coefficients defines a valid filter.A particular example is a band pass filter, for which |ψj | = O(1/|j|) as j → ±∞.

* 6.4 Spectral Decomposition

In the preceding section we interpreted the mass FX(I) that the spectral distributiongives to an interval I as the size of the contribution of the components of frequenciesλ ∈ I to the signal t 7→ Xt. In this section we give a precise mathematical meaning tothis idea. We show that a given stationary time series Xt can be written as a randomlyweighted sum of single frequency signals eiλt.

This decomposition is simple in the case of a discrete spectral measure. For givenuncorrelated, zero-mean random variables Z1, . . . , Zk and numbers λ1, . . . , λk ∈ (−π, π]the process

Xt =

k∑

j=1

Zjeiλjt

possesses as spectral measure FX the discrete measure with point masses of sizesFXλj = E|Zj |2 at the frequencies λ1, . . . , λk (and no other mass). The series Xt isthe sum of uncorrelated, single-frequency signals of stochastic amplitudes |Zj |. This iscalled the spectral decomposition of the series Xt. We prove below that this construc-tion can be reversed: given a mean zero, stationary time series Xt with discrete spectralmeasure as given, there exist mean zero uncorrelated (complex-valued) random variablesZ1, . . . , Zk with variances FXλj such that the decomposition is valid.

This justifies the interpretation of the spectrum given in the preceding section. Thepossibility of the decomposition is surprising in that the spectral measure only involvesthe auto-covariance function of a time series, whereas the spectral decomposition is adecomposition of the sample paths of the time series: if the series Xt is defined on a

6.4: Spectral Decomposition 95

given probability space (Ω,U ,P), then so are the random variables Zj and the precedingspectral decomposition may be understood as being valid for (almost) every ω ∈ Ω. Thiscan be true, of course, only if the variables Z1, . . . , Zk also have other properties besidesthe ones described. The spectral theorem below does not give any information aboutthese further properties. For instance, even though uncorrelated, the Zj need not beindependent. This restricts the usefulness of the spectral decomposition, but we couldnot expect more. Because the spectrum only involves second moment properties, it leavesmost of the distribution of a series undescribed. An important exception are Gaussiantimes series, for which the first and second moments, and hence the mean and the spectraldistribution, describe the complete distribution.

The spectral decomposition is not restricted to time series with discrete spectralmeasures. However, in general, the spectral decomposition involves a continuum of fre-quencies and the sum becomes an integral

Xt =

∫

(−π,π]eiλt dZ(λ).

A technical complication is that such an integral, relative to a “random measure” Z, isnot defined in ordinary measure theory. We must first give it a meaning.

6.24 Definition. A random measure with orthogonal increments Z is a collection(

Z(B):B ∈ B)

of mean zero, complex random variables Z(B) indexed by the Borelsets B in (−π, π] defined on some probability space (Ω,U ,P) such that, for some finiteBorel measure µ on (−π, π],

cov(

Z(B1), Z(B2))

= µ(B1 ∩B2), every B1, B2 ∈ B.

This definition does not appear to include a basic requirement of a measure: thatthe measure of a countable union of disjoint sets is the sum of the measures of theindividual sets. However, this is implied by the covariance property, which says that thelinear extension of the map 1B 7→ Z(B) is an isometry from L2(µ) into L2(Ω,U ,P).

6.25 EXERCISE. Let Z be a random measure with orthogonal increments. Show thatZ(∪jBj) =

∑

j Z(Bj) in mean square, whenever B1, B2, . . . is a sequence of pairwisedisjoint Borel sets.

6.26 EXERCISE. Let Z be a random measure with orthogonal increments and defineZλ = Z(−π, λ]. Show that (Zλ:λ ∈ (−π, π]) is a stochastic process with uncorrelatedincrements: for λ1 < λ2 ≤ λ3 < λ4 the variables Zλ4

−Zλ3and Zλ2

−Zλ1are uncorrelated.

This explains the phrase “with orthogonal increments”.

* 6.27 EXERCISE. Suppose that Zλ is a mean zero stochastic process with finite secondmoments and uncorrelated increments. Show that this process corresponds to a randommeasure with orthogonal increments as in the preceding exercise. [This asks you toreconstruct the random measure Z from the weights Zλ = Z(−π, λ] it gives to cells,similarly as an ordinary measure can be reconstructed from its distribution function.


The construction below will immediately give us the complete random measure as inDefinition 6.24. So it is not really helpful to go through the somewhat tedious details ofthis construction, although many authors start from distribution functions rather thanmeasures.]

Next we define an “integral”∫

f dZ for given functions f : (−π, π] → C. For anindicator function f = 1B of a Borel set B, we define, in analogy with an ordinaryintegral,

∫

1B dZ = Z(B). Because we wish the integral to be linear, we are lead to thedefinition

∫

∑

j

αj1BjdZ =

∑

j

αjZ(Bj),

for every finite collections of complex numbers αj and Borel sets Bj . This determinesthe integral for many, but not all functions f . We extend its domain by continuity:we require that

∫

fn dZ →∫

f dZ whenever fn → f in L2(µ). The following lemmashows that these definitions and requirements can be consistently made, and serves as adefinition of

∫

f dZ.

6.28 Lemma. For every random measure with orthogonal increments Z there exists aunique map, denoted f 7→

∫

f dZ, from L2(µ) into L2(Ω,U ,P) with the properties(i)∫

1B dZ = Z(B);(ii)

∫

(αf + βg) dZ = α∫

f dZ + β∫

g dZ;

(iii) E∣

∣

∫

f dZ∣

∣

2=∫

|f |2 dµ.In other words, the map f 7→

∫

f dZ is a linear isometry such that 1B 7→ Z(B).

Proof. By the defining property of Z, for any complex numbers αi and Borel sets Bi,

E∣

∣

∣

k∑

i=1

αiZ(Bi)∣

∣

∣

2

=∑

i

∑

j

αiαj cov(

Z(Bi), Z(Bj))

=

∫

∣

∣

∣

k∑

j=1

αi1Bi

∣

∣

∣

2

dµ.

For f a simple function of the form f =∑

i αi1Bi, we define

∫

f dZ as∑

i αiZ(Bi).This is well defined, for, if f also has the representation f =

∑

j βj1Dj, then

∑

i αiZ(Bi) =∑

j βjZ(Dj) almost surely. This follows by applying the preceding identityto∑

i αiZ(Bi)−∑

j βjZ(Dj).

The “integral”∫

f dZ that is now defined on the domain of all simple functions ftrivially satisfies (i) and (ii), while (iii) is exactly the identity in the preceding display.The proof is complete upon showing that the map f 7→

∫

f dZ can be extended fromthe domain of simple functions to the domain L2(µ), meanwhile retaining the properties(i)–(iii).

We extend the map by continuity. For every f ∈ L2(µ) there exists a sequenceof simple functions fn such that

∫

|fn − f |2 dµ → 0. We define∫

f dZ as the limit inL2(Ω,U ,P) of the sequence

∫

fn dZ. This is well defined. First, the limit exists, because,by the linearity of the integral and the identity,

E∣

∣

∣

∫

fn dZ −∫

fm dZ|2 =

∫

|fn − fm|2 dµ,


since fn − fm is a simple function. Because fn is a Cauchy sequence in L2(µ), the rightside converges to zero as m,n → ∞. We conclude that

∫

fn dZ is a Cauchy sequencein L2(Ω,U ,P) and hence it has a limit by the completeness of this space. Second, thedefinition of

∫

f dZ does not depend on the particular sequence fn → f we use. Thisfollows, because given another sequence of simple functions gn → f , we have

∫

|fn −gn|2 dµ→ 0 and hence E

∣

∣

∫

fn dZ −∫

gn dZ∣

∣

2 → 0.We conclude the proof by noting that the properties (i)–(iii) are retained under

taking limits.

6.29 Definition (Stochastic integral). The stochastic integral∫

f dZ of a function f ∈L2(µ) relative to a random measure with orthogonal increments Z as in Definition 6.24is given by the map defined in Lemma 6.28. An alternative notation is Z(f).

Warning. This integral is constructed in a similar way as the Ito integral in stochas-tic analysis, but it is not the same. The process Z in the decomposition Z(0, λ] = Zλ−Z0

of a random measure with orthogonal increments is not necessarily a semimartingale. Onthe other hand, the integral

∫

f dZ is defined here only for deterministic integrands.

6.30 EXERCISE. Show that a linear isometry Φ:H1 → H2 between two Hilbert spacesH1 and H2 retains inner products, i.e. 〈Φ(f1),Φ(f2)〉2 = 〈f1, f2〉1. Conclude thatcov(∫

f dZ,∫

g dZ)

=∫

fg dµ.

We are now ready to derive the spectral decomposition for a general stationary timeseries Xt. Let L2(Xt: t ∈ Z) be the closed, linear span of the elements of the time seriesin L2(Ω,U ,P) (i.e. the closure of the linear span of the set Xt: t ∈ Z).

6.31 Theorem (Spectral representation). For any mean zero stationary time seriesXt with spectral distribution FX there exists a random measure Z with orthogonalincrements relative to the measure FX such that

Z(B):B ∈ B

⊂ L2(Xt: t ∈ Z) andsuch that Xt =

∫

eiλt dZ(λ) almost surely for every t ∈ Z.

Proof. By the definition of the spectral measure FX we have, for every finite collectionsof complex numbers αj and integers tj ,

E∣

∣

∣

∑

αjXtj

∣

∣

∣

2

=∑

i

∑

j

αiαjγX(ti − tj) =

∫

∣

∣

∣

∑

αjeitjλ∣

∣

∣

2

dFX(λ).

Now define a map Φ:L2(FX) → L2(Xt: t ∈ Z) as follows. For f of the form f =∑

j αjeitjλ define Φ(f) =

∑

αjXtj . By the preceding identity this is well defined.(Check!) Furthermore, Φ is a linear isometry. By the same arguments as in the pre-ceding lemma, it can be extended to a linear isometry on the closure of the space ofall functions

∑

j αjeitjλ. By Fejer’s theorem from Fourier theory, this closure contains

at least all Lipschitz periodic functions. By measure theory this collection is dense in


L2(FX). Thus the closure is all of L2(FX). In particular, it contains all indicator func-tions 1B of Borel sets B. Define Z(B) = Φ(1B). Because Φ is a linear isometry, it retainsinner products and hence

cov(

Z(B1), Z(B2))

=⟨

Φ(1B1),Φ(1B2

)⟩

=

∫

1B11B2

dFX .

This shows that Z is a random measure with orthogonal increments. By definition

∫

∑

j

αj1BjdZ =

∑

j

αjZ(Bj) =∑

αjΦ(1Bj) = Φ

(

∑

j

αj1Bj

)

.

Thus∫

f dZ = Φ(f) for every simple function f . Both sides of this identity are linearisometries when seen as functions of f ∈ L2(FX). Hence the identity extends to all f ∈L2(FX). In particular, we obtain

∫

eitλ dZ(λ) = Φ(eitλ) = Xt on choosing f(λ) = eitλ.

Thus we have managed to give a precise mathematical formulation to the spectraldecomposition

Xt =

∫

(−π,π]eitλ dZ(λ)

of a mean zero stationary time series Xt. The definition is a bit involved. It may evenseem that giving a meaning to the representation was harder than proving its validity.An insightful interpretation is obtained by approximation through Riemann sums. Givena partition −π = λ0,k < λ1,k < · · · < λk,k = π and a fixed time t ∈ Z, consider thefunction λ 7→ fk(λ) that is piecewise constant, and takes the value eitλj,k on the interval(λj−1,k, λj,k]. If the partitions are chosen such that the mesh width of the partitionsconverges to zero as k → ∞, then

∣

∣fk(λ) − eitλ∣

∣ converges to zero, uniformly in λ ∈(−π, π], by the uniform continuity of the function λ 7→ eitλ, and hence fk(λ) → eitλ

in L2(FX). Because the stochastic integral f 7→∫

f dZ is linear, we have∫

fk dZ =∑

j eitλj,kZ(λj−1,kλj,k] and because it is an isometry, we find

E∣

∣

∣Xt −k∑

j=1

eitλj,kZ(λj−1,k, λj,k]∣

∣

∣

2

=

∫

∣

∣eitλ − fk(λ)∣

∣

2dFX(λ) → 0.

Because the intervals (λj−1,k, λj,k] are pairwise disjoint, the random variables Zj : =Z(λj−1,k, λj,k] are uncorrelated, by the defining property of an orthogonal random mea-sure. Thus the time seriesXt can be approximated by a time series of the form

∑

j Zjeitλj ,

as in the introduction of this section. The spectral measure FX(λj−1,k, λj,k] of the interval(λj−1,k, λj,k] is the variance of the random weight Zj in this decomposition.

6.32 Example (Discrete spectrum). If the spectral measure FX is discrete with sup-port points λ1, . . . , λk, then the integral on the right in the preceding display (withλj,k = λj) is identically zero. In that case Xt =

∑

j Zjeiλjt almost surely for every t,

where Zj = Zλj. (Note that Z(λj−1, λj) = 0).


6.33 Example (Gaussian series). If the time series Xt is Gaussian, then all variablesin the linear span of the Xt are normally distributed (possibly degenerate) and henceall variables in L2(Xt: t ∈ Z) are normally distributed. In that case the variables Z(B)obtained from the random measure Z of the spectral decomposition of Xt are jointlynormally distributed. The zero correlation of two variables Z(B1) and Z(B2) for disjointsets B1 and B2 now implies independence of these variables.

Theorem 6.22 shows how a spectral measure changes under filtering. There is acorresponding result for the spectral decomposition.

6.34 Theorem. Let Xt be a mean zero, stationary time series with spectral measureFX and associated random measure ZX , defined on some probability space (Ω,U ,P). Ifψ(λ) =

∑

j ψje−iλj converges in L2(FX), then Yt =

∑

j ψjXt−j converges in L2(Ω,U ,P)and has spectral measure FY and associated random measure ZY such that, for everyf ∈ L2(FY ), ∫

f dZY =

∫

fψ dZX .

Proof. The series Yt converges by and has spectral measure FY with density∣

∣ψ(λ)∣

∣

2

relative to FX , by Theorem 6.22. By definition,∫

eitλ dZY (λ) = Yt =∑

j

ψj

∫

ei(t−j)λ dZX(λ) =

∫

eitλψ(λ) dZX(λ),

where in the last step changing the order of integration and summation is justified by theconvergence of the series

∑

j ψjei(t−j)λ in L2(FX) and the continuity of the stochastic

integral f 7→∫

f dZX . We conclude that the identity of the theorem is satisfied for everyf of the form f(λ) = eitλ. Both sides of the identity are linear in f and isometries onthe domain f ∈ L2(FY ). Because the linear span of the functions λ 7→ eitλ for t ∈ Z isdense in L2(FY ), the identity extends to all of L2(FY ).

The spectral decompositionXt =∫

eiλt dZ(λ) is valid for both real and complex timeseries. Because the representation is in terms of the complex functions eitλ, the spectralrandom measure Z of a real time series is typically complex valued. The following lemmashows that it is complex-conjugate. Abbreviate

∫

f dZ(λ) to Z(f), and given a functionf : (−π, π] → C let f− be the reflection (of its periodic extension): f−(λ) = f(−λ) forλ ∈ (−π, π) and f−(π) = f(π).

6.35 Lemma. The spectral random measure of a real time series Xt satisfies Z(f) =Z(f−) almost surely, for every f ∈ L(FX). In particular Z(B) = Z(−B) almost surely,for every Borel set B ⊂ (−π, π). Furthermore, for any f, g ∈ L2(FX),

EReZ(f)ReZ(g) = 14

∫

(f + f−)(g + g−) dFX ,

EReZ(f) ImZ(g) = 14

∫

(f + f−)(g − g−) dFX ,

E ImZ(f) ImZ(g) = 14

∫

(f − f−)(g − g−) dFX .


In particular, if f and g are real and vanish outside (0, π), then EReZ(f) ImZ(g) = 0and EReZ(f)ReZ(g) = E ImZ(f) ImZ(g).

Proof. If the Fourier series∑

j fjeitλ, where fj = (2π)−1

∫

e−ijλf(λ) dλ, converges inL2(FX) to the function f , then by the continuity of the random spectral measure

Z(f) =∑

j

fjZ(eijλ) =

∑

j

fjXj ,

where the convergence of the infinite series is in L2(Xt: t ∈ Z). Because the Fouriercoefficients of f− are equal to f j , it then follows that

Z(f) =∑

f jXj = Z(f−).

This proves the conjugate symmetry of Z for functions f of this type. The convergence(and hence the conjugate symmetry) is certainly true for every periodic, Lipschitz func-tion f , for which the convergence of the Fourier series is uniform. A general f can beapproximated in L2(FX) by a sequence fn of Lipschitz, periodic functions, and then thesequence fn− tends in L2(FX) to f−, in view of the symmetry of FX . Thus the conjugatesymmetry extends to any such function.

The expressions for the second moments of the real and imaginary parts followupon writing ReZ(f) = 1

2

(

Z(f) + Z(f))

= 12Z(f + f−) and ImZ(f) = 1

2

(

Z(f) −Z(f)

)

= 12Z(f − f−). In the particular case that f and g are real and vanish outside

(0, π), the function f + f− is symmetric, the function g − g− is anti-symmetric, and

(f + f−)(g + g− = (f − f−)(g − g−). Together with the symmetry of FX this yields thelast assertion.

The last assertion of the preceding lemma shows that the random vectors(

ReZ(f1), . . . ,ReZ(fk))

and(

ImZ(f1), . . . , ImZ(fk))

are uncorrelated and have equalcovariance matrix, for any real functions f1, . . . , fk in L2(FX) that vanish outside (0, π).This implies that the vector of combined real and imaginary parts of the complex randomvector

(

Z(f1), . . . , Z(fk))

possesses a covariance matrix of the form (1.3), for functionsf1, . . . , fk of the special type considered, and then also for general functions that vanishat 0 and π by the linearity of Z in f . In particular, if the time series Xt is real and Gaus-sian, then the complex random vector

(

Z(f1), . . . , Z(fk))

possesses a complex normal

distribution N(0,Σ) for Σ the matrix with (i, j)th element∫

fif j dFX .

6.36 EXERCISE. Show that for a real time series Z0 and Zπ are real. Concludethat these variables are complex normally distributed only if they are 0. [One proof canbe based on the law of large numbers in Section 7.1, which exhibits 10 as a limit of real

linear combinations of functions λ 7→ eitλ; the same linear combinations of the functionsλ 7→ eit(λ−π) = (−1)teitλ tend to 1π.]

6.37 EXERCISE (Inversion formula). Show that, for any continuity points 0 ≤ a <b ≤ π of FX ,

1

2π

n∑

t=−nXt

∫ b

a

e−itλ dλ→ Z(a, b], in L2.

6.5: Multivariate Spectra 101

[Note that (2π)−1∫ b

ae−itλ dλ are the Fourier coefficients of the function 1(a,b].]

* 6.5 Multivariate Spectra

If spectral analysis of univariate time series is hard, spectral analysis of multivariate timeseries is an art. It concerns not only “frequencies present in a single signal”, but also“dependencies between signals at given frequencies”.

This difficulty concerns the interpretation only: the mathematical theory does notpose new challenges. The covariance function γX of a vector-valued times series Xt ismatrix-valued. If the series

∑

h∈Z∥

∥γX(h)∥

∥ is convergent, then the spectral density of theseries Xt can be defined by exactly the same formula as before:

fX(λ) =1

2π

∑

h∈ZγX(h)e−ihλ.

The summation is now understood to be entry-wise, and hence λ 7→ fX(λ) maps theinterval (−π, π] into the set of (d × d)-matrices, for d the dimension of the series Xt.Because the covariance function of the univariate series αTXt is given by γαTX = αT γXα,it follows that, for every α ∈ C

d,

αT fX(λ)α = fαTX(λ).

In particular, the matrix fX(λ) is nonnegative-definite, for every λ. From the identityγX(−h)T = γX(h) it can also be ascertained that it is Hermitian: fX(λ)T = fX(λ). Thediagonal elements are nonnegative, but the off-diagonal elements of fX(λ) are complexvalued, in general. If the time series Xt is real, then the spectral density is also conjugatesymmetric: fX(−λ) = fX(λ), for every λ.

Warning. The spectral density is Hermitian. The autocovariance function is not.As in the case of univariate time series, not every vector-valued time series possesses

a spectral density, but every such series does possess a spectral distribution. This “dis-tribution” is a matrix-valued, complex measure. A complex Borel measure on (−π, π] isa map B 7→ F (B) on the Borel sets that can be written as F = F1 −F2 + i(F3 −F4) forfinite Borel measures F1, F2, F3, F4. If the complex part F3 − F4 is identically zero, thenF is a signed measure. The spectral measure FX of a d-dimensional time series Xt is a(d × d) matrix whose d2 entries are complex Borel measures on (−π, π]. The diagonalelements are precisely the spectral measures of the coordinate time series and hence areordinary measures, but the off-diagonal measures are typically signed or complex mea-sures. The measure FX is Hermitian in the sense that FX(B)T = FX(B) for every Borelset B.


6.38 Theorem (Herglotz). For every stationary vector-valued time series Xt thereexists a unique Hermitian-matrix-valued complex measure FX on (−π, π] such that

γX(h) =

∫

(−π,π]eihλ dFX(λ), h ∈ Z.

Proof. For every α ∈ Cd the time series αTXt is univariate and possesses a spectral

measure FαTX . By Theorem 6.2, for every h ∈ Z,

αT γX(h)α = γαTX(h) =

∫

(−π,π]eihλ dFαTX(λ).

We can express any entry of the matrix γX(h) as a linear combination of the quadraticform on the left side, evaluated for different vectors α. One possibility is to write, withek the kth unit vector in C

d,

2γX(h)k,l = (ek + el)T γX(h)(ek + el) + i(ek + iel)

T γX(h)(ek − iel)+

− (1 + i)(

eTk γX(h)ek + eTl γX(h)el)

.

By expressing the right-hand side in the spectral matrices FαTX , by using the firstdisplay, we obtain a representation γX(h)k,l =

∫

eihλ dFk,l(λ), for a certain complex-valued measure Fk,l, for every (k, l). Then the matrix-valued complex measure F = (Fk,l)satisfies γX(h) =

∫

(−π,π] eihλ dF (λ) and hence, , for every h ∈ Z,

∫

(−π,π]eihλ dF

T(λ) = γX(−h)T = γX(h).

Thus the matrix-valued complex measure F + FT, which is Hermitian, also represents

the autocovariance function.If F is any Hermitian-matrix-valued complex measure with the representing prop-

erty, then αTFα must be the spectral measure of the time series αTXt and hence isuniquely determined. This determines αTF (B)α for every α ∈ C

d, and therefore theHermitian matrix F (B), for every Borel set B.

Consider in particular a bivariate time series, written as (Xt, Yt)T for univariate time

series Xt and Yt. The spectral density of (Xt, Yt), if it exists, is a (2× 2)-matrix valuedfunction. The diagonal elements are the spectral densities fX and fY of the univariateseries Xt and Yt. The off-diagonal elements are complex conjugates and thus define onefunction, say fXY for the (1, 2)-element of the matrix, giving

f(X,Y )T (λ) =

(

fX(λ) fXY (λ)fXY (λ) fY (λ)

)

.

6.5: Multivariate Spectra 103

The following derived functions are often plotted:

Re fXY , co-spectrum,

Im fXY , quadrature,

|fXY |2fXfY

, coherency,

|fXY |, amplitude,

arg fXY , phase.

It requires some experience to read the plots of these functions appropriately. The co-herency is perhaps the easiest to interprete: evaluated at λ ∈ (−π, π] it gives the “corre-lation between the series X and Y at the frequency λ”.

6.39 EXERCISE. Show that the stationary time seriesXt and Yt are totally uncorrelated(i.e. cov(Xs, Yt) = 0 for every s and t) if and only if the spectral measure of the vector-valued process (Xt, Yt)

T is the product of the spectral measures FX and FY of X andY . What does this imply for the measure FXY , the coherency, amplitude and phase?

A given sequence of (d × d) matrices ψj with∑

j ‖ψj‖ < ∞ defines a multivariatelinear filter that transforms a given d-dimensional time series Xt into the time seriesYt =

∑

j ψjXt−j . The corresponding multivariate transfer function is the matrix-valued

function λ 7→ ψ(λ): =∑

j ψje−ijλ. It plays the same role as the one-dimensional transfer

function given in Theorem 6.11, whose proof goes through almost without changes.

6.40 Theorem. Let Xt be a multivariate stationary time series with spectral measureFX and let

∑

j ‖ψj‖ <∞. Then Yt =∑

j ψjXt−j has spectral measure FY given by

dFY (λ) = ψ(λ) dFX(λ)ψ(λ)T.

The right side of the display is to be understood as a product of three matrices. Ifthe series Xt possesses a spectral density fX , then the right side is understood to be the

matrix-valued measure with spectral density ψ fX ψT. In the general case the display

means to express a relationship between the spectral measures FY and FX , with thenotations dFY and dFX being only formal identifiers. One way of making the relationprecise is to use some density fX of FX relative to some scalar measure λ (not necessarily

Lebesgue measure). Then FY possesses density ψ fX ψTrelative to the same dominating

measure.The spectral decomposition also extends to vector-valued time series. In particular,

for a bivariate time series (Xt, Yt)T there exist versions ZX and ZY of their spectral

measures (i.e. Xt =∫

eitλ dZX(λ) eand Yt =∫

eitλ dZY (λ)) such that, for pair of everymeasurable sets B1, B2 ⊂ (−π, π],

cov(

ZX(B1), ZY (B2))

= FX,Y (B1 ∩B2).


* 6.6 Prediction in the Frequency Domain

As the spectrum encodes the complete second order structure of a stationary time series,it can be used to solve the (linear) prediction problem. The observed variablesX1, . . . , Xn

are represented by the functions e1, . . . , en, given by eh(λ) = eihλ, in the spectral domain.A random variable Y in the closed linear span L2(Xt: t ∈ Z) of the time series can berepresented as Y =

∫

f dZX for some function f ∈ L2(FX). By the spectral isometry,

E∣

∣

∣Y −

n∑

t=1

αtXt

∣

∣

∣

2

=

∫

∣

∣

∣f −

n∑

t=1

αtet

∣

∣

∣

2

dFX .

Thus finding the best linear predictor of Y is the same as projecting f onto the linearspan of e1, . . . , en in L2(FX).

If Y is not contained in L2(Xt: t ∈ Z), then we apply this argument to its projectiononto this space, the orthogonal part of it being completely unpredictable.

The projection can be computed by solving the orthogonality relationship of Theo-rem 2.11. It is specific to the spectral measure FX . Only in the special case that this isproportional to the uniform measure, this problem reduces to the usual one in Fourieranalysis (and the truncated Fourier series gives an upper bound on the prediction error).

7Law of Large Numbers

The law of large numbers is concerned with the convergence of the sequence Xn ofaverages to its mean value. For statistical applications, the central limit theorem is ofmore interest, as it gives also expression to the size of the error when estimating the meanby the sample average. In fact, the central limit theorem implies the law of large numbers,as, by Slutsky’s lemma Xn → µ in probability if the sequence

√n(Xn − µ) is uniformly

tight. However, the law of large numbers is valid under much weaker conditions. Theweakening not only concerns moments, but also the dependence between the Xt.

We present separate laws of large numbers for second order stationary and for strictlystationary sequences, respectively. The weak law for second order stationary time seriescan be fully characterized using the spectrum of the time series. The strong law oflarge numbers for a strictly stationary time series is a central result of ergodic theory.Actually both results show that the sequence Xn always converges to a limit. However,in general this limit may be random. For nonstationary time series mixing conditionsgive alternative criteria.

* 7.1 Stationary Sequences

A mean zero, stationary time series Xt can be expressed in its spectral random measureZX through Xt =

∫

eiλt dZX(λ). (See Section 6.4.) A simple calculation then gives thefollowing law of large numbers.

7.1 Theorem. If Xt is a mean zero, stationary time series, then Xn → ZX0 in secondmean, as n→ ∞. In particular, if FX0 = 0, then Xn

P→ 0.

Proof. For ZX the spectral measure of X, we write

Xn =1

n

n∑

t=1

∫

eitλ dZX(λ) =

∫

eiλ(1− eiλn)

n(1− eiλ)dZX(λ).

106 7: Law of Large Numbers

Here the integrand must be read as 1 if λ = 0. For all other λ ∈ (−π, π] the integrandconverges to zero as n→ ∞. It is bounded by 1 for every λ. Hence the integrand convergesin second mean to 10 in L2(FX). By the continuity of the integral f 7→

∫

f dZX , we

find that Xn converges in L2(Ω,U ,P) to∫

10 dZX = ZX0.The mean of the variable ZX0 is zero, and its variance is equal to FX0. If the

latter is zero, then ZX0 is degenerate at zero.

7.2 Ergodic Theorem

Given a strictly stationary sequence Xt, defined on some probability space (Ω,U ,P) andtaking values in some measurable space (X ,A), the invariant σ-field, denoted Uinv, isthe σ-field consisting of all sets A such that A = (. . . , Xt−1, Xt, Xt+1, . . .)

−1(B) for all tand some measurable set B ⊂ X∞. Here throughout this section the product space X∞is equipped with the product σ-field A∞.

Our notation in the definition of the invariant σ-field is awkward, if not unclear,because we look at two-sided infinite series. The triple Xt−1, Xt, Xt+1 in the definition ofA is meant to be centered at a fixed position in Z. We can write this down more preciselyusing the forward shift function S:X∞ → X∞ defined by S(x)i = xi+1. The two-sidedsequence (. . . , Xt−1, Xt, Xt+1, . . .) defines a map X: Ω → X∞. The invariant sets A arethe sets such that A = StX ∈ B for all t and some measurable set B ⊂ X∞. Thestrict stationarity of the sequence X is identical to the invariance of its induced law PX

on X∞ under the shift S.The inverse images X−1(B) of measurable sets B ⊂ X∞ with B = SB are clearly in-

variant. Conversely, it can be shown that, up to null sets, all invariant sets take this form.(See Exercise 7.3.) The symmetric events are special examples of invariant sets. Theyare the events that depend symmetrically on the variables Xt. For instance, ∩tX

−1t (B)

for some measurable set B ⊂ X .

* 7.2 EXERCISE. Call a set B ⊂ X∞ invariant under the shift S:X∞ → X∞ if B = SB.Call it almost invariant relative to a measure PX if PX(BSB) = 0. Show that a set Bis almost invariant if and only if there exists an invariant set B such that PX(BB) = 0.[Try B = ∩tS

tB.]

* 7.3 EXERCISE. Define the invariant σ-field Binv on X∞ as the collection of measurablesets that are invariant under the shift operation, and let Binv be its completion under

the measure PX . Show that X−1(Binv) ⊂ Uinv ⊂ X−1(Binv), where the long bar on theright denotes completion relative to P. [Note that X ∈ B = X ∈ SB implies thatPX(B SB) = 0. Use the preceding exercise to replace B by an invariant set B.]

7.4 Theorem (Birkhoff). If Xt is a strictly stationary time series with E|Xt| < ∞,then Xn → E(X0| Uinv) almost surely and in mean.

7.2: Ergodic Theorem 107

Proof. For a given α ∈ R define a set B =

x ∈ R∞: lim supn→∞ xn > α

. Because

xn+1 =x1n+ 1

+n

n+ 1

1

n

n+1∑

t=2

xt,

a point x is contained in B if and only if lim supn−1∑n+1

t=2 xt > α. Equivalently, x ∈ B ifand only if Sx ∈ B. Thus the set B is invariant under the shift operation S:R∞ → R

∞.We conclude from this that the variable lim supn→∞Xn is measurable relative to theinvariant σ-field.

Fix some measurable set B ⊂ R∞. For every invariant set A ∈ Uinv there exists a

measurable set C ⊂ R∞ such that A = StX ∈ C for every t. By the strict stationarity

of X,

P(

StX ∈ B ∩A)

= P(StX ∈ B,StX ∈ C) = P(X ∈ B,X ∈ C) = P(

X ∈ B ∩A)

.

This shows that P(StX ∈ B| Uinv) = P(X ∈ B| Uinv) almost surely. We conclude thatthe conditional laws of StX and X given the invariant σ-field are identical.

In particular, the conditional means E(Xt| Uinv) = E(X1| Uinv) are identical forevery t, almost surely. It also follows that a time series Zt of the type Zt = (Xt, R)for R: Ω → R a fixed Uinv-measurable variable (for instance with values in R = R

2) isstrictly stationary, the conditional law B 7→ P(X ∈ B|R) = E

(

P(X ∈ B| Uinv)|R)

ofits first marginal (on X∞) being strictly stationary by the preceding paragraph, and thesecond marginal (on R∞) being independent of t.

For the almost sure convergence of the sequence Xn it suffices to show that, forevery ε > 0, the event

A =

lim supn→∞

Xn > E(X1| Uinv) + ε

and a corresponding event for the lower tail have probably zero. By the preced-ing the event A is contained in the invariant σ-field. Furthermore, the time se-ries Yt =

(

Xt − E(X1| Uinv) − ε)

1A, being a fixed transformation of the time se-

ries Zt =(

Xt,E(X1| Uinv), 1A)

, is strictly stationary. We can write A = ∪nAn for

An = ∪nt=1Y t > 0. Then EY11An

→ EY11A by the dominated convergence theo-rem, in view of the assumption that Xt is integrable. If we can show that EY11An

≥ 0for every n, then we can conclude that

0 ≤ EY11A = E(

X1 − E(X1| Uinv))

1A − εP(A) = −εP(A),

because A ∈ Uinv. This implies that P(A) = 0, concluding the proof of almost sureconvergence.

The L1-convergence can next be proved by a truncation argument. We can first show,more generally, but by an identical argument, that n−1

∑nt=1 f(Xt) → E

(

f(X0)| Uinv

)

almost surely, for every measurable function f :X → R with E|f(Xt)| < ∞. We canapply this to the functions f(x) = x1|x|≤M for given M .


We complete the proof by showing that EY11An≥ 0 for every strictly stationary

time series Yt and every fixed n, and An = ∪nt=1Y t > 0. For every 2 ≤ j ≤ n,

Y1 + · · ·+ Yj ≤ Y1 +max(Y2, Y2 + Y3, · · · , Y2 + · · ·+ Yn+1).

If we add the number 0 in the maximum on the right, then this is also true for j = 1.We can rewrite the resulting n inequalities as the single inequality

Y1 ≥ max(Y1, Y1 + Y2, . . . , Y1 + · · ·+ Yn)−max(0, Y2, Y2 + Y3, · · · , Y2 + · · ·+ Yn+1).

The event An is precisely the event that the first of the two maxima on the right ispositive. Thus on this event the inequality remains true if we add also a zero to the firstmaximum. It follows that EY11An

is bounded below by

E(

max(0, Y1, Y1 + Y2, . . . , Y1 + · · ·+ Yn)−max(0, Y2, Y2 + Y3, · · · , Y2 + · · ·+ Yn+1))

1An.

Off the event An the first maximum is zero, whereas the second maximum is alwaysnonnegative. Thus the expression does not increase if we cancel the indicator 1An

. Theresulting expression is identically zero by the strict stationarity of the series Yt.

Thus a strong law is valid for every integrable strictly stationary sequence, with-out any further conditions on possible dependence of the variables. However, the limitE(X0| Uinv) in the preceding theorem will often be a true random variable. Only if theinvariant σ-field is trivial, we can be sure that the limit is degenerate. Here “trivial” maybe taken to mean that the invariant σ-field consists of sets of probability 0 or 1 only. Ifthis is the case, then the time series Xt is called ergodic.

* 7.5 EXERCISE. Suppose that Xt is strictly stationary. Show that Xt is ergodic if andonly if every sequence Yt = f(. . . , Xt−1, Xt, Xt+1, . . .) for a measurable map f that isintegrable satisfies the law of large numbers Y n → EY1 almost surely. [Given an invariantset A = (. . . , X−1, X0, X1, . . .)

−1(B) consider Yt = 1B(. . . , Xt−1, Xt, Xt+1, . . .). ThenY n = 1A.]

Checking that the invariant σ-field is trivial may be a nontrivial operation. Thereare other concepts that imply ergodicity and may be easier to verify. A time series Xt iscalled mixing if, for any measurable sets A and B, as h→ ∞,

P(

(. . . , Xh−1, Xh, Xh+1, . . .) ∈ A, (. . . , X−1, X0, X1, . . .) ∈ B)

− P(

(. . . , Xh−1, Xh, Xh+1, . . .) ∈ A)

P(

(. . . , X−1, X0, X1, . . .) ∈ B)

→ 0.

Every mixing time series is ergodic. This follows because if we take A = B equal to aninvariant set, the preceding display reads PX(A) − PX(A)PX(A) → 0, for PX the lawof the infinite series Xt, and hence PX(A) is 0 or 1.

The present type of mixing is related to the mixing concepts used to obtain centrallimit theorems, and is weaker.

7.2: Ergodic Theorem 109

7.6 Theorem. Any strictly stationary α-mixing time series is mixing.

Proof. For t-dimensional cylinder sets A and B in X∞ (i.e. sets that depend on finitelymany coordinates only) the mixing condition becomes

P(

(Xh, . . . Xt+h) ∈ A, (X0, . . . , Xt) ∈ B)

→ P(

(Xh, . . . Xt+h) ∈ A)

P(

(X0, . . . , Xt) ∈ B)

.

For h > t the absolute value of the difference of the two sides of the display is boundedby α(h− t) and hence converges to zero as h→ ∞, for each fixed t.

Thus the mixing condition is satisfied by the collection of all cylinder sets. Thiscollection is intersection-stable, i.e. a π-system, and generates the product σ-field onX∞. The proof is complete if we can show that the collections of sets A and B for whichthe mixing condition holds, for a given set B or A, is a σ-field. By the π-λ theorem itsuffices to show that these collections of sets are a λ-system.

The mixing property can be written as PX(S−hA ∩ B) − PX(A)PX(B) → 0, ash→ ∞. Because S is a bijection we have S−h(A2 −A1) = S−hA2 −S−hA1. If A1 ⊂ A2,then

PX(

S−h(A2 −A1) ∩B)

= PX(

S−hA2 ∩B)

− PX(

S−hA1 ∩B)

,

PX(A2 −A1)PX(B) = PX(A2)P

X(B)− PX(A1)PX(B).

If, for a given set B, the sets A1 and A2 satisfy the mixing condition, then the righthand sides are asymptotically the same, as h→ ∞, and hence so are the left sides. ThusA2 − A1 satisfies the mixing condition. If An ↑ A, then S−hAn ↑ S−hA as n → ∞ andhence

PX(S−hAn ∩B)− PX(An)PX(B) → PX(S−hA ∩B)− PX(A)PX(B).

The absolute difference of left and right sides is bounded above by 2|PX(An)−PX(A)|.Hence the convergence in the display is uniform in h. If every of the sets An satisfies themixing condition, for a given set B, then so does A. Thus the collection of all sets A thatsatisfies the condition, for a given B, is a λ-system.

We can prove similarly, but more easily, that the collection of all sets B is also aλ-system.

7.7 Theorem. Any strictly stationary time series Xt with trivial tail σ-field is mixing.

Proof. The tail σ-field is defined as ∩h∈Zσ(Xh, Xh+1, . . .).As in the proof of the preceding theorem we need to verify the mixing condition

only for finite cylinder sets A and B. We can write∣

∣

∣E1Xh,...,Xt+h∈A

(

1X0,...,Xt∈B − P(X0, . . . , Xt ∈ B))

∣

∣

∣

=∣

∣

∣E1Xh,...,Xt+h∈A(

P(X0, . . . , Xt ∈ B|Xh, Xh+1, . . .)− P(X0, . . . , Xt ∈ B))

∣

∣

∣

≤ E∣

∣

∣P(X0, . . . , Xt ∈ B|Xh, Xh+1, . . .)− P(X0, . . . , Xt ∈ B))

∣

∣

∣.

For every integrable variable Y the sequence E(Y |Xh, Xh+1, . . .) converges in L1 to theconditional expectation of Y given the tail σ-field, as h → ∞. Because the tail σ-field


is trivial, in the present case this is EY . Thus the right side of the preceding displayconverges to zero as h→ ∞.

* 7.8 EXERCISE. Show that a strictly stationary time series Xt is ergodic if and only ifn−1

∑nh=1 P

X(S−hA ∩B) → PX(A)PX(B), as n→ ∞, for every measurable subsets Aand B of X∞. [Use the ergodic theorem on the stationary time series Yt = 1StX∈A tosee that n−1

∑

1X∈S−tA1B → PX(A)1B for the proof in one direction.]

* 7.9 EXERCISE. Show that a strictly stationary time series Xt is ergodic if and onlyif the one-sided time series X0, X1, X2, . . . is ergodic, in the sense that the “one-sidedinvariant σ-field”, consisting of all sets A such that A = (Xt, Xt+1, . . .)

−1(B) for somemeasurable set B and every t ≥ 0, is trivial. [Use the preceding exercise.]

The preceding theorems can be used as starting points to construct ergodic se-quences. For instance, every i.i.d. sequence is ergodic by the preceding theorems, be-cause its tail σ-field is trivial by Kolmogorov’s 0-1 law, or because it is α-mixing. Toconstruct more examples we can combine the theorems with the following stability prop-erty. From a given ergodic sequence Xt we construct a process Yt by transforming thevector (. . . , Xt−1, Xt, Xt+1, . . .) with a given map f from the product space X∞ intosome measurable space (Y,B). As before, the Xt in (. . . , Xt−1, Xt, Xt+1, . . .) is meant tobe at a fixed 0th position in Z, so that the different variables Yt are obtained by slidingthe function f along the sequence (. . . , Xt−1, Xt, Xt+1, . . .).

7.10 Theorem. The sequence Yt = f(. . . , Xt−1, Xt, Xt+1, . . .) obtained by applicationof a measurable map f :X∞ → Y to an ergodic sequence Xt is ergodic.

Proof. Define f :X∞ → Y∞ by f(x) =(

· · · , f(S−1x), f(x), f(Sx), · · ·)

, for S the forward

shift on X∞. Then Y = f(X) and S′f(x) = f(Sx), if S′ is the forward shift on Y∞.Consequently (S′)tY = f(StX). If A = (S′)tY ∈ B is invariant for the series Yt, then

A = f(StX) ∈ B = StX ∈ f−1

(B) for every t, and hence A is also invariant for theseries Xt.

7.11 EXERCISE. Let Zt be an i.i.d. sequence of integrable variables and let Xt =∑

j ψjZt−j for a sequence ψj such that∑

j |ψj | < ∞. Show that Xt satisfies the lawof large numbers (with degenerate limit).

7.12 EXERCISE. Show that the GARCH(1, 1) process defined in Example 1.10 is er-godic.

7.13 Example (Discrete Markov chains). Every stationary irreducible Markov chainon a countable state space is ergodic. Conversely, a stationary reducible Markov chainon a countable state space whose initial (or marginal) law is positive everywhere isnonergodic.

7.3: Mixing 111

To prove the ergodicity note that a stationary irreducible Markov chain is (pos-itively) recurrent (e.g. Durrett, p266). If A is an invariant set of the form A =(X0, X1, . . .)

−1(B), then A ∈ σ(Xh, Xh−1, . . .) for all h and hence

1A = P(A|Xh, Xh−1, . . .) = P(

(Xh+1, Xh+2, . . .) ∈ B|Xh, Xh−1, . . .)

= P(

(Xh+1, Xh+2, . . .) ∈ B|Xh

)

.

We can write the right side as g(Xh) for the function g(x) = P(A|X−1 = x)

. Byrecurrence, for almost every ω in the underlying probability space, the right side runsinfinitely often through every of the numbers g(x) with x in the state space. Becausethe left side is 0 or 1 for a fixed ω, the function g and hence 1A must be constant. Thusevery invariant set of this type is trivial, showing the ergodicity of the one-sided sequenceX0, X1, . . .. It can be shown that one-sided and two-sided ergodicity are the same. (Cf.Exercise 7.9.)

Conversely, if the Markov chain is reducible, then the state space can be split intotwo sets X1 and X2 such that the chain will remain in X1 or X2 once it enters there. If theinitial distribution puts positive mass everywhere, then each of the two possibilities occurswith positive probability. The sets Ai = X0 ∈ Xi are then invariant and nontrivial andhence the chain is not ergodic.

It can also be shown that a stationary irreducible Markov chain is mixing if and onlyif it is aperiodic. (See e.g. Durrett, p310.) Furthermore, the tail σ-field of any irreduciblestationary aperiodic Markov chain is trivial. (See e.g. Durrett, p279.)

7.3 Mixing

In the preceding section it was seen that an α-mixing, strictly stationary time series isergodic and hence satisfies the law of large numbers if it is integrable. In this section weextend the law of large numbers to possibly nonstationary α-mixing time series.

The key is the bound on the tails of the distribution of the sample mean given inthe following lemma.

7.14 Lemma. For any mean zero time series Xt with α-mixing numbers α(h), everyx > 0 and every h ∈ N, with Qt = F−1|Xt|,

P(Xn ≥ 2x) ≤ 2

nx2

∫ 1

0

(

α−1(u) ∧ h) 1

n

n∑

t=1

Q2t (1− u) du+

2

x

∫ α(h)

0

1

n

n∑

t=1

Qt(1− u) du.

Proof. The quantile function of the variable |Xt|/(xn) is equal to u 7→ F−1|Xt|(u)/(nx).Therefore, by a rescaling argument we can see that it suffices to bound the probabilityP(∑n

t=1Xt ≥ 2)

by the right side of the lemma, but with the factors 2/(nx2) and 2/xreplaced by 2 and the factor n−1 in front of

∑

Q2t and

∑

Qt dropped. For ease of notationset S0 = 0 and Sn =

∑nt=1Xt.


Define the function g:R → R to be 0 on the interval (−∞, 0], to be x 7→ 12x

2 on[0, 1], to be x 7→ 1 − 1

2 (x − 2)2 on [1, 2], and to be 1 on [2,∞). Then g is continuouslydifferentiable with uniformly Lipschitz derivative. By Taylor’s theorem it follows that∣

∣g(x) − g(y) − g′(x)(x − y)∣

∣ ≤ 12 |x − y|2 for every x, y ∈ R. Because 1[2,∞) ≤ g and

St − St−1 = Xt,

P(Sn ≥ 2) ≤ Eg(Sn) =

n∑

t=1

E(

g(St)− g(St−1))

≤n∑

t=1

E∣

∣g′(St−1)Xt

∣

∣+ 12

n∑

t=1

EX2t .

The last term on the right can be written 12

∑nt=1

∫ 1

0Q2

t (1− u) du, which is bounded by∑n

t=1

∫ α(0)

0Q2

t (1− u) du, because α(0) = 12 and u 7→ Qt(1− u) is decreasing.

For i ≥ 1 the variable g′(St−i)−g′(St−i−1) is measurable relative to σ(Xs: s ≤ t− i)and is bounded in absolute value by |Xt−i|. Therefore, Lemma 4.12 yields the inequality

∣

∣E(

g′(St−i)− g′(St−i−1))

Xt

∣

∣ ≤ 2

∫ α(i)

0

Qt−i(1− u)Qt(1− u) du.

For t ≤ h we can write g′(St−1) =∑t−1

i=1

(

g′(St−i)− g(St−i−1))

. Substituting this in theleft side of the following display and applying the preceding display, we find that

h∑

t=1

E∣

∣g′(St−1)Xt

∣

∣ ≤ 2

h∑

t=1

t−1∑

i=1

∫ α(i)

0


For t > h we can write g′(St−1) = g′(St−h) +∑h−1

i=1

(

g′(St−i)− g(St−i−1))

. By a similarargument, this time also using that the function |g′| is uniformly bounded by 1, we find

n∑

t=h+1

E∣

∣g′(St−1)Xt

∣

∣ ≤ 2

∫ α(h)

0

Qt(1− u) du+2n∑

t=h+1

h−1∑

i=1

∫ α(i)

0


Combining the preceding displays we obtain that P(Sn ≥ 2) is bounded above by

2

∫ α(h)

0

Qt(1− u) du+ 2

n∑

t=1

t∧h−1∑

i=1

∫ α(i)

0

Qt−i(1− u)Qt(1− u) du+ 12

n∑

t=1

EX2t .

In the second term we can bound 2Qt−i(1−u)Qt(1−u) by Q2t−i(1−u)+Q2

t (1−u) and

next change the order of summation to∑h−1

i=1

∑nt=i+1. Because

∑nt=i+1(Q

2t−i + Q2

t ) ≤2∑n

t=1Q2t this term is bounded by 2

∑h−1i=1

∫ α(i)

0

∑nt=1Q

2t (1 − u) du. Together with the

third term on the right this gives rise to by the first sum on the right of the lemma, as∑h−1

i=0 1u≤α(i) = α−1(u) ∧ h.

7.15 Theorem. For each n let the time series (Xn,t: t ∈ Z) be mixing with mixingcoefficients αn(h). If supn αn(h) → 0 as h → ∞ and (Xn,t: t ∈ Z, n ∈ N) is uniformlyintegrable, then the sequence Xn − EXn converges to zero in probability.

Proof. By the assumption of uniform integrability n−1∑n

t=1 E|Xn,t|1|Xn,t|>M → 0 asM → ∞ uniformly in n. Therefore we can assume without loss of generality that Xn,t is

7.4: Subadditive Ergodic Theorem 113

bounded in absolute value by a constant M . We can also assume that it is centered atmean zero.

Then the quantile function of |Xn,t| is bounded by M and the preceding lemmayields the bound

P(|Xn| ≥ 2ε) ≤ 4hM2

nε2+

4M

εsupnαn(h).

This converges to zero as n→ ∞ followed by h→ ∞.

* 7.16 EXERCISE. Relax the conditions in the preceding theorem to, for every ε > 0:

n−1n∑

t=1

E|Xn,t|1|Xn,t|>εn∧F−1|Xn,t|

(1−αn(h))→ 0.

[Hint: truncate at the level nεn and note that EX1X>M =∫ 1

0Q(1−u)1Q(1−u)>M du for

Q(u) = F−1X (u).]

7.4 Subadditive Ergodic Theorem

Kingman’s subadditive theorem can be considered an extension of the ergodic theoremthat gives the almost sure convergence of more general functions of a strictly stationarysequence than the consecutive means. Given a strictly stationary time series Xt withvalues in some measurable space (X ,A) and defined on some probability space (Ω,U ,P),write X for the induced map (. . . , X−1, X0, X1, . . .): Ω → X∞, and let S:X∞ → X∞ bethe forward shift function (all as before). A family (Tn:n ∈ N) of maps Tn:X∞ → R iscalled subadditive if, for every m,n ∈ N,

Tm+n(X) ≤ Tm(X) + Tn(SmX).

7.17 Theorem (Kingman). If X is strictly stationary with invariant σ-field Uinv andthe maps (Tn:n ∈ N) are subadditive with finite means ETn(X), then Tn(X)/n → γ: =infn n

−1E(

Tn(X)| Uinv

)

almost surely. Furthermore, the limit γ satisfies Eγ > −∞ if andonly if infn ETn(X)/n > −∞ and in that case the convergence Tn(X) → γ takes alsoplace in mean.

Proof. See e.g. Dudley (1987), Theorem 10.7.1.

Because the maps Tn(X) =∑n

t=1Xt are subadditive (in fact additive, with equalityrather than inequality), the “ordinary” ergodic theorem by Birkhoff is a special caseof Kingman’s theorem. If the time series Xt is ergodic, then the limit γ in Kingman’stheorem is equal to the number γ = infn n

−1ETn(X) in [−∞,∞).


7.18 EXERCISE. Show that the sequence of normalized means n−1ETn(X) of a sub-additive map tend to infn n

−1ETn(X). [Hint: along the subsequence n = 2j they aredecreasing.]

7.19 EXERCISE. Let Xt be a time series with values in the collection of (d×d) matrices.Show that the maps defined by Tn(X) = log ‖X−1 · · ·X−n‖ are subadditive.

7.20 EXERCISE. Show that Kingman’s theorem remains true if the forward shift oper-ator in the definition of subadditivity is replaced by the backward shift operator. [Hint:If R:X∞ → X∞ is the reflection (RX)t = X−t, then RSmX = BmRX. Reverse timeby considering Tn(RX).]

8ARIMA Processes

For many years ARIMA processes were the work horses of time series analysis, the “sta-tistical analysis of time series” being almost identical to fitting an appropriate ARIMAprocess. This important class of time series models are defined through linear relationsbetween the observations and noise factors.

8.1 Backshift Calculus

To simplify notation we define the backshift operator B through

BXt = Xt−1, BkXt = Xt−k.

This is viewed as operating on a complete time series Xt, transforming this into a newseries by a time shift. Even though we use the word “operator”, we shall use B only asa notational device. In particular, BYt = Yt−1 for any other time series Yt.

♯

For a given polynomial ψ(z) =∑

j ψjzj we also abbreviate

ψ(B)Xt =∑

j

ψjXt−j .

If the series on the right is well defined, then we even use this notation for infinite Laurentseries

∑∞j=−∞ ψjz

j . Then ψ(B)Xt is simply a short-hand notation for the (infinite)linear filters that we encountered before. By Lemma 1.29 the filtered time series ψ(B)Xt

♯ Be aware of the dangers of this notation. For instance, if Yt = X−t, then BYt = Yt−1 = X−(t−1). This isprobably the intended meaning. We could also argue that BYt = BX−t = X−t−1. This is something else. Suchinconsistencies can be avoided by viewing B as acting on a full time series rather than individual variables;or alternatively by defining B as a true operator, for instance a linear operator acting on the linear span of agiven time series. In the second case B may be different for different time series; e.g. the two given examplesare resolved by writing BY Yt = Yt−1 and BXYt = BXX−t = X−t−1.

116 8: ARIMA Processes

is certainly well defined if∑

j |ψj | < ∞ and supt E|Xt| < ∞, in which case the seriesconverges both almost surely and in mean.

If∑

j |ψj | < ∞, then the Laurent series∑

j ψjzj converges absolutely on the unit

circle S1 =

z ∈ C: |z| = 1

in the complex plane and hence defines a function ψ:S1 → C.In turn, the values of this function determine the coefficients. (See the next exercise.)Given two of such series or functions ψ1(z) =

∑

j ψ1,jzj and ψ2(z) =

∑

j ψ2,jzj , the

product ψ(z) = ψ1(z)ψ2(z) is a well-defined function on (at least) the unit circle. Bychanging the summation indices (which is permitted in view of absolute convergence)this can be written as

ψ(z) = ψ1(z)ψ2(z) =∑

j

ψjzj , ψk =

∑

j

ψ1,jψ2,k−j .

The coefficients ψj are called the convolutions of the coefficients ψ1,j and ψ2,j . Underthe condition that

∑

j |ψi,j | < ∞ for i = 1, 2, the Laurent series∑

k ψkzk converges

absolutely at least on the unit circle:∑

k |ψk| <∞.

8.1 EXERCISE. Show that(i) If ψ(z) =

∑

j ψjzj , for an absolutely summable sequence (ψj), then ψj =

(2πi)−1∫

S1 ψ(z)z−1−j dz.

(ii) The convolution (ψj) of two sequences (ψi,j) of filter coefficients (i = 1, 2) satisfies∑

k |ψk| ≤∑

j |ψ1,j |∑

j |ψ2,j |.

Having defined the function ψ(z) and verified that it has an absolutely convergentLaurent series representation on the unit circle, we can now also define the time seriesψ(B)Xt. The following lemma shows that the convolution formula remains valid if z isreplaced by B, at least when applied to time series that are bounded in L1.

8.2 Lemma. If both∑

j |ψ1,j | < ∞ and∑

j |ψ2,j | < ∞, then, for every time series Xt

with supt E|Xt| <∞,

ψ(B)Xt = ψ1(B)[

ψ2(B)Xt

]

, a.s..

Proof. The right side is to be read as ψ1(B)Yt for Yt = ψ2(B)Xt. The variable Yt iswell defined almost surely by Lemma 1.29, because

∑

j |ψ2,j | <∞ and supt E|Xt| <∞.Furthermore,

supt

E|Yt| = supt

E∣

∣

∣

∑

j

ψ2,jXt−j∣

∣

∣ ≤∑

j

|ψ2,j | supt

E|Xt| <∞.

Thus the time series ψ1(B)Yt is also well defined by Lemma 1.29. Now

E∑

i

∑

j

|ψ1,i||ψ2,j ||Xt−i−j | ≤ supt

E|Xt|∑

i

|ψ1,i|∑

j

|ψ2,j | <∞.

8.2: ARMA Processes 117

This implies that the double series∑

i

∑

j ψ1,iψ2,jXt−i−j converges absolutely, and henceunconditionally, almost surely. The latter means that we may sum the terms in an arbi-trary order. In particular, by the change of variables (i, j) 7→ (i = l, i+ j = k),

∑

i

ψ1,i

(

∑

j

ψ2,jXt−i−j)

=∑

k

(

∑

l

ψ1,lψ2,k−l)

Xt−k, a.s..

This is the assertion of the lemma, with ψ1(B)[

ψ2(B)Xt

]

on the left side, and ψ(B)Xt

on the right.

The lemma implies that the “operators” ψ1(B) and ψ2(B) “commute”. From nowon we omit the square brackets in ψ1(B)

[

ψ2(B)Xt

]

.

8.3 EXERCISE. Verify that the lemma remains valid for any sequences ψ1 and ψ2 with∑

j |ψi,j | < ∞ and every process Xt such that∑

i

∑

j |ψ1,i||ψ2,j ||Xt−i−j | < ∞ almostsurely. In particular, conclude that ψ1(B)ψ2(B)Xt = (ψ1ψ2)(B)Xt for any polynomialsψ1 and ψ2 and every time series Xt.

8.2 ARMA Processes

Linear regression models attempt to explain a variable by the sum of a linear function ofexplanatory variables and a noise variable. ARMA processes are a time series version oflinear regression, where the explanatory variables are the past values of the time seriesitself and the added noise is a moving average process.

8.4 Definition. A time series Xt is an ARMA(p, q)-process if there exist polynomials φand θ of degrees p and q, respectively, and a white noise series Zt such that φ(B)Xt =θ(B)Zt.

The equation φ(B)Xt = θ(B)Zt is to be understood as “pointwise almost surely”on the underlying probability space: the random variables Xt and Zt are defined on aprobability space (Ω,U ,P) and satisfy φ(B)Xt(ω) = θ(B)Zt(ω) for almost every ω ∈ Ω.

Warning. Some authors require by definition that an ARMA process be stationary.This is one way of making the solution to the ARMA equation unique. Some authors failto recognize that the ARMA equation by itself does not define a unique process, withoutinitialization or a side constraint. Many authors occasionally forget to say explicitly thatthey are concerned with a stationary ARMA process.

The polynomials are often† written in the forms φ(z) = 1− φ1z− φ2z2 − · · · − φpz

p

and θ(z) = 1 + θ1z + · · ·+ θqzq. Then the equation φ(B)Xt = θ(B)Zt takes the form

Xt = φ1Xt−1 + φ2Xt−2 + · · ·+ φpXt−p + Zt + θ1Zt−1 + · · ·+ θqZt−q.

† A notable exception is the Splus package. Its makers appear to have overdone the cleverness of includingminus-signs in the coefficients of φ and have included them in the coefficients of θ also.


In other words: the value of the time series Xt at time t is the sum of a linear regressionon its own past and of a moving average. An ARMA(p, 0)-process is also called an auto-regressive process and denoted AR(p); an ARMA(0, q)-process is also called a movingaverage process and denoted MA(q). Thus an auto-regressive process is a solution Xt tothe equation φ(B)Xt = Zt, and a moving average process is explicitly given by Xt =θ(B)Zt.

8.5 EXERCISE. Why is it not a loss of generality to assume φ0 = θ0 = 1?

We next investigate for which pairs of polynomials φ and θ there exists a corre-sponding stationary ARMA-process. For given polynomials φ and θ and a white noiseseries Zt, there are always many time series Xt that satisfy the ARMA equation, butnone of these may be stationary. If there exists a stationary solution, then we are alsointerested in knowing whether this is uniquely determined by the pair (φ, θ) and/or thewhite noise series Zt, and in what way it depends on the series Zt.

8.6 Example. The polynomial φ(z) = 1 − φz leads to the auto-regressive equationXt = φXt−1 + Zt. In Example 1.8 we have seen that a stationary solution exists if andonly if |φ| 6= 1.

8.7 EXERCISE. Let arbitrary polynomials φ and θ, a white noise sequence Zt and vari-ables X1, . . . , Xp be given. Show that there exists a time series Xt that satisfies theequation φ(B)Xt = θ(B)Zt and coincides with the given X1, . . . , Xp at times 1, . . . , p.What does this imply about existence of solutions if only the Zt and the polynomials φand θ are given?

In the following theorem we shall see that a stationary solution to the ARMA-equation exists if the polynomial z 7→ φ(z) has no roots on the unit circle S1 =

z ∈C: |z| = 1

. To prove this, we need some facts from complex analysis. The function

ψ(z) =θ(z)

φ(z)

is well defined and analytic on the region

z ∈ C:φ(z) 6= 0

. If φ has no roots on the unit

circle S1, then, since it has at most p different roots, there is an annulus

z: r < |z| < R

with r < 1 < R on which it has no roots. On this annulus ψ is an analytic function, andit has a Laurent series representation

ψ(z) =∞∑

j=−∞ψjz

j .

This series is uniformly and absolutely convergent on every compact subset of the annu-lus, and the coefficients ψj are uniquely determined by the values of ψ on the annulus.In particular, because the unit circle is inside the annulus, we obtain that

∑

j |ψj | <∞.

8.2: ARMA Processes 119

Then we know from Lemma 1.29 that ψ(B)Zt is a well defined, stationary timeseries. By the following theorem it is the unique stationary solution to the ARMA-equation. Here the white noise series Zt and the probability space on which it is definedare considered given. (In analogy with the terminology of stochastic analysis, the latterexpresses that there exists a unique strong solution.)

8.8 Theorem. Let φ and θ be polynomials such that φ has no roots on the complexunit circle, and let Zt be a white noise process. Define ψ = θ/φ. Then Xt = ψ(B)Zt

is the unique stationary solution to the equation φ(B)Xt = θ(B)Zt. It is also the onlysolution that is bounded in L1.

Proof. By the rules of calculus justified by Lemma 8.2, φ(B)ψ(B)Zt = θ(B)Zt, becauseφ(z)ψ(z) = θ(z) on an annulus around the unit circle, the series

∑

j |φj | and∑

j |ψj |converge and the time series Zt is bounded in absolute mean. This proves that ψ(B)Zt

is a solution to the ARMA-equation. It is stationary by Lemma 1.29.Let Xt be an arbitrary solution to the ARMA equation that is bounded in L1,

for instance a stationary solution. The function φ(z) = 1/φ(z) is analytic on an an-nulus around the unit circle and hence possesses a unique Laurent series representa-tion φ(z) =

∑

j φjzj . Because

∑

j |φj | < ∞, the infinite series φ(B)Yt is well definedfor every stationary time series Yt by Lemma 1.29. By the calculus of Lemma 8.2φ(B)φ(B)Xt = Xt almost surely, because φ(z)φ(z) = 1, the filter coefficients aresummable and the time series Xt is bounded in absolute mean. Therefore, the equationφ(B)Xt = θ(B)Zt implies, after multiplying by φ(B), that Xt = φ(B)θ(B)Zt = ψ(B)Zt,again by the calculus of Lemma 8.2, because φ(z)θ(z) = ψ(z). This proves that ψ(B)Zt

is the unique stationary solution to the ARMA-equation.

8.9 EXERCISE. It is certainly not true that ψ(B)Zt is the only solution to the ARMA-equation. Can you trace where in the preceding proof we use the required stationarity ofthe solution? Would you agree that the “calculus” of Lemma 8.2 is perhaps more subtlethan it appeared to be at first?

Thus the condition that φ has no roots on the unit circle is sufficient for the existenceof a stationary solution. It is almost necessary. The only point is that it is really thequotient θ/φ that counts, not the function φ on its own. If φ has a zero on the unit circleof the same or smaller multiplicity as θ, then this quotient is still a nice function. Oncethis possibility is excluded, there can be no stationary solution if φ(z) = 0 for some zwith |z| = 1.

8.10 Theorem. Let φ and θ be polynomials such that φ has a root on the unit circlethat is not a root of θ, and let Zt be a white noise process. Then there exists no stationarysolution Xt to the equation φ(B)Xt = θ(B)Zt.

Proof. Suppose that the contrary is true and let Xt be a stationary solution. Then Xt

has a spectral distribution FX , and hence so does the time series φ(B)Xt = θ(B)Zt. By


Theorem 6.11 and Example 6.6 we must have

∣

∣φ(e−iλ)∣

∣

2dFX(λ) =

∣

∣θ(e−iλ)∣

∣

2 σ2

2πdλ.

Now suppose that φ(e−iλ0) = 0 and θ(e−iλ0) 6= 0 for some λ0 ∈ (−π, π]. The precedingdisplay is just an equation between densities of measures and should not be interpreted asbeing valid for every λ, so we cannot immediately conclude that there is a contradiction.By differentiability of φ and continuity of θ there exist positive numbers A and B and aneighbourhood of λ0 on which both

∣

∣φ(e−iλ)∣

∣ ≤ A|λ−λ0| and∣

∣θ(e−iλ)∣

∣ ≥ B. Combiningthis with the preceding display, we see that, for all sufficiently small ε > 0,

∫ λ0+ε

λ0−εA2|λ− λ0|2 dFX(λ) ≥

∫ (λ0+ε)∨(−π)

(λ0−ε)∧πB2 σ

2

2πdλ.

The left side is bounded above by A2ε2FX [λ0 − ε, λ0 + ε], whereas the right side isbounded below by B2σ2ε/(2π), for small ε. This shows that FX [λ0 − ε, λ0 + ε] → ∞ asε→ 0 and contradicts the fact that FX is a finite measure.

8.11 Example. The AR(1)-equation Xt = φXt−1 + Zt corresponds to the polynomialφ(z) = 1 − φz. This has root φ−1. Therefore a stationary solution exists if and only if|φ−1| 6= 1. In the latter case, the Laurent series expansion of ψ(z) = 1/(1 − φz) aroundthe unit circle is given by ψ(z) =

∑∞j=0 φ

jzj for |φ| < 1 and is given by −∑∞j=1 φ−jz−j

for |φ| > 1. Consequently, the unique stationary solutions in these cases are given by

Xt =

∑∞j=0 φ

jZt−j , if |φ| < 1,

−∑∞j=11φjZt+j , if |φ| > 1.

This is in agreement, of course, with Example 1.8.

8.12 EXERCISE. Investigate the existence of stationary solutions to:(i) Xt =

12Xt−1 +

12Xt−2 + Zt;

(ii) Xt =12Xt−1 +

14Xt−2 + Zt +

12Zt−1 +

14Zt−2.

Warning. Some authors mistakenly believe that stationarity requires that φ has noroots inside the unit circle.

If given time series Xt and Zt satisfy the ARMA-equation φ(B)Xt = θ(B)Zt, thenthey also satisfy r(B)φ(B)Xt = r(B)θ(B)Zt, for any polynomial r. From observed dataXt it is impossible to determine whether (φ, θ) or (rφ, rθ) are the “right” polynomials.To avoid this problem of indeterminacy, we assume from now on that the ARMA-modelis always written in its simplest form. This is when φ and θ do not have common factors(are relatively prime in the algebraic sense), or equivalently, when φ and θ do not havecommon (complex) roots. Then, in view of the preceding theorems, a stationary solutionXt to the ARMA-equation exists if and only if φ has no roots on the unit circle, and thisis uniquely given by

Xt = ψ(B)Zt =∑

j

ψjZt−j , ψ =θ

φ.

8.3: Invertibility 121

8.13 Definition. An ARMA-process Xt is called causal if, in the preceding representa-tion, the filter is causal: i.e. ψj = 0 for every j < 0.

Thus a causal ARMA-processXt depends on the present and past values Zt, Zt−1, . . .of the noise sequence only. Intuitively, this is a desirable situation, if time is really timeand Zt is really attached to time t. We come back to this in Section 8.6.

A mathematically equivalent definition of causality is that the function ψ(z) is an-alytic in a neighbourhood of the unit disc

z ∈ C: |z| ≤ 1

. This follows, because theLaurent series

∑∞j=−∞ ψjz

j is analytic inside the unit disc if and only if the negativepowers of z do not occur. Still another description of causality is that all roots of φ areoutside the unit circle, because only then is the function ψ = θ/φ analytic on the unitdisc.

The proof of Theorem 8.8 does not use that Zt is a white noise process, but onlythat the series Zt is bounded in L1. Therefore, the same arguments can be used to invertthe ARMA-equation in the other direction. If θ has no roots on the unit circle and Xt isstationary, then φ(B)Xt = θ(B)Zt implies that

Zt = π(B)Xt =∑

j

πjXt−j , π =φ

θ.

8.14 Definition. An ARMA-process Xt is called invertible if, in the preceding repre-sentation, the filter is causal: i.e. πj = 0 for every j < 0.

Equivalent mathematical definitions are that π(z) is an analytic function on the unitdisc or that θ has all its roots outside the unit circle. In the definition of invertibility weimplicitly assume that θ has no roots on the unit circle. The general situation is moretechnical and is discussed in the next section.

* 8.3 Invertibility

In this section we discuss the proper definition of invertibility in the case that θ hasroots on the unit circle. The intended meaning of “invertibility” is that every Zt can bewritten as a linear function of the Xs that are prior or simultaneous to t (i.e. s ≤ t).Two reasonable ways to make this precise are:(i) Zt =

∑∞j=0 πjXt−j for a sequence πj such that

∑∞j=0 |πj | <∞.

(ii) Zt is contained in the closed linear span of Xt, Xt−1, Xt−2, . . . in L2(Ω,U ,P).In both cases we require that Zt depends linearly on the prior Xs, but the secondrequirement is weaker. It turns out that if Xt is a stationary ARMA process relative toZt and (i) holds, then the polynomial θ cannot have roots on the unit circle. In that casethe definition of invertibility given in the preceding section is appropriate (and equivalentto (i)). However, the requirement (ii) does not exclude the possibility that θ has zeroson the unit circle. An ARMA process is invertible in the sense of (ii) as soon as θ doesnot have roots inside the unit circle.


8.15 Lemma. Let Xt be a stationary ARMA process satisfying φ(B)Xt = θ(B)Zt forpolynomials φ and θ that are relatively prime.(i) Then Zt =

∑∞j=0 πjXt−j for a sequence πj such that

∑∞j=0 |πj | < ∞ if and only if

θ has no roots on or inside the unit circle.(ii) If θ has no roots inside the unit circle, then Zt is contained in the closed linear span

of Xt, Xt−1, Xt−2, . . ..

Proof. (i). If θ has no roots on or inside the unit circle, then the ARMA process isinvertible by the arguments given previously. We must argue the other direction. If Zt

has the given given reprentation, then consideration of the spectral measures gives

σ2

2πdλ = dFZ(λ) =

∣

∣π(e−iλ)∣

∣

2dFX(λ) =

∣

∣π(e−iλ)∣

∣

2

∣

∣θ(e−iλ)∣

∣

2

∣

∣φ(e−iλ)∣

∣

2

σ2

2πdλ.

Hence∣

∣π(e−iλ)θ(e−iλ)∣

∣ =∣

∣φ(e−iλ)∣

∣ Lebesgue almost everywhere. If∑

j |πj | < ∞, then

the function λ 7→ π(e−iλ) is continuous, as are the functions φ and θ, and hence thisequality must hold for every λ. Since φ(z) has no roots on the unit circle, nor can θ(z).

(ii). Suppose that ζ−1 is a zero of θ, so that |ζ| ≤ 1 and θ(z) = (1 − ζz)θ1(z)for a polynomial θ1 of degree q − 1. Define Yt = φ(B)Xt and Vt = θ1(B)Zt, whenceYt = Vt − ζVt−1. It follows that

k−1∑

j=0

ζjYt−j =k−1∑

j=0

ζj(Vt−j − ζVt−j−1) = Vt − ζkVt−k.

If |ζ| < 1, then the right side converges to Vt in quadratic mean as k → ∞ and henceit follows that Vt is contained in the closed linear span of Yt, Yt−1, . . ., which is clearlycontained in the closed linear span of Xt, Xt−1, . . ., because Yt = φ(B)Xt. If q = 1, thenVt and Zt are equal up to a constant and the proof is complete. If q > 1, then we repeatthe argument with θ1 instead of θ and Vt in the place of Yt and we shall be finished afterfinitely many recursions.

If |ζ| = 1, then the right side of the preceding display still converges to Vt as k → ∞,but only in the weak sense that E(Vt − ζkVt−k)W → EVtW for every square integrablevariable W . This implies that Vt is in the weak closure of lin (Yt, Yt−1, . . .), but this isequal to the strong closure by an application of the Hahn-Banach theorem. Thus wearrive at the same conclusion.

To see the weak convergence, note first that the projection of W onto the closed lin-ear span of (Zt: t ∈ Z) is given by

∑

j ψjZj for some sequence ψj with∑

j |ψj |2 <∞. Because Vt−k ∈ lin (Zs: s ≤ t − k), we have |EVt−kW | = |∑j ψjEVt−kZj | ≤∑

j≤t−k |ψj | sdV0 sdZ0 → 0 as k → ∞.

8.16 Example. The moving average Xt = Zt − Zt−1 is invertible in the sense of (ii),but not in the sense of (i). The moving average Xt = Zt − 1.01Zt−1 is not invertible.

Thus Xt = Zt − Zt−1 implies that Zt ∈ lin (Xt, Xt−1, . . .). An unexpected phe-nomenon is that it is also true that Zt is contained in lin (Xt+1, Xt+2, . . .). This followsby time reversal: define Ut = X−t+1 and Wt = −Z−t and apply the preceding to the

8.4: Prediction 123

processes Ut =Wt−Wt−1. Thus it appears that the “opposite” of invertibility is true aswell!

8.17 EXERCISE. Suppose that Xt = θ(B)Zt for a polynomial θ of degree q that hasall its roots on the unit circle. Show that Zt ∈ lin (Xt+q, Xt+q+1, . . .). [As in (ii) of the

preceding proof, it follows that Vt = ζ−k(Vt+k −∑k−1

j=0 ζjXt+k+j). Here the first term on

the right side converges weakly to zero as k → ∞.]

8.4 Prediction

As to be expected from their definitions, causality and invertibility are important for cal-culating predictions for ARMA processes. For a causal and invertible stationary ARMAprocess Xt satisfying φ(B)Xt = θ(B)Zt, we have

Xt ∈ lin (Zt, Zt−1, . . .), (by causality),

Zt ∈ lin (Xt, Xt−1, . . .), (by invertibility).

Here lin , the closed linear span, is the operation of first forming all (finite) linear com-binations and next taking the metric closure in L2(Ω,U ,P) of this linear span. Thedisplay shows that the two closed linear spans in its right side are identical. Since Zt isa white noise process, the variable Zt+1 is orthogonal to the linear span of Zt, Zt−1, . . ..By the continuity of the inner product it is then also orthogonal to the closed linearspan of Zt, Zt−1, . . . and hence, under causality, it is orthogonal to Xs for every s ≤ t.This shows that the variable Zt+1 is totally (linearly) unpredictable at time t given theobservations X1, . . . , Xt. This is often interpreted in the sense that in the causal, invert-ible situation the variable Zt is an “external noise variable” that is generated at time t“independently” of the history of the system before time t.

Warning. For a general white noise process Zt the argument delivers orthogonalityonly, not “independence”.

8.18 EXERCISE. The preceding argument gives that Zt+1 is uncorrelated with the sys-tem variables Xt, Xt−1, . . . of the past. Show that if the variables Zt are independent,then Zt+1 is independent of the system up to time t, not just uncorrelated.

This general discussion readily gives the structure of the best linear predictor forstationary, causal, auto-regressive processes. Suppose that

Xt+1 = φ1Xt + · · ·+ φpXt+1−p + Zt+1.

If t ≥ p, then Xt, . . . , Xt−p+1 are perfectly predictable based on the past variablesX1, . . . , Xt: by themselves. If the series is causal, then Zt+1 is totally unpredictable (itsbest prediction is zero), in view of the preceding discussion. Since a best linear predictor


is a projection and projections are linear maps, the best linear predictor of Xt+1 basedon X1, . . . , Xt is given by

ΠtXt+1 = φtX1 + · · ·+ φpXt+1−p, (t ≥ p).

We should be able to obtain this result also from the prediction equations (2.4) andthe explicit form of the auto-covariance function, but that calculation would be morecomplicated.

8.19 EXERCISE. Find a formula for the best linear predictor of Xt+2 based onX1, . . . , Xt, if t− p ≥ 1.

For moving average and general ARMA processes the situation is more complicated.A similar argument works only for computing the best linear predictor Π−∞,tXt+1 basedon the infinite past Xt, Xt−1, . . . down to time −∞. Assume that Xt is a causal andinvertible stationary ARMA process satisfying

Xt+1 = φ1Xt + · · ·+ φpXt+1−p + Zt+1 + θ1Zt + · · ·+ θqZt+1−q.

By causality the variable Zt+1 is completely unpredictable. By invertibility the variableZs is perfectly predictable based onXs, Xs−1, . . . and hence is perfectly predictable basedon Xt, Xt−1, . . . for every s ≤ t. Therefore,

(8.1) Π−∞,tXt+1 = φ1Xt + · · ·+ φpXt+1−p + θ1Zt + · · ·+ θqZt+1−q.

The practical importance of this formula is small, because we never observe the com-plete past. However, if we observe a long series X1, . . . , Xt, then the “distant past”X0, X−1, . . . will not give much additional information over the “recent past” Xt, . . . , X1,and Π−∞,tXt+1 and the best linear predictor ΠtXt+1 based on X1, . . . , Xt will be close.

8.20 Lemma. For a stationary causal, invertible ARMA process Xt there exists con-stants C and c < 1 such that E|Π−∞,tXt+1 −ΠtXt+1|2 ≤ Cct for every t.

Proof. By invertibility we can express Zs as Zs =∑∞

j=0 πjXs−j , for (πj) the coefficientsof the Taylor expansion of π = φ/θ around 0. Because this power series has convergenceradius bigger than 1, it follows that

∑

j |πj |Rj < ∞ for some R > 1, and hence |πj | ≤C1c

j , for some constants C1 > 0, c < 1 and every j. By Lemma 1.29 and continuity ofa projection ΠtZs =

∑

j πjΠtXt−s. The variables Xs−j with 1 ≤ s− j ≤ t are perfectlypredictable using X1, . . . , Xt. We conclude that for s ≤ t

Zs −ΠtZs =∑

j≥sπj(Xs−j −ΠtXs−j).

It follows that ‖Zs −ΠtZs‖2 ≤∑j≥s C1cj2‖X1‖2 ≤ C2c

s, for every s ≤ t.In view of formula (8.1) for Π−∞,tXt+1 and the towering property of projections,

the difference between Π−∞,tXt+1 and ΠtXt+1 is equal to∑q

j=1 θj(Zt+1−j −ΠtZt+1−j),

for t ≥ p. Thus the L2-norm of this difference is bounded above by∑q

j=1 |θj |‖Zt+1−j −ΠtZt+1−j‖2. The lemma follows by inserting the bound obtained in the preceding para-graph.

8.5: Auto Correlation and Spectrum 125

Thus for causal stationary auto-regressive processes the error when predicting Xt+1

at time t > p using Xt, . . . , X1 is equal to Zt+1. For general stationary ARMA-processes this is approximately true for large t, and it is exactly true if we includethe variables X0, X−1, . . . in the prediction. It follows that the square prediction errorE|Xt+1 −ΠtXt+1|2 is equal to the variance σ2 = EZ2

1 of the innovations, exactly in theauto-regressive case, and approximately for general stationary ARMA-processes.

If the innovations are large, the quality of the predictions will be small. Irrespec-tive of the size of the innovations, the possibility of prediction of ARMA-processes intothe distant future is limited. The following lemma shows that the predictors convergeexponentially fast to the trivial predictor, the constant mean, 0.

8.21 Lemma. For a stationary causal, invertible ARMA process Xt there exists con-stants C and c < 1 such that E|ΠtXt+s|2 ≤ Ccs for every s and t.

Proof. By causality, for s > q the variable θ(B)Zt+s is orthogonal to the variablesX1, . . . , Xt, and hence have best predictor 0. Thus the linearity of prediction and theARMA equation Xt+s = φ1Xt+s−1+ · · ·+φpXt+s−p+ θ(B)Zt+s yield that the variablesYs = ΠtXt+s satisfy Ys = φ1Ys−1 + · · · + φpYs−p, for s > q. Writing this equation invector-form and iterating, we obtain, for s ≥ p,

YsYs−1...

Ys−p+1

=

φ1 φ2 · · · φp−1 φp1 0 · · · 0 0...

......

...0 0 · · · 1 0

Ys−1Ys−2...

Ys−p

= Φs−p

YpYp−1...Y1

,

where Φ is the matrix in the middle expression. By causality the spectral radius of the ma-trix Φ is strictly smaller than 1, and this implies that ‖Φs‖ ≤ Ccs for some constant c < 1.(See the proof of Theorem 8.32.) In particular, it follows that |Ys|2 ≤ Cc2(s−p)

∑pi=1 Y

2i ,

for s > p ∨ q. The inequality of the lemma follows by taking expecations. For s ≤ p ∨ qthe inequality is trivially valid for sufficiently large C, because the left side is bounded.

8.5 Auto Correlation and Spectrum

The spectral density of an ARMA representation is immediate from the representationXt = ψ(B)Zt and Theorem 6.11.

8.22 Theorem. The stationary ARMA process satisfying φ(B)Xt = θ(B)Zt possessesa spectral density given by

fX(λ) =

∣

∣

∣

∣

θ(e−iλ)

φ(e−iλ)

∣

∣

∣

∣

2σ2

2π.


0.0 0.5 1.0 1.5 2.0 2.5 3.0

-10

-50

510

15

Figure 8.1. Spectral density of the AR series satisfying Xt−1.5Xt−1+0.9Xt−2−0.2Xt−3+0.1Xt−9 = Zt.(Vertical axis in decibels, i.e. it gives the logarithm of the spectral density.)

8.23 EXERCISE. Plot the spectral densities of the following time series:(i) Xt = Zt + 0.9Zt−1;(ii) Xt = Zt − 0.9Zt−1;(iii) Xt − 0.7Xt−1 = Zt;(iv) Xt + 0.7Xt−1 = Zt;(v) Xt − 1.5Xt−1 + 0.9Xt−2 − 0.2Xt−3 + 0.1Xt−9 = Zt.

Finding a simple expression for the auto-covariance function is harder, except for thespecial case of moving average processes, for which the auto-covariances can be expressedin the parameters θ1, . . . , θq by a direct computation (cf. Example 1.6 and Lemma 1.29).The auto-covariances of a general stationary ARMA process can be solved from a systemof equations. In view of Lemma 1.29(iii), the equation φ(B)Xt = θ(B)Zt leads to theidentities, with φ(z) =

∑

j φjzj and θ(z) =

∑

j θjzj ,

∑

l

(

∑

j

φj φj+l−h)

γX(l) = σ2∑

j

θjθj+h, h ∈ Z.

In principle this system of equations can be solved for the values γX(l).

8.5: Auto Correlation and Spectrum 127

An alternative method to compute the auto-covariance function is to write Xt =ψ(B)Zt for ψ = θ/φ, whence, by Lemma 1.29(iii),

γX(h) = σ2∑

j

ψjψj+h.

This requires the computation of the coefficients ψj , which can be expressed in thecoefficients of φ and θ by comparing coefficients in the power series equation φ(z)ψ(z) =θ(z). Even in simple situations many ψj must be taken into account.

8.24 Example. For the AR(1) series Xt = φXt−1 + Zt with |φ| < 1 we obtain ψ(z) =(1−φz)−1 =

∑∞j=0 φ

jzj . Therefore, γX(h) = σ2∑∞

j=0 φjφj+h = σ2φh/(1−φ2) for h ≥ 0.

8.25 EXERCISE. Find γX(h) for the stationary ARMA(1, 1) series Xt = φXt−1 +Zt +θZt−1 with |φ| < 1.

* 8.26 EXERCISE. Show that the auto-covariance function of a stationary ARMA processdecreases exponentially. Give an estimator of the constant in the exponent in terms ofthe distance of the zeros of φ to the unit circle.

A third method to express the auto-covariance function in the coefficients of thepolynomials φ and θ uses the spectral representation

γX(h) =

∫ π

−πeihλfX(λ) dλ =

σ2

2πi

∫

|z|=1

zh−1θ(z)θ(z−1)

φ(z)φ(z−1)dz.

The second integral is a contour integral along the positively oriented unit circle in thecomplex plane. We have assumed that the coefficients of the polynomials φ and θ arereal, so that φ(z)φ(z−1) = φ(z)φ(z) = |φ(z)|2 for every z in the unit circle, and similarlyfor θ. The next step is to evaluate the contour integral with the help of the residuetheorem from complex function theory.‡ The poles of the integrand are contained in theset consisting of the zeros vi and their inverses v−1i of φ and possibly the point 0. Theauto-covariance function can be written as a function of the residues at these points.

8.27 Example (ARMA(1, 1)). Consider the stationary ARMA(1, 1) series Xt =φXt−1+Zt+θZt−1 with 0 < |φ| < 1. The corresponding function φ(z)φ(z−1) has zeros ofmultiplicity 1 at the points φ−1 and φ. Both points yield a pole of first order for the inte-grand in the contour integral. The number φ−1 is outside the unit circle, so we only needto compute the residue at the second point. The function θ(z−1)/φ(z−1) = (z+θ)/(z−φ)is analytic on the unit disk except at the point φ and hence does not contribute other

‡ Remember that the residue at a point z0 of a meromorf function f is the coefficient a−1 in its Laurentseries representation f(z) = Σjaj(z − z0)

j around z0. If f can be written as h(z)/(z − z0)k for a function h

that is analytic on a neighbourhood of z0, then a−1 can be computed as h(k−1)(z0)/(k − 1)!. For k = 1 it isalso limz→z0

(z − z0)f(z).


poles, but the term zh−1 may contribute a pole at 0. For h ≥ 1 the integrand has polesat φ and φ−1 only and hence

γX(h) = σ2 resz=φ

zh−1(1 + θz)(1 + θz−1)

(1− φz)(1− φz−1)= σ2φh

(1 + θφ)(1 + θ/φ)

1− φ2.

For h = 0 the integrand has an additional pole at z = 0 and the integral evaluates tothe sum of the residues at the two poles at z = 0 and z = φ. The first residue is equal to−θ/φ. Thus

γX(0) = σ2( (1 + θφ)(1 + θ/φ)

1− φ2− θ

φ

)

.

The values γX(h) for h < 0 follow by symmetry.

8.28 EXERCISE. Find the auto-covariance function for a MA(q) process by using theresidue theorem. (This is not easier than the direct derivation, but perhaps instructive.)

We do not present an additional method to compute the partial auto-correlationfunction of an ARMA process. However, we make the important observation that for acausal AR(p) process the partial auto-correlations αX(h) of lags h > p vanish.

8.29 Theorem. For a causal AR(p) process, the partial auto-correlations αX(h) of lagsh > p are zero.

Proof. This follows by combining Lemma 2.36 and the expression for the best linearpredictor found in the preceding section.

8.6 Existence of Causal and Invertible Solutions

In practice we never observe the white noise process Zt in the definition of an ARMAprocess. The Zt are “hidden variables”, whose existence is hypothesized to explain theobserved seriesXt. From this point of view our earlier question of existence of a stationarysolution to the ARMA equation is perhaps not the right question, as it took the sequenceZt as given. In this section we turn this question around and consider an ARMA(p, q)processXt as given. We shall see that there are at least 2p+q white noise processes Zt suchthat φ(B)Xt = θ(B)Zt for certain polynomials φ and θ of degrees p and q, respectively.(These polynomials depend on the choice of Zt and hence are not necessarily the ones thatare initially given.) Thus the white noise process Zt is far from being uniquely determinedby the observed series Xt. On the other hand, among the multitude of solutions, only onechoice yields a representation of Xt as a stationary ARMA process that is both causaland invertible.

8.6: Existence of Causal and Invertible Solutions 129

8.30 Theorem. For every stationary ARMA processXt satisfying φ(B)Xt = θ(B)Zt forpolynomials φ and θ such that θ has no roots on the unit circle, there exist polynomialsφ∗ and θ∗ of the same or smaller degrees as φ and θ that have all roots outside the unitdisc and a white noise process Z∗t such that φ∗(B)Xt = θ∗(B)Z∗t , almost surely, for everyt ∈ Z.

Proof. The existence of the stationary ARMA process Xt and our implicit assumptionthat φ and θ are relatively prime imply that φ has no roots on the unit circle. Thus allroots of φ and θ are either inside or outside the unit circle. We shall move the rootsinside the unit circle to roots outside the unit circle by a filtering procedure. Supposethat

φ(z) = −φp(z − v1) · · · (z − vp), θ(z) = θq(z − w1) · · · (z − wq).

Consider any zero zi of φ or θ. If |zi| < 1, then we replace the term (z− zi) in the aboveproducts by the term (1− ziz); otherwise we keep (z − zi). For zi = 0, this means thatwe drop the term z − zi = z and the degree of the polynomial decreases; otherwise, thedegree remains the same. We apply this procedure to all zeros vi and wi and denote theresulting polynomials by φ∗ and θ∗. Because 0 < |zi| < 1 implies that |z−1i | > 1, thepolynomials φ∗ and θ∗ have all zeros outside the unit circle. For z in a neighbourhoodof the unit circle,

θ(z)

φ(z)=θ∗(z)

φ∗(z)κ(z), κ(z) =

∏

i:|vi|<1

1− viz

z − vi

∏

i:|wi|<1

z − wi

1− wiz.

Because Xt = (θ/φ)(B)Zt and we want that Xt = (θ∗/φ∗)(B)Z∗t , we define the processZ∗t by Z∗t = κ(B)Zt. This is to be understood in the sense that we expand κ(z) in itsLaurent series κ(z) =

∑

j κjzj and apply the corresponding linear filter to Zt.

By construction we now have that φ∗(B)Xt = θ∗(B)Z∗t . If |z| = 1, then |1− ziz| =|z− zi|. In view of the definition of κ this implies that

∣

∣κ(z)∣

∣ = 1 for every z on the unitcircle and hence Z∗t has spectral density

fZ∗(λ) =∣

∣κ(e−iλ)∣

∣

2fZ(λ) = 1 · σ

2

2π.

This shows that Z∗t is a white noise process, as desired.

As are many results in time series analysis, the preceding theorem is a result onsecond moments only. For instance, if Zt is an i.i.d. sequence, then the theorem doesnot guarantee that Z∗t is an i.i.d. sequence as well. Only first and second moments arepreserved by the filtering procedure in the proof, in general. Nevertheless, the theoremis often interpreted as implying that not much is lost by assuming a-priori that φ and θhave all their roots outside the unit circle.


8.31 EXERCISE. Suppose that the time series Zt is Gaussian. Show that the series Z∗tconstructed in the preceding proof is Gaussian and hence i.i.d..

8.7 Stability

Let φ and θ be polynomials of degrees p ≥ 1 and q ≥ 0, with φ having no roots on theunit circle. Given initial values X1, . . . , Xp and a process Zt, we can recursively define asolution to the ARMA equation φ(B)Xt = θ(B)Zt by

(8.2)Xt = φ1Xt−1 + · · ·+ φpXt−p + θ(B)Zt, t > p,

Xt−p = φ−1p

(

Xt − φ1Xt−1 − · · · − φp−1Xt−p+1 − θ(B)Zt

)

, t− p < 1.

Theorem 8.8 shows that the only solution Xt that is bounded in L1 is the (unique)stationary solution Xt = ψ(B)Zt, for ψ = θ/φ. Hence the solution (8.2) can only bebounded if the initial values X1, . . . , Xp are chosen randomly according to the stationarydistribution. In particular, the process Xt obtained from deterministic initial values mustnecessarily be unbounded. The “unboundedness” refers to the process (Xt: t ∈ Z) on thefull time scale Z.

In this section we show that in the causal situation, when φ has no zeros on the unitdisc, the process Xt tends to stationarity as t→ ∞, given arbitrary initial values. Hencein this case the unboundedness occurs as t → −∞. This is another reason to prefer thecase that φ has no roots on the unit disc: in this case the effect of initializing the processwears off as time goes by.

Let Zt be a given white noise process and let (X1, . . . , Xp) and (X1, . . . , Xp) betwo possible sets of initial values, consisting of random variables defined on the sameprobability space.

8.32 Theorem. Let φ and θ be polynomials such that φ has no roots on the unit disc.Let Xt and Xt be the ARMA processes as in defined (8.2) for a given white noise seriesZt and initial values (X1, . . . , Xp) and (X1, . . . , Xp), respectively. Then Xt − Xt → 0almost surely as t→ ∞.

8.33 Corollary. Let φ and θ be polynomials such that φ has no roots on the unit disc.If for a given white noise series Xt is an ARMA process with arbitrary initial values, thenthe vector (Xt, . . . , Xt+k) converges in distribution to the distribution of the stationarysolution to the ARMA equation, as t→ ∞, for every fixed k.

Proofs. For the corollary we take (X1, . . . , Xp) equal to the values of the stationarysolution. Then we can conclude that the difference betweenXt and the stationary solutionconverges to zero almost surely and hence also in distribution.

8.8: ARIMA Processes 131

For the proof of the theorem we write the ARMA relationship in the “state spaceform”, for t > p,

Xt

Xt−1...

Xt−p+1

=

φ1 φ2 · · · φp−1 φp1 0 · · · 0 0...

......

...0 0 · · · 1 0

Xt−1Xt−2...

Xt−p

+

θ(B)Zt

0...0

.

Denote this system by Yt = ΦYt−1 +Bt. By some algebra it can be shown that

det(Φ− zI) = (−1)pzpφ(z−1), z 6= 0.

Thus the assumption that φ has no roots on the unit disc implies that the eigenvalues of Φare all inside the unit circle. In other words, the spectral radius of Φ, the maximum of themoduli of the eigenvalues, is strictly less than 1. Because the sequence ‖Φn‖1/n convergesto the spectral radius as n → ∞, we can conclude that ‖Φn‖1/n is strictly less than 1for all sufficiently large n, and hence ‖Φn‖ → 0 as n → ∞. (In fact, if ‖Φn0‖: = c < 1,then ‖Φkn0‖ ≤ ck for every natural number k and hence ‖Φn‖ ≤ Cc⌊n/n0⌋ for every n,for C = max0≤j≤n0

‖Φj‖, and the convergence to zero is exponentially fast.)

If Yt relates to Xt as Yt relates to Xt, then Yt − Yt = Φt−p(Yp − Yp) → 0 almostsurely as t→ ∞.

8.34 EXERCISE. Suppose that φ(z) has no zeros on the unit circle and at least onezero inside the unit circle. Show that there exist initial values (X1, . . . , Xp) such that the

resulting process Xt is not bounded in probability as t → ∞. [Let Xt be the stationarysolution and let Xt be the solution given initial values (X1, . . . , Xp). Then, with notation

as in the preceding proof, Yt − Yt = Φt−p(Yp − Yp). Choose an appropriate deterministic

vector for Yp − Yp.]

8.35 EXERCISE. Simulate a series of length 200 of the ARMA process satisfying Xt −1.3Xt−1+0.7Xt−2 = Zt+0.7Zt−1. Plot the sample path, and the sample auto-correlationand sample partial auto-correlation functions. Vary the distribution of the innovationsand simulate both stationary and nonstationary versions of the process.

8.8 ARIMA Processes

In Chapter 1 differencing was introduced as a method to transform a nonstationary timeseries into a stationary one. This method is particularly attractive in combination withARMA modelling: in the notation of the present chapter the differencing filters can bewritten as

∇Xt = (1−B)Xt, ∇dXt = (1−B)dXt, ∇kXt = (1−Bk)Xt.


Thus the differencing filters ∇, ∇d and ∇k correspond to applying the operator η(B) forthe polynomials η(z) = 1 − z, η(z) = (1 − z)d and η(z) = (1 − zk), respectively. Thesepolynomials have in common that all their roots are on the complex unit circle. Thusthey were “forbidden” polynomials in our preceding discussion of ARMA processes.

By Theorem 8.10, for the three given polynomials η the series Yt = η(B)Xt cannotbe a stationary ARMA process relative to polynomials without zeros on the unit circle ifXt is a stationary process. Indeed, if also φ(B)Yt = θ(B)Zt, then (φη)(B)Xt = θ(B)Zt,and stationarity of Xt would imply that φη has no roots on the unit circle. On theother hand, the series Yt = η(B)Xt can well be a stationary ARMA process if Xt is anon-stationary time series. Thus we can use polynomials with roots on the unit circle toextend the domain of ARMA modelling to nonstationary time series.

8.36 Definition. A time series Xt is an ARIMA(p, d, q) process if ∇dXt is a stationaryARMA(p, q) process.

In other words, the time series Xt is an ARIMA(p, d, q) process if there exist poly-nomials φ and θ of degrees p and q and a white noise series Zt such that the timeseries ∇dXt is stationary and φ(B)∇dXt = θ(B)Zt almost surely. The additional “I” inARIMA is for “integrated”. If we view taking differences ∇d as differentiating, then thedefinition requires that the dth derivative of Xt is a stationary ARMA process, whenceXt itself is an “integrated ARMA process”.

The following definition goes a step further.

8.37 Definition. A time series Xt is a SARIMA(p, d, q)(P,D,Q, per) process if thereexist polynomials φ, θ, Φ and Θ of degrees p, q, P and Q and a white noise seriesZt such that the time series ∇D

per∇dXt is stationary and Φ(Bper)φ(B)∇Dper∇dXt =

Θ(Bper)θ(B)Zt almost surely.

The “S” in SARIMA is short for “seasonal”. The idea of a seasonal model is thatwe might only want to use certain powers Bper of the backshift operator in our model,because the series is thought to have a certain period. Including the terms Φ(Bper) andΘ(Bper) does not make the model more general (as these terms could be subsumed inφ(B) and θ(B)), but reflects our a-priori idea that certain coefficients in the polynomialsare zero. This a-priori knowledge will be important when estimating the coefficients froman observed time series.

Modelling an observed time series by an ARIMA, or SARIMA, model has becomepopular through an influential book by Box and Jenkins. A given time series Xt maybe differenced, repeatedly if necessary, until it is perceived as being stationary, and nextthe differenced series modelled as a stationary ARMA process. This unified filteringparadigm of a Box-Jenkins analysis is indeed attractive. The popularity is probably alsodue to the compelling manner in which Box and Jenkins explain the reader how he orshe must set up the analysis, going through a fixed number of steps. They thus providethe data-analyst with a clear algorithm to carry out an analysis that is intrinsicallydifficult. It is obvious that the results of such an analysis will not always be good, butan alternative is less obvious.

8.9: VARMA Processes 133

* 8.9 VARMA Processes

A VARMA process is a vector-valued ARMA process. Given matrices Φj and Θj and awhite noise sequence Zt of dimension d, a VARMA(p, q) process satisfies the relationship

Xt = Φ1Xt−1 +Φ2Xt−2 + · · ·+ΦpXt−p + Zt +Θ1Zt−1 + · · ·+ΘqZt−q.

The theory for VARMA process closely resembles the theory for ARMA processes. Therole of the polynomials φ and θ is taken over by the matrix-valued polynomials

Φ(z) = 1− Φ1z − Φ2z2 − · · · − Φpz

p,

Θ(z) = 1 + Θ1z +Θ2z2 + · · ·+Θqz

q.

These identities and sums are to be interpreted entry-wise and hence Φ and Θ are (d×d)-matrices with entries that are polynomials in z ∈ C.

Instead of looking at zeros of polynomials we must now look at the values of z forwhich the matrices Φ(z) and Θ(z) are singular. Equivalently, we must look at the zerosof the complex functions z 7→ detΦ(z) and z 7→ detΘ(z). Apart from this difference, theconditions for existence of a stationary solution, causality and invertibility are the same.

8.38 Theorem. If the matrix-valued polynomial Φ(z) is invertible for every z in theunit circle, then there exists a unique stationary solution Xt to the VARMA equations.If the matrix-valued polynomial Φ(z) is invertible for every z on the unit disc, then thiscan be written in the form Xt =

∑∞j=0 ΨjZt−j for matrices Ψj with

∑∞j=0 ‖Ψj‖ < ∞.

If, moreover, the polynomial Θ(z) is invertible for every z on the unit disc, then we alsohave that Zt =

∑∞j=0 ΠjXt−j for matrices Πj with

∑∞j=0 ‖Πj‖ <∞.

The norm ‖ · ‖ in the preceding may be any matrix norm. The proof of this theoremis the same as the proofs of the corresponding results in the one-dimensional case, inview of the following observations.

A series of the type∑∞

j=−∞ΨjZt−j for matrices Ψj with∑∞

j=0 ‖Ψj‖ < ∞ and avector-valued process Zt with supt E‖Zt‖ <∞ converges almost surely and in mean. Wecan define a vector-valued function Ψ with domain at least the complex unit circle byΨ(z) =

∑∞j=−∞Ψjz

j , and write the series as Ψ(B)Zt, as usual.Next, the analogue of Lemma 8.2 is true. The product Ψ = Ψ1Ψ2 of two matrix-

valued functions z 7→ Ψi(z) is understood to be the function z 7→ Ψ(z) with Ψ(z) equalto the matrix-product Ψ1(z)Ψ2(z).

8.39 Lemma. If (Ψj,1) and (Ψj,2) are sequences of (d×d)-matrices with∑

j ‖Ψj,i‖ <∞for i = 1, 2, then the function Ψ = Ψ1Ψ2 can be expanded as Ψ(z) =

∑∞j=−∞Ψjz

j , at

least for z ∈ S1, for matrices Ψj with∑∞

j=−∞ ‖Ψj‖ < ∞. Furthermore Ψ(B)Xt =

Ψ1(B)[

Ψ2(B)Xt

]

, almost surely, for every time series Xt with supt E‖Xt‖ <∞.

The functions z 7→ detΦ(z) and z 7→ detΘ(z) are polynomials. Hence if they arenonzero on the unit circle, then they are nonzero on an open annulus containing the unitcircle, and the matrices Φ(z) and Θ(z) are invertible for every z in this annulus. Cramer’s


rule, which expresses the solution of a system of linear equations in determinants, showsthat the entries of the inverse matrices Φ(z)−1 and Θ(z)−1 are quotients of polynomials.The denominators are the determinants detΦ(z) and detΘ(z) and hence are nonzero ina neighbourhood of the unit circle. These matrices may thus be expanded (entrywise) inLaurent series

Φ(z)−1 =(

∞∑

j=−∞(Φj)k,lz

j)

k,l=1,...,d=

∞∑

j=−∞Φjz

j ,

where the Φj are matrices such that∑∞

j=−∞ ‖Φj‖ <∞, and similarly for Θ(z)−1. Thisallows the back-shift calculus that underlies the proof of Theorem 8.8. For instance, thematrix-valued function Ψ = Φ−1Θ is well-defined on the unit circle, and can be expandedas Ψ(z) =

∑∞j=−∞Ψjz

j for matrices Ψj with∑∞

j=−∞ ‖Ψj‖ < ∞. The process Ψ(B)Zt

is the unique solution to the VARMA-equation.

* 8.10 ARMA-X Processes

The (V)ARMA-relation models a time series as an autonomous process, whose evolutionis partially explained by its own past (or the past of the other coordinates in the vector-valued situation). It is often of interest to include additional explanatory variables intothe model. The natural choice of including time-dependent, vector-valued variables Wt

linearly yields the ARMA-X model

Xt = Φ1Xt−1 +Φ2Xt−2 + · · ·+ΦpXt−p + Zt +Θ1Zt−1 + · · ·+ΘqZt−q + βTWt.

9GARCH Processes

White noise processes are basic building blocks for time series models, but can also be ofinterest on their own. In particular, GARCH processes are white noises that have beenfound useful for modelling financial time series. Until some decades ago the “random walkhypothesis”, according to which financial returns are independent random variables, wasthe accepted model in finance. The famous Black-Scholes model that underlies option-pricing theory falls into this category. Then in the 1980s it was realized that althoughreturns appear to have zero autocorrelation, they are not independent. In 2003 Engleand Granger received a Nobel prize for pointing this out and introducing models fromwhite noise series that are not i.i.d.

Figure 9.1 shows a realization of a GARCH process. The striking feature are the“bursts of activity”, which alternate with “quiet” periods of the series. The frequencyof the movements of the series appears to be constant over time, but their amplitudechanges, alternating between “volatile” periods (large amplitude) and quiet periods. Thisphenomenon is referred to as volatility clustering. A look at the auto-correlation functionof the realization, Figure 9.2, shows that the alternations are not reflected in the secondmoments of the series: the series can be modelled as white noise.

Recall that a white noise series is any mean-zero stationary time series whose auto-covariances at nonzero lags vanish. We shall speak of a heteroscedastic white noise if theauto-covariances at nonzero lags vanish, but the variances are possibly time-dependent.A related concept is that of a martingale difference series. Recall that a filtration Ft is anondecreasing collection of σ-fields · · · ⊂ F−1 ⊂ F0 ⊂ F1 ⊂ · · ·. A martingale differenceseries relative to the filtration Ft is a time series Xt such that Xt is Ft-measurable andE(Xt| Ft−1) = 0, almost surely, for every t. (The latter implicitly includes the assumptionthat E|Xt| <∞, so that the conditional expectation is well defined.)

Any martingale difference series Xt with finite second moments is a (possibly het-eroscedastic) white noise series. Indeed, the equality E(Xt| Ft−1) = 0 is equivalent toorthogonality of Xt to all random variables Y ∈ Ft−1, which includes the variablesXs ∈ Fs ⊂ Ft−1 for every s < t; consequently EXtXs = 0 for every s < t. The converseis false: not every white noise is a martingale difference series (relative to a natural filtra-

136 9: GARCH Processes

0 100 200 300 400 500

-1.0

-0.5

0.0

0.5

1.0

Figure 9.1. Realization of length 500 of the stationary Garch(1, 1) process with α = 0.15, φ1 = 0.4, θ1 = 0.4and standard normal variables Zt.

tion). This is because E(X|Y ) = 0 implies that X is orthogonal to all square-integrable,measurable functions of Y , not just to linear functions.

9.1 EXERCISE. If Xt is a martingale difference series, show that E(Xt+kXt+l| Ft) =0 almost surely for every k 6= l > 0. Thus “future variables are uncorrelated giventhe present”. Find a white noise series which lacks this property relative to its naturalfiltration.

By definition a martingale difference sequence has zero conditional first momentgiven the past. A natural step for further modelling is to postulate a specific form ofthe conditional second moment. This is exactly what GARCH models are about. Thusthey are also concerned only with first and second moments of the time series, albeitconditional moments. The attraction of GARCH processes is that they can capture manyfeatures of observed time series, in particular those in finance, that ARMA processescannot. Besides volatility clustering these stylized facts include leptokurtic (i.e. heavy)tailed marginal distributions and nonzero auto-correlations for the process X2

t of squares.(The phrase “stylized fact” is used for salient properties that seem to be shared by manyfinancial time series, but “fact” is not to be taken literally.)

9.1: Linear GARCH 137

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Figure 9.2. Sample auto-covariance function of the time series in Figure 9.1.

There are many types of GARCH processes. We discuss a selection in the followingsections, paying most attention to linear GARCH processes.

9.1 Linear GARCH

Linear GARCH processes were the earliest GARCH processes to be studied, and may beviewed as the GARCH processes.

9.2 Definition. A GARCH (p, q) process Xt is a martingale difference sequence relativeto a given filtration Ft, whose conditional variances σ2

t = E(X2t | Ft−1) exist and satisfy,

for every t ∈ Z and given nonnegative constants α > 0, φ1, . . . , φp, θ1, . . . , θq,

(9.1) σ2t = α+ φ1σ

2t−1 + · · ·+ φpσ

2t−p + θ1X

2t−1 + · · ·+ θqX

2t−q, a.s..

With the notations φ(z) = 1 − φ1z − · · · − φpzp and θ(z) = θ1z + · · · + θqz

q, theequation for the conditional variance σ2

t = var(Xt| Ft−1) can be abbreviated to

φ(B)σ2t = α+ θ(B)X2

t .

This notation is as in Chapter 8 except that the polynomial θ has zero intercept: θ0 =0. If the coefficients φ1, . . . , φp all vanish, then σ2

t is modelled as a linear function ofX2

t−1, . . . , X2t−q. This is called an ARCH (q) model, from “auto-regressive conditional

heteroscedastic”. The additional G of GARCH is for the nondescript “generalized”.


Since σt > 0 (as α > 0), we can define auxiliary variables by Zt = Xt/σt. Themartingale difference property of Xt = σtZt and the definition of σ2

t as the conditionalvariance (which shows that σ2

t ∈ Ft−1) imply

(9.2) E(Zt| Ft−1) = 0, E(Z2t | Ft−1) = 1.

Thus the time series Zt is a scaled martingale difference series. By reversing this con-struction we can easily define GARCH processes on the time set Z+. Here we could startfor instance with an i.i.d. sequence Zt of variables with mean zero and variance 1.

9.3 Lemma. Let (Zt: t ∈ Z) be a martingale difference sequence that satisfies (9.2)relative to an arbitrary filtration Ft and let σ0, σ−1, . . . , σ−r+1 for be random variableswith σt ∈ Ft−1 defined on the same probability space, where r = p ∨ q. Then thedefinitions Xt = σtZt for t = 0,−1, . . . ,−r + 1 followed by the recursions, for t ≥ 1,

σ2t = α+ φ1σ

2t−1 + · · ·+ φpσ

2t−p + θ1X

2t−1 + · · ·+ θqX

2t−q,

Xt = σtZt

,

define a GARCH process (Xt: t ≥ 1) with conditional variance process σ2t = E(X2

t | Ft−1).Furthermore σt ∈ Ft−1 and F0 ∨ σ(Xs: 1 ≤ s ≤ t) = F0 ∨ σ(Zs: 1 ≤ s ≤ t), for every t.

Proof. The set of four assertions σt ∈ Ft−1,Xt ∈ Ft, E(Xt| Ft−1) = 0, E(X2t | Ft−1) = σ2

t

follows for t ≤ 0 from the assumptions and definitions. In view of the recursive definitionsthe set of assertions extends to t ≥ 1 by induction. Thus all requirements for a GARCHprocess are satisfied for t ≥ 1.

The recursion formula for σ2t shows that σt ∈ F0 ∨ σ(Xs, Zs: 1 ≤ s ≤ t − 1). The

equality F0 ∨ σ(Zs: 1 ≤ s ≤ t) = F0 ∨ σ(Xs: 1 ≤ s ≤ t) is clearly valid for t = 0, andnext follows by induction for every t ≥ 1, since Xt = σtZt and Zt = Xt/σt, where σt isa function of σs and Xs with s < t. This yields the final assertion of the lemma.

Time in the definition of GARCH processes flows in one direction (the natural one),by the presence of the filtration Ft. For this reason the GARCH relation cannot be usedbackwards to construct the conditional variances σ2

t and process Xt at past times, unlikefor ARMA processes. As a consequence the lemma constructs the time series Xt only forpositive times. The same construction works starting from any time, but, depending onthe initialization, it cannot necessarily be extended to the full time set Z while keepingthe GARCH relation and correct time flow.

* 9.4 EXERCISE. Lemma 9.3 constructs a GARCH (1,1) process for times t ≥ 0 startingfrom a filtration Ft, an MDS Zt, and an initial value σ0 ∈ F0. Show that it extends tothe full time scale Z if and only if, for every j ≥ 0,

σ20 − α− αW−1W−2 − · · · − αW−1 · · ·W−j

W−1W−2 · · ·W−j−1∈ F−j−1,


where Wt = φ1 + θ1Z2t . In particular, if θ1 = 0 this is equivalent to σ0 ∈ Ft for every

t < 0. If in the latter case we enlarge the filtration to Ft ∨ σ(σ0), is Zt then necessarilystill an MDS? [The GARCH relation and Xt = σtZt imply that σ2

t = α+Wt−1σ2t−1. Use

this in reverse time to express σ2t for t < 0 in σ0 and Wt, . . . ,W1.]

Lemma 9.3 uses p ∨ q initial values σ0, . . . , σ−r+1. As these can be chosen in manyways, the process is not unique. The construction of stationary GARCH processes is moreinvolved and taken up in Section 9.1.2. Next in Section 9.1.3 the effect of initializationis shown to wear off as t → ∞: any process as constructed in the preceding lemma willtend to the stationary solution, if such a solution exists.

The final assertion of the lemma shows that at each time instance the randomnessadded “into the system” is generated by Zt. This encourages to interprete this standard-ized process as the innovations of a GARCH process, although this term should now beinderstood in a multiplicative sense.

As any martingale differences series, a GARCH process is “constant in the mean”.Not only is its mean function zero, but also at each time its expected change given its pastis zero. The (conditional) variance structure of a GARCH process is more interesting. Bythe GARCH relation (9.1) large absolute values Xt−1, . . . , Xt−q of the process at timest−1, . . . , t−q lead to a large conditional variance σ2

t at time t. Then the value Xt = σtZt

of the time series at time t tends to deviate from far from zero as well (depending on thesize of the innovation Zt). Large conditional variances σ2

t−1, . . . , σ2t−p similarly promote

large values of |Xt|. In financial applications (conditional) variance is often referred toas volatility, and strongly fluctuating time series are said to be volatile. Thus GARCHprocesses tend to alternate periods of high and low volatility. This volatility clustering isone of the stylized facts of financial time series.

A second stylized fact are the leptokurtic tails of the marginal distribution of atypical financial time series. A distribution on R is called leptokurtic if it has fat tails,for instance fatter than normal tails. A quantitative measure of “fatness” of the tailsof the distribution of a random variable X is the kurtosis defined as κ4(X) = E(X −EX)4/(varX)2. If Xt = σtZt, where σt is Ft−1-measurable and Zt is independent ofFt−1 with mean zero and variance 1, then EX2

t = Eσ2t , EX

4t = Eσ4

tEZ4t = Eσ4

t κ4(Zt)and

κ4(Xt)

κ4(Zt)=

EX4t /(EX

2t )

2

κ4(Zt)=

Eσ4t

(EX2t )

2=

varσ2t + (Eσ2

t )2

(Eσ2t )

2= 1 +

varσ2t

(Eσ2t )

2.

Thus κ4(Xt) ≥ κ4(Zt), and the difference can be substantial if the variance of thevolatility σ2

t is large. It follows that the GARCH structure is also able to capture someof the observed leptokurtosis of financial time series.

As the kurtosis of a normal variable is equal to 3, the kurtosis of a GARCH seriesXt driven by Gaussian innovations Zt is always bigger than 3. It has been observedthat this usually does not go far enough in explaining “excess kurtosis” over the normaldistribution. The use of a Student t-distribution can improve the fit of a GARCH processsubstantially.

If we define Wt = Xt − σ2t and replace σ2

t by X2t −Wt in (9.1), then we find after


rearranging the terms,

(9.3)X2

t = α+ (φ1 + θ1)X2t−1 + · · ·+ (φr + θr)X

2t−r

+Wt − φ1Wt−1 − · · · − φpWt−p,

where r = p ∨ q and the sequences φ1, . . . , φp or θ1, . . . , θq are padded with zeros toincrease their lengths to r, if necessary. We can abbreviate this to

(φ− θ)(B)X2t = α+ φ(B)Wt, Wt = X2

t − E(X2t | Ft−1).

This is the characterizing equation for an ARMA(r, r) process X2t relative to the noise

processWt. The variable Wt = X2t −σ2

t is the prediction error when predicting X2t by its

conditional expectation σ2t = E(X2

t | Ft−1) and hence Wt is orthogonal to Ft−1. Thus Wt

is a martingale difference series and hence a white noise sequence if its second momentsexist and are independent of t. Under these conditions the time series of squares X2

t isan ARMA process in the sense of Definition 8.4. This observation is useful to computecertain characteristics of the process.

Warning. The variables Wt in equation (9.3) are defined in terms of of the processXt (their squares and conditional variances) and therefore do not have an interpretationas an exogenous noise process that drives the process X2

t . This circularity makes thatone should not apply results on ARMA processes unthinkingly to the process X2

t . Forinstance, equation (9.3) seems to have little use for proving existence of solutions to theGARCH equation, although some authors seem to believe otherwise.

* 9.5 EXERCISE. Suppose that Xt and Wt are martingale difference series relative to agiven filtration such that φ(B)X2

t = θ(B)Wt for polynomials φ and θ of degrees p andq. Show that Xt is a GARCH process. Does strict stationarity of the time series X2

t orWt imply strict stationarity of the time series Xt?

9.6 EXERCISE. Write σ2t as the solution to an ARMA(p∨ q, q − 1) equation by substi-

tuting X2t = σ2

t +Wt in (9.3).

9.1.1 Autocovariances of Squares

Even though the autocorrelation function of a GARCH process vanishes (at every nonzerolag), the variables are not independent. One way of seeing this is to consider the auto-correlations of the squares X2

t of the time series. Independence of the Xt would implyindependence of the X2

t , and hence a zero autocorrelation function. However, in viewof (9.3) the squares form an ARMA process, whence their autocorrelations are nonzero.The auto-correlations of the squares of financial time series are typically observed to bepositive at all lags, another stylized fact of such series.

The auto-correlation function of the squares of a GARCH series will exist underappropriate additional conditions on the coefficients and the driving noise process Zt.The ARMA relation (9.3) for the square process X2

t may be used to derive this functionfrom the formulas for ARMA processes. Here we must not forget that the process Wt


in (9.3) is defined through Xt and hence its variance depends on the parameters in theGARCH relation.

Actually, equation (9.3) is an ARMA equation “with intercept α”. Provided thatβ: = (φ− θ)(1) = 1−∑j(φj + θj) 6= 0, we can rewrite the equation as (φ− θ)(B)

(

X2t −

α/β))

= φ(B)Wt, which shows that the time series X2t is an ARMA proces plus α/γ.

The autocorrelations are of course independent of this shift.

9.7 Example (GARCH(1, 1)). The conditional variances of a GARCH(1,1) processsatisfy σ2

t = α + φσ2t−1 + θX2

t−1. If we assume the process Xt to be stationary, thenEσ2

t = EX2t is independent of t. Taking the expectation across the GARCH equation

and rearranging then immediately gives

Eσ2t = EX2

t =α

1− φ− θ.

To compute the auto-correlation function of the time series of squares X2t , we employ

(9.3), which reveals this process as an ARMA(1,1) process with the auto-regressive andmoving average polynomials given as 1−(φ+θ)z and 1−φz, respectively. The calculationsin Example 8.27 yield that

γX2(h) = τ2(φ+ θ)h(1− φ(φ+ θ))(1− φ/(φ+ θ))

1− (φ+ θ)2, h > 0,

γX2(0) = τ2( (1− φ(φ+ θ))(1− φ/(φ+ θ))

1− (φ+ θ)2+

φ

φ+ θ

)

.

Here τ2 is the variance of the process Wt = X2t −E(X2

t | Ft−1), and is also dependent onthe parameters θ and φ. By squaring the GARCH equation we find

σ4t = α2 + φ2σ4

t−1 + θ2X4t−1 + 2αφσ2

t−1 + 2αθX2t−1 + 2φθσ2

t−1X2t−1.

If Zt is independent of Ft−1, then Eσ2tX

2t = Eσ4

t and EX4t = κ4(Zt)Eσ

4t . If we assume,

moreover, that the moments exists and are independent of t, then we can take theexpectation across the preceding display and rearrange to find that

Eσ4t (1− φ2 − 2φθ − κ4(Z)θ

2) = α2 + (2αφ+ 2αθ)Eσ2t .

Together with the formulas obtained previously, this gives the variance of Wt, sinceEWt = 0 and EW 2

t = EX4t − Eσ4

t , by Pythagoras’ identity for projections.

9.8 EXERCISE. Find the auto-covariance function of the process σ2t for a GARCH(1, 1)

process.

9.9 EXERCISE. Find an expression for the kurtosis of the (marginal) distribution ofXt in a stationary GARCH(1, 1) process as in the preceding example. Let Zt be Gaus-sian. For which parameter values is the formula valid? What happens if the parametersapproach the boundary of this set?


9.1.2 Stationary Solutions

Because the GARCH relation is recursive and the variables Xt enter it both explicitlyand implicitly (through their conditional variances), existence of stationary GARCHprocesses is far from obvious. In this section we show that a (strictly) stationary solutionexists only for certain parameter values φ1, . . . , φp, θ1 . . . , θq, and if it exists, then it isunique. Surprisingly, there are certain parameter values for which a strictly stationarysolution exists, but not a (second order) stationary solution.

The key to understanding these results is a state space representation, which ‘re-pairs’ the circularity of the GARCH definition. By substituting Xt−1 = σt−1Zt−1 in theGARCH equation (9.1), and leaving Xt−2, . . . , Xt−q untouched, we see that

(9.4) σ2t = α+ (φ1 + θ1Z

2t−1)σ

2t−1 + φ2σ

2t−1 + · · ·+ φpσ

2t−p + θ2X

2t−2 + · · ·+ θqX

2t−q.

This equation gives rise to the recursive equations for the ‘state vectors’ Yt =(σ2

t , . . . , σ2t−p+1, X

2t−1, . . . , X

2t−q+1)

T given by

(9.5)

σ2t

σ2t−1σ2t−2...

σ2t−p+1

X2t−1

X2t−2...

X2t−q+1

=

φ1 + θ1Z2t−1 φ2 · · · φp−1 φp θ2 · · · θq−1 θq

1 0 · · · 0 0 0 · · · 0 00 1 · · · 0 0 0 · · · 0 0...

.... . .

......

......

...0 0 · · · 1 0 0 · · · 0 0

Z2t−1 0 · · · 0 0 0 · · · 0 00 0 · · · 0 0 1 · · · 0 0...

......

......

. . ....

...0 0 · · · 0 0 0 · · · 1 0

σ2t−1σ2t−2σ2t−3...

σ2t−p

X2t−2

X2t−3...

X2t−q

+αe1.

For At the (random!) matrix in this display and b = αe1 for e1 the first unit vector,this recursion can be written Yt = AtYt−1 + b. We can construct a GARCH process byfirst constructing a vector-valued process Yt satisfying this recursive equation, and nextdefining σ2

t as the first coordinate of this process and finally setting Xt = σtZt.We start with investigating second order stationary solutions. In the following the-

orem we consider given a martingale difference sequence Zt satisfying (9.2), defined on afixed probability space equipped with a given filtration (Ft: t ∈ Z). The first part of thetheorem shows that a stationary GARCH process exists if and only if

∑

j(φj + θj) < 1.The last part shows essentially that the filtration may be reduced to the natural filtrationof the GARCH process, without changing anything.

9.10 Theorem. Let α > 0, let φ1, . . . , φp, θ1, . . . , θq be nonnegative numbers, and let Zt

be a martingale difference sequence satisfying (9.2) relative to an arbitrary filtration Ft.(i) There exists a stationary GARCH process (Xt: t ∈ Z) such that Xt = σtZt, where

σ2t = E(X2

t | Ft−1), if and only if∑

j(φj + θj) < 1.(ii) This process is unique among the GARCH processes Xt with Xt = σtZt that are

bounded in L2.


(iii) This process satisfies σ(Xt, Xt−1, . . .) = σ(Zt, Zt−1, . . .) for every t, and σ2t =

E(X2t | Ft−1) is σ(Xt−1, Xt−2, . . .)-measurable.

Proof. By iterating (9.5) we find, for every n ≥ 1,

(9.6) Yt = b+Atb+AtAt−1b+ · · ·+AtAt−1 · · ·At−n+1b+AtAt−1 · · ·At−nYt−n−1.

If the last term on the right tends to zero as n→ ∞, then this gives

(9.7) Yt = b+

∞∑

j=1

AtAt−1 · · ·At−j+1b.

We shall prove that this representation is indeed valid (with convergence of the series inprobability) for the process Yt corresponding to a GARCH process that is bounded inL2. Because the right side of the display is a given function of the Zt, the uniqueness (ii)of such a GARCH process for a given process Zt is then clear. To prove existence, as in(i), we shall reverse the argument: we use equation (9.7) to define a process Yt, and nextdefine processes σt and Xt as functions of Yt and Zt. We start by studying the randommatrix products AtAt−1 · · ·At−n.

The expectation A = EAt of a single matrix is obtained by replacing the variableZ2t−1 in the definition of At by its expectation 1. Since Zt = Xt/σt is Ft-measurable and

E(Z2t | Ft−1) = 1 for every t, we can use the towering property of conditional expecta-

tions to see that the expectation of the product of the matrices is the product of theexpectations: EAtAt−1 · · ·At−n = An+1.

The characteristic polynomial of the matrix A can (with some effort) be seen to beequal to

det(A− zI) = (−z)p+q−1(1−r∑

j=1

(φj + θj)z−j).

If∑

j(φj + θj) < 1, then the polynomial on the right has all its roots inside the unitcircle, by Lemma 9.11. In other words, the spectral radius (the maximum of the moduliof the eigenvalues) of the operator A is then strictly smaller than 1. This implies that‖An‖1/n < c < 1 for some constant c for all sufficiently large n and hence

∑∞n=0 ‖An‖ <

∞.Combining the results of the last two paragraphs we see that EAtAt−1 · · ·At−n =

An+1 → 0 as n → ∞ if∑

j(φj + θj) < 1. Because the matrices At possess nonnegativeentries, this implies that the sequence AtAt−1 · · ·At−n converges to zero in probability.If Xt is a GARCH process that is bounded in L2, then, in view of its definition, the cor-responding vector-valued process Yt is bounded in L1, and hence bounded in probability.In that case AtAt−1 · · ·At−nYt−n−1 → 0 in probability as n → ∞, whence Yt satisfies(9.7). This concludes the proof of the uniqueness (ii).

Under the assumption that∑

j(φj + θj) < 1, the series on the right side of (9.7)converges in L1 and almost surely. Thus we can use this equation to define a process Ytfrom the given martingale difference series Zt. Simple algebra on the infinite series shows


that Yt = AtYt−1 + b for every t. Clearly the coordinates of Yt are nonnegative, whencewe can define processes σt and Xt by

σt =√

Yt,1, Xt = σtZt.

Because σt is σ(Zt−1, Zt−2, . . .) ⊂ Ft−1-measurable, we have that E(Xt| Ft−1) =σtE(Zt| Ft−1) = 0 and E(X2

t | Ft−1) = σ2tE(Z

2t | Ft−1) = σ2

t . Furthermore, the processσ2t satisfies the GARCH relation (9.1). Indeed, the first row of the matrix equationYt = AtYt−1 + b expresses σ2

t into σ2t−1 and Yt−1,2, . . . , Yt−1,p+q−1, and the other rows

permit to reexpress Yt−1,2, . . . , Yt−1,p+q−1 into variables σ2s and X2

s so as to recover theequation (9.4). For instance, the second row gives that Yt,2 = Yt−1,1 = σ2

t−1, and nextthe third row gives that Yt,3 = Yt−1,2 = σ2

t−2. Similarly, the p + 1th row gives thatYt,p+1 = Z2

t−1Yt−1,1 = Z2t−1σ

2t−1 = X2

t−1 by the definition of Xt−1. Finally the p+2th top+q−1th rows give that Yt,p+2 = Yt−1,p+1 = X2

t−2, Yt,p+3 = Yt−1,p+2 = X2t−3, etc. This

concludes the proof that there exists a stationary solution as soon as∑

j(φj + θj) < 1.To see that the latter condition is necessary, we return to equation (9.6). If Xt is a

stationary solution, then its conditional variance process σ2t is integrable, and hence so is

the process Yt in (9.6). Taking the expectation of the left and right sides of this equationfor t = 0 and remembering that all terms are nonnegative, we see that

∑nj=0A

jb ≤ EY0,for every n. This implies that Anb → 0 as n → ∞, or, equivalently Ane1 → 0, where eiis the ith unit vector. In view of the definition of A we see, recursively, that

Anep+q−1 = An−1θqe1 → 0,

Anep+q−2 = An−1(θq−1e1 + ep+q−1) → 0,

...

Anep+1 = An−1(θ2e1 + ep+2) → 0,

Anep = An−1φpe1 → 0,

Anep−1 = An−1(φp−1e1 + ep),

...

Ane2 = An−1(φ2e1 + e3) → 0.

Therefore, the sequence An converges to zero. This can only happen if none of its eigen-values is on or outside the unit disc. (If z is an eigenvalue of A with |z| ≥ 1 and c 6= 0 acorresponding eigenvector, then Anc = znc does not converge to zero.) This was seen tobe the case only if

∑rj=1(φj + θj) < 1.

This concludes the proof of (i) and (ii). For the proof of (iii) we first note that thematrices At, and hence the variables Yt, are measurable functions of (Zt−1, Zt−2, . . .).In particular, the variable σ2

t , which is the first coordinate of Yt, is σ(Zt−1, Zt−2, . . .)-measurable, so that the second assertion of (iii) follows from the first.

Because the variable Xt = σtZt is a measurable transformation of (Zt, Zt−1, . . .),it follows that σ(Xt, Xt−1, . . .) ⊂ σ(Zt, Zt−1, . . .). To prove the converse, we note thatthe process Wt = X2

t − σ2t is bounded in L1 and satisfies the ARMA relation (φ −

θ)(B)X2t = α+φ(B)Wt, as in (9.3). Because φ has no roots on the unit disc, this relation


is invertible, whence Wt = (1/φ)(B)(

(φ− θ)(B)X2t − α

)

is a measurable transformationof X2

t , X2t−1, . . .. Therefore σ

2t = Wt + X2

t and hence Zt = Xt/σt is σ(Xt, Xt−1, . . .)-measurable.

9.11 Lemma. If p1, . . . , pr are nonnegative real numbers, then the polynomial p(z) =1−∑r

j=1 pjzj possesses no roots on the complex unit disc if and only if p(1) > 0.

Proof. Clearly p(0) = 1 > 0. If p(1) ≤ 0, then by continuity the polynomial p must havea real root in the interval (0, 1]. On the other hand, if p(1) > 0, then by the triangleinequality

∣

∣p(z)∣

∣ ≥ 1 −∑j pj |z|j ≥ 1 −∑j pj = p(1) > 0, for all z in the complex unitdisc, whence p has no roots.

9.12 EXERCISE. Show that any GARCH process Xt = σtZt satisfies

(9.8) σ2t = α+ (φ1 + θ1Z

2t−1)σ

2t−1 + · · ·+ (φr + θrZ

2t−r)σ

2t−r.

This exhibits the process σ2t as an auto-regressive process “with random coefficients”

(the variables φi + θiZ2t−i) and “deterministic innovations” (the constant α). Write this

equation in state space form with state vectors Yt−1 = (σ2t−1, . . . , σ

2t−r). Show that

stationarity of Xt implies that Yt satisfies (9.7) for appropriate random matrices At. [Themean A = EAt of the matrices is obtained by replacing the Z2

t by their expectations,and det(A− zI) = (−1)r

(

zr −∑rj=1(φj + θj)z

r−j).]

The proof of Theorem 9.10 hinges on the infinite series (9.7). The necessary andsufficient condition

∑

j(φj + θj) < 1 for stationarity arises because this is necessary forits convergence in L1. If the series converges in another (weaker) sense, then a GARCHprocess may still exist, albeit that it cannot have bounded second moments.

The series (9.7) involves products of random matrices At. Its convergence dependson the value of their top Lyapounov exponent, defined by

γ = infn∈N

1

nE log ‖A−1A−2 · · ·A−n‖.

Here ‖ · ‖ may be any matrix norm (all matrix norms being equivalent). If the processZt is ergodic, for instance i.i.d., then we can apply Kingman’s subergodic theorem (seeTheorem 7.17) to the process log ‖A−1A−2 · · ·A−n‖ to see that

1

nlog ‖A−1A−2 · · ·A−n‖ → γ, a.s..

This implies that the sequence of matrices A−1A−2 · · ·A−n converges to zero almostsurely as soon as γ < 0. The convergence is then exponentially fast and the series in(9.7) will converge.

Thus sufficient conditions for the existence of strictly stationary solutions to theGARCH equations can be given in terms of the top Lyapounov exponent of the randommatrices At. This exponent is in general difficult to compute analytically, but it caneasily be estimated numerically for a given sequence Zt.

The setting is the same as in Theorem 9.10, except that we now also assume ergod-icity of the ‘innovations’ Zt. For instance, Zt can be a given i.i.d. sequence of randomvariables with mean zero and unit variance and Ft its natural filtration.


9.13 Theorem. Let α > 0, let φ1, . . . , φp, θ1, . . . , θq be nonnegative numbers, and let Zt

be a strictly stationary, ergodic martingale difference series satisfying (9.2).(i) If the top Lyapounov coefficient of the random matrices At given by (9.5) is strictly

negative, then there exists a strictly stationary GARCH process Xt such that Xt =σtZt, where σ

2t = E(X2

t | Ft−1).(ii) If the sequence Zt is i.i.d., then this condition is necessary.(iii) This process satisfies σ(Xt, Xt−1, . . .) = σ(Zt, Zt−1, . . .) for every t, and σ2

t =E(X2

t | Ft−1) is σ(Xt−1, Xt−2, . . .)-measurable.

Proof. Let b = αe1, where ei is the ith unit vector in Rp+q−1. If γ′ is strictly larger than

the top Lyapounov exponent γ, then ‖AtAt−1 · · ·At−n+1‖ < eγ′n, eventually as n→ ∞,

almost surely, and hence, eventually,∥

∥AtAt−1 · · ·At−n+1b∥

∥ < eγ′n‖b‖.

If γ < 0, then we may choose γ′ < 0, and hence∑

n ‖AtAt−1 · · ·At−n+1b‖ < ∞ almostsurely. Then the series on the right side of (9.7) converges almost surely and defines aprocess Yt. We can then define processes σt and Xt by setting σt =

√

Yt,1 and Xt =σtZt. That these processes satisfy the GARCH relation follows from the relations Yt =AtYt−1 + b, as in the proof of Theorem 9.10. Being a fixed measurable transformation of(Zt, Zt−1, . . .) for each t, the process (σt, Xt) is strictly stationary.

By construction the variable Xt is σ(Zt, Zt−1, . . .)-measurable for every t. To seethat, conversely, Zt is σ(Xt, Xt−1, . . .)-measurable, we apply a similar argument as inthe proof of Theorem 9.10, based on inverting the relation (φ− θ)(B)X2

t = α+φ(B)Wt,for Wt = X2

t − σ2t . Presently, the series X2

t and Wt are not necessarily integrable,but Lemma 9.14 below still allows to conclude that Wt is σ(X2

t , X2t−1, . . .)-measurable,

provided that the polynomial φ has no zeros on the unit disc.The matrix B obtained by replacing the variables Zt−1 and the numbers θj in the

matrix At by zero is bounded above by At in a coordinatewise sense. By the nonnegativityof the entries this implies that Bn ≤ A0A−1 · · ·A−n+1 and hence Bn → 0. This canhappen only if all eigenvalues of B are inside the unit circle. Now

det(B − zI) = (−z)p+q−1φ(1/z).

Thus z is a zero of φ if and only if z−1 is an eigenvalue of B. We conclude that φ has nozeros on the unit disc.

Finally, we show the necessity of the top Lyapounov exponent being negative if Zt

is i.i.d.. If there exists a strictly stationary solution to the GARCH equations, then, by(9.6) and the nonnegativity of the coefficients,

∑nj=1A0A−1 · · ·A−n+1b ≤ Y0 for every

n, and hence A0A−1 · · ·A−n+1b → 0 as n → ∞, almost surely. By the form of b this isequivalent to A0A−1 · · ·A−n+1e1 → 0. Using the structure of the matrices At we nextsee that A0A−1 · · ·A−n+1 → 0 in probability as n → ∞, by an argument similar asin the proof of Theorem 9.10. Because the matrices At are independent and the eventwhere A0A−1 · · ·A−n+1 → 0 is a tail event, this event must have probability one. It canbe shown that this is possible only if the top Lyapounov exponent of the matrices At isnegative.

See Bougerol (), Lemma ?.


9.14 Lemma. Let φ be a polynomial without roots on the unit disc and let Xt bea time series that is bounded in probability. If Zt = φ(B)Xt for every t, then Xt isσ(Zt, Zt−1, . . .)-measurable.

Proof. Because φ(0) 6= 0 by assumption, we can assume without loss of generality thatφ possesses intercept 1. If φ is of degree 0, then Xt = Zt for every t and the assertion iscertainly true. We next proceed by induction on the degree of φ. If φ is of degree p ≥ 1,then we can write it as φ(z) = (1−φz)φp−1(z) for a polynomial φp−1 of degree p−1 and acomplex number φ with |φ| < 1. The series Yt = (1−φB)Xt is bounded in probability andφp−1(B)Yt = Zt, whence Yt is σ(Zt, Zt−1, . . .)-measurable, by the induction hypothesis.

By iterating the relation Xt = φXt−1 + Yt, we find that Xt = φnXt−n +∑n−1

j=0 φjYt−j .

Because the sequence Xt is uniformly tight and φn → 0, the sequence φnXt−n convergesto zero in probability as n→ ∞. Hence Xt is the limit in probability of a sequence thatis σ(Yt, Yt−1, . . .)-measurable and hence is σ(Zt, Zt−1, . . .)-measurable. This implies theresult.

** 9.15 EXERCISE. In the preceding lemma the function ψ(z) = 1/φ(z) possesses a powerseries representation ψ(z) =

∑∞j=0 ψjz

j on a neighbourhood of the unit disc. Is it true

under the conditions of the lemma that Xt =∑∞

j=0 ψjZt−j , where the series converges(at least) in probability?

9.16 Example. For the GARCH(1, 1) process the random matrices At given by (9.5)reduce to the random variables φ1 + θ1Z

2t−1. The top Lyapounov exponent of these ran-

dom (1 × 1) matrices is equal to E log(φ1 + θ1Z2t ). This number can be written as an

integral relative to the distribution of Zt, but in general is not easy to compute analyt-ically. Figure 9.3 shows the points (φ1, θ1) for which a strictly stationary GARCH(1, 1)process exists in the case of Gaussian innovations Zt. It also shows the line φ1 + θ1 = 1,which is the boundary line for existence of a stationary solution.

9.1.3 Stability and Persistence

Lemma 9.3 gives a simple recipe for constructing a GARCH process, at least on thepositive time set. In the preceding section it was seen that a (strictly) stationary versionexists only for certain parameter values, and then is unique. Clearly this stationarysolution is obtained by the recipe of Lemma 9.3 only if the initial values in this lemmaare chosen according to the stationary distribution.

In the following theorem we show that the effect of a “nonstationary” initializationwears off as t→ ∞ and the process will approach stationarity, provided that a stationarysolution exists. This is true both for L2-stationarity and strict stationarity, under theappropriate conditions on the coefficients.

9.17 Theorem. For a number α > 0 and nonnegative numbers φ1, . . . , φp, θ1, . . . , θqand a martingale difference series Zt satisfying (9.2), let Xt = σtZt and Xt = σtZt betwo solutions to the of the GARCH equation.


0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

**********************************************************************************************************************************************************************************************************************************************************************************************************************

***************************************************************************************************************************************************************************************************************************************************************************************************

*************************************************************************************************************************************************************************************************************************************************************************************

*************************************************************************************************************************************************************************************************************************************************************************

***************************************************************************************************************************************************************************************************************************************************************

******************************************************************************************************************************************************************************************************************************************************

**********************************************************************************************************************************************************************************************************************************************

**************************************************************************************************************************************************************************************************************************************

*******************************************************************************************************************************************************************************************************************************

************************************************************************************************************************************************************************************************************************

******************************************************************************************************************************************************************************************************************

************************************************************************************************************************************************************************************************************

*******************************************************************************************************************************************************************************************************

*************************************************************************************************************************************************************************************************

********************************************************************************************************************************************************************************************

***************************************************************************************************************************************************************************************

**********************************************************************************************************************************************************************************

******************************************************************************************************************************************************************************

*************************************************************************************************************************************************************************

*********************************************************************************************************************************************************************

*****************************************************************************************************************************************************************

*************************************************************************************************************************************************************

*********************************************************************************************************************************************************

*****************************************************************************************************************************************************

**************************************************************************************************************************************************

**********************************************************************************************************************************************

*******************************************************************************************************************************************

***************************************************************************************************************************************

************************************************************************************************************************************

*********************************************************************************************************************************

*****************************************************************************************************************************

**************************************************************************************************************************

***********************************************************************************************************************

********************************************************************************************************************

*****************************************************************************************************************

***************************************************************************************************************

************************************************************************************************************

*********************************************************************************************************

******************************************************************************************************

****************************************************************************************************

*************************************************************************************************

***********************************************************************************************

********************************************************************************************

******************************************************************************************

***************************************************************************************

*************************************************************************************

***********************************************************************************

********************************************************************************

******************************************************************************

****************************************************************************

**************************************************************************

************************************************************************

*********************************************************************

*******************************************************************

*****************************************************************

***************************************************************

*************************************************************

***********************************************************

**********************************************************

********************************************************

******************************************************

****************************************************

**************************************************

************************************************

***********************************************

*********************************************

*******************************************

******************************************

****************************************

**************************************

*************************************

***********************************

**********************************

********************************

*******************************

*****************************

****************************

**************************

*************************

***********************

**********************

*********************

*******************

******************

*****************

*************************************************************************************************************

Figure 9.3. The shaded area gives all points (φ, θ) where E log(φ+θZ2) < 0 for a standard normal variableZ. For each point in this area a strictly stationary GARCH process with Gaussian innovations exists. Thisprocess is stationary if and only if (φ, θ) falls under the line φ + θ = 1, which is also indicated.

(i) If∑

j(φj + θj) < 1 and Xt and Xt are square-integrable, then Xt − Xt converges tozero in L2 as t→ ∞.

(ii) If the top Lyapounov exponent of the matrices At in (9.5) is negative, then Xt− Xt

converges to zero in probability as t→ ∞.

Proof. From the two given GARCH processes Xt and Xt define processes Yt and Yt asindicated preceding the statement of Theorem 9.13. These processes satisfy (9.6) for thematrices At given in (9.5). Choosing n = t− 1 and taking differences we see that

Yt − Yt = AtAt−1 · · ·A1(Y0 − Y0).

If the top Lyapounov exponent of the matrices At is negative, then the norm of theright side can be bounded, almost surely for sufficiently large t, by by eγ

′t‖Y0 − Y0‖ forsome number γ′ < 0. This follows from the subergodic theorem, as before (even thoughthis time the matrix product grows on its left side). This converges to zero as t → ∞,implying that σ2

t − σ2t = Yt,1 − Yt,1 → 0 almost surely as t → ∞. Hence σt − σt → 0

almost surely, and Xt − Xt = (σt − σt)Zt → 0 in probability, because Zt is bounded inprobability. This concludes the proof of (ii).

Under the condition of (i), the spectral radius of the matrix A = EAt is strictlysmaller than 1 and hence ‖An‖ → 0. By the nonnegativity of the entries of the matricesAt the absolute values of the coordinates of the vectors Yt − Yt are bounded above bythe coordinates of the vector AtAt−1 · · ·A1W0, for W0 the vector obtained by replacingthe coordinates of Y0 − Y0 by their absolute values. By the independence of the matrices


At and vector W0, the expectation of AtAt−1 · · ·A1W0 is bounded by AtEW0, whichconverges to zero. Because σ2

t = Yt,1 and Xt = σtZt, this implies that, as t→ ∞,

E|Xt − Xt|2 = E|σt − σt|2Z2t ≤ E|σ2

t − σ2t | → 0.

In the last step we use nonnegativity of the volatility. and the inequality |x−y|2 ≤ |x2−y2|for nonnegative numbers x, y. This concludes the proof.

The preceding theorem exhibits two different notions of “persistence” of the initialvalues to a GARCH series. If the condition for strict stationarity are satisfied, then anyGARCH series will tend to the stationary solution and hence the influence of the initialvalues wears off as time goes to infinity. The initial values are not “persistent” in thiscase. If the stronger condition

∑

j(φj+θj) < 1 for L2-stationarity holds, then the processwill also approach stationarity in the stronger L2-sense.

It is a little counter-intuitive that for parameter values for which the top Lyapounovexponent is negative, but

∑

j(φj+θj) ≥ 1, the initial values are persistent in an L2-sense.There is no stationary GARCH process in this situation, and hence the strictly stationaryGARCH process must have infinite variance. An arbitrary GARCH process will approachthe strictly stationary process and therefore its variance must tend to infinity (see theexercise below). We shall see in the next section that the initialization then may influencethe way that the second moments explode. (Depending on the initialization the variancesmay be finite or infinite. The conditional variances are finite by assumption.)

The case that∑

j(φi + θj) = 1 is often viewed as having particular interest and isreferred to as integrated GARCH or IGARCH. Many financial time series yield GARCHfits that are close to IGARCH.

9.18 EXERCISE. Suppose that the time series Xt is strictly stationary with infinitesecond moments and Xt − Xt → 0 in probability as t→ ∞. Show that EX2

t → ∞.

9.1.4 Prediction

As it is a martingale difference process, a GARCH process does not allow nontriv-ial predictions of its mean values. However, it is of interest to predict the condi-tional variances σ2

t , or equivalently the process of squares X2t . Predictions based on

the infinite past Ft can be obtained using the auto-regressive representation Yt =AtYt−1 + b given in (9.5) and used in the proofs of Theorems 9.13 and 9.10, forYt = (σ2

t , . . . , σ2t−p+1, X

2t−1, . . . , X

2t−q+1)

T . The vector Yt−1 is Ft−2-measurable, and thematrix At depends on Zt−1 only, with A = E(At| Ft−2) independent of t. It follows that

E(Yt| Ft−2) = E(At| Ft−2)Yt−1 + b = AYt−1 + b.

By iterating this equation we find that, for h > 1,

(9.9) E(Yt| Ft−h) = Ah−1Yt−h+1 +h−2∑

j=0

Ajb.


As σ2t is the first coordinate of Yt, this equation gives in particular the predictions for

the volatility, based on the infinite past. If∑

j(φj + θj) < 1, then the spectral radius ofthe matrix A is strictly smaller than 1, and both terms on the right converge to zero atan exponential rate, as h → ∞. In this case the potential of predicting the conditionalvariance process is limited to the very near future. The case that

∑

j(φj + θj) ≥ 1 ismore interesting, as is illustrated by the following example.

9.19 Example (GARCH(1,1)). For a GARCH(1, 1) process the vector Yt is equal toσ2t and the matrix A reduces to the number φ1 + θ1. The general equation (9.9) can be

rewritten in the form

(9.10) E(X2t+h| Ft) = E(σ2

t+h| Ft) = (φ1 + θ1)h−1σ2

t+1 + α

h−2∑

j=0

(φ1 + θ1)j .

For φ1 + θ1 < 1 the first term on the far right converges to zero as h → ∞, indicatingthat information at the present time t does not help to predict the conditional varianceprocess in the “infinite future”. On the other hand, if φ1 + θ1 ≥ 1 and α > 0 thenboth terms on the far right side contribute positively as h→ ∞. If φ1 + θ1 = 1, then thecontribution of the term (φ1+θ1)

h−1σ2t = σ2

t is constant, and hence tends to zero relativeto the nonrandom term (which becomes α(h− 1) → ∞), as h→ ∞. If φ1 + θ1 > 1, thenthe contributions of the two terms are of the same order. In both cases the volatility σ2

t

persists into the future, although in the first case its relative influence on the predictiondwindles. If φ1+θ1 > 1, then the value σ2

t is particularly “persistent”; this happens evenif the time series tends to (strict) stationarity in distribution.

9.20 EXERCISE. Suppose that∑

j(φj + θj) < 1 and let Xt be a stationary GARCH

process. Show that E(X2t+h| Ft) → EX2

t as h→ ∞. [Hint: use (9.9).]

* 9.2 Linear GARCH with Leverage and Power GARCH

Fluctuations of foreign exchange rates tend to be symmetric, in view of the two-sidednature of the foreign exchange market. However, it is an empirical finding that for assetprices the current returns and future volatility are negatively correlated. For instance, acrash (small returns) in the stock market is often followed by large volatility.

A linear GARCH model is not able to capture this type of asymmetric relationship,because it models the volatility as a function of the squares of the past returns. Oneattempt to allow for asymmetry is to replace the GARCH equation (9.1) by

σ2t = α+ φ1σ

2t−1 + · · ·+ φpσ

2t−p + θ1(|Xt−1|+ γ1Xt−1)

2 + · · ·+ θq(|Xt−q|+ γqXt−q)2.

This reduces to the ordinary GARCH equation if the leverage coefficients γi are setequal to zero. If these coefficients are negative, then a positive deviation of the process

9.3: Exponential GARCH 151

Xt contributes to lower volatility in the near future, and conversely (see Figure 9.4). Atime series with conditional variances of this type is called GARCH with leverage.

A power GARCHmodel is obtained by replacing the squares in the preceding displayby other powers.

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 9.4. The function x 7→ (|x|+ γx)2 for γ = −0.2.

* 9.3 Exponential GARCH

The exponential GARCH, or EGARCH, model differs significantly from the GARCHmodels described so far. It retains the basic set-up of a process of the form Xt = σtZt

for a martingale difference sequence Zt satisfying (9.2) and an Ft−1-adapted process σt,but replaces the GARCH equation by

log σ2t = α+φ1 log σ

2t−1+· · ·+φp log σ2

t−p+θ1(|Zt−1|+γ1Zt−1)+· · ·+θq(|Zt−q|+γqZt−q).

The inclusion of both the variables Zt and their absolute values and the transformation tothe logarithmic scale attempts to capture the leverage effect. An advantage of modellingthe logarithm of the volatility is that the parameters of the model need not be restrictedto be positive.

Because the EGARCH model specifies the log volatility directly in terms of the noiseprocess Zt and its own past, its definition is less recursive than the ordinary GARCHdefinition, and easier to handle. In particular, for fixed and identical leverage coefficients


γi = γ the EGARCH equation describes the log volatility process log σ2t as a regular

ARMA process driven by the noise process |Zt| + γZt, and we may use the theory forARMA processes to study its properties. In particular, if the roots of the polynomialφ(z) = 1 − φ1z − · · · − φpz

p are outside the unit circle, then there exists a stationarysolution log σ2

t that is measurable relative to the σ-field generated by the process Zt−1.If the process Zt is strictly stationary, then so is the stationary solution log σ2

t and so isthe EGARCH process Xt = σtZt.

* 9.4 GARCH in Mean

By its definition a GARCH process is a white noise process, and thus can be a usefulcandidate to drive another process. For instance, an observed process Yt could be assumedto satisfy the ARMA equation

φ(B)Yt = θ(B)Xt,

for Xt a GARCH process, relative to (other) polynomials φ and θ (unrelated to φ and θ).One then says that Yt is “ARMA in the mean” and “GARCH in the variance”, or that Ytis an ARMA-GARCH series. Results on ARMA processes that hold for any driving whitenoise process will clearly also hold in the present case, where the white noise process isa GARCH process.

9.21 EXERCISE. Let Xt be a stationary GARCH process relative to polynomials φ andθ and let the time series Yt be the unique stationary solution to the equation φ(B)Yt =θ(B)Xt, for φ and θ polynomials that have all their roots outside the unit disc. Let Ft bethe filtration generated by Yt. Show that var(Yt| Ft−1) = var(Xt|Xt−1, Xt−2, . . .) almostsurely.

It has been found useful to go a step further and let also the conditional variance ofthe driving GARCH process appear in the mean model for the process Yt. Thus given aGARCH process Xt with conditional variance process σ2

t = var(Xt| Ft−1) it is assumedthat Yt = f(σt, Xt) for a fixed function f . The function f is assumed known up to anumber of parameters. For instance,

φ(B)Yt = ψσt + θ(B)Xt,

φ(B)Yt = ψσ2t + θ(B)Xt,

φ(B)Yt = ψ log σ2t + θ(B)Xt.

These models are known as GARCH-in-mean, or GARCH-M models.

10State Space Models

A time series or discrete time stochastic process Xt is Markov if the conditional distri-bution of Xt+1 given the “past” values Xt, Xt−1, . . . depends on the “present” value Xt

only, for every t. The evolution of a Markov process can be pictured as a sequence ofmoves in a “state space”, where at each time t the process chooses a new state Xt+1

according to a random mechanism based only on the current state Xt, and not on thepast evolution. The current state thus contains all necessary information to determinethe future.

Markov structure obviously facilitates prediction, since only the current state isneeded to predict the future. It also permits a simple factorization of the likelihood,which aids in defining and analuzing statistical procedures.

Markov structure can be created in any time series by redefining the states of theseries to incorporate enough past information. The “present state” should contain allinformation that is relevant for the next state. In fact, given an arbitrary time series Xt,the process ~Xt = (Xt, Xt−1, . . .) is Markov. However, the high complexity of the state

space of the process ~Xt offsets the advantages of the Markov property.General state space models take the idea of states that contain the relevant informa-

tion as a basis, but go beyond Markov models. They typically consist of a specificationof a Markov process together with a process of ’observable outputs’. Although they comefrom a different background, hidden Markov models are almost identical structures.

The showpiece of state space modelling is the Kalman filter. This is an algorithm tocompute linear predictions for certain linear state space models, under the assumptionthat the parameters of the system are known. Because the formulas for the predictors,which are functions of the parameters and the outputs, can in turn be used to set upestimating equations for the parameters, the Kalman filter is also important for statisticalanalysis. We start discussing parameter estimation in Chapter 11.

10.1 Example (AR(1)). A causal, stationary AR(1) process with i.i.d. innovations Zt

is a Markov process: the “future value” Xt+1 = φXt + Zt+1 given the “past values”X1, . . . , Xt depends on the “present value” Xt only. Specifically, the conditional density

154 10: State Space Models

of Xt+1 is given by

pXt+1|X1,...,Xt(x) = pZ(x− φXt).

The assumption of causality ensures that Zt+1 is independent of X1, . . . , Xt. The Markovstructure allows to factorize the density of (X1, . . . , Xn), as

pX1,...,Xn(X1, . . . , Xn) =

n∏

t=2

pZ(Xt − φXt−1)pX1(X1).

This expression as a function of the unknown parameter φ is the likelihood for the model.

10.2 EXERCISE. How can a causal AR(p) process be forced into Markov form? Give aprecise argument. Is “causal” needed?

10.1 Hidden Markov Models and State Spaces

A hidden Markov model consists of a Markov chain, but rather than the state at time t,we observe an “output” variable, which is generated by a mechanism which depends onthe state. This is illustrated in Figure 10.1, where the sequence . . . , X1, X2, . . . depictsa Markov chain, and . . . , Y1, Y2, . . . are the outputs at times 1, 2, . . .. It is assumed thatgiven the state sequence . . . , X1, X2, . . ., the outputs . . . , Y1, Y2, . . . are independent, andthe conditional distribution of Yt given the states depends on Xt only. Thus the arrowsin the picture indicate dependence relationships.

X1 X2 X3

. . .

Xn−1 Xn

Y1 Y2 Y3

. . .

Yn−1 Yn

Figure 10.1. Hidden Markov model. The variables X1, . . . , Xn form a Markov chain. The variablesY1, . . . , Yn are conditionally independent given this chain, with the conditional marginal distribution of Yi

depending on Xi only. Only the variables Y1, . . . , Yn are observed.

10.1: Hidden Markov Models and State Spaces 155

The probabilistic properties of a hidden Markov process are simple. For instance,the joint sequence (Xt, Yt) can be seen to be Markovian, and we can easily write downthe joint density of the variables (X1, Y1, X2, Y2, . . . , Xn, Yn) as

p(x1)p(x2|x1)× · · · × p(xn|xn−1) p(y1|x1)× · · · × p(yn|xn).

(Here, as later on in the chapter, we abuse notation by writing the conditional density ofa variable Y given a variable X as p(y|x) and a marginal density of X as p(x), thus usingthe symbol p for any density function, leaving it to the arguments to reveal which densityit is.) The difficulty is to do statistics in the case where only the outputs Y1, . . . , Yn areobserved. Typical aims are to estimate parameters of the underlying distribution, and tocompute “best guesses” of the unobserved states given the observed outputs.

Many state space models are hidden Markov models, although they are often de-scribed in a different manner. Given an “initial state” X0, “disturbances” V1,W1, V2, . . .and functions ft and gt, processes Xt and Yt are defined recursively by, for t = 1, 2, . . .,

(10.1)Xt = ft(Xt−1, Vt),

Yt = gt(Xt,Wt).

We refer to Xt as the “state” at time t and to Yt as the “output”. The state process Xt

can be viewed as primary and evolving in the background, describing the consecutivestates of a system in time. At each time t the system is “measured”, resulting in theobserved output Yt.

If the sequence X0, V1, V2, . . . consists of independent variables, then the state pro-cess Xt is a Markov chain. If, moreover, the variables X0, V1,W1, V2,W2, V3, . . . are inde-pendent, then for every t given the state Xt the output Yt is conditionally independentof the states X0, X1, . . . and outputs (Ys: s 6= t). Then the state space model becomes ahidden Markov model. Conversely, any hidden Markov model arises in this form.

10.3 Lemma. If X0, V1,W1, V2,W2, V3, . . . are independent random variables, then thepair of sequences Xt and Yt defined in (10.1) forms a hidden Markov model: the sequenceX1, X2, . . . forms a Markov chain, the variables Y1, Y2, . . . are conditionally independentgiven X0, X1, X2, . . ., and Yt is conditionally independent of (Xs: s 6= t) given Xt, forevery t. Conversely, every hidden Markov model with state and output variables Xt andYt in Polish spaces can be realized in the form (10.1) for some sequence of independentrandom variables X0, V1,W1, V2,W2, V3, . . . and measurable functions ft and gt.

Proof. Any conditional distribution R of variables with values in Polish spaces (i.e.Markov kernel (x,B) 7→ R(B|x)) from a Polish space D1 into another Polish space D2

can be represented as the distribution of h(x, U) for a measurable map h:D1×[0, 1] → D2,and a uniform random variable U (i.e. R(x,B) = P

(

h(x, U) ∈ B)

). We apply this tothe conditional laws of Xt given Xt−1 and of Yt given Xt to find functions ft such thatXt = ft(Xt−1, Vt) and gt such that Yt = gt(Xt,Wt), where the variables Vt and Wt canbe chosen independent and uniformly distributed.


The variables Vt and Wt are never observed, and in the state space model (10.1)are never observed, and typically the state process Xt is not either. At time t we onlyobserve the output Yt. For this reason the process Yt is also referred to as themeasurementprocess. The second equation in the display (10.1) is called the measurement equation,while the first is the state equation. Inference might be directed at estimating parametersattached to the functions ft or gt, to the distribution of the errors or to the initialstate, and/or on predicting or reconstructing the states Xt from the observed outputsY1, . . . , Yn. Predicting or reconstructing the state sequence is referred to as “filtering” or“smoothing”.

For linear functions ft and gt and vector-valued states and outputs the state spacemodel can without loss of generality be written in the form

(10.2)Xt = FtXt−1 + Vt,

Yt = GtXt +Wt.

The matrices Ft and Gt are often postulated to be independent of t. In this linearstate space model the analysis usually concerns linear predictions, and then a commonassumption is that the vectors X0, V1,W1, V2, . . . are uncorrelated. If Ft is independent oft and the vectors Vt form a white noise process, then the series Xt is a VAR(1) process.

Because state space models are easy to handle, it is of interest to represent a givenobservable time series Yt as the output of a state space model. This entails finding astate space, a state process Xt, and a corresponding state space model with the givenseries Yt as output. It is particularly attractive to find a linear state space model. Such astate space representation is definitely not unique. An important issue in systems theoryis to find a (linear) state space representation of minimal dimension.

10.4 Example (State space representation ARMA). Let Yt be a stationary, causalARMA(p, q) process satisfying φ(B)Yt = θ(B)Zt for an i.i.d. process Zt. Then the AR(p)process Xt = (1/φ)(B)Zt is related to Yt through Yt = θ(B)Xt. Thus if r ≥ (p − 1) ∨ qand we pad the set of coefficients of the polynomials φ or θ with zeros to increase theirnumbers to r and r + 1, if necessary, then

Yt = ( θ0, . . . , θr )

Xt...

Xt−r

,

Xt

Xt−1...

Xt−r

=

φ1 φ2 · · · φr φr+1

1 0 · · · 0 0...

......

...0 0 · · · 1 0

Xt−1Xt−2...

Xt−r−1

+

Zt

0...0

.

This is a linear state space representation (10.2) with state vector (Xt, . . . , Xt−r), andmatrices Ft and Gt, that are independent of t. Under causality the innovations Vt =(Zt, 0, . . . , 0) are orthogonal to the past Xt and Yt; the innovations Wt as in (10.2) areidentically zero. The state vectors are typically unobserved, except when θ is of degree

10.1: Hidden Markov Models and State Spaces 157

zero. (If the ARMA process is invertible and the coefficients of θ are known, then thestates can be reconstructed from the infinite past through the relation Xt = (1/θ)(B)Yt.)

In the present representation the state-dimension of the ARMA(p, q) process is r+1 = max(p, q + 1). By using a more complicated noise process it is possible to representan ARMA(p, q) process in dimension max(p, q), but this difference appears not to bevery important.♯

10.5 EXERCISE. Find a state space representation of dimension p ∨ q.

10.6 Example (State space representation ARIMA). Consider a time series Zt whosedifferences Yt = ∇Zt satisfy the linear state space model (10.2) for a state sequence Xt.Writing Zt = Yt + Zt−1 = GtXt +Wt + Zt−1, we obtain that

(

Xt

Zt−1

)

=

(

Ft 0Gt−1 1

)(

Xt−1Zt−2

)

+

(

VtWt−1

)

,

Zt = (Gt 1 )

(

Xt

Zt−1

)

+Wt.

We conclude that the time series Zt possesses a linear state space representation, withstates of one dimension higher than the states of the original series.

A drawback of the preceding representation is that the error vectors (Vt,Wt−1,Wt)are not necessarily uncorrelated if the error vectors (Vt,Wt) in the system with outputsYt have this property. In the case that Zt is an ARIMA(p, 1, q) process, we may use thestate representation of the preceding example for the ARMA(p, q) process Yt, which haserrors Wt = 0, and this disadvantage does not arise. Alternatively, we can avoid thisproblem by using another state space representation. For instance, we can write

(

Xt

Zt

)

=

(

Ft 0GtFt 1

)(

Xt−1Zt−1

)

+

(

VtGtVt +Wt

)

,

Zt = ( 0 1 )

(

Xt

Zt

)

.

This illustrates that there may be multiple possibilities to represent a time series as theoutput of a (linear) state space model.

The preceding can be extended to general ARIMA(p, d, q) models. If Yt = (1−B)dZt,

then Zt = Yt −∑d

j=1

(

dj

)

(−1)jZt−j . If the process Yt can be represented as the outputof a state space model with state vectors Xt, then Zt can be represented as the outputof a state space model with the extended states (Xt, Zt−1, . . . , Zt−d), or, alternatively,(Xt, Zt, . . . , Zt−d+1).

10.7 Example (Stochastic linear trend). A time series with a linear trend could bemodelled as Yt = α + βt +Wt for constants α and β, and a stationary process Wt (forinstance an ARMA process). This restricts the nonstationary part of the time series to

♯ See e.g. Brockwell and Davis, p469–471.


a deterministic component, which may be unrealistic. An alternative is the stochasticlinear trend model described by

(

At

Bt

)

=

(

1 10 1

)(

At−1Bt−1

)

+ Vt,

Yt = At +Wt.

The stochastic processes (At, Bt) and noise processes (Vt,Wt) are unobserved. This statespace model contains the deterministic linear trend model as the degenerate case whereVt ≡ 0, so that Bt ≡ B0 and At ≡ A0 +B0t.

The state equations imply that ∇At = Bt−1 + Vt,1 and ∇Bt = Vt,2, for Vt =(Vt,1, Vt,2)

T . Taking differences on the output equation Yt = At +Wt twice, we find that

∇2Yt = ∇Bt−1 +∇Vt,1 +∇2Wt = Vt,2 +∇Vt,1 +∇2Wt.

If the process (Vt,Wt) is a white noise process, then the auto-correlation function ofthe process on the right vanishes for lags bigger than 2 (the polynomial ∇2 = (1− B)2

being of degree 2). Thus the right side is an MA(2) process, whence the process Yt is anARIMA(0,2,2) process.

* 10.8 Example (Structural models). Besides a trend we may suspect that a given timeseries shows a seasonal effect. One possible parametrization of a deterministic seasonaleffect with S seasons is the function

(10.3) t 7→ γ0 +

⌊S/2⌋∑

s=1

(

γs cos(λst) + δs sin(λst))

, λs =2πs

S, s = 1, . . . , ⌊S/2⌋.

By appropriate choice of the parameters γs and δs this function is able to adapt to anyperiodic function on the integers with period S. (See Exercise 10.10. If S is even, then the“last” sinus function t 7→ sin(λS/2t) vanishes, and the coefficient δs is irrelevant. Thereare always S nontrivial terms in the sum.) We could add this deterministic function to agiven time series model in order to account for seasonality. Again it may not be realisticto require the seasonality a-priori to be deterministic. An alternative is to replace thefixed function s 7→ (γs, δs) by the time series defined by

(

γs,tδs,t

)

=

(

cosλs sinλssinλs − cosλs

)(

γs,t−1δs,t−1

)

+ Vs,t.

An observed time series may next have the form

Yt = ( 1 1 . . . 1 )

γ1,tγ2,t...γs,t

+ Zt.

Together these equations again constitute a linear state space model. If Vt = 0, then thisreduces to the deterministic trend model. (Cf. Exercise 10.9.)

A model with both a stochastic linear trend and a stochastic seasonal component isknown as a structural model.

10.2: Kalman Filtering 159

10.9 EXERCISE. Consider the state space model with state equations γt = γt−1 cosλ+δt−1 sinλ+Vt,1 and δt = γt−1 sinλ− δt−1 cosλ+Vt,2 and output equation Yt = γt +Wt.What does this model reduce to if Vt ≡ 0?

10.10 EXERCISE.(i) Show that the function (of t ∈ Z) in (10.3) is periodic with period S.(ii) Show that any periodic function f :Z → R with period S can be written in the form

(10.3).[For (ii) it suffices to show that any vector

(

f(1), . . . , f(S))

can be represented as a linear

combination of the vectors as =(

cosλs, cos(2λs), . . . , cos(Sλs))

for s = 0, 1, . . . , ⌊S/2⌋and the vectors bs =

(

sinλs, sin(2λs), . . . , sin(Sλs))

, for s = 1, 2, . . . , ⌊S/2⌋. To provethis note that the vectors es = (eiλs , e2iλs . . . , einλs)T for λs running through the naturalfrequencies is an orthogonal basis of Cn. The vectors as = Re es and bs = Im es can beseen to be orthogonal, apart from the trivial cases b0 = 0 and bS/2 = 0 if S is even.]

10.1.1 Notation

The variables Xt and Yt in a state space model will typically be random vectors. For tworandom vectors X and Y of dimensions m and n the covariance or “cross-covariance” isthe (m× n) matrix Cov(X,Y ) = E(X −EX)(Y −EY )T . The random vectors X and Yare called “uncorrelated” if Cov(X,Y ) = 0, or equivalently if cov(Xi, Yj) = 0 for everypair (i, j). The linear span of a set of vectors is defined as the linear span of all theircoordinates. Thus this is a space of (univariate) random variables, rather than randomvectors! We shall also understand a projection operator Π, which is a map on the spaceof random variables, to act coordinatewise on vectors: if X is a vector, then ΠX is thevector consisting of the projections of the coordinates of X. As a vector-valued operatora projection Π is still linear, in that Π(FX + Y ) = FΠX + ΠY , for any matrix F andrandom vectors X and Y .

10.2 Kalman Filtering

The Kalman filter is a recursive algorithm to compute best linear predictions of the statesX1, X2, . . . given observations Y1, Y2, . . . in the linear state space model (10.2). The corealgorithm allows to compute predictions ΠtXt of the states Xt given observed outputsY1, . . . , Yt. Here by “predictions” we mean Hilbert space projections. Given the time val-ues involved, “reconstructions” would perhaps be more appropriate. “Filtering” is thepreferred term in systems theory. (In general this term refers to characterizing (the con-ditional distribution of) a state, or some other quantity at time t, based on observationsuntil time t). Given the reconstructions ΠtXt, it is easy to compute predictions of fu-ture states and future outputs. “Kalman smoothing”, which is the reconstruction (againthrough projections) of the full state sequence X1, . . . , Xn given the outputs Y1, . . . , Yn,requires additional steps. (In general “smoothing” refers to characterizing states, or other


quantities, at time t, by observations until time n > t. The name appears to result fromthe fact that reconstructions of a state sequence using observations that include futuretimes tend to have a smoother appearance.)

In the simplest situation the vectors X0, V1,W1, V2,W2, . . . are assumed uncorre-lated. We shall first derive the filter under the more general assumption that the vectorsX0, (V1,W1), (V2,W2), . . . are uncorrelated, and in Section 10.2.4 we further relax thiscondition. The matrices Ft and Gt as well as the covariance matrices of the noise variables(Vt,Wt) are assumed known.

By applying (10.2) recursively, we see that the vector Xt is contained in the linearspan of the variables X0, V1, . . . , Vt. It is immediate from (10.2) that the vector Yt iscontained in the linear span of Xt andWt. These facts are true for every t ∈ N. It followsthat under our conditions the noise variables Vt and Wt are uncorrelated with all vectorsXs and Ys with s < t.

Let H0 be a given closed linear subspace of L2(Ω,U ,P) that contains the constants,and for t ≥ 0 let Πt be the orthogonal projection onto the spaceHt = H0+lin (Y1, . . . , Yt).The space H0 may be viewed as our “knowledge” at time 0; it may be H0 = lin 1. Weassume that the noise vectors V1,W1, V2, . . . are orthogonal to H0. Combined with thepreceding this shows that the vector (Vt,Wt) is orthogonal to the space Ht−1, for everyt ≥ 1.

The Kalman filter consists of the recursions

· · · →

Πt−1Xt−1Cov(Πt−1Xt−1)

Cov(Xt−1)

[1]

→

Πt−1Xt

Cov(Πt−1Xt)Cov(Xt)

[2]

→

ΠtXt

Cov(ΠtXt)Cov(Xt)

→ · · ·

Thus the Kalman filter alternates between “updating the current state”, step [1], and“updating the prediction space”, step [2].

10.11 Theorem (Kalman filter). The projections ΠtXt in the linear state space modelwith uncorrelated vectors X0, (V1,W1), (V2,W2), . . . can be recursively computed for t =1, 2, . . . by alternating the two steps

step [1]

Πt−1Xt = Ft(Πt−1Xt−1),

Cov(Πt−1Xt) = Ft Cov(Πt−1Xt−1)FTt ,

Cov(Xt) = Ft Cov(Xt−1)FTt +Cov(Vt).

(10.4) step [2]ΠtXt = Πt−1Xt + ΛtWt,

Cov(ΠtXt) = Cov(Πt−1Xt) + Λt Cov(Wt)ΛTt ,

where the matrix Λt = Cov(Xt, Wt) Cov(Wt)−1 can be computed as in (10.6).

Proof. For step [1] we note that Πt−1Vt = 0, because Vt ⊥ H0, Y1, . . . , Yt−1 by assump-tion. We next apply Πt to the state equation Xt = FtXt−1 + Vt, and use the linearity ofa projection.

10.2: Kalman Filtering 161

Step [2] is based on the vector Wt = Yt − Πt−1Yt. This is called the innovationat time t, because it is the part of Yt that is not explainable at time t − 1. BecauseΠt−1Wt = 0, we have by the measurement equation,

(10.5) Wt = Yt −Πt−1Yt = Yt −GtΠt−1Xt = Gt(Xt −Πt−1Xt) +Wt.

The innovation is orthogonal to Ht−1, and together with this space spans Ht. It followsthat Ht can be orthogonally decomposed as Ht = Ht−1+lin Wt and hence the projectiononto Ht is the sum of the projections onto the spaces Ht−1 and lin Wt. This gives the firstequation of step [2], where the finite-dimensional projection on lin Wt is indeed givenby the multiplication with the matrix Λt as given. The second equation is an immediateconsequence of the first and the orthogonality. It remains to derive equation (10.6) forΛt.

Because Wt ⊥ Xt−1 the state equation equation yields Cov(Xt,Wt) = Cov(Vt,Wt).By the orthogonality property of projections Cov(Xt, Xt−Πt−1Xt) = Cov(Xt−Πt−1Xt).Combining this and the identity Wt = Gt(Xt −Πt−1Xt) +Wt from (10.5), we compute

(10.6)

Cov(Xt, Wt) = Cov(Xt −Πt−1Xt)GTt +Cov(Vt,Wt),

Cov(Wt) = Gt Cov(Xt −Πt−1Xt)GTt +Gt Cov(Vt,Wt)

+ Cov(Wt, Vt)GTt +Cov(Wt),

Cov(Xt −Πt−1Xt) = Cov(Xt)− Cov(Πt−1Xt).

The matrix Cov(Xt − Πt−1Xt) is the prediction error matrix at time t− 1 and the lastequation follows by Pythagoras’ rule.

The Kalman algorithm must be initialized in one of its two steps, for instance byproviding Π0X1 and its covariance matrix, so that the recursion can start with a stepof type [2]. It is here where the choice of H0 plays a role. Choosing H0 = lin (1) givespredictions using Y1, . . . , Yt as well as an intercept and requires that we know Π0X1 =EX1. It may also be desired that Πt−1Xt is the projection onto lin (1, Yt−1, Yt−2, . . .) fora stationary extension of Yt into the past. Then we set Π0X1 equal to the projection ofX1 onto H0 = lin (1, Y0, Y−1, . . .).

10.2.1 Future States and Outputs

Predictions of future values of the state variable follow easily from ΠtXt, becauseΠtXt+h = Ft+hΠtXt+h−1 for any h ≥ 1. Given the predicted states, future outputscan be predicted from the measurement equation by ΠtYt+h = Gt+hΠtXt+h.

* 10.2.2 Missing Observations

A considerable attraction of the Kalman filter algorithm is the ease by which missingobservations can be accomodated. This can be achieved by simply filling in the missingdata points by “external” variables that are independent of the system. Suppose that(Xt, Yt) follows the linear state space model (10.2) and that we observe a subset (Yt)t∈T


of the variables Y1, . . . , Yn. We define a new set of matrices G∗t and noise variables W ∗tby

G∗t = Gt, W ∗t =Wt, t ∈ T,

G∗t = 0, W ∗t =W t, t /∈ T,

for random vectorsW t that are independent of the vectors that are already in the system.The choice W t = 0 is permitted. Next we set

Xt = FtXt−1 + Vt,

Y ∗t = G∗tXt +W ∗t .

The variables (Xt, Y∗t ) follow a state space model with the same state vectors Xt. For

t ∈ T the outputs Y ∗t = Yt are identical to the outputs in the original system, whilefor t /∈ T the output is Y ∗t = W t, which is pure noise by assumption. Because the noisevariablesW t cannot contribute to the prediction of the hidden states Xt, best predictionsof states based on the observed outputs (Yt)t∈T or based on Y ∗1 , . . . , Y

∗n are identical.

We can compute the best predictions based on Y ∗1 , . . . , Y∗n by the Kalman recursions,

but with the matrices G∗t and Cov(W ∗t ) substituted for Gt and Cov(Wt). Because theY ∗t with t /∈ T will not appear in the projection formula, we can just as well set their“observed values” equal to zero in the computations.

* 10.2.3 Kalman Smoothing

Besides in predicting future states or outputs we may be interested in reconstructing thecomplete state sequence X0, X1, . . . , Xn from the outputs Y1, . . . , Yn. The computationof ΠnXn is known as the filtering problem, and is computed in step [2] of our descriptionof the Kalman filter. The computation of ΠnXt for t = 0, 1, . . . , n − 1 is known as thesmoothing problem. For a given t it can be achieved through the recursions, with Wn asgiven in (10.5),

ΠnXt

Cov(Xt, Wn)Cov(Xt, Xn −Πn−1Xn)

→

Πn+1Xt

Cov(Xt, Wn+1)Cov(Xt, Xn+1 −ΠnXn+1)

, n = t, t+ 1, . . . .

The initial value at n = t of the recursions and the covariance matrices Cov(Wn) of theinnovations Wn are given by (10.6), and hence can be assumed known.

Because Hn+1 is the sum of the orthogonal spaces Hn and lin Wn+1, we have, as in(10.4),

Πn+1Xt = ΠnXt + Λt,n+1Wn+1, Λt,n+1 = Cov(Xt, Wn+1) Cov(Wn+1)−1.

The recursion for the first coordinate ΠnXt follows from this and the recursionsfor the second and third coordinates, the covariance matrices Cov(Xt, Wn+1) andCov(Xt, Xn+1 −ΠnXn+1).

Using in turn the state equation and equation (10.4), we find

Cov(Xt, Xn+1 −ΠnXn+1) = Cov(

Xt, Fn+1(Xn −ΠnXn) + Vn+1

)

= Cov(

Xt, Fn+1(Xn −Πn−1Xn + ΛnWn))

.

10.3: Nonlinear Filtering 163

This readily gives the recursion for the third component, the matrix Λn being knownfrom (10.4)–(10.6). Next using equation (10.5), we find

Cov(Xt, Wn+1) = Cov(Xt, Xn+1 −ΠnXn+1)GTn+1.

* 10.2.4 Lagged Correlations

In the preceding we have assumed that the vectors X0, (V1,W1), (V2,W2), . . . are uncor-related. An alternative assumption is that the vectors X0, V1, (W1, V2), (W2, V3), . . . areuncorrelated. (The awkward pairing of Wt and Vt+1 can be avoided by writing the stateequation as Xt = FtXt−1 + Vt−1 and next making the assumption as before.) Underthis condition the Kalman filter takes a slightly different form, where for economy ofcomputation it can be useful to combine the steps [1] and [2].

Both possibilities are covered by the assumptions that- the vectors X0, V1, V2, . . . are orthogonal.- the vectors W1,W2, . . . are orthogonal.- the vectors Vs and Wt are orthogonal for all (s, t) except possibly s = t or s = t+1.- all vectors are orthogonal to H0.

Under these assumptions step [2] of the Kalman filter remains valid as described. Step[1] must be adapted, because it is no longer true that Πt−1Vt = 0.

Because Vt ⊥ Ht−2, we can compute Πt−1Vt from the innovation decompositionHt−1 = Ht−2 + lin Wt−1, as Πt−1Vt = Kt−1Wt−1 for the matrix

Kt−1 = Cov(Vt,Wt−1) Cov(Wt−1)−1.

Note here that Cov(Vt, Wt−1) = Cov(Vt,Wt−1), in view of (10.5). We replace the calcu-lations for step [1] by

Πt−1Xt = Ft(Πt−1Xt−1) +KtWt−1,

Cov(Πt−1Xt) = Ft Cov(Πt−1Xt−1)FTt +Kt Cov(Wt−1)K

Tt ,

Cov(Xt) = Ft Cov(Xt−1)FTt +Cov(Vt).

This gives a complete description of step [1] of the algorithm, under the assumption thatthe vector Wt−1, and its covariance matrix are kept in memory after the preceding step[2].

The smoothing algorithm goes through as stated except for the recursion for thematrices Cov(Xt, Xn −Πn−1Xn). Because ΠnVn+1 may be nonzero, this becomes

Cov(Xt, Xn+1 −ΠnXn+1) = Cov(

Xt, Xn −Πn−1Xn)FTn+1 +Cov(Xt, Wn)Λ

TnF

Tn+1

+Cov(Xt, Wn)KTn .


* 10.3 Nonlinear Filtering

The simplicity of the Kalman filter results from the combined linearity of the state spacemodel and the predictions. These lead to update formulas expressed in the terms ofmatrix algebra. The principle of recursive predictions can be applied more generally tocompute nonlinear predictions in nonlinear state space models, provided the conditionaldensities of the variables in the system are available and certain integrals involving thesedensities can be evaluated, analytically, numerically, or by stochastic simulation.

Abusing notation we write a conditional density of a variable X given another vari-able Y as p(x| y), and a marginal density of X as p(x). Consider the nonlinear statespace model (10.1), where we assume that the vectors X0, V1,W1, V2, . . . are indepen-dent. Then the outputs Y1, . . . , Yn are conditionally independent given the state sequenceX0, X1, . . . , Xn, and the conditional law of a single output Yt given the state sequencedepends on Xt only. In principle the (conditional) densities p(x0), p(x1|x0), p(x2|x1), . . .and the conditional densities p(yt|xt) of the outputs are available from the form of thefunctions ft and gt and the distributions of the noise variables (Vt,Wt). The joint densityof states up till time n+ 1 and outputs up till time n in this hidden Markov model canbe expressed in these densities as

(10.7) p(x0)p(x1|x0) · · · p(xn+1|xn)p(y1|x1)p(y2|x2) · · · p(yn|xn).

The marginal density of the outputs (Y1, . . . , Yn) is obtained by integrating this functionrelative to (x0, . . . , xn+1). The conditional density of the state sequence (X0, . . . , Xn+1)given the outputs is proportional to the function in the display, the norming constantbeing the marginal density of the outputs. In principle, this allows the computationof all conditional expectations E(Xt|Y1, . . . , Yn), the (nonlinear) “predictions” of thestate. However, because this approach expresses these predictions as a quotient of n+1-dimensional integrals, and n may be large, this is unattractive unless the integrals canbe evaluated easily.

An alternative for finding predictions is a recursive scheme for calculating conditionaldensities, of the form

· · · → p(xt−1| yt−1, . . . , y1)[1]→p(xt| yt−1, . . . , y1)

[2]→p(xt| yt, . . . , y1) → · · · .

This is completely analogous to the updates of the linear Kalman filter: the recursionsalternate between “updating the state”, [1], and “updating the prediction space”, [2].

Step [1] can be summarized by the formula, for µt the dominating measures,

p(xt| yt−1, . . . , y1) =∫

p(xt|xt−1, yt−1, . . . , y1)p(xt−1| yt−1, . . . , y1) dµt−1(xt−1)

=

∫

p(xt|xt−1)p(xt−1| yt−1, . . . , y1) dµt−1(xt−1).

The second equality follows from the conditional independence of the vectors Xt andYt−1, . . . , Y1 given Xt−1. This is a consequence of the form of Xt = ft(Xt−1, Vt)and the independence of Vt and the vectors Xt−1, Yt−1, . . . , Y1 (which are functions ofX0, V1, . . . , Vt−1,W1, . . . ,Wt−1).

10.3: Nonlinear Filtering 165

To obtain a recursion for step [2] we apply Bayes formula to the conditional densityof the pair (Xt, Yt) given Yt−1, . . . , Y1 to obtain

p(xt| yt, . . . , y1) =p(yt|xt, yt−1, . . . , y1)p(xt| yt−1, . . . , y1)

∫

p(yt|xt, yt−1, . . . , y1)p(xt| yt−1, . . . , y1) dµt(xt)

=p(yt|xt)p(xt| yt−1, . . . , y1)

p(yt| yt−1, . . . , y1).

The second equation is a consequence of the fact that Yt = gt(Xt,Wt) is conditionallyindependent of Yt−1, . . . , Y1 given Xt. The conditional density p(yt| yt−1, . . . , y1) in thedenominator is a nuisance, because it will rarely be available explicitly, but acts only asa norming constant.

The preceding formulas are useful only if the integrals can be evaluated. If analyticalevaluation is impossible, then perhaps numerical methods or stochastic simulation couldbe of help.

If stochastic simulation is the method of choice, then it may be attractive to ap-ply Markov Chain Monte Carlo for direct evaluation of the joint law, without recur-sions. The idea is to simulate a sample from the conditional density (x0, . . . , xn+1) 7→p(x0, . . . , xn+1| y1, . . . , yn) of the states given the outputs. The biggest challenge isthe dimensionality of this conditional density. The Gibbs sampler overcomes this bysimulating recursively from the marginal conditional densities p(xt|x−t, y1, . . . , yn)of the single variables Xt given the outputs Y1, . . . , Yn and the vectors X−t =(X0, . . . , Xt−1, Xt+1, . . . , Xn+1) of remaining states. These iterations yield a Markovchain, with the target density (x0, . . . , xn+1) 7→ p(x0, . . . , xn+1| y1, . . . , yn) as a station-ary density, and under some conditions the iterates of the chain can eventually be thoughtof as a (dependent) sample from (approximately) this target density. We refer to the lit-erature for general discussion of the Gibbs sampler, but shall show that these marginaldistributions are relatively easy to obtain for the general state space model (10.1).

Under independence of the vectors X0, V1,W1, V2, . . . the joint density of states andoutputs takes the hidden Markov form (10.7). The conditional density of Xt given theother vectors is proportional to this expression viewed as function of xt only. Only threeterms of the product depend on xt and hence we find

p(xt|x−t, y1, . . . , yn) ≍ p(xt|xt−1)p(xt+1|xt)p(yt|xt).

The norming constant is a function of the conditioning variables x−t, y1, . . . , yn only andcan be recovered from the fact that the left side is a probability density as a functionof xt. A closer look will reveal that it is equal to p(yt|xt−1, xt+1)p(xt+1|xt−1). However,many simulation methods, in particular the popular Metropolis-Hastings algorithm, canbe implemented without an explicit expression for the proportionality constant. Theforms of the three densities on the right side should follow from the specification of thesystem.

The assumption that the variables X0, V1,W2, V2, . . . are independent may be toorestrictive, although it is natural to try and construct the state variables so that it issatisfied. Somewhat more complicated formulas can be obtained under more general


assumptions. Assumptions that are in the spirit of the preceding derivations in thischapter are:(i) the vectors X0, X1, X2, . . . form a Markov chain.(ii) the vectors Y1, . . . , Yn are conditionally independent given X0, X1, . . . , Xn+1.(iii) for each t ∈ 1, . . . , n the vector Yt is conditionally independent of the vector

(X0, . . . , Xt−2, Xt+2, . . . , Xn+1) given (Xt−1, Xt, Xt+1).The first assumption is true if the vectors X0, V1, V2, . . . are independent. The secondand third assumptions are certainly satisfied if all noise vectors X0, V1,W1, V2,W2, V3, . . .are independent. The exercises below give more general sufficient conditions for (i)–(iii)in terms of the noise variables.

In comparison to the hidden Markov situation considered previously not muchchanges. The joint density of states and outputs can be written in a product form similarto (10.7), the difference being that each conditional density p(yt|xt) must be replaced byp(yt|xt−1, xt, xt+1). The variable xt then occurs in five terms of the product and hencewe obtain

p(xt|x−t, y1, . . . ,yn) ≍ p(xt+1|xt)p(xt|xt−1)×× p(yt−1|xt−2, xt−1, xt)p(yt|xt−1, xt, xt+1)p(yt+1|xt, xt+1, xt+2).

This formula is general enough to cover the case of the ARV model discussed in the nextsection.

10.12 EXERCISE. Suppose that X0, V1,W1, V2,W2, V3, . . . are independent, and definestates Xt and outputs Yt by (10.1). Show that (i)–(iii) hold, where in (iii) the vector Ytis even conditionally independent of (Xs: s 6= t) given Xt.

10.13 EXERCISE. Suppose that X0, V1, V2, . . . , Z1, Z2, . . . are independent, and definestates Xt and outputs Yt through (10.2) with Wt = ht(Vt, Vt+1, Zt) for measurablefunctions ht. Show that (i)–(iii) hold. [Under (10.2) there exists a measurable bijectionbetween the vectors (X0, V1, . . . , Vt) and (X0, X1, . . . , Xn), and also between the vectors(Xt, Xt−1, Xt+1) and (Xt, Vt, Vt+1). Thus conditioning on (X0, X1, . . . , Xn+1) is the sameas conditioning on (X0, V1, . . . , Vn+1) or on (X0, V1, . . . , Vn, Xt−1, Xt, Xt+1).]

* 10.14 EXERCISE. Show that the condition in the preceding exercise that Wt =ht(Vt, Vt+1, Zt) for Zt independent of the other variables is equivalent to the conditionalindependence of Wt and X0, V1, . . . , Vn,Ws: s 6= t given Vt, Vt+1.

10.4 Stochastic Volatility Models

The term “volatility”, which we have used at multiple occasions to describe the “mov-ability” of a time series, appears to have its origins in the theory of option pricing. The

10.4: Stochastic Volatility Models 167

Black-Scholes model for pricing an option on a given asset with price St is based on adiffusion equation of the type

dSt = µtSt dt+ σtSt dBt.

Here Bt is a Brownian motion process and µt and σt are stochastic processes, whichare usually assumed to be adapted to the filtration generated by the process St. In theoriginal Black-Scholes model the process σt is assumed constant, and the constant isknown as the “volatility” of the process St.

The Black-Scholes diffusion equation can also be written in the form

logSt

S0=

∫ t

0

(µs − 12σ

2s) ds+

∫ t

0

σs dBs.

If µ and σ are deterministic processes this shows that the log returns logSt/St−1 overthe intervals (t− 1, t] are independent, normally distributed variables (t = 1, 2, . . .) with

means∫ t

t−1(µs − 12σ

2s) ds and variances

∫ t

t−1 σ2s ds. In other words, if these means and

variances are denoted by µt and σ2t , then the variables

Zt =logSt/St−1 − µt

σt

are an i.i.d. sample from the standard normal distribution. The standard deviation σt

can be viewed as an “average volatility” over the interval (t − 1, t]. If the processes µt

and σt are not deterministic, then the process Zt is not necessarily Gaussian. However,if the unit of time is small, so that the intervals (t − 1, t] correspond to short timeintervals in real time, then it is still believable that the variables Zt are approximatelynormally distributed. In that case it is also believable that the processes µt and σt areapproximately constant and hence these processes can replace the averages µt and σt.Usually, one even assumes that the process µt is constant in time. For simplicity ofnotation we shall take µt to be zero in the following, leading to a model of the form

logSt/St−1 = σtZt,

for standard normal variables Zt and a “volatility” process σt. The choice µt = µt− 12σ

2t =

0 corresponds to modelling under the “risk-free” martingale measure, but is made hereonly for convenience.

There is ample empirical evidence that models with constant volatility do not fitobserved financial time series. In particular, this has been documented through a compar-ison of the option prices predicted by the Black-Scholes formula to the observed prices onthe option market. Because the Black-Scholes price of an option on a given asset dependsonly on the volatility parameter of the asset price process, a single parameter volatilitymodel would allow to calculate this parameter from the observed price of an option onthis asset, by inversion of the Black-Scholes formula. Given a range of options written ona given asset, but with different maturities and/or different strike prices, this inversionprocess usually leads to a range of “implied volatilities”, all connected to the same assetprice process. These implied volatilities usually vary with the maturity and strike price.


This discrepancy could be taken as proof of the failure of the reasoning behind theBlack-Scholes formula, but the more common explanation is that “volatility” is a randomprocess itself. One possible model for this process is a diffusion equation of the type

dσt = λtσt dt+ γtσt dWt,

where Wt is another Brownian motion process. This leads to a “stochastic volatilitymodel in continuous time”. Many different parametric forms for the processes λt andγt are suggested in the literature. One particular choice is to assume that log σt is anOrnstein-Uhlenbeck process, i.e. it satisfies

d log σt = λ(ξ − log σt) dt+ γ dWt.

(An application of Ito’s formula show that this corresponds to the choices λt =12γ

2 +λ(ξ − log σt) and γt = γ.) The Brownian motions Bt and Wt are often assumed to bedependent, with quadratic variation 〈B,W 〉t = δt for some parameter δ ≤ 0.

A diffusion equation is a stochastic differential equation in continuous time, and doesnot fit well into our basic set-up, which considers the time variable t to be integer-valued.One approach would be to use continuous time models, but assume that the continuoustime processes are observed only at a grid of time points. In view of the importanceof the option-pricing paradigm in finance it has been also useful to give a definition of“volatility” directly through discrete time models. These models are usually motivatedby an analogy with the continuous time set-up. “Stochastic volatility models” in discretetime are specifically meant to parallel continuous time diffusion models.

The most popular stochastic volatility model in discrete time is the auto-regressiverandom variance model or ARV model. A discrete time analogue of the Ornstein-Uhlenbeck type volatility process σt is the specification

(10.8) log σt = α+ φ log σt−1 + Vt−1.

For |φ| < 1 and a white noise process Vt this auto-regressive equation possesses a causalstationary solution log σt. We select this solution in the following. The observed logreturn process Xt is modelled as

(10.9) Xt = σtZt,

where it is assumed that the time series (Vt, Zt) is i.i.d.. The latter implies that Zt is inde-pendent of Vt−1, Zt−1, Vt−2, Zt−2, . . . and hence of Xt−1, Xt−2, . . ., but allows dependencebetween Vt and Zt. The volatility process σt is not observed.

A dependence between Vt and Zt allows for a leverage effect, one of the “stylizedfacts” of financial time series. In particular, if Vt and Zt are negatively correlated, thena small return Xt, which is indicative of a small value of Zt, suggests a large value of Vt,and hence a large value of the log volatility log σt+1 at the next time instant. (Note thatthe time index t − 1 of Vt−1 in the auto-regressive equation (10.8) is unusual, becausein other situations we would have written Vt. It is meant to support the idea that σt isdetermined at time t− 1.)


An ARV stochastic volatility process is a nonlinear state space model. It induces alinear state space model for the log volatilities and log absolute log returns of the form

log σt = (α φ )

(

1log σt−1

)

+ Vt−1

log |Xt| = log σt + log |Zt|.

In order to take the logarithm of the observed series Xt it was necessary to take theabsolute value |Xt| first. Usually this is not a serious loss of information, because thesign of Xt is equal to the sign of Zt, and this is a Bernoulli 1

2 series if Zt is symmetricallydistributed.

The linear state space form allows the application of the Kalman filter to computebest linear projections of the unobserved log volatilities log σt based on the observed logabsolute log returns log |Xt|. Although this approach is computationally attractive, adisadvantage is that the best predictions of the volatilities σt based on the log returnsXt may be much better than the exponentials of the best linear predictions of the logvolatilities log σt based on the log returns. Forcing the model in linear form is not entirelynatural here. However, the computation of best nonlinear predictions is involved. MarkovChain Monte Carlo methods are perhaps the most promising technique, but are highlycomputer-intensive.

An ARV process Xt is a martingale difference series relative to its natural filtrationFt = σ(Xt, Xt−1, . . .). To see this we first note that by causality σt ∈ σ(Vt−1, Vt−2, . . .),whence Ft is contained in the filtration Gt = σ(Vs, Zs: s ≤ t). The process Xt is actuallyalready a martingale difference relative to this bigger filtration, because by the assumedindependence of Zt from Gt−1

E(Xt| Gt−1) = σtE(Zt| Gt−1) = 0.

A fortiori the process Xt is a martingale difference series relative to the filtration Ft.There is no correspondingly simple expression for the conditional variance process

E(X2t | Ft−1) of an ARV series. By the same argument

E(X2t | Gt−1) = σ2

tEZ2t .

If EZ2t = 1 it follows that E(X2

t | Ft−1) = E(σ2t | Ft−1), but this is intractable for further

evaluation. In particular, the process σ2t is not the conditional variance process, unlike

in the situation of a GARCH process. Correspondingly, in the present context, in whichσt is considered the “volatility”, the volatility and conditional variance processes do notcoincide.

10.15 EXERCISE. One definition of a volatility process σt of a time series Xt is aprocess σt such that Xt/σt is an i.i.d. standard normal series. Suppose that Xt = σtZt

is a GARCH process with conditional variance process σ2t and driven by an i.i.d. process

Zt. If Zt is standard normal, show that σt qualifies as a volatility process. [Trivial.] If Zt

is a tp-process show that there exists a process S2t with a chisquare distribution with p

degrees of freedom such that√p σt/St qualifies as a volatility process.


10.16 EXERCISE. In the ARV model is σt measurable relative to the σ-field generatedby Xt−1, Xt−2, . . .? Compare with GARCH models.

In view of the analogy with continuous time diffusion processes the assumption thatthe variables (Vt, Zt) in (10.8)–(10.9) are normally distributed could be natural. Thisassumption certainly helps to compute moments of the series. The stationary solutionlog σt of the auto-regressive equation (10.8) is given by (for |φ| < 1)

log σt =

∞∑

j=0

φj(Vt−1−j + α) =

∞∑

j=0

φjVt−1−j +α

1− φ.

If the time series Vt is i.i.d. Gaussian with mean zero and variance σ2, then it follows thatthe variable log σt is normally distributed with mean α/(1−φ) and variance σ2/(1−φ2).The Laplace transform E exp(aZ) of a standard normal variable Z is given by exp( 12a

2).Therefore, under the normality assumption on the process Vt it is straightforward tocompute that, for p > 0,

E|Xt|p = Eep log σtE|Zt|p = exp(

12

σ2p2

1− φ2+

αp

1− φ

)

E|Zt|p.

Consequently, the kurtosis of the variables Xt can be computed to be

κ4(X) = e4σ2/(1−φ2)κ4(Z).

If follows that the time series Xt possesses a larger kurtosis than the series Zt. This istrue even for φ = 0, but the effect is more pronounced for values of φ that are closeto 1, which are commonly found in practice. Thus the ARV model is able to explainleptokurtic tails of an observed time series.

Under the assumption that the variables (Vt, Zt) are i.i.d. and bivariate normallydistributed, it is also possible to compute the auto-correlation function of the squaredseries X2

t explicitly. If δ = ρ(Vt, Zt) is the correlation between the variables Vt andZt, then the vectors (log σt, log σt+h, Zt) possess a three-dimensional normal distributionwith covariance matrix

β2 β2φh 0β2φh β2 φh−1δσ0 φh−1δσ 1

, β2 =σ2

1− φ2.

Some calculations show that the auto-correlation function of the square process is givenby

ρX2(h) =(1 + 4δ2σ2φ2h−2)e4σ

2φh/(1−φ2) − 1

3e4σ2/(1−φ2) − 1, h > 0.

The auto-correlation is positive at positive lags and decreases exponentially fast to zero,with a rate depending on the proximity of φ to 1. For values of φ close to 1, the decreaseis relatively slow.


10.17 EXERCISE. Derive the formula for the auto-correlation function.

10.18 EXERCISE. Suppose that the variables Vt and Zt are independent for every t, inaddition to independence of the vectors (Vt, Zt), and assume that the variables Vt (butnot necessarily the variables Zt) are normally distributed. Show that

ρX2(h) =e4σ

2φh/(1−φ2) − 1

κ4(Z)e4σ2/(1−φ2) − 1

, h > 0.

[Factorize Eσ2t+hσ

2tZ

2t+hZ

2t as Eσ2

t+hσ2tEZ

2t+hZ

2t .]

The choice of the logarithmic function in the auto-regressive equation (10.8) hassome arbitrariness, and other possibilities, such as a power function, have been explored.

11Moment andLeast Squares Estimators

Suppose that we observe realizations X1, . . . , Xn from a time series Xt whose distri-bution is (partly) described by a parameter θ ∈ R

d. For instance, an ARMA processwith the parameter (φ1, . . . , φp, θ1, . . . , θq, σ

2), or a GARCH process with parameter(α, φ1, . . . , φp, θ1, . . . , θq), both ranging over a subset of Rp+q+1. In this chapter we dis-cuss two methods of estimation of the parameters, based on the observations X1, . . . , Xn:the “method of moments” and the “least squares method”.

When applied in the standard form to auto-regressive processes the two methodsare essentially the same, but for other models the two methods may yield quite differentestimators. Depending on the moments used and the underlying model, least squaresestimators can be more efficient, although sometimes they are not usable at all. The“generalized method of moments” tries to bridge the efficiency gap, by increasing thenumber of moments employed.

Moment and least squares estimators are popular in time series analysis, but in gen-eral they are less accurate than maximum likelihood and Bayes estimators. The differencein efficiency depends on the model and the true distribution of the time series. Maxi-mum likelihood estimation using a Gaussian model can be viewed as an extension of themethod of least squares. We discuss the method of maximum likelihood in Chapter 13.

11.1 Yule-Walker Estimators

Suppose that the time series Xt − µ is a stationary auto-regressive process of knownorder p and with unknown parameters φ1, . . . , φp and σ2. The mean µ = EXt of theseries may also be unknown, but we assume that it is estimated by Xn and concentrateattention on estimating the remaining parameters.

From Chapter 8 we know that the parameters of an auto-regressive process are notuniquely determined by the series Xt, but can be replaced by others if the white noiseprocess is changed appropriately as well. We shall aim at estimating the parameters

11.1: Yule-Walker Estimators 173

under the assumption that the series is causal. This is equivalent to requiring that allroots of the polynomial φ(z) = 1− φ1z − · · · − φpz

p are outside the unit circle.Under causality the best linear predictor of Xp+1 based on 1, Xp, . . . , X1 is given

by ΠpXp+1 = µ + φ1(Xp − µ) + · · · + φp(X1 − µ). (See Section 8.4.) Alternatively, thebest linear predictor can be obtained by solving the general prediction equations (2.4).Combined this shows that the parameters φ1, . . . , φp satisfy

γX(0) γX(1) · · · γX(p− 1)γX(1) γX(0) · · · γX(p− 2)

......

...γX(p− 1) γX(p− 2) · · · γX(0)

φ1φ2...φp

=

γX(1)γX(2)]

...γX(p)

.

We abbreviate this system of equations by Γp~φp = ~γp. These equations, known as the

Yule-Walker equations, express the parameters into second moments of the observations.The Yule-Walker estimators are defined by replacing the true auto-covariances γX(h) bytheir sample versions γn(h) and next solving for φ1, . . . , φp. This leads to the estimators

~φp: =

φ1φ2...φp

=

γn(0) γn(1) · · · γn(p− 1)γn(1) γn(0) · · · γn(p− 2)

......

...γn(p− 1) γn(p− 2) · · · γn(0)

−1

γn(1)γn(2)

...γn(p)

=: Γ−1p γp.

The parameter σ2 is by definition the variance of Zp+1, which is the prediction errorXp+1 − ΠpXp+1 when predicting Xp+1 by the preceding observations (under the as-sumption that the time series is causal). By the orthogonality of the prediction error andthe predictor ΠpXp+1, and Pythagoras’ rule,

(11.1) σ2 = varXp+1 − var(ΠpXp+1) = γX(0)− ~φTp Γp~φp.

We define an estimator σ2 by replacing all unknowns by their moment estimators, i.e.

σ2 = γn(0)− ~φT

p Γp~φp.

11.1 EXERCISE. An alternative method to derive the Yule-Walker equations is to workout the equations cov

(

φ(B)(Xt − µ), Xt−k − µ)

= cov(

Zt,∑

j≥0 ψjZt−j−k)

for k =0, . . . , p. Check this. Do you need causality? What if the time series would not be causal?

11.2 EXERCISE. Show that the matrix Γp is invertible for every p. [Suggestion: writeαTΓpα in terms of the spectral density.]

Another reasonable method to find estimators is to start from the fact that the truevalues of φ1, . . . , φp minimize the expectation

(β1, . . . , βp) 7→ E(

Xt − µ− β1(Xt−1 − µ)− · · · − βp(Xt−p − µ))2.

174 11: Moment and Least Squares Estimators

The least squares estimators are defined by replacing this criterion function by an “em-pirical” (i.e. observable) version of it and next minimizing this. Let φ1, . . . , φp minimizethe function

(11.2) (β1, . . . , βp) 7→1

n

n∑

t=p+1

(

Xt −Xn − β1(Xt−1 −Xn)− · · · − βp(Xt−p −Xn))2.

The minimum value itself is a reasonable estimator of the minimum value of the ex-pectation of this criterion function, which is EZ2

t = σ2. The least squares estimators φjobtained in this way are not identical to the Yule-Walker estimators, but the difference issmall. We can see this by deriving the least squares estimators as the solution of a systemof equations. The right side of (11.2) is n times the square of the norm ‖Yn −Dn

~βp‖ for~βp = (β1, . . . , βp)

T the vector of parameters and Yn and Dn the vector and matrix givenby

Yn =

Xn −Xn

Xn−1 −Xn

...Xp+1 −Xn

, Dn =

Xn−1 −Xn Xn−2 −Xn · · · Xn−p −Xn

Xn−2 −Xn Xn−3 −Xn · · · Xn−p−1 −Xn

......

...Xp −Xn Xp−1 −Xn · · · X1 −Xn

.

The norm β 7→ ‖Yn − Dn~βp‖ is minimized by the vector ~βp such that Dn

~βp is theprojection of the vector Yn onto the range of the matrix Dn. By the projection theorem,Theorem 2.11, this is characterized by the relationship that the residual Yn − Dn

~βp isorthogonal to the range of Dn. Since this range is spanned by the rows of DT

t , it follows

that DTn (Yn − Dn

~βp) = 0. This normal equation can be solved for βp to yield that theminimizing vector is given by

~φp =( 1

nDT

nDn

)−1 1

nDT

n ( ~Xn −Xn).

At closer inspection this vector is nearly identical to the Yule-Walker estimators. Indeed,for every s, t ∈ 1, . . . , p,

( 1

nDT

nDn

)

s,t=

1

n

n∑

j=p+1

(Xj−s −Xn)(Xj−t −Xn) ≈ γn(s− t) = (Γp)s,t,

( 1

nDT

n ( ~Xn −Xn))

t=

1

n

n∑

j=p+1

(Xj−t −Xn)(Xj −Xn) ≈ γn(t).

Asymptotically the difference between the Yule-Walker and least squares estimators isnegligible. They possess the same (normal) limit distribution.


11.3 Theorem. Let (Xt−µ) be a causal AR(p) process relative to an i.i.d. sequence Zt

with finite fourth moments. Then both the Yule-Walker and the least squares estimatorssatisfy, with Γp the covariance matrix of (X1, . . . , Xp),

√n(~φp − ~φp) N(0, σ2Γ−1p ).

Proof. We can assume without loss of generality that µ = 0. The AR equationsφ(B)Xt = Zt for t = n, n− 1, . . . , p+ 1 can be written in the matrix form

Xn

Xn−1...

Xp+1

=

Xn−1 Xn−2 · · · Xn−pXn−2 Xn−3 · · · Xn−p−1

......

...Xp Xp−1 · · · X1

φ1φ2...φp

+

Zn

Zn−1...

Zp+1

= Dn~φp +

~Zn,

for ~Zn the vector with coordinates Zt+Xn

∑

φi, and Dn the “design matrix” as before.

We can solve ~φp from this as, writing ~Xn for the vector on the left,

~φp = (DTnDn)

−1DTn ( ~Xn − ~Zn).

Combining this with the analogous representation of the least squares estimators φj wefind

√n(~φp − ~φp) =

( 1

nDT

nDn

)−1 1√nDT

n

(

~Zn −Xn(1−∑

i

φi)~1)

.

Because Xt is an auto-regressive process, it possesses a representation Xt =∑

j ψjZt−jfor a sequence ψj with

∑

j |ψj | <∞. Therefore, the results of Chapter 5 apply and show

that n−1DTnDn

P→ Γp. (In view of Problem 11.2 this also shows that the matrix DTnDn

is invertible, as was assumed implicitly in the preceding.)In view of Slutsky’s lemma it now suffices to show that

1√nDT

n~Zn N(0, σ2Γp),

1√nDT

n 1XnP→ 0.

A typical coordinate of the last vector is (for h = 1, . . . , p)

1√n

n∑

t=p+1

(Xt−h −Xn)Xn =1√n

n∑

t=p+1

Xt−hXn − n− p√nX

2

n.

In view of Theorem 4.5 and the assumption that µ = 0, the sequence√nXn converges

in distribution. Hence both terms on the right side are of the order OP (1/√n).

A typical coordinate of the first vector is (for h = 1, . . . , p)

1√n

n∑

t=p+1

(Xt−h −Xn)Zt =1√n

n−p∑

t=1

Yt +OP

( 1√n

)

,


for Yt = Xp+t−hZp+t. By causality of the series Xt we have Zp+t ⊥ Xp+t−s for s > 0and hence EYt = EXp+t−sEZp+t = 0 for every t. The same type of arguments as inChapter 5 will give the asymptotic normality of the sequence

√nY n, with asymptotic

variance ∞∑

g=−∞γY (g) =

∞∑

g=−∞EXp+g−hZp+gXp−hZp.

In this series all terms with g > 0 vanish because Zp+g is independent of the vector(Xp+g−h, Xp−h, Zp), by the assumption of causality and the fact that Zt is an i.i.d.sequence. All terms with g < 0 vanish by symmetry. Thus the series is equal to γY (0) =EX2

p−hZ2p = γX(0)σ2, which is the diagonal element of σ2Γp. This concludes the proof

of the convergence in distribution of all marginals of n−1/2DTn~Zn. The joint convergence

is proved in similarly, with the help of the Cramer-Wold device.This concludes the proof of the asymptotic normality of the least squares estima-

tors. The Yule-Walker estimators can be proved to be asymptotically equivalent to theleast squares estimators, in that the difference is of the order oP (1/

√n). Next we apply

Slutsky’s lemma.

11.4 EXERCISE. Show that the time series Yt in the preceding proof is strictly station-ary.

* 11.5 EXERCISE. Give a complete proof of the asymptotic normality of√nY n as defined

in the preceding proof, along the lines sketched, or using the fact nY n is a martingale.

11.1.1 Order Selection

In the preceding derivation of the properties of the least squares and Yule-Walker esti-mators the order p of the AR process is assumed known a-priori. Theorem 11.3 is falseif Xt − µ were in reality an AR (p0) process of order p0 > p. In that case φ1, . . . φp areestimators of the coefficients of the best linear predictor based on p observations, butneed not converge to the p0 coefficients φ1, . . . , φp0

. On the other hand, Theorem 11.3remains valid if the series Xt is an auto-regressive process of “true” order p0 strictlysmaller than the order p used to define the estimators. This follows, since for p0 ≤ p anAR(p0) process is also an AR(p) process, albeit that φp0+1, . . . , φp are zero. In view of

the Theorem 11.3, if φ(p)1 , . . . , φ

(p)j are the Yule-Walker estimators when fitting an AR(p)

model and the observations are an AR(p0) process with p0 ≤ p, then√nφ

(p)j N

(

0, σ2(Γ−1p )j,j)

, j = p0 + 1, . . . , p.

Thus “overfitting” (choosing too big an order) does not cause great harm: the estimatorsof the “unnecessary” coefficients φp0+1, . . . , φp converge to zero at rate 1/

√n. This is

reconforting. However, there is also a price to be paid by overfitting. By Theorem 11.3,if fitting an AR(p)-model, then the estimators of the first p0 coefficients satisfy

√n

φ(p)1...

φ(p)p0

−

φ1...φp0

N

(

0, σ2(Γ−1p )s,t=1,...,p0

)

.


The covariance matrix in the right side, the (p0 × p0) upper principal submatrix of the(p×p) matrix Γ−1p , is not equal to Γ−1p0

, which would be the asymptotic covariance matrixif we fitted an AR model of the “correct” order p0. In fact, it is bigger in that

(Γ−1p )s,t=1,...,p0− Γ−1p0

≥ 0.

(Here A ≥ 0 means that the matrix A is nonnegative definite.) In particular, the diagonalelements of these matrices, which are the differences of the asymptotic variances of the

estimators φ(p)j and the estimators φ

(p0)j , are nonnegative. Thus overfitting leads to more

uncertainty in the estimators of both φ1, . . . , φp0and φp0+1, . . . , φp. Fitting an auto-

regressive process of very high order p increases the chance of having the model fit thedata, but generally will result in poor estimates of the coefficients, which render the finaloutcome less useful.

* 11.6 EXERCISE. Prove the assertion that the given matrix is nonnegative definite.

In practice we do not know the correct order to use. A suitable order is oftendetermined by a preliminary data-analysis, such as an inspection of the plot of thesample partial auto-correlation function. More formal methods are discussed within thegeneral context of maximum likelihood estimation in Chapter 13.

11.7 Example. If we fit an AR(1) process to observations of an AR(1) series, then the

asymptotic covariance of√n(φ1 − φ1) is equal to σ2Γ−11 = σ2/γX(0). If to this same

process we fit an AR(2) process, then we obtain estimators (φ(2)1 , φ

(2)2 ) (not related to

the earlier φ1) such that√n(φ

(2)1 − φ1, φ

(2)2 ) has asymptotic covariance matrix

σ2Γ−12 = σ2

(

γX(0) γX(1)γX(1) γX(0)

)−1=

σ2

γ2X(0)− γ2X(1)

(

γX(0) −γX(1)−γX(1) γX(0)

)

.

Thus the asymptotic variance of the sequence√n(φ

(2)1 − φ1) is equal to

σ2γX(0)

γ2X(0)− γ2X(1)=

σ2

γX(0)

1

1− φ21.

(Note that φ1 = γX(1)/γX(0).) Thus overfitting by one degree leads to a loss in efficiencyof 1− φ21. This is particularly harmful if the true value of |φ1| is close to 1, i.e. the timeseries is close to being a (nonstationary) random walk.

11.1.2 Partial Auto-Correlations

Recall that the partial auto-correlation coefficient αX(h) of a centered time series Xt isthe coefficient of X1 in the formula β1Xh + · · · + βhX1 for the best linear predictor ofXh+1 based on X1, . . . , Xh. In particular, for the causal AR(p) process satisfying Xt =φ1Xt−1 + · · ·+ φpXt−p + Zt we have αX(p) = φp and αX(h) = 0 for h > p. The samplepartial auto-correlation coefficient is defined in Section 5.4 as the Yule-Walker estimatorφh when fitting an AR(h) model. This connection provides an alternative method toderive the limit distribution in the special situation of auto-regressive processes. Thesimplicity of the result makes it worth the effort.


11.8 Corollary. Let Xt − µ be a causal stationary AR(p) process relative to an i.i.d.sequence Zt with finite fourth moments. Then, for every h > p,

√n αn(h) N(0, 1).

Proof. For h > p the time series Xt − µ is also an AR(h) process and hence we can

apply Theorem 11.3 to find that the Yule-Walker estimators φ(h)1 , . . . , φ

(h)h when fitting

an AR(h) model satisfy√n(φ

(h)h − φ

(h)h ) N

(

0, σ2(Γ−1h )h,h)

.

The left side is exactly√n αn(h). We show that the variance of the normal distribution

on the right side is unity. By Cramer’s rule the (h, h)-element of the matrix Γ−1h can befound as det Γh−1/det Γh. By the prediction equations we have for h ≥ p

γX(0) γX(1) · · · γX(h− 1)γX(1) γX(0) · · · γX(h− 2)

......

...γX(h− 1) γX(h− 2) · · · γX(0)

φ1...φp0...0

=

γX(1)γX(2)

...γX(h)

.

This expresses the vector on the right as a linear combination of the first p columns of thematrix Γh on the left. We can use this to rewrite det Γh+1 (by a “sweeping” operation)in the form

∣

∣

∣

∣

∣

∣

∣

∣

γX(0) γX(1) · · · γX(h)γX(1) γX(0) · · · γX(h− 1)

......

...γX(h) γX(h− 1) · · · γX(0)

∣

∣

∣

∣

∣

∣

∣

∣

=

∣

∣

∣

∣

∣

∣

∣

∣

γX(0)− φ1γX(1)− · · · − φpγX(p) 0 · · · 0γX(1) γX(0) · · · γX(h− 1)

......

...γX(h) γX(h− 1) · · · γX(0)

∣

∣

∣

∣

∣

∣

∣

∣

.

The (1, 1)-element in the last determinant is equal to σ2 by (11.1). Thus this determinantis equal to σ2 det Γh and the theorem follows.

This corollary can be used to determine a suitable order p if fitting an auto-regressivemodel to a given observed time series. The true partial auto-correlation coefficients oflags higher than the true order p are all zero, and we should expect that the samplepartial auto-correlation coefficients are inside a band of the type (−1.96/

√n, 1.96

√n).

Thus we should not choose the order equal to p if αn(p+ k) is outside this band for toomany k ≥ 1. Here we should expect a fraction of 5 % of the αn(p + k) for which weperform this “test” to be outside the band in any case.

To turn this procedure in a more formal statistical test we must also take the depen-dence between the different αn(p+ k) into account, but this appears to be complicated.

11.2: Moment Estimators 179

* 11.9 EXERCISE. Find the asymptotic limit distribution of the sequence(

αn(h), αn(h+

1))

for h > p, e.g. in the case that p = 0 and h = 1.

* 11.1.3 Indirect Estimation

The parameters φ1, . . . , φp of a causal auto-regressive process are exactly the coefficientsof the one-step ahead linear predictor using p variables from the past. This makes appli-cation of the least squares method to obtain estimators for these parameters particularlystraightforward. For an arbitrary stationary time series the best linear predictor of Xp+1

given 1, X1, . . . , Xp is the linear combination µ+ φ1(Xp − µ) + · · · + φ1(X1 − µ) whosecoefficients satisfy the prediction equations (2.4). The Yule-Walker estimators are thesolutions to these equations after replacing the true auto-covariances by the sampleauto-covariances. It follows that the Yule-Walker estimators can be considered estima-tors for the prediction coefficients (using p variables from the past) for any stationarytime series. The case of auto-regressive processes is special only in that these predictioncoefficients are exactly the parameters of the model.

Furthermore, it remains true that the Yule-Walker estimators are√n-consistent

and asymptotically normal. This does not follow from Theorem 11.3, because this usesthe auto-regressive structure explicitly, but it can be inferred from the asymptotic nor-mality of the auto-covariances, given in Theorem 5.8. (The argument is the same asused in Section 5.4. The asymptotic covariance matrix will be different from the one inTheorem 11.3, and more complicated.)

If the prediction coefficients (using a fixed number of past variables) are not theparameters of main interest, then these remarks may seem little useful. However, if theparameter of interest θ is of dimension d, then we may hope that there exists a one-to-one relationship between θ and the prediction coefficients φ1, . . . , φp if we choose p = d.(More generally, we can apply this to a subvector of θ and a matching number of φj ’s.)Then we can first estimate φ1, . . . , φd by the Yule-Walker estimators and next employthe relationshiop between φ1, . . . , φp to infer an estimate of θ. If the inverse map givingθ as a function of φ1, . . . , φd is differentiable, then it follows by the Delta-method thatthe resulting estimator for θ is

√n-consistent and asymptotically normal, and hence we

obtain good estimators.If the relationship between θ and (φ1, . . . , φd) is complicated, then this idea may be

hard to implement. One way out of this problem is to determine the prediction coefficientsφ1, . . . , φd for a grid of values of θ, possibly through simulation. The value on the gridthat yields the Yule-Walker estimators is the estimator for θ we are looking for.

11.10 EXERCISE. Indicate how you could obtain (approximate) values for φ1, . . . , φpgiven θ using computer simulation, for instance for a stochastic volatility model.


11.2 Moment Estimators

The Yule-Walker estimators can be viewed as arising from a comparison of sample auto-covariances to true auto-covariances and therefore are examples of moment estimators.Moment estimators are defined in general by matching sample moments and populationmoments. Population moments of a time series Xt are true expectations of functions ofthe variables Xt, for instance,

EθXt, EθX2t , EθXt+hXt, EθX

2t+hX

2t .

In every case, the subscript θ indicates the dependence on the unknown parameter θ:in principle, every of these moments is a function of θ. The principle of the methodof moments is to estimate θ by that value θn for which the corresponding populationmoments coincide with a corresponding sample moment, for instance,

1

n

n∑

t=1

Xt,1

n

n∑

t=1

X2t ,

1

n

n∑

t=1

Xt+hXt,1

n

n∑

t=1

X2t+hX

2t .

From Chapter 5 we know that these sample moments converge, as n → ∞, to the truemoments, and hence it is believable that the sequence of moment estimators θn alsoconverges to the true parameter, under some conditions.

Rather than true moments it is often convenient to define moment estimatorsthrough derived moments such as an auto-covariance at a fixed lag, or an auto-correlation,which are both functions of moments of degree smaller than 2. These derived momentsare then matched to the corresponding sample quantities.

The choice of moments to be used is crucial for the existence and consistency of themoment estimators, and also for their efficiency.

For existence we generally need to match as many moments as there are parametersin the model. With fewer moments we should expect a moment estimator to be notuniquely defined, whereas with more moments no solution to the moment equations mayexist. Because in general the moments are highly nonlinear functions of the parameters,it is hard to make this statement precise, as it is hard to characterize solutions of systemsof nonlinear equations in general. This is illustrated already in the case of moving averageprocesses, where a characterization of the existence of solutions requires effort, and whereconditions and restrictions are needed to ensure their uniqueness. (Cf. Section 11.2.3.)

To ensure consistency and improve efficiency it is necessary to use moments that canbe estimated well from the data. Auto-covariances at high lags, or moments of high degreeshould generally be avoided. Besides on the quality of the initial estimates, the efficiencyof the moment estimators also depends on the inverse map giving the parameter as afunction of the moments. To see this we may formalize the method of moments throughthe scheme

φ(θ) = Eθf(Xt, . . . , Xt+h),

φ(θn) =1

n

n∑

t=1

f(Xt, . . . , Xt+h).

Here f :Rh+1 → Rd is a given map, which defines the moments used. (For definiteness we

allow it to depend on the joint distribution of at most h + 1 consecutive observations.)


We shall assume that he time series t 7→ f(Xt, . . . , Xt+h) is strictly stationary, so thatthe mean values φ(θ) in the first line do not depend on t, and for simplicity of notationwe assume that we observe X1, . . . , Xn+h, so that the right side of the second line isindeed an observable quantity. If the map φ: Θ → R

d is one-to-one, then the second lineuniquely defines the estimator θn as the inverse

θn = φ−1(

fn)

, fn =1

n

n∑

t=1

f(Xt, . . . , Xt+h).

We shall generally construct fn such that it converges in probability to its mean φ(θ)

as n → ∞. If this is the case and φ−1 is continuous at φ(θ), then we have that θn →φ−1φ(θ) = θ, in probability as n→ ∞, and hence the moment estimator is asymptoticallyconsistent.

Many sample moments converge at√n-rate, with a normal limit distribution. This

allows to refine the consistency result, in view of the Delta-method, given by Theo-rem 3.15. If φ−1 is differentiable at φ(θ) with derivative of full rank, and

√n(

fn − φ(θ))

converges in distribution to a normal distribution with mean zero and covariance matrixΣθ, then √

n(θn − θ) N(

0, φ′θ−1Σθ(φ

′θ−1)T

)

.

Here φ′θ−1 is the derivative of φ−1 at φ(θ), which is the inverse of the derivative of φ at

θ. Under these conditions, the moment estimators are√n-consistent with a normal limit

distribution, a desirable property.The size of the asymptotic covariance matrix φ′θ

−1Σθ(φ′θ−1)T depends both on the

accuracy by which the chosen moments can be estimated from the data (through thematrix Σθ) and the “smoothness” of the inverse φ−1. If the inverse map has a “large”

derivative, then extracting the moment estimator θn from the sample moments fn mag-nifies the error of fn as an estimate of φ(θ), and the moment estimator will be rela-tively inefficient. Unfortunately, it is hard to see how a particular implementation ofthe method of moments works out without doing (part of) the algebra leading to theasymptotic covariance matrix. Furthermore, the outcome may depend on the true valueof the parameter, a given moment estimator being relatively efficient for some parametervalues, but (very) inefficient for others.

11.2.1 Generalized Moment Estimators

Moment estimators are measurable functions of the sample moments fn and hence can-not be better than the best estimator based on fn. In most cases summarizing the datathrough the sample moments fn incurs a loss of information. Only if the sample momentsare sufficient (in the statistical sense), moment estimators can be fully efficient. Suffi-ciency is an exceptional situation. The loss of information can be diminished by workingwith the right type of moments, but is usually unavoidable through the restriction ofusing only as many moments as there are parameters. This is because the reduction ofa sample of size n to a “sample” of empirical moments of size d usually entails a loss ofinformation.


This observation motivates the generalized method of moments. The idea is to reducethe sample to more “empirical moments” than there are parameters. Given a functionf :Rh+1 → R

e, for e > d, with corresponding mean function φ(θ) = Eθf(Xt, . . . , Xt+h),

there is no hope, in general, to solve an estimator θn from the system of equationsφ(θ) = fn, because these are e > d equations in d unknowns. The generalized method

of moments overcomes this by defining θn as the minimizer of the quadratic form, for agiven (possibly random) matrix Vn,

(11.3) θ 7→(

φ(θ)− fn)TVn(

φ(θ)− fn)

.

Thus a generalized moment estimator tries to solve the system of equations φ(θ) = fnas well as possible, where the discrepancy is measured through a certain quadratic form.The matrix Vn weighs the influence of the different components of fn on the estimator θn,and is typically chosen dependent on the data to increase the efficiency of the estimator.We assume that Vn is symmetric and positive-definite.

As n → ∞ the estimator fn typically converges to its expectation φ(θ0) = Eθ0fnunder the true parameter, which we shall denote by θ0 for clarity. The quadratic formobtained by replacing fn in the criterion function (11.3) by its limit φ(θ0) is reducedto zero for θ equal to θ0. This is clearly the minimal value of the quadratic form, andthe choice θ = θ0 will be unique as soon as the map φ is one-to-one. This suggeststhat the generalized moment estimator θn is asymptotically consistent. As for ordinarymoment estimators, a rigorous justification of the consistency must take into account theproperties of the function φ.

The distributional limit properties of a generalized moment estimator can be un-derstood by linearizing the function φ around the true parameter. Insertion of the firstorder Taylor expansion φ(θ) = φ(θ0) + φ′θ0(θ − θ0) into the quadratic form yields theapproximate criterion

θ 7→(

fn − φ(θ0)− φ′θ0(θ − θ0))TVn(

fn − φ(θ0)− φ′θ0(θ − θ0))

=1

n

(

Zn − φ′θ0√n(θ − θ0)

)TVn(

Zn − φ′θ0√n(θ − θ0)

)

,

for Zn =√n(

fn − φ(θ0))

. The sequence Zn is typically asymptotically normally dis-tributed, with mean zero. Minimization of this approximate criterion over h =

√n(θ−θ0)

is equivalent to minimizing the quadratic form h 7→ (Zn−φ′θ0h)Vn(Zn−φ′θ0h), or equiv-alently minimizing the norm of the vector Zn − φ′θ0h over h in the Hilbert space R

d

with inner product defined by 〈x, y〉 = xT Vny. This comes down to projecting the vectorZn onto the range of the linear map φ′θ0 and hence by the projection theorem, Theo-

rem 2.11, the minimizer h =√n(θ − θ0) is characterized by the orthogonality of the

vector Zn − φ′θ0 h to the range of φ′θ0 . The algebraic expression of this orthogonality

(φ′θ0)T Vn(Zn − φ′θ0 h) = 0 can be written in the form

√n(θn − θ0) =

(

(φ′θ0)T Vnφ

′θ0

)−1(φ′θ0)

T VnZn.

This readily gives the asymptotic normality of the sequence√n(θn−θ0), with mean zero

and a somewhat complicated covariance matrix depending on φ′θ0 , Vn and the asymptotic


covariance matrix of Zn. The minimizer θn of the true criterion function (11.3) is not thesame as θn, but Theorem 11.12 below give sufficient conditions for its having the samelimiting distribution.

The best nonrandom weight matrices Vn, in terms of minimizing the asymptoticcovariance matrix of

√n(θn − θ), is the inverse of the covariance matrix of Zn. (Cf.

Problem 11.11.) For our present situation this suggests to choose the matrix Vn to beconsistent for the inverse of the asymptotic covariance matrix of the sequence Zn =√n(

fn−φ(θ0))

. With this choice and the asymptotic covariance matrix denoted by Σθ0 ,we may expect that

√n(θn − θ0) N

(

0,(

(φ′θ0)TΣ−1θ0

φ′θ0)−1)

.

11.11 EXERCISE. Let Σ be a symmetric, positive-definite matrix and A a given matrix.Show that the matrix (ATV A)−1ATV ΣV TA(ATV A)−1 is minimized over symmetric,nonnegative-definite matrices V (where we say that V ≤W if W −V is nonnegative def-inite) for V = Σ−1. [The given matrix is the covariance matrix of βA = (ATV A)−1ATV Zfor Z a random vector with the normal distribution with covariance matrix Σ. Show thatCov(βA − βΣ−1 , βΣ−1) = 0.]

The argument shows that the generalized moment estimator can be viewed as aweighted least squares estimators for regressing Z − n =

√n(

fn − φ(θ0))

onto φ′θ0 . If we

use more initial moments to define fn and hence φ(θ), then we add “observations” andcorresponding rows to the design matrix φ′θ0 , but keep the same parameter

√n(θ − θ0).

This suggests that the asymptotic efficiency of the optimally weighted generalized mo-ment estimator increases if we use a longer vector of initial moments fn. In particular,the optimally weighted generalized moment estimator is more efficient than an ordinarymoment estimator based on a subset of d of the initial moments. Thus, the general-ized method of moments achieves the aim of using more information contained in theobservations.

These arguments are based on asymptotic approximations. They are reasonablyaccurate for values of n that are large relative to the values of d and e, but should notbe applied if d or e are large. In particular, it is illegal to push the preceding argumentto its extreme and infer that it is fruitful to use as many initial moments as possible.Increasing the dimension of the vector fn indefinitely may contribute more “variability”to the criterion (and hence to the estimator), without increasing the information much(also depending on the accuracy of the estimator Vn).

In the following theorem we make the preceding heuristic derivation of the asymp-totics of generalized moment estimators rigorous. The theorem is a corollary of Theo-rems 3.17 and 3.18 on the asymptotics of general minimum contrast estimators. Con-sider generalized moment estimators as previously, defined as the point of minimum ofa quadratic form of the type (11.3). In most cases the function φ(θ) will be the expected

value of the random vectors fn under the parameter θ, but this is not necessary. Thefollowing theorem is applicable as soon as φ(θ) gives a correct “centering” to ensure that

the sequence√n(

fn − φ(θ))

converges to a limit distribution, and hence may also applyto nonstationary time series.


11.12 Theorem. Let Vn be random matrices such that VnP→ V0 for some matrix V0.

Assume that φ: Θ ⊂ Rd → R

e is differentiable at an inner point θ0 of Θ with derivativeφ′θ0 such that the matrix (φ′θ0)

TV0φ′θ0

is nonsingular and satisfies, for every δ > 0,

(11.4) infθ:‖θ−θ0‖>δ

(

φ(θ)− φ(θ0))TV0(

φ(θ)− φ(θ0))

> 0.

Assume either that V0 is invertible or that the set φ(θ): θ ∈ Θ is bounded. Fi-

nally, suppose that the sequence of random vectors Zn =√n(

fn − φ(θ0))

is uniformly

tight. If θn are random vectors that minimize the criterion (11.3), then√n(θn − θ0) =

−((φ′θ0)TV0φ

′θ0)−1V0Zn + oP (1).

Proof. We first prove that θnP→ θ0 using Theorem 3.17, with the criterion functions

Mn(θ) =∥

∥V 1/2n

(

fn − φ(θ))∥

∥,

Mn(θ) =∥

∥V 1/2n

(

φ(θ0)− φ(θ))∥

∥.

The squares of these functions are the criterion in (11.3) and the quadratic form (11.4),but with V0 replaced by Vn, respectively. By the triangle inequality |Mn(θ)−Mn(θ)| ≤∥

∥V1/2n

(

fn −φ(θ0))∥

∥→ 0 in probability, uniformly in θ. Thus the first condition of Theo-rem 3.17 is satisfied. The second condition, that infMn(θ): ‖θ−θ0‖ > δ is stochasticallybounded away from zero for every δ > 0, is satisfied by assumption (11.4) in the case thatVn = V0 is fixed. Because Vn

P→ V0, where V0 is invertible or the set φ(θ): ‖θ−θ0‖ > δ isbounded, it is also satisfied in the general case, in view of Exercise 11.13. This concludesthe proof of consistency of θn.

For the proof of asymptotic normality we use Theorem 3.18 with the criterion func-tions Mn and Mn redefined as the squares of the functions Mn and Mn as used inthe consistency proof (so that Mn(θ) is the criterion function in (11.3)) and with thecentering function M defined by

M(θ) =(

φ(θ)− φ(θ0))TV0(

φ(θ)− φ(θ0))

.

It follows that, for any random sequence θnP→ θ0,

n(Mn −Mn)(θn)− n(Mn −Mn)(θ0)

= −2√n(

φ(θn)− φ(θ0))TVnZn,

= −2√n(θn − θ0)

T (φ′θ0)T VnZn +

√noP (θn − θ0),

by the differentiability of φ at θ0. Together with the convergence of Vn to V0, thedifferentiability of φ also gives that Mn(θn) − M(θn) = oP

(

‖θn − θ0‖2)

for any se-

quence θnP→ θ0. Therefore, we may replace Mn by M in the left side of the preced-

ing display, if we add an oP(

n‖θn − θ0‖2)

-term on the right. By a third applicationof the differentiability of φ, the function M permits the two-term Taylor expansionM(θ) = (θ − θ0)

TW (θ − θ0) + o(‖θ − θ0‖)2, for W = (φ′θ0)TV0φ

′θ0. Thus the conditions

of Theorem 3.18 are satisfied and the proof of asymptotic normality is complete.


11.13 EXERCISE. Let Vn be a sequence of nonnegative-definite matrices such that Vn →V for a matrix V such that infxTV x:x ∈ C > 0 for some set C. Show that:(i) If V is invertible, then lim inf infxTVnx:x ∈ C > 0.(ii) If C is bounded, then lim inf infxTVnx:x ∈ C > 0.(iii) The assertion of (i)-(ii) may fail without some additional assumption.[Suppose that xTnVnxn → 0. If V is invertible, then it follows that xn → 0. If thesequence xn is bounded, then xTnV xn − xTnVnxn → 0. As counterexample let Vn be thematrices with eigenvectors propertional to (n, 1) and (−1, n) and eigenvalues 1 and 0,let C = x: |x1| > δ and let xn = δ(−1, n).]

11.2.2 Simulated Method of Moments

The implementation of the (generalized) method of moments requires that the expecta-tions φ(θ) = Eθf(Xt, . . . , Xt+h) are available as functions of θ. In some models, such asAR or MA models, this causes no difficulty, but already in ARMA models the requiredanalytical computations become complicated. Sometimes it is easy to simulate realiza-tions of a time series, but hard to compute explicit formulas for moments. In this casethe values φ(θ) may be estimated stochastically at a grid of values of θ by simulatingrealizations of the given time series, taking in turn each of the grid points as the “true”parameter, and next computing the empirical moment for the simulated time series. Ifthe grid is sufficiently dense and the simulations are sufficiently long, then the grid pointfor which the simulated empirical moment matches the empirical moment of the datais close to the moment estimator. Taking it to be the moment estimator is called thesimulated method of moments.

11.2.3 Moving Average Processes

Suppose that Xt−µ =∑q

j=0 θjZt−j is a moving average process of order q. For simplicityof notation assume that 1 = θ0 and define θj = 0 for j < 0 or j > q. Then the auto-covariance function of Xt can be written in the form

γX(h) = σ2∑

j

θjθj+h.

Given observations X1, . . . , Xn we can estimate γX(h) by the sample auto-covariancefunction and next obtain estimators for σ2, θ1, . . . , θq by solving the system of equations

γn(h) = σ2∑

j

θj θj+h, h = 0, 1, . . . , q.

A solution of this system, which has q + 1 equations with q + 1 unknowns, does notnecessarily exist, or may be nonunique. It cannot be derived in closed form, but must bedetermined numerically by an iterative method. Thus applying the method of momentsfor moving average processes is considerably more involved than for auto-regressive pro-cesses. The real drawback of this method is, however, that the moment estimators are lessefficient than the least squares estimators that we discuss later in this chapter. Momentestimators are therefore at best only used as starting points for numerical procedures tocompute other estimators.


11.14 Example (MA(1)). For the moving average processXt = Zt+θZt−1 the momentequations are

γX(0) = σ2(1 + θ2), γX(1) = θσ2.

Replacing γX by γn and solving for σ2 and θ yields the moment estimators

θn =1±

√

1− 4ρ2n(1)

2ρn(1), σ2 =

γn(1)

θn.

We obtain a real solution for θn only if |ρn(1)| ≤ 1/2. Because the true auto-correlationρX(1) is contained in the interval [−1/2, 1/2], it is reasonable to truncate the sample auto-correlation ρn(1) to this interval and then we always have some solution. If |ρn(1)| < 1/2,

then there are two solutions for θn, corresponding to the ± sign. This situation willhappen with probability tending to one if the true auto-correlation ρX(1) is strictly

contained in the interval (−1/2, 1/2). From the two solutions, one solution has |θn| < 1

and corresponds to an invertible moving average process; the other solution has |θn| > 1.The existence of multiple solutions was to be expected in view of Theorem 8.30.

Assume that the true value |θ| < 1, so that ρX(1) ∈ (−1/2, 1/2) and

θ =1−

√

1− 4ρ2X(1)

2ρX(1).

Of course, we use the estimator θn defined by the minus sign. Then θn−θ can be writtenas φ

(

ρn(1))

− φ(

ρX(1))

for the function φ given by

φ(ρ) =1−

√

1− 4ρ2

2ρ.

This function is differentiable on the interval (−1/2, 1/2). By the Delta-method the

limit distribution of the sequence√n(θn − θ) is the same as the limit distribution of

the sequence φ′(

ρX(1))√n(

ρn(1) − ρX(1))

. Using Theorem 5.9 we obtain, after a longcalculation, that

√n(θn − θ) N

(

0,1 + θ2 + 4θ4 + θ6 + θ8

(1− θ2)2

)

.

Thus, to a certain extent, the method of moments works: the moment estimator θnconverges at a rate of 1/

√n to the true parameter. However, the asymptotic variance

is large, in particular for θ ≈ 1. We shall see later that there exist estimators withasymptotic variance 1 − θ2, which is smaller for every θ, and is particularly small forθ ≈ 1.


−0.4 −0.2 0.0 0.2 0.4

−1.

0−

0.5

0.0

0.5

1.0

Figure 11.1. The function ρ 7→ (1−√

1− 4ρ2)/(2ρ).

11.15 EXERCISE. Derive the formula for the asymptotic variance, or at least convinceyourself that you know how to get it.

The asymptotic behaviour of the moment estimators for moving averages of orderhigher than 1 can be analysed, as in the preceding example, by the Delta-method as well.Define φ:Rq+1 → R

q+1 by

(11.5) φ

σ2

θ1...θq

= σ2

∑

j θ2j

∑

j θjθj+1

...∑

j θjθj+q

.

Then the moment estimators and true parameters satisfy

σ2

θ1...θq

= φ−1

γn(0)γX(1)

...γn(q)

,

σ2

θ1...θq

= φ−1

γX(0)γX(1)

...γX(q)

.

The joint limit distribution of the sequences√n(

γn(h) − γX(h))

is known from Theo-

rem 5.8. Therefore, the limit distribution of the moment estimators σ2, θ1, . . . , θq followsby the Delta-method, provided the map φ−1 is differentiable at

(

γX(0), . . . , γX(q))

.Practical and theoretical complications arise from the fact that the moment equa-

tions may have zero or multiple solutions, as illustrated in the preceding example. Thisdifficulty disappears if we insist on an invertible representation of the moving averageprocess, i.e. require that the polynomial 1+ θ1z+ · · ·+ θqz

q has no roots in the complexunit disc. This follows by the following lemma, whose proof also contains an algorithmto compute the moment estimators numerically.


11.16 Lemma. Let Θ ⊂ Rq be the set of all vectors (θ1, . . . , θq) such that all roots of

1+θ1z+ · · ·+θqzq are outside the unit circle. Then the map φ:R+×Θ → Rq+1 in (11.5)

is one-to-one and continuously differentiable. Furthermore, the map φ−1 is differentiableat every point φ(σ2, θ1, . . . , θq) for which the roots of 1 + θ1z + · · ·+ θqz

q are distinct.

* Proof. Abbreviate γh = γX(h). The system of equations σ2∑

j θjθj+h = γh for h =0, . . . , q implies that

q∑

h=−qγhz

h = σ2∑

h

∑

j

θjθj+hzh = σ2θ(z−1)θ(z).

For any h ≥ 0 the function zh + z−h can be expressed as a polynomial of degree h inw = z + z−1. For instance, z2 + z−2 = w2 − 2 and z3 + z−3 = w3 − 3w. The caseof general h can be treated by induction, upon noting that by rearranging Newton’sbinomial formula

zh+1 + z−h−1 − wh+1 = −(

h+ 1

(h+ 1)/2

)

−∑

j 6=0

(

h+ 1

(h+ 1− j)/2

)

(zj + z−j).

Thus the left side of the preceding display can be written in the form

γ0 +∑

h=1

γj(zj + z−j) = a0 + a1w + · · ·+ aqw

q,

for certain coefficients (a0, . . . , aq). Let w1, . . . , wq be the zeros of the polynomial onthe right, and for each j let ηj and η−1j be the solutions of the quadratic equation

z + z−1 = wj . Choose |ηj | ≥ 1. Then we can rewrite the right side of the precedingdisplay as

aq

q∏

j=1

(z + z−1 − wj) = aq(z − ηj)(ηj − z−1)η−1j .

On comparing this to the first display of the proof, we see that η1, . . . , ηq are the zerosof the polynomial θ(z). This allows us to construct a map

(γ0, . . . , γq) 7→ (a0, . . . , aq) 7→ (w1, . . . , wq, aq) 7→ (η1, . . . , ηq, aq) 7→ (θ1, . . . , θq, σ2).

If restricted to the range of φ this is exactly the map φ−1. It is not hard to see that thefirst and last step in this decomposition of φ−1 are analytic functions. The two middlesteps concern mapping coefficients of polynomials into their zeros.

For α = (α0, . . . , αq) ∈ Cq+1 let pα(w) = α0 + α1w + · · · + αqw

q. By the implicitfunction theorem for functions of several complex variables we can show the following.If for some α the polynomial pα has a root of order 1 at a point wα, then there existsneighbourhoods Uα and Vα of α and wα such that for every β ∈ Uα the polynomial pβhas exactly one zero wβ ∈ Vα and the map β 7→ wβ is analytic on Uα. Thus, under theassumption that all roots are or multiplicity one, the roots can be viewed as analyticfunctions of the coefficients. If θ has distinct roots, then η1, . . . , ηq are of multiplicity oneand hence so are w1, . . . , wq. In that case the map is analytic.


* 11.2.4 Moment Estimators for ARMA Processes

If Xt − µ is a stationary ARMA process satisfying φ(B)(Xt − µ) = θ(B)Zt, then

cov(

φ(B)(Xt − µ), Xt−k)

= E(

θ(B)Zt

)

Xt−k.

If Xt − µ is a causal, stationary ARMA process, then the right side vanishes for k > q.Working out the left side, we obtain the eqations

γX(k)− φ1γX(k − 1)− · · · − φpγX(k − p) = 0, k > q.

For k = q + 1, . . . , q + p this leads to the system

γX(q) γX(q − 1) · · · γX(q − p+ 1)γX(q + 1) γX(q) · · · γX(q − p+ 2)

......

...γX(q + p− 1) γX(q + p− 2) · · · γX(q)

φ1φ2...φp

=

γX(q + 1)γX(q + 2)

...γX(q + p)

.

These are the Yule-Walker equations for general stationary ARMA processes and maybe used to obtain estimators φ1, . . . , φp of the auto-regressive parameters in the sameway as for auto-regressive processes: we replace γX by γn and solve for φ1, . . . , φp.

Next we apply the method of moments for moving averages to the time se-ries Yt = θ(B)Zt to obtain estimators for the parameters σ2, θ1, . . . , θq. Because alsoYt = φ(B)(Xt − µ) we can write the covariance function γY in the form

γY (h) =∑

i

∑

j

φiφjγX(h+ i− j), if φ(z) =∑

j

φjzj .

Let γY (h) be the estimators obtained by replacing the unknown parameters φj = −φjand γX(h) by their moment estimators and sample moments, respectively. Next we solve

σ2, θ1, . . . , θq from the system of equations

γY (h) = σ2∑

j

θj θj+h, h = 0, 1, . . . , q.

As is explained in the preceding section, if Xt − µ is invertible, then the solution isunique, with probability tending to one, if the coefficients θ1, . . . , θq are restricted to givean invertible stationary ARMA process.

The resulting estimators (σ2, θ1, . . . , θq, φ1, . . . , φp) can be written as a function of(

γn(0), . . . , γn(q + p))

. The true values of the parameters can be written as the same

function of the vector(

γX(0), . . . , γX(q + p))

. In principle, under some conditions, thelimit distribution of the estimators can be obtained by the Delta-method.


* 11.2.5 Stochastic Volatility Models

In the stochastic volatility model discussed in Section 10.4 an observation Xt is de-fined as Xt = σtZt for log σt a stationary auto-regressive process satisfying log σt =α + φ log σt−1 + σVt−1, and (Vt, Zt) an i.i.d. sequence of bivariate normal vectors withmean zero, unit variances and correlation δ. Thus the model is parameterized by fourparameters α, φ, σ, δ.

The series Xt is a white noise series and hence we cannot use the auto-covariancesγX(h) at lags h 6= 0 to construct moment estimators. Instead, we might use highermarginal moments or auto-covariances of powers of the series. In particular, it is com-puted in Section 10.4 that

E|Xt| = exp(

12

σ2

1− φ2+

α

1− φ

)

√

2

π,

E|Xt|2 = exp(

12

4σ2

1− φ2+

2α

1− φ

)

,

E|Xt|3 = exp(

12

9σ2

1− φ2+

3α

1− φ

)

2

√

2

π,

EX4t = exp

( 8σ2

1− φ2+

4α

1− φ

)

3,

ρX2(1) =(1 + 4δ2σ2)e4σ

2φ/(1−φ2) − 1

3e4σ2/(1−φ2) − 1,

ρX2(2) =(1 + 4δ2σ2φ2)e4σ

2φ2/(1−φ2) − 1

3e4σ2/(1−φ2) − 1,

ρX2(3) =(1 + 4δ2σ2φ4)e4σ

2φ3/(1−φ2) − 1

3e4σ2/(1−φ2) − 1.

We can use a selection of these moments to define moment estimators, or use some orall of them to define generalized moments estimators. Because the functions on the rightside are complicated, this requires some effort, but it is feasible.†

11.3 Least Squares Estimators

For auto-regressive processes the method of least squares is directly suggested by thestructural equation defining the model, but it can also be derived from the predictionproblem. The second point of view is deeper and can be applied to general time series.

A least squares estimator is based on comparing the predicted value of an observa-tion Xt based on the preceding observations to the actually observed value Xt. Such aprediction Πt−1Xt will generally depend on the underlying parameter θ of the model,

† See Taylor (1986).

11.3: Least Squares Estimators 191

which we shall make visible in the notation by writing it as Πt−1Xt(θ). The index t−1 ofΠt−1 indicates that Πt−1Xt(θ) is a function of X1, . . . , Xt−1 (and the parameter) only.By convention we define Π0X1 = 0. A weighted least squares estimator, with inverseweights wt(θ), is defined as the minimizer, if it exists, of the function

(11.6) θ 7→n∑

t=1

(

Xt −Πt−1Xt(θ))2

wt(θ).

This expression depends only on the observations X1, . . . , Xn and the unknown param-eter θ and hence is an “observable criterion function”. The idea is that using the “true”parameter should yield the “best” predictions. The weights wt(θ) could be chosen equalto one, but are generally chosen to increase the efficiency of the resulting estimator.

This least squares principle is intuitively reasonable for any sense of prediction,in particular both for linear and nonlinear prediction. For nonlinear prediction we setΠt−1Xt(θ) equal to the conditional expectation Eθ(Xt|X1, . . . , Xt−1), an expression thatmay or may not be easy to derive analytically.

For linear prediction, if we assume that the the time series Xt is centered at meanzero, we set Πt−1Xt(θ) equal to the linear combination β1Xt−1 + · · · + βt−1X1 thatminimizes

(β1, . . . , βt−1) 7→ Eθ

(

Xt − (β1Xt−1 + · · ·+ βt−1X1))2, β1, . . . , βt ∈ R.

In Chapter 2 the coefficients of the best linear predictor are expressed in the auto-covariance function γX by the prediction equations (2.4). Thus the coefficients βt de-pend on the parameter θ of the underlying model through the auto-covariance function.Hence the least squares estimators using linear predictors can also be viewed as momentestimators.

The difference Xt − Πt−1Xt(θ) between the true value and its prediction is calledinnovation. Its second moment

vt(θ) = Eθ

(


is called the (square) prediction error at time t− 1. The weights wt(θ) are often chosenequal to the prediction errors vt(θ) in order to ensure that the terms of the sum of squarescontribute “equal” amounts of information.

For both linear and nonlinear predictors the innovations X1 − Π0X1(θ), X2 −Π1X2(θ), . . . , Xn − Πn−1Xn(θ) are uncorrelated random variables. This orthogonalitysuggests that the terms of the sum contribute “additive information” to the criterion,which should be good. It also shows that there is usually no need to replace the sumof squares by a more general quadratic form, which would be the standard approach inordinary least squares estimation.

Whether the sum of squares indeed possesses a (unique) point of minimum θ andwhether this constitutes a good estimator of the parameter θ depends on the statisticalmodel for the time series. Moreover, this model determines the feasibility of computingthe point of minimum given the data. Auto-regressive and GARCH processes provide apositive and a negative example.


11.17 Example (Autoregression). A mean-zero, causal, stationary, auto-regressiveprocess of order p is modelled through the parameter θ = (σ2, φ1, . . . , φp). For t ≥ p thebest linear predictor is given by Πt−1Xt = φ1Xt−1+ · · ·φpXt−p and the prediction erroris vt = EZ2

t = σ2. For t < p the formulas are more complicated, but could be obtainedin principle.

The weighted sum of squares with weights wt = vt reduces to

p∑

t=1

(

Xt −Πt−1Xt(φ1, . . . , φp))2

vt(σ2, φ1, . . . , φp)+

n∑

t=p+1

(

Xt − φ1Xt−1 − · · · − φpXt−p)2

σ2.

Because the first term, consisting of p of the n terms of the sum of squares, possesses acomplicated form, it is often dropped from the sum of squares. Then we obtain exactlythe sum of squares considered in Section 11.1, but with Xn replaced by 0 and divided byσ2. For large n the difference between the sums of squares and hence between the twotypes of least squares estimators should be negligible.

Another popular strategy to simplify the sum of squares is to act as if the “ob-servations” X0, X−1, . . . , X−p+1 are available and to redefine Πt−1Xt for t = 1, . . . , paccordingly. This is equivalent to dropping the first term and letting the sum in thesecond term start at t = 1 rather than at t = p+1. To implement the estimator we mustnow choose numerical values for the missing observations X0, X−1, . . . , X−p+1; zero is acommon choice.

The least squares estimators for φ1, . . . , φp, being (almost) identical to the Yule-Walker estimators, are

√n-consistent and asymptotically normal. However, the least

squares criterion does not lead to a useful estimator for σ2: minimization over σ2 leadsto σ2 = ∞ and this is obviously not a good estimator. A more honest conclusion is thatthe least squares criterion as posed originally fails for auto-regressive processes, sinceminimization over the full parameter θ = (σ2, φ1, . . . , φp) leads to a zero sum of squaresfor σ2 = ∞ and arbitrary (finite) values of the remaining parameters. The method ofleast squares works only for the subparameter (φ1, . . . , φp) if we first drop σ2 from thesum of squares.

11.18 Example (GARCH). A GARCH process is a martingale difference series andhence the one-step predictions Πt−1Xt(θ) are identically zero. Consequently, the weightedleast squares sum, with weights equal to the prediction errors, reduces to

n∑

t=1

X2t

vt(θ).

Minimizing this criterion over θ is equivalent to maximizing the prediction errors vt(θ).It is intuitively clear that this does not lead to reasonable estimators.

One alternative is to apply the least squares method to the squared series X2t . This

satisfies an ARMA equation in view of (9.3). (Note however that the innovations in thatequation are also dependent on the parameter.)

The best fix of the least squares method is to augment the least squares criterion tothe Gaussian likelihood, as discussed in Chapter 13.

11.3: Least Squares Estimators 193

So far the discussion in this section has assumed implicitly that the mean valueµ = EXt of the time series is zero. If this is not the case, then we apply the precedingto the time series Xt − µ instead of to Xt. Then the parameter µ will show up in theleast squares criterion. To define estimators we can either replace the unknown value µby the sample mean Xn and minimize the sum of squares with respect to the remainingparameters, or perform a joint minimization over all parameters.

Least squares estimators can rarely be written in closed form, the case of stationaryauto-regressive processes being an exception, but iterative algorithms for an approxi-mate calculation are implemented in many computer packages. Newton-type algorithmsprovide one possibility. The best linear predictions Πt−1Xt(θ) are often computed recur-sively in t (for a grid of values θ), for instance with the help of a state space representationof the time series and the Kalman filter.

The method of least squares is closely related to Gaussian likelihood, as discussed inChapter 13. Gaussian likelihood is perhaps more fundamental than the method of leastsquares. For this reason we restrict further discussion of the method of least squares toARMA processes.

11.3.1 ARMA Processes

The method of least squares works well for estimating the regression and moving averageparameters (φ1, . . . , φp, θ1, . . . , θq) of ARMA processes, provided that we perform theminimization for a fixed value of the parameter σ2.

In general, if some parameter, such as σ2 for ARMA processes, enters the covariancefunction as a multiplicative factor, then the best linear predictor ΠtXt+1 is free from thisparameter, by the prediction equations (2.4). On the other hand, the prediction errorvt+1 = γX(0) − (β1, . . . , βt)Γt(β1, . . . , βt)

T (where β1, . . . , βt are the coefficients of thebest linear predictor) contains such a parameter as a multiplicative factor. It follows thatthe inverse of the parameter will enter the least squares criterion (11.6) as a multiplicativefactor. Thus the least squares methods does not yield an estimator for this parameter,but we can just omit it and minimize the criterion over the remaining parameters.

In particular, in the case of ARMA processes least squares estimators for(φ1, . . . , φp, θ1, . . . , θq) can be defined as the minimizers of, for vt = σ−2vt,

n∑

t=1

(

Xt −Πt−1Xt(φ1, . . . , φp, θ1, . . . , θq))2

vt(φ1, . . . , φp, θ1, . . . , θq).

This is a complicated function of the parameters. However, for a fixed value of(φ1, . . . , φp, θ1, . . . , θq) it can be computed using the state space representation of anARMA process and the Kalman filter. A grid search or iteration method can do the rest.

11.19 Theorem. Let Xt be a causal and invertible stationary ARMA(p, q) process rela-tive to an i.i.d. sequence Zt with finite fourth moments. Then the least squares estimatorssatisfy

√n

((

~φp~θq

)

−(

~φp~θq

)

)

N(0, σ2J−1~φp,~θq),


where J~φp,~θqis the covariance matrix of (U−1, . . . , U−p, V−1, . . . , V−q) for stationary auto-

regressive processes Ut and Vt satisfying φ(B)Ut = θ(B)Vt = Zt.

Proof. The proof of this theorem is long and technical. See e.g. Brockwell andDavis (1991), pages 375–396, Theorem 10.8.2.

11.20 Example (MA(1)). The least squares estimator θn for θ in the moving averageprocess Xt = Zt+θZt−1 with |θ| < 1 possesses asymptotic variance equal to σ2/ varV−1,where Vt is the stationary solution to the equation θ(B)Vt = Zt. Note that Vt is an auto-regressive process of order 1, not a moving average!

As we have seen before (see Example 1.8) the process Vt possesses the representationVt =

∑∞j=0(−θ)jZt−j and hence varVt = σ2/(1− θ2) for every t.

Thus the sequence√n(θn − θ) is asymptotically normally distributed with mean

zero and variance equal to 1 − θ2. This is always smaller than the asymptotic varianceof the moment estimator, obtained in Example 11.14.

11.21 EXERCISE. Find the asymptotic covariance matrix of the sequence√n(φn −

φ, θn − θ) for (φn, θn) the least squares estimators for the stationary, causal, invertibleARMA process satisfying Xt = φXt−1 + Zt + θZt−1.

12Spectral Estimation

In this chapter we study nonparametric estimators of the spectral density and spectraldistribution of a stationary time series. As in Chapter 5 “nonparametric” means that noa-priori structure of the series is assumed, apart from stationarity.

If a well-fitting model is available, then spectral estimators suited to this model arean alternative to the methods of this chapter. For instance, the spectral density of astationary ARMA process can be expressed in the parameters σ2, φ1, . . . φp, θ1, . . . , θq ofthe model (see Section 8.5) and hence can be estimated by plugging in estimators forthe parameters. If the ARMA model is appropriate, this should lead to better estimatorsthan the nonparametric estimators discussed in this chapter. We do not further discussthis type of estimator.

Let the observations X1, . . . , Xn be the values at times 1, . . . , n of a stationary timeseries Xt, and let γn be their sample auto-covariance function. In view of the definitionof the spectral density fX(λ), a natural estimator is

(12.1) fn,r(λ) =1

2π

∑

|h|<r

γn(h)e−ihλ.

Whereas fX(λ) is defined as an infinite series, the estimator fn,r is truncated at its rthterm. Because the estimators γn(h) are defined only for |h| < n and there is no hope ofestimating the auto-covariances γX(h) for lags |h| ≥ n, we must choose r ≤ n. Becausethe estimators γn(h) are unreliable for |h| ≈ n, it may be wise to choose r much smallerthan n. We shall see that a good choice of r depends on the smoothness of the spectraldensity and also on which aspect of the spectrum is of interest. For estimating fX(λ)at a point, values of r such as nα for some α ∈ (0, 1) may be appropriate, whereas forestimating the spectral distribution function (i.e. areas under fX) the choice r = n workswell.

In any case, since the covariances of lags |h| ≥ n can never be estimated fromthe data, nonparametric estimation of the spectrum is hopeless, unless one is willing toassume that expressions such as

∑

|h|≥n∣

∣γX(h)∣

∣ are small. In Section 12.3 ahead we relate

this tail series to the smoothness of the spectral density λ 7→ fX(λ). If this function is

196 12: Spectral Estimation

several times differentiable, then the auto-covariance function decreases fast to zero, andnonparametric estimation is feasible.

12.1 Finite Fourier Transform

The finite Fourier transform is a useful tool in spectral analysis, both for theory and prac-tice. The practical use comes from the fact that it can be computed efficiently by a cleveralgorithm, the Fast Fourier Transform (FFT). While a naive computation would requireO(n2) multiplications and summations, the FFT needs only O(n log n) operations.‡

The finite Fourier transform of an arbitrary sequence x1, . . . , xn of complex numbersis defined as the function λ 7→ dx(λ) given by

dx(λ) =1√n

n∑

t=1

xte−iλt, λ ∈ (−π, π].

In other words, the function√

n/2π dx(λ) is the Fourier series corresponding to thecoefficients . . . , 0, 0, x1, x2, . . . , xn, 0, 0, . . .. The inversion formula (or a short calculation)shows that

xt =

√n

2π

∫ π

−πeitλdx(λ) dλ, t = 1, 2, . . . , n.

Thus there is a one-to-one relationship between the numbers x1, . . . , xn and the functiondx. We may view the function dx as “encoding” the numbers x1, . . . , xn.

Encoding n numbers by a function on the interval (−π, π] is rather inefficient. Atcloser inspection the numbers x1, . . . , xn can also be recovered from the values of dx onthe grid

. . . ,−4π

n,−2π

n, 0,

2π

n,4π

n, . . . ⊂ (−π, π].

These n points are called the natural frequencies at time n. (The value 0 is always anatural frequency; for odd n there are (n−1)/2 positive and negative natural frequencies,situated symmetrically about 0; for even n the value π is a natural frequency, and theremaining natural frequencies are n/2− 1 points in (0, π) and their reflections.)

12.1 Lemma. If dx is the finite Fourier transform of x1, . . . , xn ∈ C, then

xt =1√n

∑

j

dx(λj)eitλj , t = 1, 2, . . . , n,

where the sum is computed over the natural frequencies λj ∈ (−π, π] at time n.

Proof. For every of the natural frequencies λj define a vector

ej =1√n(eiλj , ei2λj , . . . , einλj )T .

‡ See e.g. Brockwell and Davis, Chapter 10 for a discussion.

12.2: Periodogram 197

It is straightforward to check that the n vectors ej form an orthonormal set in Cn and

hence a basis. Thus the vector x = (x1, . . . , xn)T can be written as x =

∑

j〈x, ej〉ej .Now 〈x, ej〉 = dx(λj) and the lemma follows.

The proof of the preceding lemma shows how the numbers dx(λj) can be interpreted.View the coordinates of the vector x = (x1, . . . , xn) as the values of a signal at the timeinstants 1, 2, . . . , n. Similarly, view the coordinates of the vector ej as the values ofthe trigonometric signal t 7→ n−1/2eitλj of frequency λj at these time instants. By thepreceding lemma the signal x can be written as a linear combination of the signals ej .The value

∣

∣dx(λj)∣

∣ is the weight of signal ej , and hence of frequency λj , in x.

12.2 EXERCISE. How is the weight of frequency 0 expressed in (x1, . . . , xn)?

12.3 EXERCISE. Show that d(µ,µ,...,µ)(λj) = 0 for every nonzero natural frequency λjand every µ ∈ C. Conclude that dx−xn

(λj) = dx(λj) for every λj 6= 0, where x − xdenotes the numbers x1 − x, . . . , xn − x. [Hint: you can compute this explicity, or usethat dx(λj) = 〈x, ej〉 as in the proof of the preceding lemma, for any x.]

12.2 Periodogram

The periodogram of a sequence of observations X1, . . . , Xn is defined as the functionλ 7→ In(λ) given by

In(λ) =1

2π

∣

∣dX(λ)∣

∣

2=

1

2πn

∣

∣

∣

n∑

t=1

Xte−itλ

∣

∣

∣

2

.

We write In,X if the dependence on X1, . . . , Xn needs to be stressed.In view of the interpretation of the finite Fourier transform in the preceding section

In(λ) is the square of the weight of frequency λ in the signal X1, . . . , Xn. Because thespectral density fX(λ) at λ can be interpreted as the variance of the component offrequency λ in the time seriesXt, the function In may appear to be a reasonable estimatorof the function fX . We shall see that the expected value of In(λ) is indeed close to fX(λ),but In(λ) is not a good estimator of fX(λ). Because better estimator can be derived fromthe periodogram, it is still of interest to study its properties.

By evaluating the square in its definition and rearranging the resulting double sum,the periodogram can be rewritten in the form (if x1, . . . , xn are real)

(12.2) In(λ) =1

2π

∑

|h|<n

( 1

n

n−|h|∑

t=1

Xt+|h|Xt

)

e−ihλ.


For natural frequencies λj 6= 0 we have that d−µ,...,µ(λj) = 0 for every µ, in particularfor µ = Xn. This implies that In,X(λj) = In,X−Xn

(λj) and hence

(12.3) In(λj) =1

2π

∑

|h|<n

γn(h)e−ihλj , λj ∈

2π

nZ− 0.

This is exactly the estimator fn,n(λ) given in (12.1), with r = n. As noted before, weshould expect this estimator to be unreliable as an estimator of fX(λ), because of theimprecision of the estimators γn(h) for lags |h| close to n.

Under the assumption that the time series Xt is stationary, we can compute themean of the periodogram, for λ 6= 0, as

EIn(λ) =1

2πE∣

∣dX(λ)∣

∣

2=

1

2πvar dX(λ) +

1

2π

∣

∣EdX(λ)∣

∣

2

=1

2πn

n∑

s=1

n∑

t=1

cov(Xs, Xt)eiλ(s−t) +

1

2πn

∣

∣

∣

n∑

t=1

EXte−iλt

∣

∣

∣

2

=1

2π

∑

|h|<n

(

1− |h|n

)

γX(h)e−iλh +µ2

2πn

∣

∣

∣

∣

1− e−iλn

1− e−iλ

∣

∣

∣

∣

2

.

The second term on the far right is of the order O(1/n) for every λ 6= 0 (and λ ∈(−π, π)) and even vanishes for every natural frequency λj 6= 0. Under the conditionthat

∑

h

∣

∣γX(h)∣

∣ < ∞, the first term converges to fX(λ) as n → ∞, by the dominatedconvergence theorem. We conclude that the periodogram is asymptotically unbiased forestimating the spectral density: EIn(λ) → fX(λ). This is a good property.

However, the periodogram is not a consistent estimator for fX(λ): the followingtheorem shows that In(λ) is asymptotically exponentially distributed with mean fX(λ),whence we do not have that In(λ)

P→ fX(λ). Using the periodogram as an estimator offX(λ) is asymptotically equivalent to estimating fX(λ) using a single observation froman exponential distribution with mean fX(λ). This is disappointing, because we shouldhope that observing the time series Xt long enough will enable us to estimate its spectraldensity with arbitrary precision. The periodogram does not fulfill this hope, as it keepsfluctuating around the target value fX(λ). Apparently, it does not effectively use theinformation available in the observations X1, . . . , Xn.

12.4 Theorem. Let Xt =∑∞

j=−∞ ψjZt−j for an i.i.d. sequence Zt with mean zeroand finite second moment and coefficients ψj with

∑

j |ψj | < ∞. Then for any values0 < µ1 < · · · < µk < π the variables In(µ1), . . . , In(µk) are asymptotically distributedas independent exponential variables with means fX(µ1), . . . , fX(µk), respectively.

Proof. First consider the case that Xt = Zt for every t. Then the spectral density fX)is the function given by fZ(λ) = σ2/2π, for σ2 the variance of the white noise sequence.We can write

dZ(λ) =1√n

n∑

t=1

Zt cos(λt)− i1√n

n∑

t=1

Zt sin(λt) =:An(λ)− iBn(λ).


By straightforward calculus we find that, for any λ, µ ∈ (0, π),

cov(

An(λ), An(µ))

=σ2

n

n∑

t=1

cos(λt) cos(µt) →

σ2/2 if λ = µ,0 if λ 6= µ,

cov(

Bn(λ), Bn(µ))

=σ2

n

n∑

t=1

sin(λt) sin(µt) →

σ2/2 if λ = µ,0 if λ 6= µ,

cov(

An(λ), Bn(µ))

=σ2

n

n∑

t=1

cos(λt) sin(µt) → 0.

By the Lindeberg central limit theorem, Theorem 3.16, we now find that the se-quence of vectors

(

An(λ), Bn(λ), An(µ), Bn(µ))

converges in distribution to a vector

(G1, G2, G3, G4) with the N4

(

0, (σ2/2)I)

distribution. Consequently, by the continuousmapping theorem,

(

In(λ), In(µ))

=1

2π

(

A2n(λ) +B2

n(λ), A2n(µ) +B2

n(µ))

1

2π(G2

1 +G22, G

23 +G2

4).

The vector on the right is distributed as σ2/(4π) times a vector of two independent χ22

variables. Because the chisquare distribution with two degrees of freedom is identical tothe standard exponential distribution with parameter 1/2, this is the same as a vectorof two independent exponential variables with means σ2/(2π).

This concludes the proof in the special case that Xt = Zt and for two frequencies λand µ. The case of k different frequencies µ1, . . . , µk can be treated in exactly the sameway, but is notationally more involved.

The spectral density of a general time series of the form Xt =∑

j ψjZt−j satisfies

fX(λ) =∣

∣ψ(λ)∣

∣2fZ(λ), for ψ(λ) =

∑

ψje−ijλ the transfer function of the linear filter. We

shall prove the theorem by showing that the periodograms In,X and In,Z approximatelysatisfy a similar relation. Indeed, rearranging sums we find

dX(λ) =1√n

n∑

t=1

(

∑

j

ψjZt−j)

e−itλ =∑

j

ψje−ijλ

( 1√n

n−j∑

s=1−jZse−isλ

)

.

If we replace the sum∑n−j

s=1−j in the right side by the sum∑n

s=1, then the right sideof the display becomes ψ(λ)dZ(λ). These two sums differ by 2(|j| ∧ n) terms, every ofthe terms Zse

−iλs having mean zero and variance bounded by σ2, and the terms beingindependent. Thus

E∣

∣

∣

1√n

n−j∑

s=1−jZse−isλ − dZ(λ)

∣

∣

∣

2

≤ 2|j| ∧ nn

σ2.

In view of the inequality E|X| ≤(

EX2)1/2

, we can drop the square on the left side if wetake a root on the right side. Next combining the two preceding displays and applyingthe triangle inequality, we find

E∣

∣dX(λ)− ψ(λ)dZ(λ)∣

∣ ≤∑

j

|ψj |(

2|j| ∧ nn

)1/2

σ.


The jth term of the series is bounded by |ψj |(

2|j|/n)1/2

σ and hence converges to zero

as n → ∞, for every fixed j; it is also dominated by |ψj |√2σ. Therefore, the right side

of preceding display converges to zero as n→ ∞.By Markov’s and Slutsky’s lemmas it follows that dX(λ) has the same limit dis-

tribution as ψ(λ)dZ(λ). Hence, by the continuous mapping theorem, the sequenceIn,X(λ) has the same limit distribution as |ψ(λ)|2In,Z(λ). This is true for every fixedλ, but also for finite sets of λ jointly. The proof is finished, because the variables|ψ(λ)|2In,Z(λ) are asymptotically distributed as independent exponential variables withmeans |ψ(λ)|2fZ(λ), by the first part of the proof.

A remarkable aspect of the preceding theorem is that the periodogram values In(λ)at different frequencies are asymptotically independent. This is well visible already forfinite values of n in plots of the periodogram, which typically have a wild and peakyappearance. The theorem says that for large n such a plot should be similar to a plotof independent exponentially distributed variables Eλ with means fX(λ) (on the y-axis)versus λ (on the x-axis).

0.0 0.1 0.2 0.3 0.4 0.5

-30

-20

-10

010

frequency

spec

trum

Figure 12.1. Periodogram of a realization of the moving average Xt = 0.5Zt + 0.2Zt−1 + 0.5Zt−2 for aGaussian white noise series. (Vertical scale in decibel, i.e. 10 log.)

The following theorem shows that we even have independence of the periodogramvalues at natural frequencies that converge to the same value.


12.5 Theorem. LetXt =∑

ψjZt−j for an i.i.d. sequence Zt with finite second momentsand coefficients ψj with

∑

j |ψj | < ∞. Let λn = (2π/n)jn for jn ∈ Z be a sequenceof natural frequencies such that λn → λ ∈ (0, π). Then for any k ∈ Z the variablesIn(λn−k2π/n), In

(

λn−(k−1)2π/n)

, . . . , In(λn+k2π/n) are asymptotically distributedas independent exponential variables with mean fX(λ).

Proof. The second part of the proof of Theorem 12.4 is valid uniformly in λ and henceapplies to sequences of frequencies λn. For instance, the continuity of ψ(λ) and the proofshows that

∣

∣dX(µn) − ψ(µn)dZ(µn)∣

∣P→ 0 for any sequence µn. It suffices to extend the

first part of the proof, which concerns the special case that Xt = Zt.Here we apply the same method as in the proof of Theorem 12.4. The limits of

the covariances are as before, where in the present case we use the fact that we areconsidering natural frequencies only. For instance,

cov(

An

(

k2π

n

)

, Bn

(

l2π

n

))

=σ2

n

n∑

t=1

cos(

kt2π

n

)

sin(

lt2π

n

)

= 0,

for every integers k, l such that (k + l)/n and (k − l)/n are not contained in Z. Anapplication of the Lindeberg central limit theorem concludes the proof.

The sequences of frequencies λn + j(2π/n) considered in the preceding theorem allconverge to the same value λ. That Theorem 12.4 remains valid (it does) if we replacethe fixed frequencies µj in this theorem by sequences µj,n → µj is not very surprising.More surprising is the asymptotic independence of the periodograms In(µn,j) even ifevery sequence µn,j converges to the same frequency λ. As the proof of the precedingtheorem shows, this depends crucially on using natural frequencies µn,j .

The remarkable independence of the periodogram at frequencies that are very closetogether is a further explanation of the peaky appearance of the periodogram In(λ) asa function of λ. It is clear that this function is not a good estimator of the spectraldensity. However, the independence suggests ways of improving our estimator for fX(λ):the values In(λn − k2π/n), In

(

λn − (k − 1)2π/n)

, . . . , In(λn + k2π/n) can be viewed asa sample of independent estimators of fX(λ), for any k. Rather than one exponentiallydistributed veriable, we therefore have many exponentially distributed variables, all withthe same (asymptotic) mean. We exploit this in the next section.

In practice the periodogram may have one or a few extremely high peaks thatcompletely dominate its graph. This indicates an important cyclic component in the timeseries at those frequencies. Cyclic components of smaller amplitude at other frequenciesmay be hidden. It is practical wisdom that in such a case a fruitful spectral analysis atother frequencies requires that the peak frequencies are first removed from the signal(by a filter with the appropriate transfer function). We next estimate the spectrum ofthe new time series and, if desired, transform this back to obtain the spectrum of theoriginal series, using the formula given in Theorem 6.11. Because a spectrum without

Actually for all k, l, as a more elaborate argument may show.


high peaks is similar to the uniform spectrum of a white noise series, this procedure isknown as prewhitening of the data.

12.3 Estimating a Spectral Density

Given λ ∈ (0, π) and n, let λn be the natural frequency closest to λ. Then λn → λ asn → ∞ and Theorem 12.5 shows that for any k ∈ Z the variables In(λn + j2π/n) forj = −k, . . . , k are asymptotically distributed as independent exponential variables withmean fX(λ). This suggests to estimate fX(λ) by the average

(12.4) fk(λ) =1

2k + 1

∑

|j|≤kIn

(

λn +2π

nj)

.

As a consequence of Theorem 12.5, the variables (2k + 1)fk(λ) are asymptotically (asn→ ∞, for fixed k) distributed according the Gamma distribution with shape parameter2k+1 and mean (2k+1)fX(λ). This suggests a confidence interval for fX(λ) of the form,with χ2

k,α the α-quantile of the chisquare distribution with k degrees of freedom,

(

(4k + 2)fk(λ)

χ24k+2,1−α

,(4k + 2)fk(λ)

χ24k+2,α

)

.

12.6 EXERCISE. Show that, for every fixed k, this interval is asymptotically of level1− 2α. [Hint: a Γ(m/2, 1/2)-distribution is identical to a χ2

m-distribution.]

Instead of a simple average we may prefer a weighted average. For given weights Wj

such that∑

j Wj = 1, we use

(12.5) fk(λ) =∑

j

WjIn

(

λn +2π

nj)

.

This allows to give greater weight to frequencies λn + (2π/n)j that are closer to λ. Adisadvantage is that the asymptotic distribution is relatively complicated: it is a weightedsum of independent exponential variables. Because tabulating these types of distributionsis complicated, one often approximates them by a scaled chisquare distribution, wherethe scaling and the degrees of freedom are chosen to match the first two moments: theestimator c−1fk(λ) is approximately χ2

ν distributed for c and ν solving the equations

asymptotic mean of fk(λ) = fX(λ) = cν,

asymptotic variance of fk(λ) =∑

j

W 2j f

2X(λ) = c22ν.

This yields c proportional to fX(λ) and ν dependent on∑

j W2j only, fX(λ), and thus

confidence intervals based on this approximation can be derived as before. Rather than

12.3: Estimating a Spectral Density 203

0.0 0.1 0.2 0.3 0.4 0.5

-15

-10

-50

5

frequency

spec

trum

Figure 12.2. Smoothed periodogram of a realization of the moving average Xt = 0.5Zt+0.2Zt−1+0.5Zt−2

for a Gaussian white noise series. (Vertical scale in decibel, i.e. 10 log.)

using the approximation, we could of course determine the desired quantiles by computersimulation.

Because the periodogram is continuous as a function of λ, a discrete average (overnatural frequencies) can be closely approximated by a continuous average of the form

(12.6) fW (λ) =

∫

W (ω)In,X−X(λ− ω) dω.

Here the weight function W is to satisfy∫

W (ω) dω = 1 and would typically concentrateits mass near zero, so that the average is computed over In,X−X(ω) for ω ≈ λ. We use

the periodogram of the centered series X −X, because the average involves nonnaturalfrequencies. In view of (12.2) this estimator can be written in the form

fW (λ) =

∫

W (ω)1

2π

∑

|h|<n

γn(h)e−i(λ−ω)h dω =

1

2π

∑

|h|<n

w(h)γn(h)e−iλh,

where w(h) =∫

eiωhW (ω) dω are the Fourier coefficients of the weight function. Thuswe have arrived at a generalization of the estimator (12.1). If we choose w(h) = 1 for|h| < r and w(h) = 0 otherwise, then the preceding display exactly gives (12.1). The more


general form can be motivated by the same reasoning: the role of the coefficients w(h) isto diminish the influence of the relatively unreliable estimators γn(h) (for h ≈ n) whenplugging in these sample estimators for the true auto-covariances in the expression forthe spectral density. Thus, the weights w(h) are typically chosen to decrease in absolutevalue from

∣

∣w(0)∣

∣ = 1 to∣

∣w(n)∣

∣ = 0 if h increases from 0 to n.The function W is known as the spectral window; its Fourier coefficients w(h) are

known as the lag window, tapering function or convergence factors. The last name comesfrom Fourier analysis, where convergence factors were introduced to improve the ap-proximation properties of a Fourier series: it was noted that for suitably chosen weightsw(h) the partial sums

∑

|h|<n w(h)γX(h)e−ihλ could be much closer to the full series∑

h γX(h)e−ihλ than the same partial sums with w ≡ 1. In our statistical context this iseven more so the case, because we introduce additional approximation error by replacingthe coefficients γX(h) by the estimators γn(h).

12.7 Example (Dirichlet kernel, uniform tapering). The tapering function

w(h) =

1 if |h| ≤ r,0 if |h| > r,

corresponds to the Dirichlet kernel

W (λ) =1

2π

∑

|h|≤re−ihλ =

1

2π

sin(r + 12 )λ

sin 12λ

.

Therefore, the estimator (12.1) should be compared to the estimators (12.5) and (12.6)with weights Wj chosen according to the Dirichlet kernel.

12.8 Example (Uniform kernel). The uniform kernel

W (λ) =

r/(2π) if |λ| ≤ π/r,0 if |λ| > π/r,

corresponds to the weight function w(h) = r sin(πh/r)/(πh). These choices of spectraland lag windows correspond to the estimator (12.4).

All estimators for the spectral density considered so far can be viewed as smoothedperiodograms: the value f(λ) of the estimator at λ is an average or weighted average ofvalues In(µ) for µ in a neighbourhood of λ. Thus “irregularities” in the periodogramare “smoothed out”. The amount of smoothing is crucial for the accuracy of the estima-tors. This amount, called the bandwidth, is determined by the parameter k in (12.4), theweights Wj in (12.5), the kernel W in (12.6), and, more hidden, by the parameter r in(12.1). For instance, a large value of k in (12.4) or a kernel W with a large variance in(12.6) result in a large amount of smoothing (large bandwidth). Oversmoothing, choos-ing a bandwidth that is too large, results in spectral estimators that are too flat andtherefore inaccurate, whereas undersmoothing, choosing too small a bandwidth, yields

12.3: Estimating a Spectral Density 205

spectral estimators that share the bad properties of the periodogram. In practice an “op-timal” bandwidth is often determined by plotting the spectral estimators for a number ofdifferent bandwidths and next choosing the one that looks “reasonable”. An alternativeis to use one of several methods of “data-driven” choices of bandwidths, such as crossvalidation or penalization. We omit a discussion.

Theoretical analysis of the choice of the bandwidth is almost exclusively asymp-totical in nature. Given a number of observations tending to infinity, the “optimal”bandwidth decreases to zero. A main concern of an asymptotic analysis is to determinethe rate of decrease. The key concept is the bias-variance trade-off. Because the peri-odogram is more or less unbiased, little smoothing gives an estimator with small bias.However, as we have seen, the estimator will have a large variance. Oversmoothing hasthe opposite effects: it cuts variance by including more (independent) information, butcreates bias unless the spectral density is constant in an interval. Because accurate esti-mation requires that both bias and variance are small, we need an intermediate value ofthe bandwidth.

We shall quantify this bias-variance trade-off for estimators of the type (12.1), wherewe consider r as the bandwidth parameter. As our objective we take to minimize themean integrated square error

(12.7) 2πE

∫ π

−π

∣

∣fn,r(λ)− fX(λ)∣

∣

2dλ.

The integrated square error is a global measure of the discrepancy between fn,r andfX . Because we are interested in fX as a function, it is more relevant than the distance∣

∣fn,r(λ)− fX(λ)∣

∣ for any fixed λ.We shall use Parseval’s identity, which says that the space L2(−π, π] is isometric to

the space ℓ2.

12.9 Lemma (Parseval’s identity). Let f : (−π, π] → C be a measurable function suchthat

∫

|f |2(λ) dλ <∞. Then its Fourier coefficients fj =∫ π

−π eijλf(λ) dλ satisfy

∫ π

−π

∣

∣f(λ)∣

∣

2dλ =

1

2π

∞∑

j=−∞|fj |2.

12.10 EXERCISE. Prove this identity. Also show that for a pair of square-integrable,measurable functions f, g: (−π, π] → C we have

∫

f(λ)g(λ) dλ =∑

j fjgj .

The function fn,r − fX possesses the Fourier coefficients γn(h) − γX(h) for |h| < rand −γX(h) for |h| ≥ r. Thus, Parseval’s identity yields that the mean integrated squareerror (12.7) is equal to

E∑

|h|<r

∣

∣γn(h)− γX(h)∣

∣

2+∑

|h|≥r

∣

∣γX(h)∣

∣

2

In a rough sense the two terms in this formula are the “variance” and the square “bias”term. A large value of r clearly decreases the second, square bias term, but increases the


first, variance term. This variance term can itself be split into a bias and variance termand we can reexpress the mean integrated square error as

∑

|h|<r

var γn(h) +∑

|h|<r

∣

∣Eγn(h)− γX(h)∣

∣

2+∑

|h|≥r

∣

∣γX(h)∣

∣

2.

Assume for simplicity that EXt = 0 and that Xt =∑

j ψjZt−j for a summable sequence(ψj) and i.i.d. sequence Zt with finite fourth moments. Furthermore, assume that we use

the estimator γn(h) = n−1∑n−h

t=1 Xt+hXt rather than the true sample auto-covariancefunction. (The results for the general case are similar, but the calculations will be moreinvolved. Note that the difference between the present estimator and the usual one is

approximately X2and

∑

|h|<r E(X)4 = O(r/n2), under appropriate moment conditions.

This is negligible in the following.) Then the calculations in Chapter 5 show that thepreceding display is equal to

∑

|h|<r

1

n2

∑

|g|<n−|h|

(

n− |h| − |g|)

[

κ4σ4∑

i

ψiψi+hψi+gψi+g+h + γ2X(g)

+ γX(g + h)γX(g − h)

]

+∑

|h|<r

( |h|nγX(h)

)2

+∑

|h|≥rγ2X(h)

≤ |κ4|σ4

n

∑

h

∑

g

∑

i

|ψiψi+hψi+gψi+g+h|+2r + 1

n

∑

g

γ2X(g)

+1

n

∑

h

∑

g

∣

∣γX(g + h)γX(g − h)∣

∣+r2

n2

∑

h

γ2X(h) +∑

|h|≥rγ2X(h).

To ensure that the second and last terms on the right converge to zero as n → ∞ wemust choose r = rn such that rn/n → 0 and rn → ∞, respectively. The first and thirdterm are of the order O(1/n), and the fourth term is of the order O(r2n/n

2). Under therequirements rn → ∞ and rn/n→ 0 these terms are dominated by the other terms, andthe whole expression is of the order

rnn

+∑

|h|≥rn

γ2X(h).

A first conclusion is that the sequence of estimators fn,rn is asymptotically consistentfor estimating fX relative to the L2-distance whenever rn → ∞ and rn/n → 0. A widerange of sequences rn satisfy these constraints. For an optimal choice we must makeassumptions regarding the rate at which the bias term

∑

|h|≥r γ2X(h) converges to zero

as r → ∞. For any constant m we have that

rnn

+∑

|h|≥rn

γX(h)2 ≤ rnn

+1

r2mn

∑

h

γ2X(h)h2m.

Suppose that the series on the far right converges; this means roughly that the auto-covariances γX(h) decrease faster than |h|−m−1/2 as |h| → ∞. Then we can make a

12.4: Estimating a Spectral Distribution 207

bias-variance trade-off by balancing the terms rn/n and 1/r2mn . These terms are of equalorder for rn ≍ n1/(2m+1); for this choice of rn we find that

E

∫ π

−π

∣

∣fn,rn(λ)− fX(λ)∣

∣

2dλ = O

(( 1

n

)2m/(2m+1))

.

Large values of m yield the fastest rates of convergence. The rate n−m/(2m+1) is alwaysslower than n−1/2, the rate obtained when using parametric spectral estimators, butapproaches this rate as m→ ∞.

Unfortunately, in practice we do not know γX(h) and therefore cannot check whetherthe preceding derivation is valid. So-called cross-validation techniques may be used todetermine a suitable constant m from the data.

The condition that∑

h γ2X(h)h2m < ∞ can be interpreted in terms of the smooth-

ness of the spectral density. It can be shown (see Exercise 12.11) that the Fourier coef-

ficients of the mth derivative f(m)X of fX are equal to γX(h)(−ih)m. Consequently, by

Parseval’s identity∑

h

γ2X(h)h2m =

∫ π

−πf(m)X (λ)2 dλ.

Thus the left side is finite if and only if the mth derivative of fX exists and is square-integrable. Thus for time series with such an m-smooth spectral density, one can estimatethe spectral density with an integrated square error of order O(n−2m/(2m+1)). This rate

is uniform over the set of all time series with spectral densities such that∫

f(m)X (λ)2 dλ

is uniformly bounded, i.e. balls in the Sobolev space of order m.

12.11 EXERCISE. Show that the Fourier coefficients∫ π

−π eijλ f ′(λ) dλ of the derivative

of a continuously differentiable function f : [−π, π] → R with f(−π) = f(π), are equal to−ijfj , for fj the Fourier coefficients of f . [Hint: apply partial integration.]

This conclusion is similar to the conclusion in the problem of estimating a densitygiven a random sample from this density, where alsom-smooth densities can be estimatedwith an integrated square error of order O(n−2m/(2m+1)). The smoothing methods dis-cussed previously (the estimator (12.6) in particular) are also related to the method ofkernel smoothing for density estimation. It is interesting that historically the method ofsmoothing was first applied to the problem of estimating a spectral density. Here kernelsmoothing of the periodogram was a natural extension of taking simple averages as in(12.4), which itself is motivated by the independence property of the periodogram. Themethod of kernel smoothing for the problem of density estimation based on a randomsample from this density was invented later, even though this problem by itself appearsto be simpler.


* 12.4 Estimating a Spectral Distribution

In the preceding section it is seen that nonparametric estimation of a spectral densityrequires smoothing and yields rates of convergence n−α for values of α < 1/2. In contrast,a spectral distribution function can be estimated at the “usual” rate of convergencen−1/2 and natural estimators are asymptotically normally distributed. We assume Xt isa stationary time series with spectral density fX .

The spectral distribution function FX(λ0) =∫ λ0

−π fX(λ) can be written in the form

∫ π

−πa(λ)fX(λ) dλ

for a the indicator function of the interval (−π, λ0]. We shall consider estimation of ageneral functional of this type by the estimator

In(a): =

∫ π

−πa(λ)In(λ) dλ.

12.12 Theorem. Suppose that Xt =∑

j ψjZt−j for an i.i.d. sequence Zt with finitefourth cumulant κ4 and constants ψj such that

∑

j |ψj | < ∞. Moreover, assume that∑

h |h|γ2X(h) <∞. Then, for any symmetric function a such that∫ π

−π a2(λ) dλ <∞,

√n(

In(a)−∫

afX dλ)

N(

0, κ4

(

∫

afX dλ)2

+ 4π

∫

a2f2X dλ)

.

Proof. We can expand a(λ) in its Fourier series a(λ) =∑

j aje−ijλ (say). By Parseval’s

identity∫

afX dλ =∑

h

γX(h)ah.

Similarly, by (12.2) and Parseval’s identity

In(a) =

∫

aIn dλ =∑

|h|<n

γ∗n(h)ah.

First suppose that ah = 0 for |h| > m and some m. Then∫

aIn dλ is a linear combinationof(

γn(0), . . . , γn(m))

. By Theorem 5.8, as n→ ∞,

√n(

∑

h

γn(h)ah −∑

h

γX(h)ah

)

∑

h

ahZh,

where (Z−m, . . . , Z0, Z1, . . . , Zm) is a mean zero normally distributed random vector suchthat (Z0, . . . , Zm) has covariance matrix V as in Theorem 5.8 and Z−h = Zh for every


h. Thus∑

h ahZh is normally distributed with mean zero and variance

(12.8)

∑

g

∑

h

Vg,hagah = κ4

(

∑

g

agγX(g))2

+∑

g

∑

h

(

∑

k

γX(k + h)γX(k + g)

+∑

k

γX(k + h)γX(k − g))

agah

= κ4

(

∫

afX dλ)2

+ 4π

∫

a2f2X dλ.

The last equality follows after a short calculation, using that ah = a−h. (Note that wehave used the expression for Vg,h given in Theorem 5.8 also for negative g or h, which iscorrect, because both cov(Zg, Zh) and the expression in Theorem 5.8 remain the same ifg or h is replaced by −g or −h.)

This concludes the proof in the case that ah = 0 for |h| > m, for some m. Thegeneral case is treated with the help of Lemma 3.10. Set am =

∑

|j|≤m aje−iλj and apply

the preceding argument to Xn,m: =√n∫

am(In − fX) dλ) to see that Xn,m N(0, σ2m)

as n→ ∞, for every fixed m. The asymptotic variance σ2m is the expression given in the

theorem with am instead of a. If m → ∞, then σ2m converges to the expression in the

theorem, by the dominated convergence theorem, because a is squared-integrable andfX is uniformly bounded. Therefore, by Lemma 3.10 it suffices to show that for everymn → ∞

(12.9)√n

∫

(a− amn)(In − fX) dλ P→ 0.

Set b = a−amn. The variance of the random variable in (12.9) is the same as the variance

of∫

(a− amn)In dλ, and can be computed as, in view of Parseval’s identity,

var(

√n

2π

∑

|h|<n

γn(h)bh

)

=n

4π2

∑

|g|<n

∑

|h|<n

cov(

γn(g), γn(h))

bgbh

=n

4π2

∑

|g|<n

∑

|h|<n

1

n2

n−|g|∑

s=1

n−|h|∑

t=1

cov(Xs+gXs, Xt+hXt)bgbh.

Using the same approach as in Section 5.2, we can rewrite this as

1

4π2n

∑

|g|<n

∑

|h|<n

n−|g|∑

s=1

n−|h|∑

t=1

(

κ4σ4∑

i

ψi+gψiψt−s+h+iψt−s+i

+ γX(t− s+ h− g)γX(t− s) + γX(t− s− g)γX(t− s+ h))

bgbh.


The absolute value of this expression can be bounded above by

1

4π2

∑

g

∑

h

∑

k

(

∑

i

|ψi+gψiψk+h+iψk+i||κ4|σ4 +∣

∣γX(k + h− g)γX(k)∣

∣

+∣

∣γX(k − g)γX(k + h)∣

∣

)

|bgbh|

=1

4π2

(

|k4|(

∫

bfXdλ)2

+ 4πb2∫

f2Xdλ

)

,

by the same calculation as in (12.8), where we define

b =∑

h

|bh|e−iλh, fX(λ) =

∑

h

∑

i

|ψiψi+h|σ2e−iλh, fX(λ) =

∑

h

∣

∣γX(h)∣

∣e−iλh.

Under our assumptions fXand f

Xare bounded functions. It follows that var

∫

bnIn dλ→0 if

∫

b2n dλ→ 0. This is true in particular for bn = a− amn.

Next the mean of the left side of (12.9) can be computed as

√n(

∑

|h|<n

Eγn(h)bh −∫

b fX dλ)

=√n(

∑

|h|<n

n− |h|n

γX(h)bh −∑

h

γX(h)bh

)

= −∑

h

n ∧ |h|√n

γX(h)bh.

By the Cauchy-Schwarz inequality this is bounded in absolute value by the square rootof

∑

h

|bh|2∑

h

(n ∧ |h|)2n

γ2X(h) ≤∑

h

|bh|2∑

h

|h|γ2X(h).

Under our assumptions this converges to zero as∫

b2 dλ→ 0.

The preceding theorem is restricted to symmetric functions a, but can easily beextended to general functions, because by the symmetry of the spectral density

∫

a(λ)fX(λ) dλ =

∫

a(λ) + a(−λ)2

fX(λ) dλ.

12.13 EXERCISE. Show that for a possibly nonsymmetric function a the theorem isvalid, but with asymptotic variance

κ4

(

∫

afX dλ)2

+ 2π

∫ π

−πa2f2X dλ+ 2π

∫ π

−πa(λ)a(−λ)f2X(λ) dλ.

12.14 Example. To obtain the limit distribution of the estimator for the spectral dis-tribution function at the point λ0 ∈ [0, π], we apply the theorem with the symmetricfunction a =

(

1(−π,λ0] + 1(−λ0,π]

)

. The asymptotic variance is equal to κ4FX(λ0)2 +

4π∫ λ0

λ0f2X dλ+ 2π

∫ π

λ0f2X dλ for λ0 ∈ [0, π].


12.15 Example. The choice a(λ) = cos(hλ) yields the estimator (for 0 ≤ h < n)

∫

cos(hλ)In(λ) dλ = Re

∫

eihλIn(λ) dλ =1

n

n−h∑

t=1

Xt+hXt

of the auto-covariance function γX(h) in the case that EXt = 0. Thus the precedingtheorem contains Theorem 5.8 as a special case. The present theorem shows how theasymptotic covariance of the sample auto-covariance function can be expressed in thespectral density.

12.16 EXERCISE. Show that the sequence of bivariate random vectors√n(∫

a(In −fX) dλ,

∫

b(In−fX) dλ)

converges in distribution to a bivariate Gaussian vector (Ga, Gb)with mean zero and EGaGb = κ4

∫

afX dλ∫

bfX dλ+ 4π∫

abf2X dλ.

12.17 EXERCISE. Plot the periodogram of a white noise series of length 200. Does thislook like a plot of 200 independent exponential variables?

12.18 EXERCISE. Estimate the spectral density of the simulated time series given inthe file ~sda/Cursusdata/sim2 by a smoothed periodogram. Compare this to the estimateobtained assuming that sim2 is an AR(3) series.

12.19 EXERCISE. Estimate the spectral density of the Wolfer sunspot numbers (theobject sunspots in Splus) by(i) a smoothed periodogram;(ii) the spectral density of an appropriate AR-model.Note: the mean is nonzero.

13Maximum Likelihood

The method of maximum likelihood is one of the unifying principles of statistics, andapplies equally well to replicated experiments as to time series. Given observationsX1, . . . , Xn with a joint probability density (x1, . . . , xn) 7→ pn,θ(x1, . . . , xn) that dependson a parameter θ, the likelihood function is the stochastic process

θ 7→ pn,θ(X1, . . . , Xn).

The maximum likelihood estimator for θ, if it exists, is the value of θ that maximizes thelikelihood function.

The likelihood function corresponding to i.i.d. observations X1, . . . , Xn is the prod-uct of the likelihood functions of the individual observations, which makes likelihoodinference relatively easy in this case. For time series models the likelihood function maybe a more complicated function of the observations and the parameter. This compli-cates the practical implementation of likelihood inference. It also makes its theoreticalanalysis more involved, although in “most” situations the final results are not that dif-ferent from the more familiar i.i.d. case. In particular, maximum likelihood estimators of“smooth” finite-dimensional parameters are typically

√n-consistent and, after centering

at the true parameter and scaling, possess a normal limit distribution, with mean zeroand covariance the inverse of the Fisher information matrix.

In this chapter we study the maximum likelihood estimator, and some approxima-tions. We also consider the effect of model misspecification: using the likelihood for amodel that does not contain the true distribution of the data. Such misspecification ofthe model may be unintended, but also be deliberate. For instance, the likelihood un-der the assumption that X1, . . . , Xn are part of a stationary Gaussian time series Xt ispopular for inference, even if one may not believe that the time series is Gaussian. Thecorresponding maximum likelihood estimator is closely related to the least squares esti-mators and turns out to perform well also for a wide range of non-Gaussian time series.Another example is to postulate that the innovations in a GARCH model are Gaussian.The resulting estimators again work well also for non-Gaussian innovations.

A misspecified likelihood is also referred to as a quasi likelihood and the resulting

13.1: General Likelihood 213

estimators as quasi likelihood estimators. Alternatively, a misspecified maximum likeli-hood estimator falls in the class of M-estimators, which are defined as the maximizersof a given stochastic process (a contrast function). We prefer M -estimator over quasi-likelihood estimator. Yet another term, which sometimes refers to the same, is pseudolikelihood. In our heuristic discussion of the asymptotic properties of maximum likeli-hood estimators we use this for a certain process that is close to the likelihood, but hasa simpler structure.

13.1 General Likelihood

A convenient representation of a likelihood is obtained by repeated conditioning. Toalleviate the notation we abuse notation by writing p(y|x) for a conditional density ofa variable Y given that another variable X takes the value x, and denote the marginaldensity ofX by p(x). Thus the argument(s) of the function must reveal its identity. In thisnotation the likelihood corresponding to the observations X1, . . . , Xn is pθ(X1, . . . , Xn),and it can be decomposed as

(13.1) pθ(X1, . . . , Xn) = pθ(X1)pθ(X2|X1) · · · pθ(Xn|Xn−1, . . . , X1).

Clearly we must select appropriate versions of the (conditional) densities, but we shallnot worry about technical details in this section.

The decomposition resembles the factorization of the likelihood of an i.i.d. sample ofobservations, but an important difference is that the n terms on the right may all be of adifferent form. Even if the time series Xt is strictly stationary, each further term entailsconditioning on a bigger past and hence is potentially of a different character than theearlier terms. However, in many examples the “present” Xt is nearly independent of the“distant past” (Xs: s ≪ t) given the “near past” (Xs: s < t, s ≈ t). Then the likelihooddoes not change much if the conditioning in each term is limited to a fixed number ofvariables in the past, and most of the terms of the product will take almost a commonform. Alternatively, we may augment the conditioning in each term to include the full“infinite past”, yielding the pseudo likelihood

(13.2) θ 7→ pθ(X1|X0, X−1, . . .)pθ(X2|X1, X0, . . .)× · · · × pθ(Xn|Xn−1, Xn−2, . . .).

If the time series Xt is strictly stationary, then the tth term pθ(Xt|Xt−1, Xt−2, . . .) inthis product is a fixed measurable function, independent of t, applied to the vector(Xt, Xt−1, . . .). In particular, the terms of the product form a strictly stationary timeseries, which will be ergodic if the original time series Xt is ergodic. This is almost asgood as the i.i.d. terms obtained in the case of an i.i.d. time series.

The pseudo likelihood (13.2) cannot be used in practice, because the “negative”variables X0, X−1, . . . are not observed. However, the preceding discussion suggests thatthe maximum pseudo likelihood estimator, defined as the maximizer of the pseudo like-lihood, may behave the same as the true maximum likelihood estimator. Moreover, if itis true that the past observations X0, X−1, . . . do not play an important role in defining

214 13: Maximum Likelihood

the pseudo likelihood, then we could also replace them by arbitrary values, for instancezero, and hence obtain an observable criterion function.

13.1 Example (Markov time series). If the time series Xt is Markov, then the con-ditioning in each term pθ(Xt|Xt−1, . . . , X1) or pθ(Xt|Xt−1, Xt−2, . . .) can be restrictedto the single variable Xt−1. In this case the likelihood and the pseudo likelihood differonly in their first terms, which are pθ(X1) and pθ(X1|X0), respectively. This differenceshould be negligible if n is large.

Similarly, if the time series is Markov of order p, i.e. p(Xt|Xt−1, Xt−2, . . .) dependsonly on Xt, Xt−1, . . . , Xt−p, then the two likelihoods differ only in p terms. This shouldbe negligible if n is large relative to p.

13.2 Example (Autoregression). A causal auto-regressive time series defined relativeto an i.i.d. white noise series is an example of a pth order Markov series. Maximumlikelihood estimators for auto-regressive processes are commonly defined by using thepseudo likelihood with X0, . . . , X−p+1 set equal to zero. Alternatively, we simply dropthe first p terms of the likelihood, and work with the approximate likelihood

(σ, φ1, . . . , φp) 7→n∏

t=p+1

pZ,σ

(

Xt − φ1Xt−1 − · · · − φpXt−p)

,

for pZ,σ the density of the innovations. This can also be considered a conditional likeli-hood given the observations X1, . . . , Xp. The difference of this likelihood with the truelikelihood is precisely the marginal density of the vector (X1, . . . , Xp), which can becomplicated in general, but should have a noticeable effect on the maximum likelihoodestimator only if p is large relative to n.

13.3 Example (GARCH). A strictly stationary GARCH process Xt relative to ani.i.d. series Zt can be written as Xt = σtZt, for σ

2t = E(X2

t | Ft−1) and Ft the filtrationgenerated by Xt, Xt−1, . . .. From Theorem 9.13 it is known that the filtration Ft is alsothe natural filtration of the process Zt and hence the variable Zt is independent of σt,as this is measurable relative to Ft−1. Thus given Xt−1, Xt−2, . . . the variable Xt = σtZt

is obtained by first calculating σt from Xt−1, Xt−2, . . . and next multiplying σt by anindependent variable Zt. The conditional distribution of Xt given Xt−1, Xt−2, . . . is thedistribution of Zt scaled by σt. If pZ is the marginal density of the variables Zt, then thepseudo likelihood (13.2) takes the form

(13.3)n∏

t=1

1

σtpZ

(Xt

σt

)

.

The parameters α, φ1, . . . φp, θ1, . . . , θq are hidden in the process σt, through the GARCHrelation (9.1). Formula (13.3) is not the true likelihood, because it depends on the un-observable variables X0, X−1, . . . through the σt.

For an ARCH(q) process the conditional variances σ2t depend only on the variables

Xt−1, . . . , Xt−q, in the simple form

σ2t = α+ θ1X

2t−1 + · · ·+ θqX

2t−q.

13.2: Asymptotics under a Correct Model 215

In this case the true likelihood and the pseudolikelihood differ only in the first q of the nterms. This difference should be negligible. For practical purposes, if n is large relativeto q, we could either drop those first q terms, giving a conditional likelihood, or act as ifthe unobserved variables X2

0 , . . . , X2−q+1 are zero.

For general GARCH processes the difference between the likelihoods is more sub-stantial, because the conditional variance σ2

t depends on X2t−1, . . . , X

2t−q as well as on

the previous σ2s , causing a dependence on the full past X2

s with s < t of the process ofsquares. However, the dependence of σ2

t on lagged variables X2s decreases exponentially

fast as s → ∞. This suggests that we might again use the pseudo likelihood with theunobserved variables X2

s replaced by zero.That the variables X2

s with s ≤ 0 do not play an important role in the definition ofthe (pseudo) likelihood is also suggested by Theorem 9.17, which shows that a GARCHseries defined from arbitrary starting values converges to (strict) stationarity as t growsto infinity (provided that a strictly stationary GARCH process exists). (It is suggestedonly, as the stability assertion of Theorem 9.17 is about the time series itself and notabout its likelihood function.)

A practical implementation is to define σ20 , . . . , σ

2−p+1 and X2

0 , . . . , X2−q+1 to be

zero, and next compute σ21 , σ

22 , . . . recursively, using the GARCH relation (9.1) and the

observed values X1, . . . , Xn. By Theorem 9.17 these zero starting values cannot be thetrue values of the series if the series is strictly stationary, but any other initializationshould yield approximately the same likelihood. Given σ2

1 , σ22 , . . . , σ

2n and X1, . . . , Xn we

can use (13.3) as a contrast function.

13.2 Asymptotics under a Correct Model

In this section we consider the asymptotic behaviour of the maximum likelihood esti-mator under the assumption that the observations are sampled according to one of thedistributions used to form the likelihood. We denote the parameter of latter distributionby θ0.

We adopt the working hypothesis that the maximum likelihood estimators have thesame asymptotic properties as the corresponding maximum pseudo likelihood estimators.Furthermore, we assume that the time series Xt is strictly stationary and ergodic. Theseconditions are certainly too stringent, but they simplify the arguments. The conclusionstypically apply to any time series that “approaches stationarity” as t→ ∞ and for whichaverages converge to constants.

13.2.1 Consistency

Abbreviate Xt, Xt−1, . . . to ~Xt. The maximum pseudo likelihood estimator maximizesthe function

(13.4) θ 7→Mn(θ) =1

n

n∑

t=1

log pθ(Xt| ~Xt−1).


If the variables log pθ(Xt| ~Xt−1) are integrable, as we assume, then, by the ergodic the-orem, Theorem 7.4, the averages Mn(θ) converge to their expectation

(13.5) M(θ) = Eθ0 log pθ(X1| ~X0).

The expectation is taken under the “true” parameter θ0 governing the distribution ofthe time series Xt. The difference of the expected valuesM(θ0) andM(θ) can be written

as, for µ the dominating measure of the conditional densities pθ(·| ~X0),

M(θ0)−M(θ) = Eθ0

∫

(

logpθ0(x1| ~X0)

pθ(x1| ~X0)

)

pθ0(x1| ~X0) dµ(x1).

The integral inside the expectation is the Kullback-Leibler divergence between the (condi-

tional) measures with densities pθ(·| ~X0) and pθ0(·| ~X0). The Kullback-Leibler divergence∫

log(p/q) dP between two probability measures P and Q (with densities p and q relativeto some measure) is always nonnegative, and it is zero if and only if the two measuresare the same. This yields the following lemma.

13.4 Lemma. The function M : Θ → R defined in (13.5) satisfies M(θ) ≤ M(θ0) for

every θ ∈ Θ, with equality if and only if the conditional distribution of X1 given ~X0 = ~x0is identical when computed for θ and θ0, for almost every ~x0 under the distribution of~X0 under θ0.

Proof. The inequality log x ≤ 2(√x− 1), valid for x > 0, yields, for any two probability

measures p and q relative to some measure µ,

∫

logq

pp dµ ≤

∫

2(

√

q

p− 1)

p dµ = −∫

(√q −√

p)2 dµ.

It follows that∫

log(p/q) p dµ is nonnegative, and equal to zero if and only if p = q almost

everywhere under µ. We apply this with p and q equal to the densities x1 7→ pθ0(x1| ~X0)

and x1 7→ pθ(x1| ~X0), respectively, to the inner integral in the right side of the expressionforM(θ)−M(θ0) preceding the lemma, and conclude that this is nonnegative. Moreover,it is zero if and only if the two conditional laws are identical. The whole expression,including the outer expectation Eθ0 , is nonegative, and vanishes only if the inner integralis zero for almost every ~x0.

Under the reasonable assumption that each value of θ indexes a different underlying(conditional) distribution of the time series Xt (“identifiability of θ”), we conclude thatthe map θ 7→M(θ) possesses a unique absolute maximum at θ = θ0.

The convergence of the criterion functionsMn toM , and the definitions of θn and θ0as their points of maximum suggest that θn converges to θ0. In other words, we expect themaximum likelihood estimators to be consistent for the “true” value θ0. This argumentcan be made mathematically rigorous, for instance by imposing additional conditionsthat guarantee the uniform convergence of Mn to M . See e.g. Theorem 3.17.

13.2: Asymptotics under a Correct Model 217

13.2.2 Asymptotic Normality

If it is true that θn → θ0, then the question poses itself to characterize the rate atwhich the difference θn − θ0 converges to zero and to find a possible limit distributionfor the rescaled difference. We shall assume that the parameter set Θ is a subset of Rd,and establish the asymptotic normality of the sequence

√n(θn − θ0), under smoothness

conditions on the contrast function θ 7→Mn(θ) we shall.We assume that the gradient Mn(θ) and second derivative matrix Mn(θ) of the

map θ 7→ Mn(θ) exist and are continuous. Because θn is a point of maximum of Mn, it

satisfies the stationary equation Mn(θn) = 0 if θn is an inner point of the parameter set.

As we assume consistency of θn, this is the case with probability tending to 1 if θ0 is aninner point of Θ, as we shall assume. In the case that the parameter is one-dimensional,Taylor’s theorem shows that there exists a point θn on the line segment between θ0 andθn such that

0 = Mn(θn) = Mn(θ0) + Mn(θn)(θn − θ).

This can be rewritten as

√n(θn − θ0) = −

(

Mn(θn))−1√

nMn(θ0).

Taylor’s theorem does not apply in the same way to higher-dimensional parameters, forwhich the function Mn is vector-valued. In that case we can apply the theorem to everycoordinate function of Mn, giving the second last display (which is now vector-valued)for every coordinate, but possibly with a different point θn for every coordinate. Abusingnotation we shall omit the dependence on the coordinate and write the resulting matrixof gradients again as Mn(θn). We have

√nMn(θ) =

1√n

n∑

i=1

∂/∂θ log pθ(Xt| ~Xt−1), Mn(θ) =1

n

n∑

i=1

∂2/∂θ2 log pθ(Xt| ~Xt−1).

The matrices Mn(θ) are averages and hence the ergodic theorem guarantees their con-vergence in probability to a fixed matrix under reasonable conditions. Because θn

P→ θ0,it is a reasonable assumption that the matrices Mn(θn) and Mn(θ0) possess the samelimit. If we can also show that the sequence

√nMn(θ0) converges in distribution, then

we can conclude that the sequence√n(θn − θ0) converges in distribution, by Slutsky’s

lemma.The convergence of the sequence

√nMn(θ0) can be established by the martin-

gale central limit theorem, Theorem 4.16. To see this, first differentiate the identify∫

pθ(x1| ~x0) dµ(x1) = 1 twice to verify (under regularity conditions) that

∫

pθ(x1| ~x0) dµ(x1) =∫

pθ(x1| ~x0) dµ(x1) = 0.

The function θ 7→ ℓθ = log pθ possesses derivatives ℓθ = pθ/pθ and ℓθ = pθ/pθ − ℓθ ℓTθ .

Combination with the preceding display yields conditional versions of the Bartlett iden-tities “expectation of score function is zero” and “expectation of observed information is


minus the Fisher information”:

(13.6)Eθ

(

ℓθ(X1| ~X0)| ~X0

)

= 0,

−Eθ

(

ℓθ(X1| ~X0)| ~X0

)

= Eθ(

ℓθ ℓTθ (X1| ~X0)| ~X0

)

= Covθ(

ℓθ(X1| ~X0)| ~X0

)

.

The first identity shows that the sequence nMn(θ) =∑n

t=1 ℓθ(Xt| ~Xt−1) is a martingaleunder the true measure specified by the parameter θ. Under reasonable conditions themartingale central limit theorem yields that the sequence

√nMn(θ) is asymptotically

normal with mean zero and covariance matrix

Iθ = Covθ(

ℓθ(X1| ~X0))

.

By the second identity EθMn(θ) = −Iθ and hence we may expect that Mn(θ)P→ − Iθ

under θ, by the ergodic theorem. Combining this with Slutsky’s lemma as indicatedbefore, we find that, under the true parameter θ0,

√n(θn − θ0) N(0, I−1θ0

).

The matrix Iθ is known as the Fisher information matrix (average per observation).Typically, it can also be found as the limits

In,θ: =1

nEθ

∂

∂θlog pθ(X1, . . . , Xn)

∂

∂θlog pθ(X1, . . . , Xn)

T → Iθ,

− 1

n

∂2

∂θ2log pθ(X1, . . . , Xn)|θ=θn

→ Iθ.

The expression on the left in the second line is minus the second derivative matrix ofthe likelihood surface (times 1/n) at the maximum likelihood estimator, and is knownas the observed information. As it gives an estimate for the inverse of the asymptoticcovariance matrix of the sequence

√n(θn − θ) under θ, a “large” observed information

(strongly curved likelihood surface) indicates that the maximum likelihood estimator hassmall asymptotic covariance.

The first line of the preceding display connects the matrix Iθ to the definition of theFisher information for arbitrary observations, as it appears in the Cramer-Rao boundfor the variance of unbiased estimators. According to the Cramer-Rao theorem, thecovariance matrix of any unbiased estimator Tn of θ satisfies

Covθ Tn ≥ 1

nI−1n,θ.

The preceding informal derivation suggests that the asymptotic covariance matrix ofthe sequence

√n(θn − θ) under θ is equal to I−1θ . We interprete this as saying that the

maximum likelihood estimator is asymptotically of minimal variance, or asymptoticallyefficient.

For a rigorous proof of the asymptotic normality of the maximum likelihood esti-mator, and a precise formulation of its asymptotic efficiency (based on local asymptoticnormality), see ??

13.3: Asymptotics under Misspecification 219

13.5 EXERCISE. Compute the conditional maximum likelihood estimator for (θ, σ2) ina stationary, causal AR(1) model Xt = θXt−1 +Zt with Gaussian innovations Zt. Whatis its limit distribution? Calculate the Fisher information matrix Iθ.

13.6 EXERCISE. Find the pair of (conditional) likelihood equations Mn(α, θ) = 0 forestimating the parameters (α, θ) in an ARCH(1) model. Verify the martingale propertyof nMn(α, θ).

13.3 Asymptotics under Misspecification

Specification of a correct statistical model for a given time series is generally difficult, andit is typically hard to decide which of two given reasonable models is the better one. Thisobservation is often taken as motivation for modelling a time series as a Gaussian series,Gaussianity being considered as good as any other specification and Gaussian likelihoodsbeing relatively easy to handle. Meanwhile the validity of the Gaussian assumption maynot really be accepted. It is therefore important, in time series analysis even more than instatistics for replicated experiments, to consider the behaviour of estimation proceduresunder misspecification of a model.

Consider an estimator θn defined as the point of maximum of a likelihood functionof a model that possibly does not contain the true distribution of the observations. Itis again easier to consider the pseudo likelihood (13.2) than the true likelihood. Themisspecified maximum pseudo likelihood estimator is still the point of maximum of themap θ 7→Mn(θ) defined in (13.4). For the asymptotic analysis of θn we again apply theergodic theorem to see that Mn(θ) → M(θ) almost surely, for M(θ) the expectation ofMn(θ), defined by

(13.7) M(θ) = E log pθ(X1| ~X0) = E

∫

log pθ(x1| ~X0) p(x1| ~X0) dµ(x1).

The difference with the foregoing is that presently the expectation is taken under thetrue model for the series Xt, which may or may not be representable through one of theparameters θ. In the display this is indicated by denoting a general expectation by thesymbol E (and not Eθ0) and a generic density by p (and not pθ0). However, the same

reasoning suggests that θn converges in probability to a value θ0 that maximizes the mapθ 7→ M(θ). Without further specification of the model and the true distribution of thetime series, there is little more we can say about this maximizing value than that it givesconditional densities pθ0(·| ~x0) that are, on the average, closest to the true conditionaldensities p(·| ~x0) of the time series in terms of the Kullback-Leibler divergence.

Having ascertained that the sequence θn ought to converge to a limit, most of thesubsequent arguments to establish asymptotic normality of the sequence

√n(θn− θ0) go

through, also under misspecification, provided that

(13.8) E(

ℓθ0(X1| ~X0)| ~X0

)

= 0, a.s..


Under this condition the sequence nMn(θ0) is still a martingale, and√nMn(θ0) may

be expected to be asymptotically normal by the martingale central limit theorem. Bythe assumed ergodicity of the series Xt the sequence Mn(θ0) will still converge to afixed matrix, and the same may be expected to be true for the sequence of secondderivatives Mn(θn) evaluated at a point between θn and θ0. A difference is that theasymptotic covariance matrix Σθ0 of the sequence

√nMn(θ0) and the limit Rθ0 of the

sequence Mn(θ0) may no longer be each other’s negatives (as it the case under correctspecification by the second Bartlett identity (13.6)). The conclusion will therefore takethe more complicated form

√n(θn − θ0) N

(

0, R−1θ0Σθ0(R

−1θ0

)T)

.

The covariance of this normal limit distribution is referred to as the sandwich formula.Thus under (13.8) we may expect that the sequence θn will converge rapidly to a

limit θ0. Then “fitting the wrong model” will be useful as long as the density pθ0 (or anaspect of interest of it) is sufficiently close to the true density of the time series.

Condition (13.8) is odd, and not automatically satisfied. It is certainly satisfied ifthe point of maximum θ0 of the map θ 7→ M(θ) is such that for every ~x0 it is also apoint of maximum of the map

(13.9) θ 7→∫

log pθ(x1| ~x0) p(x1| ~x0) dµ(x1).

This is not necessarily the case, as the points of maxima of these functions may dependon the different different values of ~x0. The point θ0 is by definition the point of maximumof the average of these functions over ~x0, weighted by the true distribution of ~X0. Failureof (13.8) does not necessarily mean that the sequence

√n(θn − θ0) is not asymptotically

normally distributed, but it does mean that we cannot apply the martingale central limittheorem, as in the preceding argument.

In the next section we discuss a major example of possible misspecification: esti-mating a parameter by Gaussian maximum likelihood. The following example concernsGARCH processes, and illustrates that some misspecifications are harmless, whereasothers may cause trouble.

13.7 Example (GARCH). As found in Example 13.3, the pseudo likelihood for aGARCH(p, q) process takes the form

n∏

t=1

1

σt(θ)pZ

( Xt

σt(θ)

)

,

where pZ is the density of the innovations and θ = (α, φ1, . . . , φp, θ1, . . . , θq). In Chapter 9it was noted that a t-density pZ may be appropriate to explain the observed leptokurtictails of financial time series. However, the Gaussian density pZ(z) = exp(− 1

2z2)/

√2π is

more popular for likelihood based inference for GARCH processes. The correspondingGaussian log pseudo likelihood is up to additive and multiplicative constants equal to

θ 7→ − 1

n

n∑

t=1

log σ2t (θ)−

1

n

n∑

t=1

X2t

σ2t (θ)

.

13.4: Gaussian Likelihood 221

The expectation of this criterion function can be written as

M(θ) = −E(

log σ21(θ) +

E(X21 | F0)

σ21(θ)

)

.

Both the expectation and the conditional expectation on the right side are taken relativeto the true distribution of the time series. (The outermost expectation is relative to the

vector ~X0 that is hidden in σ21(θ) and in E(X2

1 | F0).) The sequence θn may be expectedto converge to the point of maximum of the map θ 7→M(θ).

Suppose that the GARCH equation (9.1) for the conditional variances is correctlyspecified, but allow the true density of the innovations to be not standard normal. ThusE(X2

1 | F0) = σ21(θ0) for the true parameter θ0 and hence

(13.10) M(θ) = −E(

log σ21(θ) +

σ21(θ0)

σ21(θ)

)

.

For every fixed σ20 , the map σ2 7→ log σ2 + σ2

0/σ2 assumes its minimal value on the

domain (0,∞) at σ2 = σ20 . It follows that the map θ 7→M(θ) is maximized at θ = θ0 no

matter the distribution of ~X0 that determines the expectation E that defines M(θ).We conclude that the use of the Gaussian density for pZ will lead (under regularity

conditions) to consistent estimators θn for the coefficients of the GARCH equation as longas the conditional variance model is correctly specified. In particular, the true densitypZ of the innovations need not be Gaussian. As shown in the preceding arguments, thispleasant fact is the result of the fact that the likelihood based on choosing the normaldensity for pZ depends on the observations Xt only through a linear function of thesquares X2

t . For another choice of density pZ , such as a t-density, this is not true, andwe cannot hope to be guarded against misspecification in that case.

The maximizing parameter θ0 maximizes the expression θ 7→ log σ21(θ)+σ

21(θ0)/σ

21(θ)

inside the brackets in (13.10) over θ, for any fixed value of ~X0. In the general notation this

corresponds to maximizing the inner integral in (13.7), for every value of ~X0. Therefore

(13.8) is satisfied, and we expect the sequence√n(θn− θ0) to be asymptotically normal,

with asymptotic variance given by the sandwich formula.These conclusions are all conditional on the assumption that the conditional variance

model stipulated by the GARCHmodel is correct. Because the parameter θ has a meaningonly relative to this model, this does not seem an important restriction.

13.4 Gaussian Likelihood

A Gaussian time series is a time series Xt such that the joint distribution of every finitesubvector (Xt1 , . . . , Xtn) of the series possesses a multivariate normal distribution. Inparticular, the vector (X1, . . . , Xn) is multivariate normally distributed, and hence itsdistribution is completely specified by a mean vector µn ∈ R

n and an (n×n) covariance


matrix Γn. For a stationary time series Xt this matrix has entries (Γn)s,t = γX(s − t),for γX the auto-covariance function of Xt.

We assume that both the mean µn and the covariance function γX can be expressedin a parameter θ of fixed dimension, so that we can write µn = µn(θ) and Γn = Γn(θ).Then the likelihood function under the assumption that Xt is a Gaussian time series isthe multivariate normal density viewed as function of the parameter and takes the form

(13.11) θ 7→ 1

(2π)n/21

√

det Γn(θ)e−

12

(

~Xn−µn(θ))T

Γn(θ)−1(

~Xn−µn(θ))

.

We refer to this function as the Gaussian likelihood, and to its point of maximum θn,if it exists, as the maximum Gaussian likelihood estimator. The Gaussian likelihood andthe corresponding estimator are commonly used, also in the case that the time series Xt

is non-Gaussian.Maximum Gaussian likelihood is closely related to the method of least squares, de-

scribed in Section 11.3. We can see this using the likelihood factorization (13.1). Fora Gaussian process the conditional densities pθ(xt|Xt−1, . . . , X1) are univariate normaldensities with means Eθ(Xt|Xt−1, . . . , X1) and variances vt(θ) equal to the prediction er-rors. (Cf. Exercise 13.8.) Furthermore, the best nonlinear predictor Eθ(Xt|Xt−1, . . . , X1)is automatically a linear combination of the predicting variables and hence coincides withthe best linear predictor Πt−1Xt(θ). This shows that the factorization (13.1) reduces to

n∏

t=1

1√

vt(θ)φ(Xt −Πt−1Xt(θ)

√

vt(θ)

)

.

Maximizing this relatively to θ is equivalent to maximizing its logarithm, which can bewritten in the form

(13.12) θ 7→ −n2log(2π)− 1

2

n∑

t=1

log vt(θ)−1

2

n∑

t=1

(


vt(θ).

This function differs in form from the least squares criterion function (11.6) only inthe presence of the function θ 7→ − 1

2

∑nt=1 log vt(θ). In situations where this function is

almost constant least squares and Gaussian maximum likelihood estimators are almostthe same.

13.8 EXERCISE. Suppose that the vector (X1, . . . , Xt) possesses a multivariate normaldistribution. Show that the conditional distribution of Xt given (X1, . . . , Xt−1) is normalwith mean the best linear predictor Πt−1Xt of Xt based on X1, . . . , Xt−1 and variancethe prediction error vt = var(Xt − Πt−1Xt). [Reduce to the case that the unconditionalmean is 0. Let Y be the best linear predictor of Xt based on X1, . . . , Xt−1. Show that

(Xt − Y, ~Xt−1) is multivariate normal with zero covariance between Xt − Y and ~Xt−1.Conclude that the variables are independent, and use this to deduce the conditionaldistribution of Xt = (Xt − Y ) + Y .]

13.9 Example (Autoregression). For causal stationary auto-regressive processes oforder p and t > p the best linear predictor of Xt is equal to φ1Xt−1+ · · ·+φpXt−p. Thus


the innovations Xt −Πt−1Xt are equal to the noise inputs Zt, and the prediction errorsvt are equal to σ2 = EZ2

t for t > p. Thus the function θ 7→ − 12

∑nt=1 log vt(θ) in the

formula for the Gaussian likelihood is approximately equal to − 12n log σ

2, if n is muchbigger than p. The log Gaussian likelihood is approximately equal to

−n2log(2π)− 1

2n log σ2 − 1

2

n∑

t=p+1

(

Xt − φ1Xt−1 − · · · − φpXt−p)2

σ2.

For a fixed σ2 maximization relative to φ1, . . . , φp is equivalent to minimization of thesum of squares and hence gives identical results as the method of least squares discussedin Sections 11.1 and 11.3. Maximization relative to σ2 gives (almost) the Yule-Walkerestimator discussed in Section 11.1.

13.10 Example (ARMA). In ARMA models the parameter σ2 enters as a multiplica-tive factor in the covariance function (cf. Section 11.3). This implies that the log Gaussianlikelihood function can be written in the form, with θ = (φ1, . . . , φp, θ1, . . . , θq),

−n2log(2π)− n

2log σ2 − 1

2

n∑

t=1

log vt(θ)−1

2

n∑

t=1

(


σ2vt(θ).

Differentiating this with respect to σ2 we see that for every fixed θ, the Gaussian likeli-hood is maximized relative to σ2 by

σ2(θ) =1

n

n∑

t=1

(


vt(θ).

Substituting this expression in the log Gaussian likelihood, we see that the maximumGaussian likelihood estimator of θ maximizes the function

θ 7→ −n2log(2π)− n

2log σ2(θ)− 1

2

n∑

t=1

log vt(θ)−n

2.

The latter function is called the profile log likelihood for θ, and the elimination of theparameter σ2 is referred to as concentrating out this parameter. We can drop the constantterms in the profile log likelihood and conclude that the maximum Gaussian likelihoodestimator θ for θ minimizes

(13.13) θ 7→ log1

n

n∑

t=1

(


vt(θ)+

1

n

n∑

t=1

log vt(θ).

The maximum Gaussian likelihood estimator for σ2 is σ2(θ).For causal, invertible stationary ARMA processes the innovations Xt − Πt−1Xt(θ)

are for large t approximately equal to Zt, whence vt(θ) ≈ EZ2t /σ

2 = 1. (Cf. the discus-sion in Section 8.4. In fact, it can be shown that |vt − 1| ≤ ct for some 0 < c < 1 andsufficiently large t.) This suggests that the criterion function (13.13) does not changemuch if we drop its second term and retain only the sum of squares. The correspond-ing approximate maximum Gaussian likelihood estimator is precisely the least squaresestimator, discussed in Section 11.3.


13.11 Example (GARCH). The distribution of a GARCH process Xt = σtZt dependson the distribution of the innovations Zt, but is rarely (or never?) Gaussian. Neverthelesswe may try and apply the method of Gaussian likelihood.

Because a GARCH series is a white noise series, the mean vector of (X1, . . . , Xn)is zero and its covariance matrix is diagonal. For a stationary GARCH process thediagonal elements are identical and can be expressed in the parameters of the GARCHprocess. (For instance EX2

t = α/(1 − φ − θ) for the GARCH(1, 1).) As the Gaussianlikelihood depends on the parameters of the model only through the variances v2t = EX2

t ,this likelihood can at best yield good estimators for functions of this variance. TheGARCH parameters cannot be recovered from this. For instance, we cannot estimate theparameter (α, φ, θ) of a GARCH (1, 1) process from a criterion function that depends onthese parameters only through α/(1− φ− θ).

We conclude that the method of Gaussian likelihood is useless for GARCH processes.We note that the “Gaussian likelihood” as discussed in this section is not the same asthe GARCH likelihood under the assumption that the innovations Zt are Gaussian (cf.Example 13.3). The Gaussian likelihood is similar in form, but with the conditionalvariances σ2

t in the latter replaced by their expectations.

In the preceding examples we have seen that for AR and ARMA processes theGaussian maximum likelihood estimators are, asymptotically as n → ∞, close to theleast squares estimators. The following theorem shows that the asymptotic behaviourof these estimators is identical to that of the least squares estimators, which is given inTheorem 11.19.

13.12 Theorem. Let Xt be a causal, invertible stationary ARMA(p, q) process relativeto an i.i.d. sequence Zt. Then the Gaussian maximum likelihood estimator satisfies

√n

((

~φp~θq

)

−(

~φp~θq

)

)

N(0, σ2J−1~φp,~θq),

where J~φp,~θqis the covariance matrix of (U−1, . . . , U−p, V−1, . . . , V−q) for stationary auto-

regressive processes Ut and Vt satisfying φ(B)Ut = θ(B)Vt = Zt.

Proof. The proof is long and technical. See Brockwell and Davis (1991), pages 375–396,Theorem 10.8.2.

The theorem does not assume that the time series Xt itself is Gaussian; it usesthe Gaussianity only as a working hypothesis to define maximum likelihood estimators.Apparently, using “the wrong likelihood” still leads to reasonable estimators. This isplausible, because Gaussian maximum likelihood estimators are asymptotically equiv-alent to least squares estimators and the method of least squares can be motivatedwithout reference to Gaussianity. Alternatively, it can be explained from considerationof the Kullback-Leibler divergence, as in Section 13.3.

On the other hand, in the case that the series Xt is not Gaussian the true maximumlikelihood estimators (if the true model, i.e. the true distribution of the noise factors Zt is


known) are likely to perform better than the least squares estimators. In this respect timeseries analysis is not different from the situation for replicated experiments. An importantdifference is that in practice non-Gaussianity may be difficult to detect, other plausibledistributions difficult to motivate, and other likelihoods be computationally complex.The Gaussian distribution is therefore frequently adopted as a working hypothesis.

* 13.4.1 Whittle Estimators

Because the Gaussian likelihood function of a mean zero time series depends on theautocovariance function only, it can be helpful to write it in terms of the spectral density.The covariance matrix of a vector (X1, . . . , Xn) belonging to a stationary time series Xt

with spectral density fX can be written as Γn(fX), for

Γn(f) =(

∫ π

−πei(s−t)λf(λ) dλ

)

s,t=1,...,n.

Thus if the time series Xt has spectral density fθ under the parameter θ and mean zero,then the log Gaussian likelihood can be written in the form

−n log(2π)− 1

2log det Γn(fθ)−

1

2~XTn Γn(fθ)

−1 ~Xn.

Maximizing this expression over θ is equivalent to maximizing the Gaussian likelihoodas discussed previously, but gives a different perspective. For instance, to fit an ARMAprocess we would maximize this expression over all “rational spectral densities” of the

form σ2∣

∣θ(e−iλ)∣

∣

2/∣

∣φ(e−iλ)∣

∣

2.

The true advantage of writing the likelihood in spectral notation is that it suggestsa convenient approximation. The Whittle approximation is defined as

−n2log(2π)− n

4π

∫ π

−πlog fθ(λ) dλ− n

4π

∫ π

−π

In(λ)

fθ(λ)dλ,

where In(λ) is the periodogram of the time series Xt, as defined in Section 12.2. Thisapproximation results from the following approximations, valid for sufficiently regularfunctions f ,

Γn(f)−1 ≈ Γn

( 1

f

) 1

4π2,

1

nlog det Γn(f) ≈ log(2π) +

1

2π

∫ π

−πlog f(λ) dλ,

combined with the identity

1

n~XTn Γn(f) ~Xn = 2π

∫

In(λ)f(λ) dλ.

The approximations are made precise in Lemma ?? (Kolmogorov-Szego), whereas theidentity follows by some algebra.


13.13 EXERCISE. Verify the identity in the preceding display.

The Whittle approximation is both more convenient for numerical manipulation andmore readily amenable to theoretical analysis. The point of maximum θn of the Whittleapproximation, if it exists, is known as the Whittle estimator. Conceptually, this comesdown to a search in the class of spectral densities fθ defined through the model.

13.14 Example (Autoregression). For an auto-regressive time series of fixed order theWhittle estimators are identical to the Yule-Walker estimators, which are also (almost)identical to the maximum Gaussian likelihood estimators. This can be seen as follows.

The Whittle estimators are defined by maximizing the Whittle approximation over

all spectral densities of the form fθ(λ) = σ2/∣

∣φ(e−iλ)∣

∣

2, for φ the auto-regressive poly-

nomial φ(z) = 1−φ1z−· · ·−φpzp. By the Kolmogorov-Szego formula (See ??), or directcomputation,

∫

log fθ(λ) dλ = 2π log(σ2/2π) is independent of the parameters φ1, . . . , φp.Thus the stationary equations for maximizing the Whittle approximation with respectto the parameters take the form

0 =∂

∂φk

∫

In(λ)

fθ(λ)/σ2dλ =

∂

∂φk

∫

φ(eiλ)φ(e−iλ)In(λ) dλ

=

∫

[

−eiλkφ(e−iλ)− φ(eiλ)e−iλk]

In(λ) dλ

= −2Re

∫

[

eiλk − φ1eiλ(k−1) − · · · − φpe

iλ(k−p)]

In(λ) dλ

= −2Re[

γ∗n(k)− φ1γ∗n(k − 1)− · · · − φpγ

∗n(k − p)

]

,

because γ∗n(h) = n−1∑n−|h|

t=1 Xt+|h|Xt are the Fourier coefficients of the function In for|h| < n, by (12.2). Thus the stationary equations are the Yule-Walker equations, apartfrom the fact that the observations have been centered at mean zero, rather than at Xn.

13.15 EXERCISE. Derive the Whittle estimator for σ2 for an autoregressive process.

If we write In(f) for∫

In(λ)f(λ) dλ, then a Whittle estimator is a point of minimumof the map

θ 7→Mn(θ) =

∫ π

−πlog fθ(λ) dλ+ In

( 1

fθ

)

.

In Section 12.4 it is shown that the sequence√n(

In(f) −∫

ffX dλ)

is asymptoticallynormally distributed with mean zero and covariance matrix Σ2(f), under some condi-tions. This implies that the sequence Mn(θ) converges for every fixed θ in probabilityto

M(θ) =

∫

log fθ(λ) dλ+

∫

fXfθ

(λ) dλ.

By reasoning as in Section 13.1 we expect that the Whittle estimators θn will be asymp-totically consistent for the parameter θ0 that minimizes the function θ 7→M(θ).


If the true spectral density fX takes the form fX = fθ0 for some parameter θ0, thenthis parameter is the minimizing value. Indeed, by the inequality − log x + (x − 1) ≥(√x− 1

)2, valid for every x ≥ 0,

M(θ)−M(θ0) =

∫

(

logfθfθ0

(λ) +fθ0fθ

(λ)− 1)

dλ ≥∫

(

√

fθ0fθ

(λ)− 1)2

dλ.

This shows that the function θ 7→ M(θ) possesses a minimum value at θ = θ0, and thispoint of minimum is unique as soon as the parameter θ is identifiable from the spectraldensity.

To derive the form of the limit distribution of the Whittle estimators we replaceM(θ) by its linear approximation, as in Section 13.1, and obtain that

√n(θn − θ0) = −

(

Mn(θn))−1√

nMn(θ0).

Denoting the gradient and second order derivative matrix of the function θ 7→ log fθ(λ)by ℓθ(λ) and ℓθ(λ), we can write

√nMn(θ) =

∫

ℓθ(λ) dλ− In

( ℓθfθ

)

,

Mn(θ) =

∫

ℓθ(λ) dλ+ In

( ℓθ ℓTθ − ℓθfθ

)

.

By the results of Section 12.4 the sequence√nMn(θ0) converges in distribution to a

normal distribution with mean zero and covariance matrix Σ(ℓθ0/fθ0), under some condi-tions. Furthermore, the sequence Mn(θ0) converges in probability to

∫

ℓθ0 ℓTθ0(λ) dλ =: Jθ0 .

If both are satisfied, then we obtain that

√n(θn − θ0) N

(

0, J−1θ0Σ(ℓθ0/fθ0)J

−1θ0

)

.

The asymptotic covariance is of the “sandwich form”.If the time series Xt is Gaussian, then the Whittle likelihood is an approximation

for the correctly specified likelihood, and the asyptotic covariance reduces to a simplerexpression. In this case, since the kurtosis of the normal distribution is 0,

Σ(f) = 4π

∫

ffT (λ)f2X(λ) dλ.

It follows that in the case, and with fX = fθ0 , the asymptotic covariance of the sequence√nMn(θ0) reduces to 4πJθ0 , and the sandwich covariance reduces to 4πJ−1θ0

.


13.16 Example (ARMA). The log spectral density of a stationary, causal, invertibleARMA(p, q) process with parameter vector θ = (σ2, φ1, . . . , φp, θ1, . . . , θq) can be writtenin the form

log fθ(λ) = log σ2 − log(2π) + log θ(eiλ) + log θ(e−iλ)− log φ(eiλ)− log φ(e−iλ).

Straightforward differentiation shows that the gradient of this function is equal to

ℓθ(λ) =

σ−2eiλk

φ(eiλ)+ e−iλk

φ(e−iλ)

eiλl

θ(eiλ)+ e−iλl

θ(e−iλ)

k=1,...,p,l=1,...,q

Here the second and third lines of the vector on the right are abbreviations of vectors oflength p and q, obtained by letting k and l range over the values 1, . . . , p and 1, . . . , q,respectively. The matrix Jθ =

∫

ℓθ ℓTθ (λ) dλ takes the form

Jθ =

2πσ4 0 00 AR MAAR0 MAART MA

,

where AR, MA, and MAAR are matrices of dimensions (p × p), (q × q) and (p × q),respectively, which are described in more detail in the following. The zeros must bereplicated to fulfill the dimension requirements, and result from calculations of the type,for k ≥ 1,

∫ π

−π

eiλk

φ(eiλ)dλ =

1

i

∫

|z|=1

zk−1

φ(z)dz = 0,

by Cauchy’s theorem, because the function z 7→ zk−1/φ(z) is analytic on a neighbourhoodof the unit disc, by the assumption of causility of the ARMA process.

Using the identity (f +f)(g+ g) = 2Re(fg+fg) we can compute the (l, l′)-elementof the matrix MA as

2Re

∫

[ eiλl

θ(eiλ)

eiλl′

θ(eiλ)+

eiλl

θ(eiλ)

e−iλl′

θ(e−iλ)

]

dλ

= 2Re(

0 +

∫

eiλ(l−l′)

|θ(eiλ)|2 dλ)

= 2γV (l − l′)2π,

where Vt is a stationary auto-regressive process satisfying θ(B)Vt = Zt for a white noiseprocess Zt of unit variance. The matrix AR can be expressed similarly as the covariancematrix of p consecutive elements of an auto-regressive process Ut satisfying φ(B)Ut = Zt.The (k, l)-element of the matrix MAAR can be written in the form

2Re

∫

[ eiλk

φ(eiλ)

eiλl

θ(eiλ)+

eiλk

φ(eiλ)

e−iλl

θ(e−iλ)

]

dλ = 2Re(

0 + 2π

∫

fUV (λ)eiλ(k−l) dλ

)

.

Here fUV (λ) = 1/(

2πφ(eiλ)θ(e−iλ))

is the cross spectral density of the auto-regressiveprocesses Ut and Vt defined previously (using the same white noise process Zt)(?). Hence


the integral on the far left is equal to 2π times the complex conjugate of the crosscovariance γUV (k − l).

Taking this all together we see that the matrix resulting from deleting the first rowand first column from the matrix Jθ/(4π) results in the matrix J~φp,~θq

that occurs in The-

orem 13.12. Thus the Whittle estimators and maximum Gaussian likelihood estimatorshave asymptotically identical behaviour.

In view of the block structure in Jθ, the Whittle estimator for σ2 is asymptoticallyindependent of the estimators of the remaining parameters.

* 13.4.2 Gaussian Time Series

In this section we study the behaviour of the maximum likelihood estimators for generalGaussian time series in more detail. Thus θn is the point of maximum of (13.11) (or

equivalently (13.12)), and we study the properties of the sequence√n(θn − θ) under the

assumption that the true density of (X1, . . . , Xn) possesses the form (13.11), for someθ. For simplicity we assume that the time series is centered at mean zero, so that themodel is completely parametrized by the covariance matrix Γn(θ). Equivalently, it isdetermined by the spectral density fθ, which is related to the covariance matrix by

(

Γn(θ))

s,t=

∫ π

−πei(s−t)λ fθ(λ) dλ.

It is easier to express conditions and results in terms of the spectral density fθ, which isfixed, than in terms of the sequence of matrices Γn(θ). The asymptotic Fisher informationfor θ is defined as

Iθ =1

4π

∫ π

−π

∂ log fθ∂θ

(λ)(∂ log fθ

∂θ(λ))T

dλ.

13.17 Theorem. Suppose that Xt is a Gaussian time series with zero mean and spectraldensity fθ such that the map θ 7→ fθ is one-to-one and the map (θ, λ) 7→ fθ(λ) isthree times continuously differentiable and strictly positive. Suppose that θ ranges overa bounded, open subset of R

d. Then the maximum likelihood estimator θn based onX1, . . . , Xn satisfies

√n(θn − θ) N(0, I−1θ ).

Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIII.

The theorem is similar in form to the theorem for maximum likelihood estimatorsbased on replicated experiments. If pn,θ is the density of ~Xn = (X1, . . . , Xn) (given in(13.11)), then it can be shown under the conditions of the theorem that

In,θ: =1

nEθ

∂

∂θlog pn,θ( ~Xn)

( ∂

∂θlog pn,θ( ~Xn)

)T

→ Iθ.

The left side of this display is the true (average) Fisher information (per observation) for

θ based on ~Xn, and this explains the name asymptotic Fisher information for Iθ. Withthis in mind the analogy with the situation for replicated experiments is perfect.


13.5 Model Selection

In the preceding sections and chapters we have studied estimators for the parameters ofARMA or GARCH processes assuming that the orders p and q are known a-priori. Inpractice reasonable values of p and q can be chosen from plots of the auto-correlationand the partial auto-correlation functions, followed by diagnostic checking after fitting aparticular model. Alternatively (or in addition) we can estimate appropriate values of pand q and the corresponding parameters simultaneously from the data. The maximumlikelihood method must then be augmented by penalization.

The value of the likelihood (13.1) depends on the dimension the parameter θ. Ifmodels of different dimension are available, then we can make the dependence explicitby denoting the log likelihood as, with d the dimension of the model,

Λn(θ, d) =

n∑

t=1

log pθ,d(Xt|Xt−1, . . . , X1).

A first idea to select a reasonable dimension is to maximize the function Λn jointly over(θ, d). This rarely works. The models of interest are typically nested in that a model ofdimension d is a submodel of a model of dimension d+1. The maximum over (θ, d) is thentaken for the largest possible dimension. To counter this preference for large dimensionwe can introduce a penalty function. Instead of Λn we maximize

(θ, d) 7→ Λn(θ, d)− φn(d),

where φn is a fixed function that takes large values for large values of its argument.Maximizing this function jointly over (θ, d) must strike a balance between maximizingΛn, which leads to big values of d, and minimizing φn, which leads to small values of d.The choice of penalty function is crucial for this balance to yield good results.

Several penalty functions are in use, each of them motivated by certain consider-ations. There is no general agreement as to which penalty function works best, partlybecause there are several reasonable criteria for “best”. Three examples for models ofdimension d are

AIC(d) = d,

AICC(d) =nd

n− d− 1,

BIC(d) = 12d log n.

The abbreviations are for Akaike’s Information Criterion, Akaike’s information correctedcriterion, and Bayesian Information Criterion respectively.

It seems reasonable to choose a penalty function such that as n → ∞ the value dnthat maximizes the penalized likelihood converges to the true value (in probability oralmost surely). By the following theorem penalties such that φn(d) → ∞ faster thanloglog n achieve this aim in the case of ARMA processes. Here an ARMA(p, q) process isunderstood to be exactly of orders p and q, i.e. the leading coefficients of the polynomialsφ and θ of degrees p and q are nonzero.

13.5: Model Selection 231

13.18 Theorem. Let Xt be a Gaussian causal, invertible stationary ARMA(p0, q0) pro-

cess and let (θ, p, q) maximize the penalized likelihood over ∪p+q≤d0(Θp,q, p, q), where

for each (p, q) the set Θp,q is a compact subset of Rp+q+1 consisting of parameters of acausal, invertible stationary ARMA(p, q) process and d0 ≥ p0+q0 is fixed. If φn(d)/n→ 0and lim inf φn(d)/ loglog n is sufficiently large for every d ≤ d0, then p→ p0 and q → q0almost surely.

Proof. See Azencott and Dacunha-Castelle (1984), Chapitre XIV.

The condition on the penalty is met by the BIC penalty, but not by Akaike’s penaltyfunction. It is observed in practice that the use of Akaike’s criterion overestimates theorder of the model. The AICC criterion, which puts slightly bigger penalty on big models,is an attempt to correct this.

However, choosing a model of the correct order is perhaps not the most relevantcriterion for “good” estimation. A different criterion is the distance of the estimatedmodel, specified by a pair of a dimension d and a corresponding parameter θ, to thetrue law of the observations. Depending on the distance used, an “incorrect” estimate dtogether with a good estimate θ of that dimension may well yield a model that is closerthan the estimated model of the correct (higher) dimension. For ARMA processes theAIC criterion performs well in this respect.

The AIC criterion is based on the Kullback-Leibler distance.

13.19 EXERCISE. Repeatedly simulate a MA(1) process with θ = .8 for n = 50 orn = 100.(i) Compare the quality of the moment estimator and the maximum likleihood estima-

tor.(ii) Are the sampling distributions of the estimators approximately normal?

13.20 EXERCISE. Find best fitting AR and ARMA models for the Wolfer sunspotnumbers (object sunspots in Splus).


PICTURES

1984 1986 1988 1990 1992

1.0

1.5

2.0

2.5

1984 1986 1988 1990 1992

-0.2

-0.1

0.0

0.1

Figure 13.1. Stock price (top) and log returns (bottom) of Hewlett Packard.

Time

sunspots

1750 1800 1850 1900 1950

050

100

150

200

250

Figure 13.2. Sunspots.


t (s)

curr

ent

(pA

)

1.0 1.1 1.2 1.3 1.4 1.5 1.6

-6-5

-4-3

-2-1

0

t (s)

curr

ent

(pA

)

1.0 1.1 1.2 1.3 1.4 1.5 1.6

-6-5

-4-3

-2-1

0

Figure 13.3. Ion channel in barley cell.

13.5: Model Selection 3

Figure 13.4. Ion channel in barley cell.

lecture notes - universiteit leiden

Documents