brooklyn college, cunyuserhome.brooklyn.cuny.edu/cbenes/f19m4506lecturenotes.pdf · here’s where...

Brooklyn College, CUNY

Math 4506 – Time Series

Lecture Notes

Fall 2019

Christian Benes

[email protected]

http://userhome.brooklyn.cuny.edu/cbenes/timeseries.html

Math 4506 (Fall 2019) August 28, 2019Prof. Christian Benes

Lecture #1: Introduction; Probability Review

1.1 What this course is about: time series and time series models

Essentially, all models are wrong, but some are useful.

George E. P. Box

Probabilists study time series models. These are abstract random objects which arecompletely well-defined and can generate sets of data (using random number generators).

Statisticians study time series (which are data sets) and try to find the right model for it,that is, the time series model from which the data could have been generated.

In that sense, probabilists and statisticians do the opposite job, the first being (arguably)more elegant, the second being (definitely) more practical.

Below are some examples of time series. The first three are “real-world” data. The following6 are computer-generated. Our goal in this course will be to find ways to construct modelsfrom which these data could have arisen.

1930 1968

400

450

500

550

600

650

Baltimore city annual water use, liters per capita per day, 1890-1968

1–1

0 100 200 300

0.70

0.75

0.80

Index

ed

Daily value of one $US in Euros, May 6, 2010 - May 6, 2011

1–2

1400

1600

1800

Closing value of NASDAQ 100 index, July 25 2008 - January 23, 2009.

1–3

2 4 6 8 10

-1.5

-1.0

-0.5

0.5

1.0

Ten random data points. What can we say about the underlying distribution?

2 4 6 8 10

0.6

0.8

1.0

1.2

1.4

1.6

What about these 10 data points?

2 4 6 8 10

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

Same question

1–4

Scale is important when visualizing data. Here are the same data sets as on the previouspage, shown all three at the same scale:

2 4 6 8 10

-2

-1

0

1

2

2 4 6 8 10

-2

-1

0

1

2

2 4 6 8 10

-2

-1

0

1

2

1–5

It turns out that these data are drawn from the (multivariate) normal distributionsN(0,Σ1), N(0,Σ2),N(0,Σ3), respectively, where

Σ1 =

1 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 00 0 0 0 1 0 0 0 0 00 0 0 0 0 1 0 0 0 00 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 1

,

Σ2 =

1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1

,

Σ3 =

1 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 1 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 1 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 1 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 1 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 1 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 1 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 1 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 1 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 1

.

If you’re not sure what this means, don’t worry. Details are coming up. In a nutshell, thesamples of the first data set are drawn from independent normal random variables, whilethose from the other two sets are drawn from a family of pairwise positively correlatedrandom variables (with covariances 4/5 in the first case and 24/25 in the second).

The main purpose of time series modeling is to come up (as one would expect) with thestochastic process (time series model) from which the observed data (time series) is a re-alization. This is an impossible task, as suggested by the quote at the beginning of thislecture.

Randomness in the real world is simply too complex to grasp completely. However, thereare ways to determine according to some (sometimes subjective) criteria which models workbetter and which models don’t work as well in a given setting.

1–6

Here’s where finding a model for data is tricky: There are many choices for a model whichat first (and even second) glance seem reasonable for a given data set. I am sure none ofyou would have been shocked if I had told you that the second to last data set above wasdrawn from independent normal random variables with mean 0 and standard deviation 1/2.Nor would you have been very troubled if I’d suggested that they were generated usingindependent exponential random variables with mean 1.

This illustrates the fact that in time series modeling, one often has a choice between anumber of models (in the case I just mentioned, types of random variables) and withinthese, a number of parameters (means, variances, covariances, etc.).

In this course, you will be exposed to a number of models which all depend on a number ofparameters. There usually isn’t a systematic way to choose a model (and the correspondingparameters), so modeling usually requires a fair dose of theoretical understanding (to deter-mine if a model is even acceptable in a given setting) and flair (since all models are wrong,experience comes in handy when trying to find one that is better than others).

Since the title of this course is Time Series, it might be useful if we know what a time seriesis!

Definition 1.1. A time series is simply a set of observations xt, with each data pointbeing observed at a specific time t.

A time series model is a set of random variables Xt, each of which corresponds to a specifictime t.

Notation

The symbol A := B means A is defined to equal B, whereas C = D by itself means simplythat C and D are equal. This is an important distinction because if you write A := B, thenthere is no need to verify the equality of A and B. They are equal by definition. However,if C = D, then there IS something that needs to be proved, namely the equality of C andD (which might not be obvious).

For example, you may recall that for a random variable X,

V ar(X) := E[(X − E[X])2]

andV ar(X) = E[X2]− E[X]2.

1.2 Introduction to Random Variables

While writing my book [Stochastic Processes] I had an argument with Feller. Heasserted that everyone said “random variable” and I asserted that everyone said“chance variable.” We obviously had to use the same name in our books, so wedecided the issue by a stochastic procedure. That is, we tossed for it and he won.

Joe Doob

1–7

In probability, Ω is used to denote the sample space of outcomes of an experiment.

Example 1.1. Toss a die once: Ω = 1, 2, 3, 4, 5, 6.Example 1.2. Toss two dice: Ω = (i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6.

Note that in each case Ω is a finite set. (That is, the cardinality of Ω, written |Ω|, is finite.)

Example 1.3. Consider a needle attached to a spinning wheel centred at the origin. Whenthe wheel is spun, the angle ω made with the tip of the needle and the positive x-axis ismeasured. The possible values of ω are Ω = [0, 2π).

In this case, Ω is an uncountably infinite set. (That is, Ω is uncountable with |Ω| =∞.)

Definition 1.2. A random variable X is a function from the sample space Ω to the realnumbers R = (−∞,∞). Symbolically,

X : Ω → R

ω 7→ X(ω).

Example 1.4. (1.1 continued). Let X denote the upmost face when a die is tossed. Then,X(i) = i, i = 1, . . . , 6.

Example 1.5. (1.2 continued). Let X denote the sum of the upmost faces when two diceare tossed. Then, X((i, j)) = i + j, i = 1, . . . , 6, j = 1, . . . , 6. Note that the elements ofΩ are ordered pairs, so that the function X(·) acts on (i, j) giving X((i, j)). We will oftenomit the inner parentheses and simply write X(i, j).

Example 1.6. (1.3 continued). Let X denote the cosine of the angle made by the needleon the spinning wheel and the positive x-axis. Then X(ω) = cos(ω) so that X(ω) ∈ [−1, 1].

Remark. As mentioned in the definition, a random variable is really a function whose inputvariable is random, that is, determined by chance (or God, or destiny, or karma, or whateveryou think decides how our world works). The use of the notation X and X(ω) is EXACTLYthe same as the use of f and f(x) in elementary calculus. For example, f(x) = x2, f(t) = t2,f(ω) = ω2, and X(ω) = ω2 all describe EXACTLY the same function (at least if we assumethe domains are the same), namely, the function which takes a number and squares it.

What makes random variables slightly more complicated than functions is that, unlike thevariable x from calculus, the variable ω is random and therefore comes from a distribution.

1.3 Discrete and Continuous Random Variables

Definition 1.3. Suppose that X is a random variable. Suppose that there exists a functionf : R→ R with the properties that f(x) ≥ 0 for all x,

∫∞−∞ f(x) dx = 1, and

P (ω ∈ Ω : X(ω) ≤ a) =: P (X ≤ a) =

∫ a

−∞f(x) dx.

We call f the (probability) density (function) of X and say that X is a continuous randomvariable. Furthermore, the function F defined by F (a) := P (X ≤ a) is called the (probability)distribution (function) of X.

1–8

Note 1.1. By the Fundamental Theorem of Calculus, F ′(x) = f(x).

Remark. There exist continuous random variables which do not have densities. Althoughit’s good to know that the definition of continuous random variables is slightly more generalthan what is suggested above, you won’t need to worry about it in this course.

Example 1.7. A random variable X is said to be normally distributed with parameters µ,σ2, if the density of X is

f(x) =1

σ√

2πexp

(−(x− µ)2

2σ2

), −∞ < µ <∞, 0 < σ <∞.

This is sometimes written X ∼ N (µ, σ2). In Exercise 1.2, you will show that the mean of Xis µ and the variance of X is σ2, respectively.

Definition 1.4. Suppose that X is a random variable. Suppose that there exists a functionp : Z→ R with the properties that p(k) ≥ 0 for all k,

∑∞k=−∞ p(k) = 1, and

P (ω ∈ Ω : X(ω) ≤ N) =: P (X ≤ N) =N∑

k=−∞

p(k).

We call p the (probability mass function or) density of X and say that X is a discreterandom variable. Furthermore, the function F defined by F (N) := P (X ≤ N) is called the(probability) distribution (function) of X.

Example 1.8. 1.2 (continued). If X is defined to be the sum of the upmost faces whentwo dice are tossed, then the density of X, written p(k) := P (X = k), is given by

p(2) p(3) p(4) p(5) p(6) p(7) p(8) p(9) p(10) p(11) p(12)

1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

and p(k) = 0 for any other k ∈ Z.

Remark. There do exist random variables which are neither discrete nor continuous; how-ever, such random variables will not concern us.

1.4 Expectation and Variance

Suppose that X : Ω → R is a random variable (either discrete or continuous), and thatg : R → R is a (piecewise) continuous function. Then Y := g X : Ω → R defined byY (ω) = g(X(ω)) is also a random variable. We usually write Y = g(X).

We now define the expectation of the random variable Y , distinguishing the discrete andcontinuous cases.

1–9

Definition 1.5. If X is a discrete random variable and g is as above, then the expectationof g X is given by

E[g(X)] :=∑k

g(k) p(k)

where p is the probability mass function of X.

Definition 1.6. If X is a continuous random variable and g is as above, then the expectationof g X is given by

E[g(X)] :=

∫ ∞−∞

g(x) f(x) dx

where f is the probability density function of X.

Notice that if g is the identity function (that is, g(x) = x for all x, we get the expectationof X itself:

• E[X] :=∑k

k p(k), if X is discrete, and

• E[X] :=

∫ ∞−∞

x f(x) dx if X is continuous.

µ := E[X] is also called the mean of X. Note that −∞ ≤ µ ≤ ∞. If −∞ < µ < ∞, thenwe say that X has a finite mean, or that X is an integrable random variable, and we writeX ∈ L1.

Exercise 1.1. Suppose that X is a Cauchy random variable. That is, X is a continuousrandom variable with density function

f(x) =1

π· 1

x2 + 1.

Carefully show that X 6∈ L1 (that is, X doesn’t have a finite mean).

Theorem 1.1 (Linearity of Expectation). Suppose that X : Ω → R and Y : Ω → R are(discrete or continuous) random variables with X ∈ L1 and Y ∈ L1. Suppose also thatf : R → R and g : R → R are both (piecewise) continuous and such that f(X) ∈ L1 andg(Y ) ∈ L1. Then, for any a, b ∈ R, af(X) + bg(Y ) ∈ L1 and, furthermore,

E[af(X) + bg(Y )] = aE[f(X)] + bE[g(Y )].

Using Definitions 1.5 and 1.6, we can compute the kth moments E[Xk] of a random variableX. One frequent assumption about a random variable is that it has a finite second moment.This is to ensure that the Central Limit Theorem can be used.

Definition 1.7. If X is a random variable with E[X2] < ∞, then we say that X has afinite second moment and write X ∈ L2. If X ∈ L2, then we define the variance of X to bethe number σ2 := E [(X − µ)2]. The standard deviation of X is the number σ :=

√σ2. (As

usual, this is the positive square root.)

1–10

Remark. It is an important fact that if X ∈ L2, then it must be the case that X ∈ L1.

The following is a useful formula when computing variances (people sometime confuse itwith the definition of variance, which it’s not; for the definition, see above).

Theorem 1.2. Suppose X ∈ L2. Then

V ar(X) = E[X2]− E[X]2.

Proof. By linearity of expectation,

V ar(X) = E[(X − µ)2] = E[X2 − 2µX + µ2] = E[X2]− E[2µX] + E[µ2]

= E[X2]− 2µE[X] + µ2 = E[X2]− 2µ2 + µ2 = E[X2]− µ2 = E[X2]− E[X]2

The following exercise is a little bit tedious, but you should make sure you know how to doit. If you remember doing it and remember well how it works, feel free to skip it. Since thislecture and the next are mostly review, I am including several exercises which are meant torefresh your memory on some basic ideas from probability but which you may know verywell how to do already. That’s why I’m including the comment “(optional)” next to them.I will not include these problems on the homework assignments.

Exercise 1.2. (optional) The purpose of this exercise is to make sure you can compute somestraightforward (but messy) integrals [Hint: A change of variables will make them easier tohandle.]. Suppose that X ∼ N (µ, σ2); that is, X is a normally distributed random variablewith parameters µ, σ2. (See Example 1.7 for the density of X.) Show directly (withoutusing any unstated properties of expectations or distributions) that

• E[X] = µ,

• E[X2] = σ2 + µ2, and

• E[e−θX ] = exp

−(θµ− σ2θ2

2

), for 0 ≤ θ <∞.

• V ar(X) = σ2 [Note that this follows from the first two parts and Theorem 1.2.]

This is the reason that if X ∼ N (µ, σ2), we say that X is normally distributed with meanµ and variance σ2 (not just with parameters µ and σ2).

1.5 Bivariate Random Variables

Theorem 1.3. If X and Y are random variables with X ∈ L2 and Y ∈ L2, then the productXY is a random variable with XY ∈ L1.

1–11

Definition 1.8. If X and Y are both random variables in L2, then the covariance of X andY , written Cov(X, Y ) is defined to be

Cov(X, Y ) := E [(X − µX)(Y − µY )]

where µX := E[X], µY := E[Y ]. Whenever the covariance of X and Y exists, we define thecorrelation of X and Y to be

Corr(X, Y ) :=Cov(X, Y )

σXσY(†)

where σX is the standard deviation of X, and σY is the standard deviation of Y .

Remark. By convention, 0/0 := 0 in the definition of correlation. This arbitrary choiceis designed to simplify some formulas and means that if Var(X) = 0 or V ar(Y ) = 0,then Corr(X, Y ) = 0 (this follows from the fact that if Var(X) = 0 or Var(Y ) = 0, thenCov(X, Y ) = 0). Since if Var(X) = 0, X is constant (in which case we call X degeneratewhich in this context just means non-random), this means that the correlation of two randomvariables is always 0 if one of them is degenerate.

Definition 1.9. We say that X and Y are uncorrelated if Cov(X, Y ) = 0 (or, equivalently,if Corr(X, Y ) = 0).

Fact 1.1. If X ∈ L2 and Y ∈ L2, then the following computational formulas hold:

• Cov(X, Y ) = E[XY ]− E[X]E[Y ];

• Var(X) = Cov(X,X);

Exercise 1.3. Verify the two computational formulas above. [Note that the formulas don’tnecessarily hold without the assumption that X ∈ L2 and Y ∈ L2, so make sure you explainwhy these assumptions are needed in general.]

The following result tells us how to deal with the covariance of linear combinations of randomvariables.

Theorem 1.4. If X, Y, Z ∈ L2 and a, b, c ∈ R, then

Cov(aX + bY + c, Z) = aCov(X,Z) + bCov(Y, Z).

Exercise 1.4. (optional) Prove Theorem 1.4

Note 1.2. From this theorem follows another result which you already know:

Var(aX) = a2 Var(X).

Definition 1.10. Two random variables X and Y are said to be independent if f(x, y), thejoint density of (X, Y ), can be expressed as

f(x, y) = fX(x) · fY (y)

where fX is the (marginal) density of X and fY is the (marginal) density of Y .

1–12

Remark. Notice that we have combined the cases of a discrete and a continuous randomvariable into one definition. You can substitute the phrases probability mass function orprobability density function as appropriate.

The following result is often needed and at a first glance not completely obvious.

Theorem 1.5. If X and Y are independent random variables with X ∈ L1 and Y ∈ L1,then

• the product XY is a random variable with XY ∈ L1, and

• E[XY ] = E[X]E[Y ].

Exercise 1.5. (optional) Using this theorem, quickly prove that if X and Y are independentrandom variables, then they are necessarily uncorrelated. (As the next exercise shows, theconverse, however, is not true: there do exist uncorrelated, dependent random variables.)

Exercise 1.6. (optional) Consider the random variable X defined by P (X = −1) = 1/4,P (X = 0) = 1/2, P (X = 1) = 1/4. Let the random variable Y be defined as Y := X2.Hence, P (Y = 0|X = 0) = 1, P (Y = 1|X = −1) = 1, P (Y = 1|X = 1) = 1.

• Show that the density of Y is P (Y = 0) = 1/2, P (Y = 1) = 1/2.

• Find the joint density of (X, Y ), and show that X and Y are not independent.

• Find the density of XY , compute E[XY ], and show that X and Y are uncorrelated.

The following result allows us to get a grip on the variance in algebraic manipulations whenthe random variables involved are independent:

Theorem 1.6 (Linearity of Variance in the Case of Independence). Suppose that X : Ω→ R

and Y : Ω → R are (discrete or continuous) random variables with X ∈ L2 and Y ∈ L2. IfX and Y are independent, then X + Y ∈ L2 and

Var(X + Y ) = Var(X) + Var(Y ).

1–13

Math 4506 (Fall 2019) September 4, 2019Prof. Christian Benes

Lecture #2: Multivariate Random Variables

2.1 Multivariate Random Variables

We just saw that pairs of random variables can be more complicated than what one mightlike to think. It is not enough to know the distributions of the random variables X and Yto know how they behave together.

Think of the following example: You may know the distribution of the heights (X) andweights (Y ) of people in a certain population. However, this by itself will not tell you howheight affects weight and vice-versa. The information on how the random variables arerelated is not contained in the distributions of X and Y (that is, the marginals). To have anidea of the relative behavior of random variables, one needs the correlation coefficient.

Recall:

• If we want to describe a single random variable (also called univariate random vari-ables), we need a density f(x), which graphically can be described as a curve (or a setof points in the discrete case) in the plane.

• If we want to describe a pair of random variables (also called bivariate random vari-ables), we need a joint density f(x, y), which graphically can be described as a surface(or a set of points in the discrete case) in space.

This extends easily to higher dimensions:

• If we want to describe a family of n random variables, we need a joint density f(x1, . . . , xn),which graphically can be described as a hyper-surface (or a set of points in the discretecase) in (n+ 1)-dimensional space.

We are usually comfortable with drawing or imagining objects in 1, 2, or 3 dimensions.In higher dimensions, we tend to get a headache before we can make a sense of what weare trying to represent, so we will limit ourselves to depicting densities of univariate andbivariate random variables and will deal with the rest algebraically (and refer to pictures indimensions ≤ 3 when we get confused and need a picture to help us out).

We will write

x =

x1

x2

· · ·xn

= (x1, . . . , xn)′

2–1

and will think of random vectors as being column vectors. Therefore, the random vectorX = (X1, . . . , Xn)′ has joint distribution (we will often just say distribution)

F (x1, . . . , xn) = P (X1 ≤ x1, . . . Xn ≤ xn).

An equivalent way of writing this is

F (x) = P (X ≤ x).

Recall that if F (x, y) is a bivariate distribution (say for jointly continuous r.v.’s), then

F (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) =

∫ x

−∞

∫ ∞−∞

f(a, b) db da = F (x,∞).

The distributions of subsets of random variables are obtained in the same way as in 2dimensions: If F (x1, . . . , xn) is a multivariate distribution, then, for instance,

F (x1, x2, xn) = P (X1 ≤ x1, X2 ≤ x2, X3 ≤ ∞, Xn−1 ≤ ∞, Xn ≤ xn) = F (x1, x2,∞, . . . ,∞, xn).

For univariate random variables, you know that the p.d.f. is the derivative of the distributionfunction. In higher dimensions, this is true as well, but since we are dealing with functionsof several variables, we have to talk about partial derivatives.

f(x1, . . . xn) =∂nF (x1, . . . , xn)

∂x1 · · · ∂xn.

The random variables X1, . . . , Xn are independent if

F (x1, . . . , xn) = FX1(x1) · · ·FXn(xn)

or, alternatively, if the joint p.d.f. (p.m.f.) is the product of the marginal p.d.f’s (p.m.f’s).

Since the random vector X = (X1, . . . , Xn)′ is a vector, so is its mean E[X] = (E[X1], . . . E[Xn])′.Since there is a covariance between any two of the Xi, there is a total of n2 covariances whichcompose the covariance matrix

ΣX =

Cov(X1, X1) Cov(X1, X2) . . . Cov(X1, Xn−1) Cov(X1, Xn)Cov(X2, X1) Cov(X2, X2) . . . Cov(X2, Xn−1) Cov(X2, Xn)

......

. . ....

...Cov(Xn−1, X1) Cov(Xn−1, X2) . . . Cov(Xn−1, Xn−1) Cov(Xn−1, Xn)Cov(Xn, X1) Cov(Xn, X2) . . . Cov(Xn, Xn−1) Cov(Xn, Xn)

.

Note that

•

ΣX =

Var(X1) Cov(X1, X2) . . . Cov(X1, Xn−1) Cov(X1, Xn)

Cov(X2, X1) Var(X2) . . . Cov(X2, Xn−1) Cov(X2, Xn)...

.... . .

......

Cov(Xn, X1) Cov(Xn, X2) . . . Cov(Xn, Xn−1) Var(Xn)

.• Since for any i, j ∈ 1, . . . , n,Cov(Xi, Xj) = Cov(Xj, Xi), the covariance matrix is

symmetric.

2–2

2.2 Some Basic Linear Algebra

Caveat 2.1. I may not be entirely consistent with notation in what follows. Sometimes,vectors will be represented by boldfaced symbols (x) and sometimes like this: ~x. On rareoccasions, I may use the same notation as for scalars, since that notation is common as well.If that’s the case, you should be able to figure out from context whether you’re dealing witha vector or not.

For matrices

A = [aij]1≤i≤k,1≤j≤` =

a1,1 a1,2 . . . a1,`−1 a1,`

a2,1 a2,2 . . . a2,`−1 a2,`...

.... . .

......

ak,1 ak,2 . . . ak,`−1 ak,`

,

B = [bij]1≤i≤`,1≤j≤n =

b1,1 b1,2 . . . b1,n−1 b1,n

b2,1 b2,2 . . . b2,n−1 b2,n...

.... . .

......

b`,1 b`,2 . . . b`,n−1 b`,n

,and a vector

~v =

v1

v2...

v`−1

v`

,we have the following definitions:

• The product of two matrices is AB = [ci,j]1≤i≤k,1≤j≤n, where

ci,j =∑k=1

ai,kbk,j.

• In particular, the product of a matrix and a vector is

A~v =

∑i=1

a1,ivi

∑i=1

a2,ivi

...∑i=1

an−1,ivi

∑i=1

an,ivi

.

2–3

• The transpose of matrix A is A′ = [ci,j]1≤i≤`,1≤j≤k, where

ci,j = aj,i.

• The determinant of a matrix A, written det(A), is something fairly easy to computebut its definition isn’t exactly short, so those who can’t remember it should look it upin a book on linear algebra. Wikipedia also has a definition and some examples. Notethat the determinant is defined only for square matrices (with same number of rowsand columns). We say that A is singular if det(A) = 0. Otherwise, A is nonsingular.

• The following definitions are for the case k = ` (that is, A is a square matrix):

– If A is nonsingular, the inverse of A, denoted by A−1 is the unique matrix suchthat

AA−1 = A−1A = 1k :=

1 0 . . . 0 00 1 . . . 0 0...

.... . .

......

0 0 . . . 0 1

.If it is clear from context what the dimensions of the matrix are, we write 1 = 1k.

– A is called orthogonal if A′ = A−1. In that case,

AA′ = A′A = 1.

– A is symmetric if for all 1 ≤ i, j ≤ k,

ai,j = aj,i.

– A is positive semi-definite if for all vectors ~v = [v1, . . . , vk]′,

~v′A~v ≥ 0.

Theorem 2.1. If an n× n matrix A is symmetric, it can be written as

A = PΛP ′,

where

Λ =

λ1 0 . . . 0 00 λ2 . . . 0 0...

.... . .

......

0 0 . . . 0 λn

and P is orthogonal. Here, λ1, . . . , λn are the eigenvalues of A.

Theorem 2.2. The covariance matrix of a random vector ~X is symmetric and positivesemi-definite.

2–4

Proof. Symmetry is obvious. If ~v = [v1, . . . , vk]′ and Σ is a covariance matrix, then

~v′Σ~v =n∑

i,j=1

vivj Cov(Xi, Xj) = Var(n∑i=1

viXi) ≥ 0.

Corollary 2.1. The covariance matrix Σ of a random vector ~X can be written in the form

Σ = PΛP ′,

where

Λ =

λ1 0 . . . 0 00 λ2 . . . 0 0...

.... . .

......

0 0 . . . 0 λn

and P is orthogonal.

Proof. This follows from the symmetry of Σ.

Note 2.1. Since Σ is positive semi-definite, its eigenvalues λ1, . . . , λn are nonnegative, so wecan define

Λ1/2 :=

λ

1/21 0 . . . 0 0

0 λ1/22 . . . 0 0

......

. . ....

...

0 0 . . . 0 λ1/2n

and

B = PΛ1/2P ′,

then, since PP ′ = P ′P = 1,

B2 = BB = PΛ1/2P ′PΛ1/2P ′ = PΛP ′ = Σ.

Since B2 = Σ, it makes perfect sense to define

Σ1/2 := PΛ1/2P ′ = B. (1)

Since we will often deal with linear transformations of random variables, the following propo-sition will be useful:

Proposition 2.1. If X is a random vector, a is a (nonrandom) vector, B is a matrix, andY = BX + a, then

E[Y] = a +BE[X],

ΣY = BΣXB′.

Proof. See first homework assignment.

2–5

2.3 Multivariate Normal Random Variables

You already know that the normal distribution is the most important of them all, since thecentral limit theorem tells us that as soon as we start adding up random variables, a normalpops up. Recall from Lecture 2 that a normal random variable X with parameters µ, σ2 hasdensity

f(x) =1

σ√

2πexp

(−(x− µ)2

2σ2

), −∞ < µ <∞, 0 < σ <∞.

You should verify that this is the one-dimensional particular case of the multivariate normaldensity with mean µ and nonsingular covariance matrix Σ (written X ∼ N(µ,Σ)):

fX(x) =1

((2π)ndet(Σ))1/2exp−1

2(x− µ)′Σ−1(x− µ).

Note 2.2. Make sure you understand why one needs Σ to be nonsingular in order for thedefinition of the multivariate normal density to make sense.

Exercise 2.7. Suppose X ∼ N(0, 1), Y ∼ N(0, 2) are bivariate normal with correlationcoefficient ρ(X, Y ) = 1

2.

• Find the joint density of X and Y .

• Let S1 be the square with vertices (0,0), (1,0), (0,1), and (1,1) and let S2 be the squarewith vertices (0,0), (1,0), (0,-1), and (1,-1). Without doing any computations, explainwhich of P ((X, Y ) ∈ S1) and P ((X, Y ) ∈ S2) should be greater.

You probably recall that if X ∼ N(µ, σ2), you can apply a linear transformation to changeX into a standard normal:

Z =X − µσ

∼ N(0, 1).

The same works for the multivariate normal:

Exercise 2.8. Prove that if X ∼ N(~µ,Σ), then

Z := Σ−1/2(X− ~µ) ∼ N(0,1).

In particular (prove this only in the bivariate case), the components of Z are independent.

Hint: Use proposition 2.1.

Note 2.3. This last exercise shows how to obtain a standard normal vector from any mul-tivariate normal distribution. On the homework, you will also show how to do the converse,that is, obtain any multivariate normal distribution from the standard multivariate normal.

2–6

You can generate multivariate normal random variables in R using the following commands(note that comments about what a line does will follow the symbol %; these comments arenot part of what you should include in your input line):

> library(MASS) % this loads the library in which the multivariate normal generator is

> S=c(1,0,0,1) % this generates the vector (1, 0, 0, 1)

> dim(S)=c(2,2) % this transforms the vector into a 2-by-2 matrix

> S % this allows you to check what S is.

[,1] [,2]

[1,] 1 0

[2,] 0 1

> mu=c(0,0) % this is the mean (row) vector

> mu

[1] 0 0

> dim(mu)=c(2,1) %this makes the mean vector into a column vector

> mu

[,1]

[1,] 0

[2,] 0

> N=mvrnorm(100,mu,S) % this generates 100 samples from the multivariate normal ran-dom distribution with mean mu and covariance matrix S

> plot(N)

-2 -1 0 1 2 3

-2-1

01

2

N[,1]

N[,2]

2–7

> S2=c(1,1,1,1)

> dim(S2)=c(2,2)

> N2=mvrnorm(100,mu,S2)

> plot(N2)

-2 -1 0 1 2

-2-1

01

2

N2[,1]

N2[,2]

> S3=c(1,-0.8,-0.8,1)

> dim(S3)=c(2,2)

> N3=mvrnorm(100,mu,S3)

> plot(N3)

-3 -2 -1 0 1 2 3

-2-1

01

2

N3[,1]

N3[,2]

2–8

The following are the graphs of 3 multivariate normal densities (any two pictures on a sameline are of the same pdf, but seen under different angles) Try to say as much as you canabout their means and covariance matrices.

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.000.050.100.15

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.000.050.100.15

2–9

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.00

0.05

0.10

0.15

The joint pdf of two independent standard normal random variables

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.000.050.100.15

The joint pdf of two normal random variables with mean 0 and covariance matrix

Σ =

[1 1/2

1/2 1

].

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-2

0

2

x

-2

0

2

y

0.000.050.100.15


Σ =

[1 −1/2−1/2 1

].

2–10

When pictures of surfaces don’t make as much sense as we’d like, we can always look at levelcurves. Here are the same graphs as above with level curves:

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y

The joint pdf of two independent standard normal random variables

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 1/2

1/2 1

].

-2

0

2

x -2

0

2

y

0.00

0.05

0.10

0.15

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 −1/2−1/2 1

].

2–11

When you draw samples from a distribution, you should see most of your data points ac-cumulate in areas of high probability. The shapes of these areas are precisely given by thelevel curves:

-1.5 -1.0 -0.5 0.5 1.0 1.5

-2

-1

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y

50 samples from a bivariate normal random variable with mean 0 and covariance matrix

Σ =

[1 00 1

].

-1 1 2

-1.5

-1.0

-0.5

0.5

1.0

1.5

2.0

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 1/2

1/2 1

].

-2.0 -1.5 -1.0 -0.5 0.5 1.0 1.5

-2

-1

1

2

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 −1/2−1/2 1

].

2–12

The connection between the data and the distribution becomes more obvious as the data setincreases in size:

-3 -2 -1 1 2 3

-3

-2

-1

1

2

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 00 1

].

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 1/2

1/2 1

].

-2 -1 1 2 3

-3

-2

-1

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

x

y


Σ =

[1 −1/2−1/2 1

].

2–13


Lecture #3: Decomposing Time Series; Stationarity

Reference. The material in this section is an introduction to time series and is meant tocomplement Chapter 1 in the textbook. Make sure you read that chapter in its entirety andwork in parallel with R to reproduce what is being done in the textbook. This lecture alsocovers most of the topics from Chapter 2, which we will re-visit in more detail in the nextlecture.

3.1 Basic decomposition

The following graph represents the number of monthly aircraft miles (in Millions) flown byU.S. airlines between 1963 and 1970:

Time

Air.ts

1964 1966 1968 1970

6000

8000

10000

12000

14000

16000

Given a data set such as the one above, how can we construct a model for it? The idea willbe to decompose random data into three distinct components:

• A trend component mt (increase of populations, increase in global temperature, etc.)

• A seasonal component st (describing cyclical phenomena such as annual temperaturepatterns, etc.)

3–1

• A random noise component Yt describing the non-deterministic aspect of the timeseries. Note that the book uses zt for this component. In the notes, I’ll write Yt, as theletter z usually suggests a normal distribution, which may not be the actual underlyingdistribution of the random noise component.

A common model is the so-called additive model, that is, one where we try to find mt, st, Ytsuch that a given time series can be expressed as

Xt = mt + st + Yt.

We will never know what mt, st, and Yt actually are, but we can estimate them. Theestimates will be called mt, st, and yt. Note that we’ll use the same notation for estimatesand estimators in this case. Once we see the data, our estimates have to satisfy

xt = mt + st + yt,

where mt is an estimate for mt, st is an estimate for st, and yt is an estimate for Yt.

The corresponding data set can be found at

http://robjhyndman.com/tsdldata/data/kendall3.dat

and looks like this:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

1963 6827 6178 7084 8162 8462 9644 10466 10748 9963 8194 6848 70271964 7269 6775 7819 8371 9069 10248 11030 10882 10333 9109 7685 76021965 8350 7829 8829 9948 10638 11253 11424 11391 10665 9396 7775 79331966 8186 7444 8484 9864 10252 12282 11637 11577 12417 9637 8094 92801967 8334 7899 9994 10078 10801 12950 12222 12246 13281 10366 8730 96141968 8639 8772 10894 10455 11179 10588 10794 12770 13812 10857 9290 109251969 9491 8919 11607 8852 12537 14759 13667 13731 15110 12185 10645 121611970 10840 10436 13589 13402 13103 14933 14147 14057 16234 12389 11595 12772

In fact, this is not exactly the form in which the data set is found on that website. There,it doesn’t have any labels. As it turns out, it is quite straightforward to include those labelswith R.

Let’s look at the graph above. Two patterns are striking. There appears to be

• an increasing pattern

• a clear cyclical pattern with some apparently fixed period

Some questions we’ll try to answer throughout the course are: “How can we extract thesepatterns?”. “Once we’ve extracted the patterns are we left with pure randomness or does therandomness have a structure?” “Can we use these patterns to make predictions for futurevalues of this time series?”

3–2

3.2 Stationary Time Series

We will eventually return to a more careful analysis of the trend and seasonal component ofa time series, but focus for now on Yt, the random component of a time series after extractionof a trend and cyclical component.

Multidimensional distributions are very complicated objects and involve more parametersthan we would like to deal with. We will focus on two essential quantities giving informationabout a time series: the means and the covariances.

Definition 3.1. If Xt is a time series with Xt ∈ L1 for each t, then the mean function (ortrend) of Xt is the non-random function µ(t) := E[Xt].

Definition 3.2. If Xt is a time series with Xt ∈ L2 for each t, then the autocovariancefunction of Xt is the non-random function

γ(t, s) := Cov(Xt, Xs) = E [(Xt − µ(t))(Xs − µ(s))] .

The autocorrelation function of Xt is

ρ(t, s) =γ(t, s)√

Var(Xt) Var(Xs)= Corr(Xt, Xs).

Definition 3.3. We call the time series Xt second-order (or weakly) stationary if

• there is a constant µ such that µ(t) = µ for all t, and

• γ(t + h, t) only depends on h; that is, if γ(t + h, t) = γ(h, 0) =: γ(h) for all t and forall h.

Exercise 3.9. For a second-order stationary process, show that Var(Xt) = γ(0) for each t.

Via the last exercise, the second condition for second-order stationarity allows us to rephrasethe definition above:

Definition 3.4. Suppose that Xt is a second-order stationary process. The autocovariancefunction (ACVF) at lag h of Xt is

γ(h) := Cov(Xt+h, Xt).

The autocorrelation function (ACF) at lag h of Xt is

ρ(h) := Corr(Xt+h, Xt).

Note 3.1. By Exercise 3.9,

ρ(h) =Cov(Xt+h, Xt)√

Var(Xt+h) Var(Xt)=γ(h)

γ(0).

3–3

3.3 Some simple time series models

All the time series below are discrete-time, that is, the time set is a subset of the integers.

Example 3.1. (White Noise.)

Often when taking measurements, little imprecisions (in the measuring device and on thepart of the measurer) will yield measurements that are a little off. It is often assumed thatthese errors are uncorrelated and that they all come from a same distribution with zero mean.A sequence of random variables Xnn≥1 with E[Xn] = 0 and E[XkXm] = σ2δ(k − m) iscalled white noise. (The name comes from the spectrum of a stationary process which wemay discuss at the end of the semester. There also noises that are pink, red, blue, purple,etc.) Here δ(k −m) is the Dirac delta function, defined by

δ(x) =

1 x = 00 x ∈ R \ 0

Two important particular cases of white noise are:

• The distribution of Xi is binary: P (Xi = a) = 1−P (Xi = −a) = 1/2 for some a ∈ R.

• Xi ∼ N(0, σ2). In this case, we talk about Gaussian white noise.

Example 3.2. (IID Noise.)

A sequence of independent, identically distributed random variables Xnn≥1 with E[Xn] = 0is called i.i.d. noise.

Example 3.3. (Random walk.) If Xii≥1 is i.i.d. noise,

Sn =n∑i=1

Xi

is a random walk. In particular, if P (Xi = 1) = 1−P (Xi = −1) = 1/2, we have a symmetricsimple random walk.

Random walks have been a (very crude) choice of model for the stock market for a long time.

0 10 20 30 40 50 60 70 80 90 100−4

−2

0

2

4

6

8

10

12

14

Number of Steps

Pos

ition

of W

alke

r

0 10 20 30 40 50 60 70 80 90 100−14

−12

−10

−8

−6

−4

−2

0

2

4

6

Number of Steps

Pos

ition

of W

alke

r

Two independent realizations of a simple random walk of 100 time steps.

3–4

Example 3.4. (Gaussian time series.)

Xnn≥1 is a Gaussian time series if for every collection of integers ik1≤k≤n, the vector

(Xi1 , . . . , Xin)

is multivariate Gaussian.

Since many natural quantities have a normal distribution, this is a natural model in manysettings. It also has the advantage of allowing many kinds of dependence between the data.

3.4 Autocovariance function: some examples

We saw that for stationary time series, covariance depends only on one parameter (the timebetween two given random variables), allowing us to define an autocovariance function atlag h. In the examples below, we compute the autocovariance function of the simple timeseries which we defined during the last lecture and use it to determine which of them arestationary and which are not.

Example 3.5 (White Noise). Suppose that Xt is White Noise. We now verify that Xtis second-order stationary. First, it is obvious that µ(t) = 0 for all t. Second, if s 6= t, thenthe assumption that the collection is uncorrelated implies that γ(t, s) = 0, s 6= t. On theother hand, if s = t, then γ(t, t) = Var(Xt) = σ2. Thus, µ(t) = 0 for all t, and

γ(h) = γ(t+ h, t) =

σ2, h = 0,

0, h 6= 0,

This shows that Xt is indeed second-order stationary since γ depends only on h. We writeXt ∼ WN(0, σ2) to indicate that Xt is white noise with Var(Xt) = σ2, for each t.

Example 3.6 (IID Noise). Suppose instead that Xt is collection of independent randomvariables, each with mean 0 and variance σ2. We say that Xt is iid Noise. As with whitenoise, we easily see that iid noise is stationary with trend µ(t) = 0 and

γ(h) = γ(t+ h, t) =

σ2, h = 0,

0, h 6= 0.

We write Xt ∼ IID(0, σ2) to indicate that Xt is iid noise with Var(Xt) = σ2, for each t.

Remark. With these two examples, we see that two different processes may both have thesame trend and autocovariance function. Thus, µ(t) and γ(t+ h, t) are NOT always enoughto distinguish stationary processes. (However, for stationary Gaussian processes they areenough.)

Example 3.7. If St =t∑i=1

Xi (where Xi is a sequence of independent random variables

with P (Xi = 1) = 1 − P (Xi = −1) = 1/2 and therefore Var(Xi) = 1) is symmetric simple

3–5

random walk, we find that if s > t,

γ(s, t) = Cov(Ss, St) = Cov(St +Xt+1 + . . .+Xs, St) = Cov(St, St)

= Var(St) =t∑i=1

VarXi = t.

In particular, γ(t+h, t) = t, which implies that simple random walk is not a stationary timeseries (since stationary time series have a constant variance).

3–6


Lecture #4: Linear Processes; MA processes

Reference. Chapter 2 and Sections 4.1 and 4.2 from the textbook.

4.1 Inequalities

Many probabilists are enthralled by inequalities (upper/lower bounds). One of the manypurposes for finding upper bounds is to check that quantities are finite, by checking it fora more tractable but larger quantity. (This is something you’ve seen in the comparisontest for integrals: Though it’s not straightforward to check that

∫∞1000

e−x2

log log log |x +

1| < ∞, the fact that for x ≥ 1000, 0 ≤ e−x2

log log log |x + 1| ≤ e−x implies that 0 ≤∫∞1000

e−x2

log log log |x+ 1| ≤∫Re−x <∞.)

A very common inequality in analysis and probability is Jensen’s inequality.

Definition 4.1. A function φ : R→ R is called convex if for x, y ∈ R, 0 ≤ p ≤ 1,

φ(px+ (1− p)y) ≤ pφ(x) + (1− p)φ(y).

Theorem 4.1 (Jensen’s inequality). Suppose φ : R→ R is convex. Suppose X is a randomvariable satisfying E[|X|] <∞ and E[|φ(x)|] <∞. Then

φ(E[X]) ≤ E[φ(X)].

Proof. If φ is convex, then for every x0 ∈ R, there is a c(x0) such that φ(x)−φ(x0)x−x0 ≥ c(x0).

Choosing x0 = E[X] and letting x = X, we get

φ(X) ≥ c(E[X])(X − E[X]) + φ(E[X]).

Taking expectations on both sides concludes the proof.

Example 4.1. Two straightforward consequences of Jensen’s formula are:

|E[X]| ≤ E[|X|].

E[X]2 ≤ E[X2].

In particular, applying the second inequality to the random variable |X|, we get

E[|X|]2 ≤ E[|X|2] ≤ E[X2],

so that if E[X] = 0,E[|X|] ≤ σ. (2)

Two other very commonly useful inequalities are

4–1

Theorem 4.2. (Cauchy-Schwarz inequality) If X, Y ∈ L2,

E[|XY |]2 ≤ E[X2]E[Y 2].

Note 4.1. This last inequality is the probabilistic version of the C-S inequality and shouldbe compared with the C-S inequality in its most standard form:(

n∑i=1

xiyi

)2

≤n∑i=1

x2i

n∑i=1

y2i . (3)

Theorem 4.3. (Triangle inequality) If x, y ∈ R,

|x+ y| ≤ |x|+ |y|

By induction, if x1, . . . , xn ∈ R,

|n∑i=1

xi| ≤n∑i=1

|xi|.

4.2 Linear Processes

Definition 4.2. We define the backwards shift operator B by

BXt = Xt−1.

For j ≥ 2, we defineBjXt = BBj−1Xt.

In other words,BjXt = Xt−j.

Definition 4.3. A time series Xtt∈Z is a linear process if for every t ∈ Z, we can write

Xt =∞∑

i=−∞

ψiZt−i, (4)

where Zt ∼ WN(0, σ2) and the scalar sequence ψii∈Z satisfies∑i∈Z

|ψi| < ∞. Using the

shortcut Ψ(B) =∞∑

i=−∞

ψiBi, we can write

Xt = ψ(B)Zt.

If ψi = 0 for all i < 0, we call X a moving average or MA(∞) process.

4–2

Note 4.2. Infinite sums of random variables are somewhat delicate. You know what itmeans for an infinite sum of real numbers to converge, but for random variables, it isn’tclear at first what the corresponding meaning would be. In fact, there are a number ofdifferent ways to give a meaning to the notion of convergence of random variables.

For technical reasons, convergence of a sum of random variables is often taken in the meansquare sense: Ykk≥1 converges to Y in the mean square sense if there exists a randomvariable Y such that

E

( n∑k=1

Yk − Y

)2 n→∞→ 0.

In any case, it should be intuitively clear that some requirement on the ψi is necessary, sinceif all the ψi were equal to 1, Xt would be an infinite sum of i.i.d. random variables, whichdoes not converge (since we’re always adding more random variables that don’t shrink, thesum would not stabilize).

The requirement∑i∈Z

|ψi| < ∞ ensures that the random series∞∑

i=−∞

ψiZt−i has a limit. I

won’t expect you to completely understand what this means, but if you care about it, here’sthe argument:∑

i≥0

|ψi| <∞⇒∑i≥0

ψ2i <∞⇒

∑i≥0

ψ2iE[Z2

t−1] <∞⇒n∑

i=m

ψ2iE[Z2

t−i]m,n→∞→ 0.

(The last implication is the Cauchy criterion for convergence of series.) Now by the Cauchy-Schwarz inequality ((3) with yi = 1 for all i ∈ 1, . . . , n),

n∑i=m

ψ2iE[Z2

t−i] = E[n∑

i=m

ψ2iZ

2t−i] ≥ E

( n∑i=m

ψiZt−i

)2 .

Therefore,

n∑i=m

ψ2iE[Z2

t−i]m,n→∞→ 0 ⇒ E

( n∑i=m

ψiZt−i

)2 m,n→∞→ 0

⇒n∑

i=m

ψiZt−i converges as n,m→∞⇒∑i≥0

ψiZt−i converges.

The last implication is the Cauchy criterion for convergence of sequences of random variables.

Now that we know that the process defined in (4) exists, let’s also show that for any t ∈Z, Xt ∈ L1:

If∑i∈Z

|ψi| <∞, using the triangle inequality (for the first inequality; note that since it’s an

infinite sum, we have to take limits) and Jensen’s inequality (for the last), we get

E[|Xt|] ≤∑i∈Z

E|ψiZt−i| ≤∑i∈Z

|ψi|E|Zt−i| ≤ σ∑i∈Z

|ψi|.

4–3

4.3 Moving Average Processes

We will now construct stationary time series that have a non-zero autocovariance up to acertain lag q but have zero autocovariance at all later lags. One simple and natural way isto start with white noise Zt (denoted Zt ∼ WN(0, σ2)) and to construct a new sequence ofrandom variables which depend on an overlapping subset of the Zt.

Definition 4.4. A moving-average process of order q is defined for t ∈ Z by the equation

Xt = Zt + θ1Zt−1 + . . .+ θqZt−q = Zt +

q∑i=1

θiZt−i =

q∑i=0

θiZt−i = Θ(B)Zt,

where Zt ∼ WN(0, σ2), θ0 = 1, θ1, . . . , θq are constants, and Θ(z) = 1 +

q∑i=1

θizi.

We now check that Xt is a stationary sequence:

E[Xt] = E[Zt] +

q∑i=1

θiE[Zt−i] = 0.

If h > q,

Cov(Xt, Xt+h) = Cov(

q∑i=0

θiZt−i,

q∑j=0

θjZt+h−j) =

q∑i,j=0

θiθj Cov(Zt−i, Zt+h−j) = 0,

since if h > q and j ≤ q, then t+h− j > t, so that t+h− j > t− i, so that Zt−i and Zt+h−jare uncorrelated.

If 0 ≤ h ≤ q, the random variables Xt and Xt+h contain some of the same Zi.

Cov(Xt, Xt+h) = Cov(θqZt−q + . . .+ θq−hZt+h−q + . . .+ θ0Zt, θqZt+h−q + . . . θhZt + . . .+ θ0Zt+h)

= Cov(θqZt−q+. . .+θq−h+1Zt+h−q−1+

q∑i=h

θq−iZt−q+i,

q∑i=h

θq+h−iZt−q+i+θh−1Zt+1+. . .+θ0Zt+h)

= σ2

q∑i=h

θq−iθq−i+h = σ2

q−h∑i=0

θq−i−hθq−i

Since this covariance does not depend on t, we see that the moving-average process of orderq is weakly stationary.

To find the autocorrelation function, we just need to compute

E[X2t ] = Cov(Xt, Xt) = Cov(

q∑i=0

θiZt−i,

q∑i=0

θiZt−i) = σ2

q∑i=0

θ2i .

4–4

Combining all our computations above, we get

γX(h) =

σ2

q−h∑i=0

θq−i−hθq−i 0 ≤ |h| ≤ q

0 |h| > q

(5)

and

ρX(h) =

q−h∑i=0

θq−i−hθq−i

q∑i=0

θ2i

0 ≤ |h| ≤ q

0 |h| > q

(6)

4–5


Lecture #5: MA processes - Autocovariance; AR processes

Reference. Section 4.2 from the textbook.

5.1 ACF of MA Processes

Example 5.1. (MA(1) process) Let’s examine the ACF of a MA(1) process: If

Xt = Zt + θ1Zt−1,

we have θ0 = 1, θ1 6= 0, and θi = 0 for all i > 1. Therefore, using (5) and (6), we get

γX(0) = σ2

1−0∑i=0

θ1−iθ1−i = σ2(1 + θ2),

γX(1) = σ2

1−1∑i=0

θ−iθ1−i = σ2θ0θ1 = σ2θ1

γX(h) = 0, |h| > 2,

and

ρX(0) = 1,

ρX(1) =σ2θ1

σ2(1 + θ2),

ρX(h) = 0, |h| ≥ 2,

Example 5.2. (MA(2) process) We’ll now compute the ACF of a MA(2) process. Again,this is straightforward with the help of (5) and (6):

γX(0) = σ2

2−0∑i=0

θ2−i−0θ2−i = σ2(1 + θ21 + θ2

2),

γX(1) = σ2

2−1∑i=0

θ2−i−1θ2−i = σ2(θ1θ2 + θ1)

γX(2) = σ2

2−2∑i=0

θ2−i−2θ2−i = σ2θ2

γX(h) = 0, |h| ≥ 3.

5–1

Therefore, the ACF is

ρX(0) = 1,

ρX(1) =θ1θ2 + θ1

1 + θ21 + θ2

2

ρX(2) =θ2

1 + θ21 + θ2

2

ρX(h) = 0, |h| ≥ 3.

Example 5.3. Let us now simulate two MA(2) processes. First, consider the process

Xt = Zt + Zt−1 − Zt−2.

We can simulate it as follows:

> Z=rnorm(500)

> X=Z

> for (i in 3:500) X[i]=Z[i]+Z[i-1]-Z[i-2]

> plot(X,type=”l”)

0 100 200 300 400 500

-4-2

02

4

Index

X

Let’s now change the signs of the coefficients in the time series above to see what the process

Xt = Zt − Zt−1 + Zt−2

looks like.

> Z=rnorm(500)

> X=Z

> for (i in 3:500) X[i]=Z[i]-Z[i-1]+Z[i-2]

> plot(X,type=”l”)

5–2

0 100 200 300 400 500

-4-2

02

4

Index

X

5–3


Lecture #6: AR processes


6.1 AR processes

Recall the following definition:


BXt = Xt−1.

For j ≥ 2, we defineBjXt = BBj−1Xt.

In other words,BjXt = Xt−j.

Example 6.1. Recall that for n ≥ 1, we defined random walks Sn as follows: If Xii≥1 isi.i.d. noise,

Sn =n∑i=1

Xi.

Another way of defining random walk is by defining S1 = X1 and for n ≥ 2,

Sn = Sn−1 +Xn

or, with the backward shift notation,

Sn −BSn = Xn.

We can use the factorization that we use for real numbers in this case as well, but have to becareful and realize that the symbolic factorization is for operators (in particular, 1 representsthe identity operator, not the number one). This gives

(1−B)Sn = Xn.

One natural way of introducing correlation into a time series model is by defining the timeseries recursively.

Definition 6.2. We define an autoregressive process of order p to be a process X satisfyingfor all t ∈ Z,

Xt − φ1Xt−1 − . . .− φpXt−p = Zt (7)

⇐⇒ (1− φ1B − φ2B2 − · · · − φpBp)Xt = Zt

⇐⇒ Φp(B)Xt = Zt,

where Zt ∼ WN(0, σ2), Zt is independent of Xs, s < t, and Φp(z) = 1−∑p

i=1 φizi.

6–1

Note that random walk Sn is defined by the equation

(1−B)Sn = Xn,

so random walk is a particular case of an AR(1) process. We already saw that random walkis not stationary, so we see that there are processes satisfying the AR equation that aren’tstationary. Note that this is different from MA processes, which are always stationary.

6.2 Stationarity of AR processes

It turns out that for any set of parameters φi1≤i≤p, this process exists. However, it isn’talways stationary. The criterion for stationarity is quite simple: An AR(p) process is sta-tionary if and only if all roots of the characteristic equation Φp(z) = 0 have modulus greaterthan 1. In that case, the process is uniquely defined by the equation (7) In other words, ifz1, . . . zp are the roots of the characteristic equation, we need |zi| ≥ 1 for all i ∈ 1, . . . , p.Note that the zi have to be thought of as complex numbers.

Let’s see what might go wrong when |φ| = 1 by looking at simple random walk:

Example 6.2. Is the AR(3) process defined by

Xt = Xt−2 +Xt−3 + Zt

stationary?

We can rewrite the equation above as Φ3(B)Xt = Zt, where Φ3(z) = 1− z2 − z3. Therefore,we need to find the roots of the characteristic polynomial Φ3(z) = 1− z2 − z3. This is bestdone with the help of R: First define the vector of coefficients of the polynomial

> a=c(1,0,-1,-1)

Then compute the roots:

> polyroot(a)

which gives

[1] 0.7548777+0.0000000i -0.8774388+0.7448618i -0.8774388-0.7448618i

To be able to tell right away what the modulus of these roots is, type

> roots=polyroot(a)

> abs(roots)

which gives

[1] 0.7548777 1.1509639 1.1509639

Since one of these roots has modulus less than 1, the process is not stationary.

6.3 Simulations of AR(2) Processes

Note that these simulations are for different AR processes than the one we looked at in class,but the principle is exactly the same.

6–2

Example 6.3. Consider the AR(2) process defined by

Xt = 0.7Xt−1 + 0.2Xt−2 + Zt,

where Zt ∼ WN(0, 1). We can produce a realization of the time series:

> Z=rnorm(200)

> X1=Z

> for (i in 3:200) X1[i]=0.7*X1[i-1]+0.2*X1[i-2]+Z[i]

> plot(X1, type=“l”)

0 50 100 150 200

-4-2

02

4

Index

X

A plot of X1

We can now do the same thing for the AR(2) process defined by

Xt = 0.7Xt−1 − 0.2Xt−2 + Zt,

where Zt ∼ WN(0, 1). > for (i in 3:200) X2[i]=0.7*X2[i-1]-0.2*X2[i-2]+Z[i]

> plot(X2, type=”l”)

0 50 100 150 200

-3-2

-10

12

3

Index

X2

A plot of X2

Note that to put both pictures in a same window on your screen, you can use the followingcommands:

6–3

> par(mfrow = c(2, 1))



0 50 100 150 200

-4-2

02

4

Index

X

0 50 100 150 200

-3-1

01

23

Index

X2

A plot of X1 and X2

We can also see what the ACFs of these processes look like:

> Y1=ARMAacf(ar=c(0.7,0.2),lag.max=15)

> Y2=ARMAacf(ar=c(0.7,-0.2),lag.max=15)

> plot(Y1)

> plot(Y2)

6–4

5 10 15

0.4

0.6

0.8

1.0

Index

Y

5 10 15

0.0

0.4

0.8

Index

Y2

A plot of the ACFs of X and X2

Example 6.4. What does the realization of a non-stationary AR(2) process look like? Let’ssimulate a realization of the process defined by

Xt = Xt−1 + 0.2Xt−2 + Zt,

where Zt ∼ WN(0, 1):

> Z=rnorm(200)

> X=Z

> for (i in 3:200) X[i]=1*X[i-1]+0.2*X[i-2]+Z[i]

> plot(X,type=“l”)

0 50 100 150 200

0.0e+00

2.0e+12

4.0e+12

6.0e+12

8.0e+12

1.0e+13

1.2e+13

Index

X

A plot of X

6–5

As it’s not very clear what is happening other than that the process is blowing up, let’sreduce the domain a bit:

> plot(X[1:50],type=“l”)

0 10 20 30 40 50

0100

200

300

400

500

600

Index

X[1:50]

A plot of X for times 1 to 50

> plot(X[1:20],type=“l”)

5 10 15 20

02

46

Index

X[1:20]

A plot of X for times 1 to 20

We see that the process looks potentially stationary for a short while, but eventually appearsto be going off to infinity. One would certainly not claim based on this picture that thevariance or covariances at a given lag are constant over time.

6–6

6.4 Autocorrelation for Stationary AR(1) Processes

We can write the AR(1) process as follows:

Xt = φXt−1 + Zt. (8)

If Xt is stationary, the constance of E[Xt] implies that E[Xt] = 0. Indeed, if E[Xt] doesn’tdepend on t, we can take expected values on both sides of (8) and get

E[Xt] = φE[Xt−1]+E[Zt] ⇐⇒ E[Xt] = φE[Xt] ⇐⇒ E[Xt](1−φ) = 0 ⇐⇒ φ = 1 or E[Xt] = 0.

If φ = 1, the root of the characteristic equation is 1, so Xt is not stationary. Therefore,E[Xt] = 0 for all t. If h > 0,

γX(h) = E[XtXt−h] = φE[Xt−1Xt−h] + E[ZtXt−h] = φγX(h− 1).

We can repeat this procedure h− 1 times to obtain

γX(h) = φhγX(0).

Here is a case where finding the autocorrelation function is easier than finding the autoco-variance function. Indeed, the last equation yields automatically for all h ∈ Z (using thefact that γX(h) = γX(−h))

ρX(h) =γX(h)

γX(0)= φ|h|.

Since Cov(Zt, Xt−1) = 0, we find that

γX(0) = Cov(Xt, Xt) = Cov(φXt−1 + Zt, φXt−1 + Zt) = φ2γX(0) + σ2,

implying that

γX(0) =σ2

1− φ2.

This gives for all h ∈ Z,

γX(h) = ρX(h)γX(0) = φ|h|σ2

1− φ2.

Note 6.1. When looking at sample correlograms we will often need to determine which ofour models has a correlogram resembling that provided by the data. For that purpose itis important to note that the correlogram above is an exponentially decaying function in h(which alternates between positive and negative values if φ < 0).

6–7


Lecture #7: Autocovariance of linear processes; stationarity ofAR processes

7.1 Linear Processes

Proposition 7.2. If Yt is stationary with mean 0 and autocovariance function γY and∑i∈Z

|ψi| <∞, then

Xt =∞∑

i=−∞

ψiYt−i

is stationary with mean 0 and autocovariance function

γX(h) =∞∑

j,k=−∞

ψjψkγY (h+ k − j).

In particular, if Y ∼ WN(0, σ2) (which by definition means that X is linear), X is stationarywith mean 0 and autocovariance function

γX(h) =∞∑

k=−∞

ψkψk+hσ2.

Proof. This is a straightforward computation, once one knows that for a convergent randomseries (which we have here thanks to the requirements on the ψi) the expected value of eachof the infinite sums below is the sum of the expected values.

E[Xt] = E[∞∑

i=−∞

ψiYt−i] =∞∑

i=−∞

ψiE[Yt−i] = 0,

since Yt is a mean zero time series. Also,

E[Xt+hXt] =∞∑

j,k=−∞

ψjψkE[Yt+h−jYt−k] =∞∑

j,k=−∞

ψjψkE[Yk+h−jY0] =∞∑

j,k=−∞

ψjψkγY (k + h− j).

Example 7.1. Suppose Xt is an MA(∞) process, that is,

Xt =∑i≥0

θiZt−i,

7–1

with∑

i≥0 |θi| <∞. Then since Zt is stationary, we can apply Proposition 7.2 to Xt toobtain

γX(h) =∞∑

j,k=−∞

θjθkγZ(h+ k − j) = σ2

∞∑k=0

θkθk+h,

so that

ρX(h) =

∑∞k=0 θkθk+h∑∞k=0 θ

2k

.

In particular, if Xt is an MA(q) process (meaning that θi = 0 if i > q),

γX(h) = σ2

q−h∑k=0

θkθk+h and ρX(h) =

∑q−hk=0 θkθk+h∑qk=0 θ

2k

.

7.2 Stationarity of AR(1) Process - Take Two

Recall Definition 6.2: An autoregressive process of order p satisfies for all t ∈ Z,

Xt − φ1Xt−1 − . . .− φpXt−p = Zt (9)

⇐⇒ (1− φ1B − φ2B2 − · · · − φpBp)Xt = Zt

⇐⇒ Φp(B)Xt = Zt,

where Zt ∼ WN(0, σ2), Zt is independent of Xs, s < t, and Φp(z) = 1 −∑p

i=1 φizi. In

particular, X is an AR(1) process if it satisfies

X1 − φXt−1 = Zt,

where Zt ∼ WN(0, σ2) and Zt is independent of Xs for s < t.

We’ve already seen that an AR(1) process is stationary if and only if |φ| < 1. Here is a wayof seeing why:

Xt = φXt−1 + Zt ⇒ Xt = φ (φXt−2 + Zt−1) + Zt = Zt + φZt−1 + φ2Xt−2

= Zt + φZt−1 + φ2 (φXt−3 + Zt−2) = Zt + φZt−1 + φ2Zt−2 + φ3Xt−3

=∑i≥0

φiZt−i, (10)

where the last step follows from taking limn→∞ φnXt−n and noting that if |φ| < 1, this limit

is 0 (in a subtle sense you might want to think about; in other words, ask yourself this:What does it mean for the limit of a sequence of random variables to be 0?).

What we just did was rewrite a stationary AR(1) process as a linear process (see Lecture 4),more specifically as an MA(∞) process.

However, as the next example shows, one has to be a little bit careful when dealing with theAR equation (9).

7–2

Example 7.2. If |φ| > 1, we can show that the time series defined by

Xt = −∑j≥1

(1

φ

)jZt+j (11)

satisfies the AR(1) equation Xt = φXt−1 + Zt. Indeed,

φXt−1 + Zt = φ(−∑j≥1

(1

φ

)jZt−1+j) + Zt = −

∑j≥1

(1

φ

)j−1

Zt+j−1 + Zt

= −∑j≥0

(1

φ

)jZt+j + Zt = −Zt −

∑j≥1

(1

φ

)jZt+j + Zt

= −∑j≥1

(1

φ

)jZt+j = Xt.

This time series is clearly stationary, since it is a linear process with summable coefficients,i.e., such that

∑j≥1

∣∣ 1φ

∣∣ <∞. So does this mean that there are two distinct stationary AR(1)

processes, those defined in (10) and in (11)? We know that this cannot be the case (byuniqueness of stationary AR processes), so there must be something preventing one of theseprocesses from being an AR(1) process. It turns out that in (11), Zt is not independent of Xs

for s < t since for instance Xt = −∑

j≥1

(1φ

)jZt+j, so that Zt+1 = φ(−Xt−

∑j≥2

(1φ

)jZt+j).

Example 7.3. If Wt = Xt + cφt, where Xt is a stationary AR(1) process (which means thatXt = φXt−1 + Zt with |φ| < 1, we see that

Wt = φWt−1 + Zt.

Indeed,

φWt−1 + Zt = φ(Xt−1 + cφt−1) + Zt = φXt−1 + Zt + cφt = Xt + cφt = Wt.

Also, since by assumption, Zt is independent of Xs, s < t and since Wt = Xt + cφt, we havethat Zt is independent of Ws, s < t. Therefore, Wt is an AR(1) process. Moreover, |φ| < 1.So is Wt a stationary AR(1) process? No, since E[Wt] = E[Xt + cφt] = cφt, which is notconstant since it depends on t. So we see that there is more than one AR(1) process for agiven parameter φ with |φ| < 1. However, as mentioned several times before, only one canbe stationary.

7–3


Lecture #8: Yule-Walker equations; Causality and Invertibility

8.1 Yule-Walker Equations for AR(p) Processes

If X is an AR(p) process, then

Xt = φ1Xt−1 + . . .+ φpXt−p + Zt.

If, moreover, X is stationary, then there exists µ such that for all t, E[Xt] = µ. Then

E[Xt] = φ1E[Xt−1] + . . .+ φpE[Xt−p] + E[Zt]⇒ µ(1−p∑j=1

φj) = 0.

Therefore, if X is stationary,

µ = 0 or 1−p∑j=1

φj = 0. (12)

Now note that Φ(z) = 1 −∑p

j=1 φjzj. Since we are assuming that X is stationary, we

know that all solutions of the characteristic equation Φ(z) = 0 must be outside of theunit disk. In particular, z = 1 cannot be a solution of the characteristic equation, so that0 6= Φ(1) = 1−

∑pj=1 φj. Therefore, we see from (12) that if X is stationary, then µ = 0.

If X is an AR(p) process, we can write for any j ∈ 0, . . . , p,

Xt = φ1Xt−1 + . . .+ φpXt−p + Zt ⇒ E[XtXt−j] =

p∑i=1

φiE[Xt−iXt−j] + E[ZtXt−j]

⇒ γ(j) =

p∑i=1

φiγ(j − i) + E[ZtXt−j], (13)

This gives us:

• If j = 0,

γ(0) =

p∑i=1

φiγ(i) + E[ZtXt] =

p∑i=1

φiγ(i) + σ2. (14)

• Since Zt is uncorrelated with Xt−j whenever j ∈ 1, . . . , p, we get for all j ∈ 1, . . . , p,

γ(j) =

p∑i=1

φiγ(j − i),

8–1

which, in matrix notation, can be written as

Γpφφφ = γp, (15)

where Γp = (γ(i − j))p1=i,j is the covariance matrix, γp = (γ(1), . . . , γ(p))′, and φφφ =(φ1, . . . , φp)

′.

In particular, dividing every element on both sides of the equality yields

Rpφφφ = ρp, (16)

where Γp = (ρ(i−j))p1=i,j is the correlation matrix, ρp = (ρ(1), . . . , ρ(p))′, andφφφ = (φ1, . . . , φp)′.

The equations for j = 0 and j ∈ 1, . . . , p are a set of p+1 equations in the 2p+2 variablesσ2, φ1, . . . , φp, γ(0), . . . , γ(p). If the model is entirely specified, we know σ2, φ1, . . . , φp andcan therefore solve for γ(0), . . . , γ(p). This is of course true only if the matrix defining oursystem of equations is nonsingular.

Once we have γ(0), . . . , γ(p), we can use (13) to compute γ(j) for all j ≥ p+ 1 recursively:

γ(j) =

p∑i=1

φiγ(j − i).

8.2 AR(2) processes

We can use the Yule-Walker equations to find the autocorrelation function of a given sta-tionary AR(2) process satisfying

Xt = φ1Xt−1 + φ2Xt−2 + Zt.

Recall that an AR(2) process is stationary if all the solutions of the characteristic equation

1− φ1z − φ2z2 = 0 ⇐⇒ φ2z

2 + φ1z − 1 = 0

have magnitude > 1. For such processes, the Yule-Walker equations yield

γ(0) = φ1γ(1) + φ2γ(2) + σ2

γ(1) = φ1γ(0) + φ2γ(1)

γ(2) = φ1γ(1) + φ2γ(0)

and for j ≥ 3,γ(j) = φ1γ(j − 1) + φ2γ(j − 2).

We can compute this explicitly (by hand, or using some software):

γ(0) =(φ2 − 1)σ2

(1 + φ2)(φ21 − (φ2 − 1)2)

, γ(1) =−φ1σ

2

(1 + φ2)(φ21 − (φ2 − 1)2)

, γ(2) =−(φ2

1 − φ22 + φ2)σ2

(1 + φ2)(φ21 − (φ2 − 1)2)

.

8–2

However, this is not particularly instructive if we wish to know the shape of the ACF.Instead, let’s first focus on the set of points (φ1, φ2) ∈ R2 for which the AR(2) process hasa stationary solution. The quadratic equation tells us that

1− φ1z − φ2z2 = 0 ⇐⇒ z =

−φ1 ±√φ2

1 + 4φ2

2φ2

.

To know when these numbers are greater than one in absolute value, we consider three cases:

1. φ21 + 4φ2 < 0. Then z takes two complex values, each of which has magnitude

φ21

4φ22

+φ2

1 + 4φ2

4φ22

=φ2

1 + 2φ2

2φ22

.

2. φ21 + 4φ2 = 0. Then z takes one real value which has magnitude 1

2

∣∣φ1φ2

∣∣.3. φ2

1 + 4φ2 > 0. Then z takes two real values with magnitudes∣∣∣∣−φ1 +√φ2

1 + 4φ2

2φ2

∣∣∣∣ and

∣∣∣∣−φ1 −√φ2

1 + 4φ2

2φ2

∣∣∣∣.One can show (see Appendix B, p.84) that these roots are greater than 1 in absolute valueif and only if

φ1 + φ2 < 1, φ2 − φ1 < 1, and |φ2| < 1.

It turns out that key quantities in understanding AR(2) processes are the reciprocals of theroots of the characteristic equations,

G1 =2φ2

−φ1 −√φ2

1 + 4φ2

=2φ2

−φ1 −√φ2

1 + 4φ2

φ1 −√φ2

1 + 4φ2

φ1 −√φ2

1 + 4φ2

=φ1 −

√φ2

1 + 4φ2

2

and

G2 =2φ2

−φ1 +√φ2

1 + 4φ2

=2φ2

−φ1 +√φ2

1 + 4φ2

φ1 +√φ2

1 + 4φ2

φ1 +√φ2

1 + 4φ2

=φ1 +

√φ2

1 + 4φ2

2.

The autocorrelations of an AR(2) process can be expressed in terms of G1 and G2:

1. If there is only one root to the characteristic equation, i.e., if φ21 + 4φ2 = 0, we have

ρk =

(1 +

1 + φ2

1− φ2

k

)(φ1

2

)k, k ≥ 0.

2. Otherwise,

ρk =(1−G2

2)Gk+11 − (1−G1)2Gk+1

2

(G1 −G2)(1 +G1G2).

In particular, if the roots are complex, i.e., if φ21 + 4φ2 < 0, we can write

ρk = Rk sin(Θk + Φ)

sin Φ,

with R =√−φ2 (recall that for a stationary AR(2) process, |φ2| < 1 and if the roots

are complex, it is easy to see that φ2 < 0 since φ21 +4φ2 < 0), Θ satisfies cos Θ = φ1

2√−φ2

,

and tan Φ = 1−φ21+φ2

tan Θ.

8–3


Lecture #9: Causality and Invertibility

9.1 Causality and Invertibility

There are two dual (you can think of duality as some form of symmetry) forms in which onemight be able to express time series. Roughly,

• if Xt is defined in terms of Zss≤t, we call X causal.

• if Zt is defined in terms of Xss≤t, we call X invertible.

More formally:

Definition 9.1. A time series Xt is

• causal if there exist constants ψj with∑j≥0

|ψj| <∞ such that

Xt =∑j≥0

ψjZt−j,

where Zt ∼ WN(0, σ2). Note that such a process can also be thought of as an MA(∞)process.

• invertible if there exist constants πj with∑j≥0

|πj| <∞ such that

Zt =∑j≥0

πjXt−j,

where Zt ∼ WN(0, σ2).

Clearly, any MA(q) process is causal and any AR(p) process is invertible (both by definition).We will now show that some MA(q) processes are invertible as well and that some AR(p)processes are causal.

9.2 Stationary AR(p) processes are causal

If Xt is an AR(p) process, then, as we’ve seen before, using the backward shift operatorB,

Xt −p∑i=1

φiXt−i = Zt ⇐⇒ Φ(B)Xt = Zt ⇐⇒ Xt = (Φ(B))−1Zt,

9–1

where Φ(z) = 1−∑p

i=1 φizi and (Φ(B))−1 is the inverse operator of Φ(B).

What exactly is the operator (Φ(B))−1? We can try to write it explicitly by assuming thatis has the form Ψ(B) = 1 +

∑i≥1 ψiB

i. Then

(Φ(B))(Φ(B))−1 = 1⇒ Φ(B)Ψ(B) = 1, (17)

where 1 is the identity operator (i.e., the operator that doesn’t do anything to Xt: 1Xt = Xt),NOT the number 1.

We will know what Ψ(B) is if we can figure out what all the ψi are (at least in terms ofthe φi, which define the process Xt. The second equality in (17) can be re-written forpolynomials:

Φ(z)Ψ(z) = 1.

The left side of this equality is a polynomial and the right side is the number 1. For thesetwo to be equal, all the coefficients of the polynomial must be 0, except that of order 0,which must be 1. By solving the equation we get for each coefficient, we can figure out whatthe ψi are:

Φ(z)Ψ(z) = 1 ⇐⇒ (1−p∑i=1

φizi)(1 +

∑i≥1

ψizi) = 1.

Expanding the left side in increasing order of degree, we get

1 + (ψ1 − φ1)z + (ψ2 − ψ1φ1 − φ2)z2 + (ψ3 − ψ2φ1 − ψ1φ2 − φ3)z3 + · · · ,

which yields the equations

1 = 1

ψ1 − φ1 = 0

ψ2 − ψ1φ1 − φ2 = 0

ψ3 − ψ2φ1 − ψ1φ2 − φ3 = 0

...

ψk −k∑i=1

ψk−iφi = 0,

where ψ0 = 1, which give the following equations for the ψi:

ψ1 = φ1

ψ2 = ψ1φ1 + φ2 = φ21 + φ2

ψ3 = ψ2φ1 + ψ1φ2 + φ3 = φ31 + φ1φ2 + φ1φ2 + φ3

...

ψk =k∑i=1

ψk−iφi,

9–2

which gives us the values of all the ψi recursively.

It of course isn’t obvious from the recursive equations above that∑

j≥0 |ψj| < ∞. Thefollowing theorem says exactly when that is the case:

Theorem 9.1. An AR(p) process is causal iff whenever Φp(z) = 0, then |z| > 1. In otherwords, an AR(p) process is causal iff all zeros of Φp(z) = 0 are outside of the unit disk.

In the following example, we use the expression we just obtained to express an AR(1) processexplicitly. In other words, we’ll show that a stationary AR(1) process is causal and will re-derive its acf (we already derived it once in Lecture 9).

Example 9.1. Recall that an AR(1) process is defined to be the stationary solution of

Xt − φXt−1 = Zt,

where Zt ∼ WN(0, σ2). We already know such a process exists if |φ| < 1. Applying therecursive equations above to the AR(1) case, where Φ(z) = 1− φz, |φ| < 1, we get

ψ1 = φ

ψ2 = φ2

ψ3 = φ3

...

ψk = φk,

Therefore,

Ψ(z) = Φ−1(z) =∑k≥0

(φz)k.

Therefore,

Xt = Ψ(B)Zt =∑k≥0

φkZt−k.

Note that since∑

k≥0 φk <∞ (since |φ| < 1), we see that indeed Xt is causal.

There is an easy (more intuitive) way of checking that Xt = Ψ(B)Zt =∑

k≥0 φkZt−k indeed

satisfies the autoregressive equation: Suppose Xt =∑j≥0

φjZt−j. Then

Xt − φXt−1 =∑j≥0

φjZt−j −∑j≥0

φj+1Zt−1−j =∑j≥0


φj+1Zt−(j+1)

=∑j≥0


φjZt−j = Zt.

By Proposition 7.2, Xt is stationary with

E[Xt] = 0 and γX(h) =σ2φh

1− φ2, h ≥ 0.

9–3

9.3 MA(q) processes can be invertible

We can mimic the work done in the previous section to find an invertible expression forMA(q) processes. Suppose Xt is an MA(q) process. Then

Xt =

q∑i=0

θiZt−i = Θ(B)Zt ⇒ Θ−1(B)Xt = Zt.

(Here, θ0 = 1.) Suppose Θ−1(B) is of the form Π(B) = 1−∑∞

i=1 πiBi. Then

Π(B)Θ(B) = (1−∞∑i=1

πiBi)(

q∑i=0

θiBi) = 1.

(Again, here, “1” is the identity operator, not the number.) Equating coefficients of thepolynomials on both sides of the equality, we get the equations

1 = 1

− π1 + θ1 = 0

− π2 − π1θ1 + θ2 = 0

− π3 − π2θ1 − π1θ2 + θ3 = 0

...

θk −k∑i=1

πiθk−i = 0,

which give the following equations for the πi:

π1 = θ1

π2 = θ2 − π1θ1 = θ2 − θ21

π3 = θ3 − π2θ1 − π1θ2 = θ3 − θ1(θ2 − θ21)− θ1θ2

...

As above, it isn’t obvious from the recursive equations above that∑

j≥0 |πj| < ∞, but thistheorem tells us when this is true:

Theorem 9.2. An MA(q) process is invertible iff whenever Θq(z) = 0, then |z| > 1. Inother words, an MA(q) process is invertible iff all zeros of Θq(z) = 0 are outside of the unitdisk.

Example 9.2. As in the AR case above, we consider the particular case where q = 1 toexpress an MA(1) process in its inverted form. To achieve this, we just need to solve the

9–4

equations above when θ1 = θ and θi = 0 if i ≥ 2. Doing that, we get

π1 = θ

π2 = −θ2

π3 = θ3

...

πk = (−1)k+1θk

In particular, we see that an MA(1) process is invertible if |θ| < 1 (since in that case,∑j≥0 |πj| <∞), thus confirming Theorem 9.2

9–5

Math 4506 (Fall 2019) October 2, 2019Prof. Christian Benes

Lecture #10: ARMA processes

Reference. Sections 4.4 and 4.5 from the textbook.

10.1 ARMA Processes

What happens when you mix an AR(p) and MA(q) process? Not too surprisingly, you getan ARMA(p, q) process.

Definition 10.1. A time series Xt is an ARMA(p, q) process if Xt is stationary andfor all t,

Xt −p∑i=1

φiXt−i =

q∑j=0

θjZt−j, (18)

where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials 1 −p∑i=1

φizi and

q∑j=0

θjzj have no

common factors.

Note 10.1. Clearly, AR(p) processes are just a particular case of ARMA(p, q) processes (inthe case when θi = 0 for i = 1, . . . , q). So are MA(q) processes (in the case when φi = 0 fori = 1, . . . , p).

Note 10.2. Recall the the backward shift operators Bj defined for j ≥ 0 by BjXt = Xt−j.Then if we define the polynomials

Φ(z) = 1−p∑i=1

φizi

and

Θ(z) =

q∑j=0

θjzj,

we can re-write equation (43) in the more succinct form

Φ(B)Xt = Θ(B)Zt. (19)

To simplify the notation and derivations of properties of ARMA processes, we start byfocussing on the case where p = q = 1 and will come back to the general case later. AnARMA(1, 1) process Xt is a stationary time series satisfying the equation

Xt − φXt−1 = Zt + θZt−1,

or, equivalently,Φ(B)Xt = Θ(B)Zt,

10–1

where Φ(z) = 1 − φz and Θ(z) = 1 + θz. We’ve already seen a heuristic derivation of asolution for the AR(1) process. We will now look for a solution in a less explicit but quickerway for the ARMA(1,1) process. Note that we could have used the same quicker method forthe AR(1) process.

First a few generalities:

We know from Proposition 7.2 that if∑i∈Z

|ψi| < ∞ and Yt is a stationary time series, then

the time seriesψ(B)Yt,

where ψ(B) =∑j∈Z

ψjBj, is stationary as well. This suggests that we can repeatedly apply

operators of the form ψ(B) =∑j∈Z

ψjBj (also called filters) to a stationary time series without

losing stationarity:

Suppose that∑i∈Z

|αi| <∞ and∑i∈Z

|βi| <∞ and define the polynomials α(z) =∑j∈Z

αjzj and

β(z) =∑j∈Z

βjzj. Then Proposition 7.2 implies that successive applications of the operators

α(B) and β(B) to a stationary time series Yt yield another stationary time series, that is,

Wt = α(B)β(B)Yt

is stationary. In that case,

Wt =∑j∈Z

ηjYt−j,

where

ηj =∑k∈Z

αkβj−k =∑k∈Z

βkαj−k (20)

(you will show this on Homework 3). Equivalently,

Wt = H(B)Yt,

where H(B) = α(B)β(B) = β(B)α(B). Note that the operator α(B)β(B) is obtainedfrom α(B) and β(B) by performing a formal product of these two operators as if they werepolynomials and grouping the terms which have same powers of B.

Let’s return to the ARMA(1,1) process which, by definition, is a stationary time series Xt

satisfyingXt − φXt−1 = Zt + θZt−1.

Equivalently, it is a stationary solution to

Φ(B)Xt = Θ(B)Zt,

10–2

where Φ(z) = 1− φz and Θ(t) = 1 + θz.

We start by finding the Taylor series (let’s call it Ψ(z)) for 1φ(z)

= 11−φz . By analogy with the

formula for geometric series, we see that

1

φ(z)=

1

1− φz=∑j≥0

φjzj = Ψ(z).

Now if |φ| < 1, the coefficients of the series∑j≥0

φjzj =∑j∈Z

φjzj (so that φj = φj if j ≥ 0 and

φj = 0 if j < 0) are absolutely summable, that is, they satisfy∑j∈Z

|φj| <∞. Therefore, the

generalities above apply and we can use (20) to conclude that Ψ(B)Φ(B) = 1, the identityoperator.

If we apply Ψ(B) to the two sides of the equation which defines the ARMA(1,1) process,

Φ(B)Xt = Θ(B)Zt,

we getXt = Ψ(B)Θ(B)Zt,

Using equation (20), we get

Ψ(B)Θ(B) =∑i≥0

φiBi(1 + θB) =∑j≥0

ηjBj,

where η0 = 1 and ηj = (φ+ θ)φj−1 if j ≥ 1.

Writing H(B) =∑j≥0

ηjBj now gives an explicit expression for the ARMA(1,1) process:

Xt = H(B)Zt = Zt + (φ+ θ)∑j≥1

φj−1Zt−j.

We just studied the ARMA(1,1) process in the case where |φ| < 1. Let us now examine thesame process when |φ| > 1.

First note that in this case (this, again, is something you will show on Homework 3)

1

φ(z)=

1

1− φz= −

∑j≥1

φ−jz−j.

Mimicking the case for |φ| < 1, we write Ψ(z) = −∑j≥1

φ−jz−j and apply ψ(B) to both sides

ofΦ(B)Xt = Θ(B)Zt,

10–3

thus obtaining

Xt = Ψ(B)Θ(B)Zt = −θφ−1Zt − (θ + φ)∑j≥1

φ−(j+1)Zt+j.

Finally, in the cases where |φ| = 1, the ARMA(1,1) equations have no stationary solutions(you showed this on the homework in the purely autoregressive case), implying that in thosecases, there is no ARMA(1,1) process.

Looking at the explicit solutions for the ARMA(1,1) equations which we just derived, we seethat

• If |φ| < 1, Xt depends only on “past” values of Z, that is, Xt is defined in terms ofZss≤t. We call X causal.

• If |φ| > 1, Xt depends only on “future” values of Z, that is, Xt is defined in terms ofZss≥t. We call X noncausal.

• If |φ| = 1, there is no stationary solution to (43).

10.2 Invertibility and causality of ARMA(p, q) Processes



Xt −p∑i=1

φiXt−i =

q∑j=0

θjZt−j,

where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials φ(z) = 1−p∑i=1

φizi and θ(z) =

q∑j=0

θjzj

have no common factors.

Using the backward shift operator, we can re-write equation (43) in the more succinct form

φ(B)Xt = θ(B)Zt.

ARMA processes are commonly used for a number of reasons. One of these is their linearstructure which simplifies a number of calculations, particularly when predicting. Anotheris the fact that for many autocovariance functions, one can find an ARMA process with thatautocovariance function.

Definition 10.3. An ARMA(p, q) process is

10–4

• causal if there exist constants ψj with∑j≥0

|ψj| <∞ such that

Xt =∑j≥0

ψjZt−j.

Note that in that case, an ARMA(p, q) process is also what we defined to be an MA(∞)process.

• invertible if there exist constants πj with∑j≥0

|πj| <∞ such that

Zt =∑j≥0

πjXt−j.

Theorem 10.1. An ARMA(p, q) process is

• causal if Φ(z) 6= 0 for all |z| ≤ 1.

• invertible if Θ(z) 6= 0 for all |z| ≤ 1.

Theorem 10.1 tells us how to verify if a given ARMA process is causal or invertible: All oneneeds to do is solve the two equations Φ(z) = 0 and Θ(z) = 0, of order p and q, respectively.

For practical purposes, in particular to find the ACF of a causal invertible ARMA process, itwill be useful to determine the coefficients ψj and πj of the causal and invertible formsof the process. This can be done using an idea you’ve seen already:

If

Xt =∑j≥0

ψjZt−j,

then (44) becomes

Φ(B)∑j≥0

ψjZt−j = Θ(B)Zt.

Therefore, to find the coefficients ψj in terms of the coefficients φj and θj, we just need tomatch the coefficients in (

1−p∑i=1

φizi

)∑k≥0

ψkzk =

q∑j=0

θjzj,

which can be re-written more explicitly as(1− φ1z − φ2z

2 − · · · − φpzp) (ψ0 + ψ1z + ψ2z

2 + ψ3z3 + · · ·

)= 1 + θ1z + θ2z

2 + · · ·+ θqzq.

This yields:ψ0 = 1,

10–5

ψ1 − φ1ψ0 = θ1 ⇒ ψ1 = θ1 + φ1.

ψ2 − φ1ψ1 − φ2ψ0 = θ2 ⇒ ψ2 = θ2 + φ1ψ1 + φ2ψ0.

· · ·

ψj = θj +

j∑i=1

φiψj−i,

with θj = 0 for all j > q and φi = 0 for all i > p. In summary, we get the recursive formula:

ψj =

θj +

j∑i=1

φiψj−i, j ≤ minp, q

θj +

p∑i=1

φiψj−i, p < j ≤ q

j∑i=1

φiψj−i, q < j ≤ p

p∑i=1

φiψj−i, j > maxp, q

These equations will allow us, whenever we are dealing with a causal ARMA process to write

it in its causal form: Xt =∑j≥0

ψjZt−j, with ψj as above.

We can use the same procedure to determine the coefficients πj for an invertible ARMAprocess:

If

Zt =∑j≥0

πjXt−j,

then (44) becomes

Φ(B)Xt = Θ(B)∑j≥0

πjXt−j,

so to find the coefficients πj, we match the coefficients in

1−p∑i=1

φizi =

q∑j=0

θjzj∑k≥0

πkzk,

or, equivalently,

1− φ1z − φ2z2 − · · · − φpzp =

(1 + θ1z + θ2z

2 + · · ·+ θqzq) (π0 + π1z + π2z

2 + π3z3 + · · ·

).

This yields:π0 = 1,

π1 + θ1π0 = −φ1 ⇒ π1 = −φ1 − θ1π0.

10–6

π2 + θ1π1 + θ2π0 = −φ2 ⇒ π2 = −φ2 − θ1π1 − θ2π0.

· · ·

πj = −φj −j∑i=1

θiπj−i,

with φj = 0 for all j > p and θi = 0 for all i > q.

10–7


Lecture #11: ACF of ARMA processes; First Statistical Steps


11.1 ACF for causal ARMA processes

The causal representation of ARMA processes will make it relatively easy to compute theautocorrelation function (ACF) for some ARMA processes. As you already know, one can get

the ACF ρ from the autocovariance function (ACVF) γ using the relationship ρ(h) = γ(h)γ(0)

.

As an example of this, we look at a few particular ARMA processes and derive their ACFs.

If an ARMA process is causal, we know that we can write Xt =∑j≥0

ψjZt−j. Then, using the

fact that Ztt∈Z is a sequence of uncorrelated random variables with variance σ2, we get

γ(h) = Cov(Xt, Xt+h) = Cov(∑j≥0

ψjZt−j,∑k≥0

ψkZt+h−k) = Cov(∑j≥0

ψjZt−j,∑j≥−h

ψj+hZt−j)

=

σ2∑j≥0

ψjψj+h, h ≥ 0

σ2∑j≥−h

ψjψj+h, h ≤ 0=

σ2∑j≥0

ψjψj+h, h ≥ 0

σ2∑i≥0

ψi−hψi, h ≤ 0= σ2

∑j≥0

ψjψj+|h|.

Example 11.1. In this example, we derive the ACVF for causal ARMA(1,1) processes.Note that there is a slightly more detailed discussion of the ACF of ARMA(1,1) processesin Section 4.4 in the textbook. Make sure you read it.

We’ve seen that if |φ| < 1, such a process can be written as the following MA(∞) process:

Xt = ψ(B)Zt = Zt + (φ+ θ)∑j≥1

φj−1Zt−j.

Therefore, the expression

γ(h) = σ2∑j≥0

ψjψj+|h|

yields

γ(0) = σ2∑j≥0

ψ2j = σ2

(1 +

∑j≥1

((φ+ θ)φj−1

)2

)= σ2

(1 +

∑j≥1

(φ+ θ)2φ2j−2

)

= σ2

(1 + (φ+ θ)2

∑j≥1

φ2j−2

)= σ2

(1 + (φ+ θ)2

∑j≥0

φ2j

)= σ2

(1 +

(φ+ θ)2

1− φ2

),

11–1

γ(1) = σ2∑j≥0

ψjψj+1 = σ2

((φ+ θ) +

∑j≥1

(φ+ θ)2φj−1φj

)

= σ2

((φ+ θ) + (φ+ θ)2

∑j≥1

φ2j−1

)= σ2

((φ+ θ) + (φ+ θ)2

∑j≥0

φ2j+1

)

= σ2

((φ+ θ) +

φ(φ+ θ)2

1− φ2

),

and if h ≥ 2,

γ(h) = σ2∑j≥0

ψjψj+h = σ2

((φ+ θ)φh−1 +

∑j≥1

(φ+ θ)2φj−1φj+h−1

)

= σ2

((φ+ θ)φh−1 + (φ+ θ)2

∑j≥1

φ2j+h−2

)= σ2

((φ+ θ)φh−1 + (φ+ θ)2

∑j≥0

φ2j+h

)

= σ2

((φ+ θ)φh−1 +

φh(φ+ θ)2

1− φ2

)= φh−1γ(1).

Example 11.2. Consider the ARMA process defined by

Xt +1

2Xt−1 = Zt −

1

4Zt−1, Zt ∼ WN(0, σ2).

Then the autoregressive polynomial for this process, Φ(z) = 1 + 12z, has one zero, namely

z = −2. Since |− 2| > 1, this ARMA process is causal. Using the fact that θ = −14, φ = −1

2,

we get from Example 11.1

γ(0) = σ2(1 +9/16

3/4) =

7

4σ2

and for h ≥ 1,

γ(h) = σ2

(−1

2

)h−1(−3

4+

(−1/2)(9/16)

3/4

)= σ2

(−1

2

)h−1(−9

8

)= (−1)h9

(1

2

)h+2

σ2.

We therefore see that the ACF

ρ(h) =γ(h)

γ(0)= (−1)h

9

7

(1

2

)his alternating and decays exponentially.

11–2

11.2 The statistics point of view

In everything that follows, we assume that we are dealing with a stationary time series.We will hope that this assumption is satisfied by the residual sequence Yt of our time seriesmodel.

So far, we’ve looked at time series from a probability point of view, that is, we’ve developedtime series models. Of course, our goal is to make sure our models depict reality appropriately,so our next task will be to examine the data and use it to determine which models areappropriate.

11.2.1 Basic Ideas

One of the big goals of Statistics is to estimate population parameters. This is done bycalculating statistics (or estimators), which are simply numbers computed from data, andusing them as point estimates of the appropriate parameter.

If θ is a parameter in a distribution (such as the mean, the variance, or anything else), thestandard notation for an estimator for θ is θ and estimates are denoted by θe. Note that thebook uses a different (and worse) notation, but you should still be able to figure out fromcontext what

So what’s the difference between an estimate and an estimator? At first it may seem quitesubtle. The key is to understand that before we observe data, we are dealing with randomvariables (we don’t know yet which values they will take, so they are still random), while afterwe see the data, these random variables have crystallized into (non-random) real numbers.

• An estimator is a random variable based on the random variables for which onewishes to estimate something.

• An estimate is a number obtained after observing realizations of the random variables.

Typically, there are a number of reasonable estimators (e.g., the maximum likelihood es-timator or the method of moments estimator) for a given parameter and there are severalcriteria according to which their value is judged. One of them is the notion of unbiasedness:

Definition 11.1. An estimator θ for a parameter θ is unbiased if

E[θ] = θ.

If you draw independent samples y1, y2, . . . , yn from the sequence of random variablesY1, . . . , Yn with an unknown mean µ and unknown variance σ2, then an unbiased estimatorof µ is the sample mean

µ = Y =Y1 + · · ·+ Yn

n,

and a common unbiased estimator of σ2 is the sample variance

σ2 = S2 =1

n− 1

n∑i=1

(Yi − Y )2.

11–3

The corresponding estimates are:

µe = y =y1 + · · ·+ yn

n,

and

σ2e = s2 =

1

n− 1

n∑i=1

(yi − y)2.

Note 11.1. Though they are less natural, the following are also unbiased estimators for themean:

• µ1 = Y1

• µ2 = Y1+Y22

• µ3 = 2n

2n−1

n∑i=1

(1/2)iYi

11.2.2 Confidence Intervals

We now discuss another method for drawing conclusions on the processes from which datamight originate. This method relies on confidence intervals, an object somewhat related tothe idea of hypothesis testing. We describe this method via an example.

Suppose that Xini=1 ∼ N(µ, σ2) are i.i.d. Then the sample mean (recall that it is anestimator, thus a random variable)

X =1

n

n∑i=1

Xi ∼ N

(µ,σ2

n

).

[Recall that this can be shown either with moment generating functions or by computingthe convolution of the p.d.f.’s. In either case, one just needs to deal with the case n = 2 andgeneralize the result by induction.]

In particular, knowing the distribution of X allows us to compute probabilities for it.

We already know how we would test the hypothesis that µ has some specific value. We nowlook at how to derive a confidence interval for µ.

We know that if Xini=1 ∼ N(µ, σ2) are independent, then

X ∼ N

(µ,σ2

n

),

which implies thatX − µσ/√n∼ N(0, 1),

11–4

so

P (−zα/2 ≤X − µσ/√n≤ zα/2) = 1− α. (21)

In particular, as a numerical example, we have

P (−1.96 ≤ X − µσ/√n≤ 1.96) = 0.95.

We can now play around with (21) and try to isolate µ to find an interval in which µ has a1− α chance of finding itself:

1− α = P (−zα/2 ≤X − µσ/√n≤ zα/2) = P (−zα/2σ/

√n ≤ X − µ ≤ zα/2σ/

√n)

= P (−zα/2σ/√n− X ≤ −µ ≤ zα/2σ/

√n− X) = P (zα/2σ/

√n+ X ≥ µ ≥ −zα/2σ/

√n+ X)

Writing this in a slightly more elegant way gives

P (X − σ√nzα/2 ≤ µ ≤ X +

σ√nzα/2) = 1− α. (22)

This means that the probability that µ finds itself in the interval

(X − σ√nzα/2, X +

σ√nzα/2)

is 1−α. That interval is therefore called a (1−α)-confidence interval. It is a random intervalwhich has a (1− α) chance of containing the true value µ.

11.2.3 The chi-square distribution

The chi-square distribution is one of the most common distribution in statistics. It appearsin a number of very different tests. This distribution is so natural because it is an offspringof the normal distribution.

Definition 11.2. Suppose Z1, . . . , Zn are i.i.d. N(0,1) random variables. Then

X =n∑i=1

Z2i

is said to have the chi-square distribution with n degrees of freedom. We write

X ∼ χ2n.

Since X is defined from independent normal random variables, one can fairly easily derivethe pdf for X ∼ χ2

n:

f(x) =(1/2)n/2

Γ(n/2)xn/2−1e−x/2.

11–5

Note 11.2. Clearly, since the chi-square is a sum of squares, it is always positive.

Note 11.3. A quick look at the pdf above shows that

• If X ∼ χ22, then X ∼ Exp(1/2).

• Chi-square distributions are a particular case of gamma distributions: If X ∼ χ2n, then

X ∼ Γ(n/2, 1/2)

Since we have a pdf for X ∼ χ2n, we can derive its moments. However, it will be easier to

derive the mean and the variance using the expression X =n∑i=1

Z2i :

E[X] = E[n∑i=1

Z2i ] =

n∑i=1

E[Z2i ] = n.

E[X2] = E[(n∑i=1

Z2i )2] =

n∑i,j=1

E[Z2i Z

2j ] = nE[Z4

1 ] + (n2 − n)E[Z21Z

22 ]

= nE[Z41 ] + (n2 − n)E[Z2

1 ]E[Z22 ] = nE[Z4

1 ] + (n2 − n)].

Since Z1 ∼ N(0, 1), we can use its moment generating function, MZ(t) = et2/2, to compute

E[Z41 ]:

E[Z41 ] = M

(4)Z (0) = 3.

Therefore,E[X2] = 3n+ n2 − n = n2 + 2n,

implying thatVar(X) = E[X2]− E[X]2 = n2 + 2n− n2 = 2n.

11–6

2 4 6 8 10

0.05

0.10

0.15

0.20

0.25

0.30

The pdf of a chi-square random variable with 1 degree of freedom

2 4 6 8 10

0.1

0.2

0.3

0.4

0.5

The pdf of a chi-square random variable with 2 degrees of freedom

2 4 6 8 10 12 14

0.05

0.10

0.15

0.20


2 4 6 8 10 12 14

0.05

0.10

0.15


5 10 15 20

0.05

0.10

0.15


11–7

11.3 Estimators for covariance, correlation

By analogy with the definition of the sample variance, it is probably not too difficult to believethat a reasonable estimator for the covariance of a random vector ~X = (X1, . . . , Xm)′, based

on a sample coming from the n independent random vectors ~X1 = (X1,1, . . . , Xm,1)′, . . . , ~Xn =

(X1,n, . . . , Xm,n)′ would be the sample covariance matrix Q = (Qi,j)1≤i,j≤m, where

Qi,j =1

n− 1

n∑k=1

(Xi,k − Xi)(Xj,k − Xj),

where Xi = 1n

∑nk=1Xi,k. The “n − 1” is there because it makes each Qi,j unbiased, as you

will show on a homework problem.

Note that in the setting of stationary time series, the covariance is a function of one parameteronly, so we will define the sample covariance in a slightly different way.

Sample Autocovariance Function (ACVF)

γ(h) :=1

n

n−|h|∑t=1

(Xt+|h| −X)(Xt −X), −n < h < n

Notice that γ(−h) = γ(h).

Notice also that the sum is divided by n, not n − |h| − 1 as we might have expected byanalogy with the definition of the sample covariance above. The main reason for this is thatwith this definition, the sample covariance matrix ends up being nonnegative definite (don’tworry about why). Though we gain a little with this nice property, we lose a little by havingan estimator which is not unbiased. However, it turns out that when we deal with largesamples, this estimator will usually be close to unbiased.

Sample Autocovariance MatrixThe sample covariance matrix for stationary time series is simply the matrix of sampleautocovariance functions given by

Γn := (γ(i− j))1≤i,j≤n.

Note that Γn is nonnegative definite.

Sample Autocorrelation Function (ACF)

ρ(h) :=γ(h)

γ(0), −n < h < n.

A plot of ρe(h) versus h is called a sample correlogram or often just correlogram.

Sample Autocorrelation MatrixThe sample correlation matrix is simply the matrix of sample autocorrelation functions givenby

Rn := (ρ(i− j))1≤i,j≤n

Notice that Rn is nonnegative definite, and that each diagonal entry of Rn is 1 since ρ(0) =1.

11–8


Lecture #12: First Statistical Steps


Before using the ideas developed in the last few lectures to define and analyze ARMAprocesses, let’s take a step back to take a first look at time series from a statistical point ofview (now that all of you have the tools needed to do so).

12.1 Estimators for covariance, correlation

Example 12.1. Suppose that the observed data set is 0, 4, 8, 4, 0,−4, 0,−4. Viewing thisas a “time series” means that x1, x2, x3, x4, x5, x6, x7, x8 = 0, 4, 8, 4, 0,−4, 0,−4. Thesample mean is therefore

x =1

8

8∑t=1

xt =0 + 4 + 8 + 4 + 0− 4 + 0− 4

8=

8

8= 1,

and the sample autocovariance function is

γe(h) :=1

8

8−|h|∑t=1

(xt+|h| − 1)(xt − 1), −8 < h < 8.

Thus, we can easily compute that

γe(0) =1

8

8∑t=1

(xt − 1)(xt − 1)

=1

8

[(x1 − 1)2 + (x2 − 1)2 + (x3 − 1)2 + (x4 − 1)2 + (x5 − 1)2 + (x6 − 1)2 + (x7 − 1)2 + (x8 − 1)2

]=

1

8

[(0− 1)2 + (4− 1)2 + (8− 1)2 + (4− 1)2 + (0− 1)2 + (−4− 1)2 + (0− 1)2 + (−4− 1)2

]=

120

8

γe(1) = γe(−1) =1

8

7∑t=1

(xt+1 − 1)(xt − 1)

=1

8[(x2 − 1)(x1 − 1) + (x3 − 1)(x2 − 1) + (x4 − 1)(x3 − 1) + (x5 − 1)(x4 − 1) + (x6 − 1)(x5 − 1)

+ (x7 − 1)(x6 − 1) + (x8 − 1)(x7 − 1)]

=1

8[(4− 1)(0− 1) + (8− 1)(4− 1) + (4− 1)(8− 1) + (0− 1)(4− 1) + (−4− 1)(0− 1)

+ (0− 1)(−4− 1) + (−4− 1)(0− 1)]

=51

8

12–1

γe(2) = γe(−2) =1

8

6∑t=1

(xt+2 − 1)(xt − 1)

=1

8[(x3 − 1)(x1 − 1) + (x4 − 1)(x2 − 1) + (x5 − 1)(x3 − 1) + (x6 − 1)(x4 − 1) + (x7 − 1)(x5 − 1)

+ (x8 − 1)(x6 − 1)]

=1

8[(8− 1)(0− 1) + (4− 1)(4− 1) + (0− 1)(8− 1) + (−4− 1)(4− 1) + (0− 1)(0− 1)

+ (−4− 1)(−4− 1)]

=6

8

γe(3) = γe(−3) =1

8

5∑t=1

(xt+3 − 1)(xt − 1)

=1

8[(x4 − 1)(x1 − 1) + (x5 − 1)(x2 − 1) + (x6 − 1)(x3 − 1) + (x7 − 1)(x4 − 1) + (x8 − 1)(x5 − 1)]

=1

8[(4− 1)(0− 1) + (0− 1)(4− 1) + (−4− 1)(8− 1) + (0− 1)(4− 1) + (−4− 1)(0− 1)]

=−39

8

γe(4) = γe(−4) =1

8

4∑t=1

(xt+4 − 1)(xt − 1)

=1

8[(x5 − 1)(x1 − 1) + (x6 − 1)(x2 − 1) + (x7 − 1)(x3 − 1) + (x8 − 1)(x4 − 1)]

=1

8[(0− 1)(0− 1) + (−4− 1)(4− 1) + (0− 1)(8− 1) + (−4− 1)(4− 1)]

=−36

8

γe(5) = γe(−5) =1

8

3∑t=1

(xt+5 − 1)(xt − 1)

=1

8[(x6 − 1)(x1 − 1) + (x7 − 1)(x2 − 1) + (x8 − 1)(x3 − 1)]

=1

8[(−4− 1)(0− 1) + (0− 1)(4− 1) + (−4− 1)(8− 1)]

=−33

8

12–2

γe(6) = γe(−6) =1

8

2∑t=1

(xt+6 − 1)(xt − 1)

=1

8[(x7 − 1)(x1 − 1) + (x8 − 1)(x2 − 1)]

=1

8[(0− 1)(0− 1) + (−4− 1)(4− 1)]

=−14

8

γe(7) = γe(−7) =1

8

1∑t=1

(xt+7 − 1)(xt − 1)

=1

8[(x8 − 1)(x1 − 1)]

=1

8[(−4− 1)(0− 1)]

=5

8

The sample autocorrelation function is

ρe(h) :=γe(h)

γe(0)=γe(h)

120/8, −8 < h < 8,

so that

ρe(0) = 1, ρe(1) = ρe(−1) = 51/120, ρe(2) = ρe(−2) = 5/120, ρe(3) = ρe(−3) = −39/120,

ρe(4) = ρe(−4) = −36/120, ρe(5) = ρe(−5) = −33/120, ρe(6) = ρe(−6) = −14/120,

ρe(7) = ρe(−7) = 5/120.

As for the sample covariance and correlation matrices, we have

Γ8,e =

γe(0) γe(1) γe(2) γe(3) γe(4) γe(5) γe(6) γe(7)γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4) γe(5) γe(6)γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4) γe(5)γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4)γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3)γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2)γe(−6) γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1)γe(−7) γe(−6) γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0)

=

120/8 51/8 6/8 −39/8 −36/8 −33/8 −14/8 5/851/8 120/8 51/8 6/8 −39/8 −36/8 −33/8 −14/86/8 51/8 120/8 51/8 6/8 −39/8 −36/8 −33/8−39/8 6/8 51/8 120/8 51/8 6/8 −39/8 −36/8−36/8 −39/8 6/8 51/8 120/8 51/8 6/8 −39/8−33/8 −36/8 −39/8 6/8 51/8 120/8 51/8 6/8−14/8 −33/8 −36/8 −39/8 6/8 51/8 120/8 51/8

5/8 −14/8 −33/8 −36/8 −39/8 6/8 51/8 120/8

,

12–3

and

R8,e =

ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5) ρe(6) ρe(7)ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5) ρe(6)ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5)ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4)ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3)ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2)ρe(−6) ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1)ρe(−7) ρe(−6) ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0)

=

1 51/120 6/120 −39/120 −36/120 −33/120 −14/120 5/12051/120 1 51/120 6/120 −39/120 −36/120 −33/120 −14/1206/120 51/120 1 51/120 6/120 −39/120 −36/120 −33/120−39/120 6/120 51/120 1 51/120 6/120 −39/120 −36/120−36/120 −39/120 6/120 51/120 1 51/120 6/120 −39/120−33/120 −36/120 −39/120 6/120 51/120 1 51/120 6/120−14/120 −33/120 −36/120 −39/120 6/120 51/120 1 51/120

5/120 −14/120 −33/120 −36/120 −39/120 6/120 51/120 1

.

Note that this example can be done in R in almost no time, since R will compute samplecorrelograms for you. Here’s how: Type

> x=c(0,4,8,4,0,-4,0,-4)

and

> acf(X)

This gives the following graph:

0 1 2 3 4 5 6 7

-0.5

0.0

0.5

1.0

Lag

ACF

Series x

To see the actual values of the sample autocorrelation function, not just the graph, type:

12–4

> a=acf(x)

> a

This gives the following:

Autocorrelations of series ‘x’, by lag

0 1 2 3 4 5 6 71.000 0.425 0.050 −0.325 −0.300 −0.275 −0.117 0.042

12.2 Test for the residual sequence

The simplest random sequence is white noise as it has the simplest covariance structure. Wewill see here how to determine if the residual sequence in a time series could be modeledby white noise. The key idea here is that if Y1, . . . , Yn are i.i.d. with finite variance, thenρ(1), . . . , ρ(n− 1) are approximately i.i.d. with distribution N(0, 1/n). This fact is far fromobvious, so feel free not to worry about why it is the case. Note that this approximation isgood for small lags, but becomes bad for large lags.

Now suppose X ∼ N(0, 1/n). Then√nX ∼ N(0, 1). Therefore,

P (−1.96/√n ≤ X ≤ 1.96/

√n) = P (−1.96 ≤

√nX ≤ 1.96) ≈ 0.95 = 95%.

Of course, the same applies to our random variables ρ(i) above.

For any i, under the assumption that ρ(i) ∼ N(0, 1/n), ρ(i) has a 95% chance of landing inthat interval. In particular, if our assumption that ρ(i) ∼ N(0, 1/n) and that the ρ(i) areindependent is correct, roughly 95% of the ρ(i) should be in that interval.

This gives us a nice procedure for determining whether the random variables of our residualsequence Y1, . . . , Yn could be iid with finite variance or not:

If much more than 5% of the sample autocorrelations land outside of the interval

(−1.96/√n, 1.96/

√n),

then there is no good reason to believe that ρ(0), . . . , ρ(n) are approximately i.i.d. withdistribution N(0, 1/n) and therefore no good reason to believe that Y1, . . . , Yn are i.i.d. withfinite variance. In that case, we reject the hypothesis that Y1, . . . , Yn are i.i.d. with finitevariance. Although this is not formal (we don’t yet have a systematic quantitative rule forrejecting this hypothesis), it at least suggests a simple way of checking whether a randomsequence may be iid noise or not.

Let’s look at this using a “controlled” example, that is, an example where we know what thetrue time series model is. The only way to know this, is if we create the time series. We willgenerate a white noise Gaussian time series of 2000 time steps and look at its correlogramto see if it is as we expect. We will then reproduce the experiment with a random walk forwhich the increments are the values of the white noise Gaussian time series. This is done asfollows: First generate 2000 independent standard normal random variables by typing

12–5

>w=rnorm(2000)

Generate a plot of the Gaussian white noise by typing

>plot(w,type=”l”)

This gives the following graph (if you reproduce this at home, you’ll get a different graph):

0 500 1000 1500 2000

-3-2

-10

12

3

Index

w

Now we can obtain the sample correlogram for our data set w by typing

>acf(w)

This gives

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series w

To see the values of the autocorrelation function for small lags, type

>a=acf(w)

>a

If the time series were completely uncorrelated (which we know it is, as we generated it), wewould expect about 95% of these values to be less than 1.96/

√n in absolute value. To know

what the value is in this case, type

>1.96/sqrt(2000)

12–6

12.3 Some Examples of Correlograms

To understand a bit better what the correlogram tells us about the underlying process, let’slook at a few additional examples for correlograms of time series which exhibit specific kindsof patterns. We will do this exercise numerically and you will analyze this question morecarefully on the homework from a theoretical point of view.

Example 12.2. Consider random walk which we can generate recursively this way:

>w=rnorm(2000)

>x=w

>for (t in 2:2000) x[t]=x[t-1]+w[t]

To see the picture, type

>plot(x,type=”l”)

0 500 1000 1500 2000

-20

020

4060

Index

x

The correlogram for the random walk is the following:

>acf(x)

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series x

Below, we examine the correlograms of two particularly nice time series, one of which isperfectly linear and the other perfectly cyclical.

12–7

Example 12.3. If for t ≥ 1, Xt = t, Xt is a non-random time series. We can nonethelesscompute the autocorrelation function for this time series. Let’s do this with R, for instancefor a time series of length 1000:

First, we generate the time series:

> X=c(1:1000)

> for (i in 1:1000) X[i]=i

If you want to check the numerical values of X, don’t forget you can just type

> X

Now to see the correlogram, type

> acf(X)

This gives the following graph:

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series X

Note that the sample autocovariances are all fairly close to 1. This is not surprising, sincethe values of the underlying time series are very strongly correlated for small lags. To seethe actual values of the sample autocorrelations in the graph, type

> a=acf(X)

> a

Note also that R only gives you the first 30 values of the correlogram. This is because bydefault it only gives you the first 10 log10(n) values of the correlogram (where n is the lengthof the time series). This is partly because the estimates of the sample correlation quicklybecome bad estimates for larger lags. To see all the sample correlations, type the following:

> a=acf(X,lag.max=999)

> a

This will produce the following picture, as well as the corresponding numerical values:

12–8

0 200 400 600 800 1000

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series X

It may seems surprising that there are negative correlations at large lags. This can beexplained by the fact that since correlations are computed relatively to the mean at the dataset and that for large lags, typically one value lies above the mean ((n + 1)/2) and one liesbelow.

Exercise 12.10. Reproduce the steps above with a linear time series of length 10,000. Is thenumber of lags given by default by R what you expected it to be? How does the correlogramfor the first 10 lags compare to the correlogram for the first 10 lags obtained above?

Note 12.1. On the homework, you will show that the phenomenon you observed in Exercise12.10 is not accidental.

Example 12.4. We now do the same thing as above for a periodic time series. Noting thatcos(2πx

n) has period n, we define the following periodic time series of period 10:

> X=c(1:200)

> for (i in 1:200) X[i]=cos(pi*i/5)

> acf(X)

This gives

0 5 10 15 20

-1.0

-0.5

0.0

0.5

1.0

Lag

ACF

Series X

At a first glance, the graph looks perfectly periodic with period 10. Typing

12–9

> a=acf(X) > a

shows that this is not completely the case. The amplitude decreases. This is due to the factthat the sample autocovariance function

γ(h) :=1

n

n−|h|∑t=1

(Xt+|h| −X)(Xt −X)

contains a sum with fewer and fewer terms as the lag h increases, but the multiplicativefactor 1/n is always the same.

Note 12.2. You will show on the homework that for a periodic time series of integer periodthe true autocorrelation function is actually perfectly periodic, unlike the sample autocorre-lation function.

12–10


Lecture #13: Testing if a Time Series Could Be White Noise;Inference for Mean; Additive Model

Reference. Sections 8.1, 3.2 from the textbook.

13.1 Tests for the estimated noise sequence

When trying to determine an appropriate decomposition (e.g. additive)

Xt = mt + st + Yt

for a time series, where the goal will be to find a description for m, s, and Y which is intune with our observations, a good first thing to check is whether Yt is stationary with thesimplest possible covariance structure, in other words, if Yt is white noise.

Below, we develop a few tests (the more, the better) aimed at answering the same question.One of these relies on the chi-square distribution. Note that if you are interested in findingquantiles for the chi-square, one option is to go to

http://www.stat.tamu.edu/˜west/applets/chisqdemo.html

Alternatively, you can use the R command “qchisq(alpha,n)”, which will give you the valueof χ2

α,n.

13.1.1 The Turning Point Test

We already saw that we should be very skeptical of independence of a sequence of randomvariables if the signs of their realizations alternate with too much regularity. This idea is atthe center of the turning point test.

Definition 13.1. Suppose y1, . . . , yn is a sequence of realizations of a time series. We saythat there is a turning point at time i if yi = maxyi−1, yi, yi+1 or yi = minyi−1, yi, yi+1.

If Y1, . . . , Yn is a random sequence, we can define the random variable T to be the numberof turning points.

We now assume that Y1, . . . , Yn is an i.i.d. sequence. Clearly, if 2 ≤ i ≤ n− 1,

P (i is a turning point) = P (Yi = maxYi−1, Yi, Yi+1) + P (Yi = minYi−1, Yi, Yi+1 =2

3.

Now suppose that for 2 ≤ i ≤ n− 1,

Ii = 1i is a turning point =

1 i is a turning point0 i is not a turning point

.

13–1

Then T =n−1∑i=2

Ii and

E[Ii] = P (i is a turning point),

so

µT = E[T ] = E[n−1∑i=2

Ii] =n−1∑i=2

P (i is a turning point) = (n− 2)2

3.

Similarly, one can show that

σ2T = Var(T ) =

16n− 29

90.

One of the many versions of the central limit theorem now implies that if n is large,

Tapprox∼ N

((n− 2)

2

3,16n− 29

90

).

So we will reject the hypothesis at level α if

|T − µT |σT

> zα/2.

13.1.2 The Portmanteau Tests

Recall from Lecture 12 that if Y1, . . . , Yn are i.i.d. with finite variance, then ρ(1), . . . , ρ(n−1)are approximately i.i.d. with distribution N(0, 1/n). Therefore, if Y1, . . . , Yn are i.i.d. withfinite variance, then

√nρ(0), . . . ,

√nρ(n) are approximately i.i.d. with distribution N(0, 1),

so the sum of their squares is approximately a chi-square. More precisely, if 1 ≤ k ≤ n− 1,

k∑i=1

(√nρ(i))2)

approx∼ χ2k,

that is,

Q = n

k∑i=1

ρ(i)2 approx∼ χ2k. (23)

This gives us a number of different options for which test statistic to use (one for each1 ≤ k ≤ n − 1). So which k do we choose? Typically, k = log10 n is a choice that ensuresthat the approximation is good.

The estimator Q is called the Box-Pierce statistic. Based on (23), we should reject thehypothesis that Y1, . . . , Yn are i.i.d. with finite variance if Q falls within an unlikely regionfor a χ2

n random variable. We will reject the hypothesis at level α if q, the realization of Qsatisifes

q > χ21−α,k,

13–2

where χ21−α,k is such that if X ∼ χ2

k, then

P (X > χ21−α,k) = α.

There is a variant (really an improvement) of the Box-Pierce statistic, called the Ljung-Boxstatistic:

Q = n(n+ 2)k∑i=1

ρ(i)2

n− iapprox∼ χ2

k.

As this estimator is better than the Box-Pierce statistic, you should use the Ljung-Boxstatistic rather than the Box-Pierce statistic. The reason I mentioned the Box-Pierce statisticis because it is easier to see why its distribution might be well approximated by the chisquare distribution. In R, the command for both tests is Box.test. The optional argument“type=“L”” tells R to use the Ljung-Box statistic while the argument “type=“B”” tells Rto use the Box-Pierce statistic.

Example 13.1. In this example, we will generate two time series. Z will be white noise andX will be an AR(1) process. We will test for both times series via both the Box-Pierce andthe Ljung-Box test if they could be white noise. (Of course, since we generated the data, weknow the answer already.)

> Z=rnorm(100)

> X=Z

> for (i in 2:100) X[i]=X[i-1]/2+Z[i]/2

> Box.test(Z,lag=2,type=”L”)

Box-Ljung test

data: Z

X-squared = 0.033, df = 2, p-value = 0.9836

> Box.test(R,lag=2,type=”B”)

Box-Pierce test

data: Z

X-squared = 0.0318, df = 2, p-value = 0.9842

> Box.test(X,lag=2,type=”L”)

Box-Ljung test

data: X

X-squared = 25.3737, df = 2, p-value = 3.091e-06

> Box.test(X,lag=2,type=”B”)

Box-Pierce test

data: X

X-squared = 24.6017, df = 2, p-value = 4.548e-06

13–3

We see that in both cases, both tests do exactly what one would hope, i.e., reject (over-whelmingly) the white noise hypothesis for X and fail to reject (by very much) the whitenoise hypothesis for Z. Try this at home with different values of n and different time seriesX that aren’t white noise.

13.2 Inference for µ

Even in the case of dependent variable X1, . . . , Xn, X is a natural estimator for µ. However,since we are not assuming here that the Xt are independent, we can’t just claim that X ∼N(µ, σ

2

n

).

It is still true that

E[Xn] = E[1

n

n∑i=1

Xi] =1

n

n∑i=1

E[Xi] =1

nnµ = µ.

However,

Var(Xn) = Var(1

n

n∑i=1

Xi) =1

n2Var(

n∑i=1

Xi) =1

n2

E[

(n∑i=1

Xi

)2

]− E[n∑i=1

Xi]2

=

1

n2

(n∑i=1

n∑j=1

E[XiXj]− E[Xi]E[Xj]

)=

1

n2

n∑i=1

n∑j=1

Cov(Xi, Xj)

=1

n2

n∑i=1

n∑j=1

γ(i− j) =1

n2

n−1∑h=−n+1

(n− |h|)γ(h) =1

n

n−1∑h=−n+1

(1− |h|

n

)γ(h)

Now any nonsingular linear transformation of a multivariate Gaussian vector is multivariateGaussian too, so since

1n

1n

1n· · · 1

n1n

0 1 0 · · · 0 00 0 1 · · · 0 0...

......

......

...0 0 0 · · · 1 00 0 0 · · · 0 1

X1

X2...

Xn−1

Xn

=

Xn

X2...

Xn−1

Xn

,

the vector Xn

X2...

Xn−1

Xn

13–4

is multivariate normal, which means that its marginals are normal, so that Xn has a normaldistribution. Since we know its mean and variance, we know everything about it:

Xn ∼ N

(µ,

1

n

n−1∑h=−n+1

(1− |h|

n

)γ(h)

).

Equivalently, √n(Xn − µ)√

v∼ N(0, 1),

where v =n−1∑

h=−n+1

(1− |h|

n

)γ(h).

So we know everything about Xn for a Gaussian time series. But what if X is not Gaussian?Then we turn to the usual trick, namely, the central limit theorem, which tells us that alarge sum of random variables is close to being Gaussian. In that case, we get the following:

√n(Xn − µ)√

v

approx.∼ N(0, 1),

where v is as above. This gives the following 100(1−α)%-confidence interval (or approximateconfidence interval if Xt is not Gaussian) for µ:

(X −√v√nzα/2, X +

√v√nzα/2)

Note 13.1. Since we usually don’t know v, we have to estimate it too. A natural estimatoris

v =n∑

h=−n

(1− |h|

n

)γ(h).

13–5


Lecture #14: Trend and Seasonal Variation


14.1 The Additive Model

Recall the example we briefly looked at in Lecture 3, where we had a data set composed ofthe number of monthly aircraft miles (in Millions) flown by U.S. airlines between 1963 and1970. The graph for this data set was the following:

Time

Air.ts

1964 1966 1968 1970

6000

8000

10000

12000

14000

16000

Given a data set such as the one above, how can we construct a model for it? The idea willbe to decompose random data into three distinct components:

• A trend component mt (increase of populations, increase in global temperature, etc.)

• A seasonal component st (describing cyclical phenomena such as annual temperaturepatterns, etc.)

• A random noise component Yt describing the non-deterministic aspect of the timeseries. Note that the book uses zt for this component. In the notes, I’ll write Yt, as theletter z usually suggests a normal distribution, which may not be the actual underlyingdistribution of the random noise component.

14–1

A common model is the so-called additive model, that is, one where we try to find mt, st, Ytsuch that a given time series can be expressed as

Xt = mt + st + Yt.

We will never know what mt, st, and Yt actually are, but we can estimate them. Theestimates will be called mt, st, and yt. Note that we’ll use the same notation for estimatesand estimators in this case. Once we see the data, our estimates have to satisfy

xt = mt + st + yt,

where mt is an estimate for mt, st is an estimate for st, and yt is an estimate for Yt.

The corresponding data set can be found at

http://robjhyndman.com/tsdldata/data/kendall3.dat

and looks like this:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

1963 6827 6178 7084 8162 8462 9644 10466 10748 9963 8194 6848 70271964 7269 6775 7819 8371 9069 10248 11030 10882 10333 9109 7685 76021965 8350 7829 8829 9948 10638 11253 11424 11391 10665 9396 7775 79331966 8186 7444 8484 9864 10252 12282 11637 11577 12417 9637 8094 92801967 8334 7899 9994 10078 10801 12950 12222 12246 13281 10366 8730 96141968 8639 8772 10894 10455 11179 10588 10794 12770 13812 10857 9290 109251969 9491 8919 11607 8852 12537 14759 13667 13731 15110 12185 10645 121611970 10840 10436 13589 13402 13103 14933 14147 14057 16234 12389 11595 12772

In fact, this is not exactly the form in which the data set is found on that website. There,it doesn’t have any labels. As it turns out, it is quite straightforward to include those labelswith R.

Let’s look at the graph above. Two patterns are striking. There appears to be

• an increasing pattern

• a clear cyclical pattern with some apparently fixed period

14.1.1 The Trend

There are a number of methods available to analyze the trend. We will see here how thefunction “decompose” in R estimates the trend and will discuss some refinements of thislater in the semester.

For a given time series Xt, one natural way of estimating the trend mt is to assume that it isinfluenced by factors on a number of times around t, so we can let mt be a moving average

14–2

of values of Xt around time t. In general, if the time series Xt1≤t≤N consists of N datapoints, we can, for some arbitrary a, define, for a+ 1 ≤ t ≤ N − a,

mt =1

1 + 2a

a∑k=−a

Xt+k.

Alternatively, if there is good reason to take an average with values weighted by 12a

, as inthe case where there is a good reason to think the natural period of the times series is 2a,we can define instead

mt =1

2a(1

2Xt−a +

a−1∑k=−a+1

Xt+k +1

2Xt+a)

so that the sum of the weights equals 1 as well.

In the time series above, since the period is 12, we would define, for 7 ≤ t ≤ 90,

mt =1

12(1

2Xt−6 +

5∑k=−5

Xt+k +1

2Xt+6).

Note that this process means that the trend estimate is undefined for 1 ≤ t ≤ 6 and91 ≤ t ≤ 96.

14.1.2 Seasonal Variation

In cyclical data, numerical values start repeating themselves with each new cycle. Forexample, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, . . . is a cyclical data set with period 4. Thecycles have length 4.

If we have already found a model for the trend for a time series, we define

a(t) = Xt − mt

for which we now wish to estimate a seasonal component (assuming this makes sense; in theexample above, it certainly does) and later a random component.

If there is a true cyclical component in a time series, its values have to repeat themselveswith every period. Since the values of the time series at are unlikely to be exactly cyclical,we have to estimate the actual values of this cyclical component. This is done by averagingthe values of at corresponding to a same position in the cycle. For instance, in the exampleabove, the values of at for each given month will be averaged. More precisely, we define, fort = 7, . . . , 18,

ct = ct+12 = · · · = ct+72 =1

7

6∑i=0

at+12i.

Since in our example at is only defined for t = 7, . . . , 90 (since mt is only defined for those t),we define by extension ct for t = 1, . . . , 6, by ct = ct+12 and for t = 91, . . . 96, by ct = ct−12, so

14–3

that ct is defined for all t = 1, . . . , 96. Why did we call this object ct rather than st? Becausewe need to do one more thing to get st. The time series ct is cyclical, but we will transformit into a series for which the mean is 0. This is achieved by defining for t = 1, . . . , 96,

st = ct −1

12

18∑t=7

ci.

Exercise 14.11. Show that the mean of the values stt∈1,...,96 is 0.

Note again that everything we are doing here is based on the fact that a natural cycle of ourtime series has length 12, but all the steps can be reproduced for time series of any period.

14.1.3 Random Component

The estimate for the random component is just

yt = xt − mt − st.

This random component our main focus for this course, but let’s see how to obtain it fromany time series for which the additive model would be a good fit.

14.1.4 Decomposing a Time Series with R

First, in order for R to be able to do any time series analysis with your data, it must knowthat it is dealing with a time series. You will need to use a command to transform your dataset into a time series.

First, load the data set from the web by typing

>www=“http://robjhyndman.com/tsdldata/data/kendall3.dat”

and create a data file by typing

>Air=scan(www)

I chose the name “Air”, but you can of course call the data set what you’d like. Then,transform that data set into a time series. Since the years go from 1963 to 1970 and for eachyear, the months go from 1 to 12, it makes sense to put the data into an 8-by-12 array. Thisis done as follows:

>Air.ts=ts(Air,start=c(1963,1),end=c(1970,12),fr=12)

The command “fr=12” tells R that your time series follows a natural cycle that has period12. R will automatically deduce from this information that your data is measured on amonthly basis and yield the table above. But the fact that you’ve created an object of thetime series class by defining an object with the suffix “.ts” will allow R to do much more, infact most things one might want to do with time series and which R can’t do with files thatdon’t have the “.ts” suffix.

Typing

14–4

>decompose(Air.ts)

will show all the values of the time series mt, st, and yt (which are the estimates for mt, st,and Yt. If you type

>plot(decompose(Air.ts))

you will see the graphs of xt, mt, st, and yt:

8000

12000

observed

9000

11000

13000

trend

-2000

01000

seasonal

-2500

-1000

01000

1964 1966 1968 1970

random

Time

Decomposition of additive time series

Note that if you wish to analyze the seasonal component, trend, or random componentseparately, you can use the following commands:

> D=decompose(Air.ts)

> S=D$seasonal

> T=D$trend[7:90]

> R=D$random[7:90]

This gives you the data arrays S, T, and R containing all the values of the seasonal compo-nent, trend, and random component. Note that since T and R have no values for the first 6and last six times, we had to ask the software to disregard them when creating T and R.

14–5

14.2 The Least Squares Method

Not all time series can be modeled by stationary processes. One might hope, however, thata time series Xt could be expressed as follows:

Xt = f(t) + Yt,

where f(t) is a deterministic function and Yt is stationary. The least squares method can beof use when trying to extract a function f(t) which might describe a trend and a seasonalcomponent from our time series.

Here is a quick reminder of the basic principles of the least square method.

The basic idea is as follows. Suppose that we decide (subjectively) that the best model forthe evolution of a measurable quantity over time is a linear function of the form f(t) = y =β0t+ β1. Not all straight lines will seem to be equally good models once we see the data. Inparticular, a model will not be too good if all data points lie above or below the line givenby the model.

The least squares method tries to minimize the sum of the squares of the differences betweenthe data values and values predicted by the model. More specifically, if the observed dataconsists of the n points (t1, y1), . . . (tn, yn), the least square method finds the parameters aand b so as to minimize

n∑i=1

(yi − f(ti))2.

IfYt = β0 + β1t+Xt,

where Xt is stationary (which means that Yt is the sum of a stationary process and a linearfunction), the estimators for β0 and β1 that yield the least squares estimates are as follows:

β1 =1n

∑nt=1 tYt − tY

1n

∑nt=1 t

2 − t2, β0 = Y − β1t = Y − t

1n

∑nt=1 tYt − tY

1n

∑nt=1 t

2 − t2.

The same idea works for any deterministic function, not just linear, which can be defined interms of any number of parameters (for instance polynomials, exponential functions, etc.).

Example 14.1. We now see how R does least squares regression for us by looking at thedaily closing prices of Hewlett-Packard stock for 672 trading days up to June 7, 2007. Thedata can be obtained as follows:

> www=“http://www.maths.adelaide.edu.au/andrew.metcalfe/Data/HP.txt”

> HP.dat=read.table(www,header=T);attach(HP.dat)

> plot(Price,type=“l”)

This gives:

14–6

0 100 200 300 400 500 600

2025

3035

4045

Index

Price

We will perform a linear regression on the data set, using the command

> HP.lm=lm(Price time(Price))

The summary is then

> HP.lm

Call:

lm(formula = Price time(Price))

Coefficients:

(Intercept) time(Price)

17.2333 0.0398

This means that the least squares line is

f(t) = 17.2333 + 0.0398t.

Note that you can obtain confidence intervals for the parameters as follows:

> confint(HP.lm)

2.5 % 97.5 %

(Intercept) 17.00349739 17.46311534

time(Price) 0.03921052 0.04039385

Now let’s see if as we might hope Yt = Xt − f(t) can be described by an ARMA model, byexamining the ACF of the residuals:

> acf(resid(HP.lm))

14–7

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series resid(HP.lm)

The slowly decaying ACF suggests that we may not be in presence of a time series thatcould be well modeled by an ARMA process, so we will need to find a way to deal withnon-stationary time series.

14–8


Lecture #15: Pre-Midterm Q&A

15–1


Lecture #16: Midterm

16–1

Math 4506 (Fall 2019) November 4, 2019Prof. Christian Benes

Lecture #17: Differencing


17.1 Differencing

In time series analysis, one goal is to reduce a given time series model Xt to a stationary timeseries, whenever possible. An important observation we have made already is that takingdifferences of a nonstationary time series can yield a stationary time series as is the casewith random walk for which taking differences yields white noise.

We will need the following definition, part of which we’ve seen already:


BXt = Xt−1.

For j ≥ 2, we define the operator Bj by

BjXt = B(Bj−1Xt) = Xt−j.

The order 1 difference operator ∇ is defined by

∇Xt = Xt −Xt−1.

For j ≥ 2, the order j difference operator ∇j is defined by

∇jXt = ∇∇j−1Xt.

Note 17.1. Conveniently, operations on the operator B follow the same rules as polynomials:For instance,

∇2Xt = ∇(∇Xt) = ∇(Xt −Xt−1) = (Xt −Xt−1)− (Xt−1 −Xt−2) = Xt − 2Xt−1 +Xt−2

and

∇3Xt = (1−B)(1−B)(1−B)Xt = (1−B)3Xt = (1−3B+3B2−B3)Xt = Xt−3Xt−1+3Xt−2−Xt−3,

The first important thing to note is that if the time series Xt is stationary, then so is ∇Xt.This is a direct consequence of Proposition 7.2. So we certainly don’t lose stationarity bydifferencing. However, as we saw in the random walk example, differencing can transform anonstationary time series into one that is stationary.

Assume now that our time series model can reasonably be written in the form

Xt = mt + st + Yt,

17–1

where mt is a trend, st a seasonal component, and Yt a random noise component (which mayor may not be stationary), we saw in the past lecture how m and s can be estimated, leavingus with the random process Yt. If Yt is not stationary, we can try taking differences until wehave a time series that is stationary. However, we can also apply this process to Xt beforeestimating m and s. The main idea of this method comes from calculus. It basically saysthat for many functions, taking derivatives “flattens” the function.

• What do I mean by this? Take for instance f(x) = x2. For large x, the slope of thetangent to the graph is very steep. However, the slope of the tangent to the graph off ′(x) is the same everywhere (it’s 2). Moreover, f ′′(x) is a constant. Clearly, the samething works for any polynomial. Since many functions look locally like a polynomial,one can also hope for this idea to work for a larger class of functions.

• Why is this useful? A stationary series has constant mean, so transforming the trendm into a constant is necessary if we wish to transform our time series into a stationarytime series. We will also see a slight modification of this idea that gets rid of seasonalcomponents.

17.1.1 Differencing when there is no seasonal component

Suppose that our time series model is

Xt = mt + Yt.

Let’s see how this method works on time series where m is a polynomial: Suppose mt =n∑i=0

aiti and Xt = mt + Yt. You will show in the homework that

∇nXt = cn +∇nYt.

Since ∇nYt is a stationary sequence with mean 0, ∇nXt is stationary with constant meancn.

This suggests that we may try to apply successive difference operators to a given timeseries until it is stationary (that is, the trend has been removed). Note that this methodis considerably less sophisticated than the method where we remove the trend using leastsquares.

17.1.2 Differencing when there is a seasonal component

Suppose now that our time series model is

Xt = mt + st + Yt.

Since we know that taking successive derivatives of (e.g.) sin(x) only sends us back andforth between sin and cos functions, it should be clear that if the goal is to get rid of the

17–2

seasonal term, the lag-1 differencing operator won’t take us far. However, if we know thatthe seasonality period is d, we can try to use a differencing operator taking into account theperiod:

Definition 17.2. The lag-d differencing operator ∇d is defined by

∇dXt = Xt −Xt−d = (1−Bd)Xt.

If Xt = mt + st + Yt and st is d-periodic which means that st+d = st, we get

∇dXt = mt −mt−d + st − st−d + Yt − Yt−d = mt −mt−d + Yt − Yt−d,

the sum of a trend term and a noise term. Now we’re back in the case of the last subsection,which we already know how to handle.

17–3


Lecture #18: Differencing and ARIMA models;LogarithmicTransformations


18.1 Differencing

18.1.1 A Model

Recall the example with which we finished the last lecture, examining the daily closing pricesof Hewlett-Packard stock for 672 trading days up to June 7, 2007. We now have a way ofgoing a bit further in that example.

Example 18.1.

0 100 200 300 400 500 600

2025

3035

4045

Index

Price

After extraction of a linear trend we ended up with the following sample ACF for theresiduals.

18–1

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series resid(HP.lm)

The slowly decaying sample ACF suggests that we may not be in presence of a time series,so let’s try to difference it:

> Diff=diff(resid(HP.lm))

> plot(Diff,type=”l”)

0 100 200 300 400 500 600

-2-1

01

23

Index

Diff

> acf(Diff)

18–2

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series Diff

The ACF of the differenced series suggest that we may be in presence of white noise. Let’sfind its mean and variance:

> mean(Diff)

[1] 6.368846e-05

> var(Diff)

[1] 0.2112592

Since the differenced Yt can be modeled by white noise, Yt can be modeled by random walk.So a reasonable model would be Xt = 17.233 + 0.0398t + St, where St is random walkconstructed from adding normal r.v.Os with mean 0 and variance 0.2112592, that is,

Xt = 17.233 + 0.0398t+t∑i=1

Ui,

where Ui ∼ N(0, 0.21126) are independent.

18.2 ARIMA Processes

We can now use the ideas developed above to define a new class of processes which are notnecessarily stationary. ARIMA (auto-regressive integrated moving average) processes are anextension of ARMA processes. In short, a process is an ARIMA(p, d, q) process if differencingit d times gives an ARMA(p, q) process. In particular, if d = 0, an ARIMA process is anARMA process. More precisely, an ARIMA(p, 0, q) process is the same thing as a causalARMA(p, q) process. Formally,

Definition 18.1. For d ∈ N ∪ 0, a process Xt is an ARIMA(p, d, q) process if Yt :=∇dXt = (1−B)dXt is an ARMA(p, q) process.

18–3

18.3 ARIMA(p, 1, q) Processes

If Yt is an ARIMA process, we generally assume that Yt is observed for time t ≥ 1 butthat time is indexed starting at some negative time −m, and so one assumes that Yt = 0if t < −m. Then, by definition of an ARIMA(p, 1, q) process, the process Wt defined fort ≥ −m+ 1 by Wt = Yt − Yt−1 is an ARMA(p, q) process, so that for t ≥ −m+ 1 + p,

Wt =

p∑i=1

φiWt−i +

q∑i=0

θiZt−i,

with θ0 = 1.

Moreover, note that

Yt = (Yt − Yt−1) + (Yt−1 − Yt−2) + (Yt−2 − Yt−3) + · · ·+ (Y−m+1 − Y−m) + (Y−m − Y−m−1)

=t+m+1∑i=1

(Yt+1−i − Yt−i) =t+m+1∑i=1

Wt+1−i =t∑

i=−m

Wi. (24)

In particular, this allows us to understand the ARIMA(0,1,1) model, also called the IMA(1,1)model (since there is no autoregressive part), where

Wt = Zt + θZt−1

with |θ| < 1. Using (24) gives

Yt =t∑

i=−m

Wi =t∑

i=−m

(Zi + θZi−1) = Zt + (1 + θ)m+t∑i=1

Zt−i + θZ−m−1. (25)

Note that as t increases, the number of terms in this sum increases, which suggests (correctly)that the sum doesn’t represent a stationary time series. Equation (24) allows us to computethe covariance and correlation of an IMA(1,1) process: For h > 0,

Cov(Yt, Yt−h) = Cov

(Zt + (1 + θ)

m+t∑i=1

Zt−i + θZ−m−1, Zt−h + (1 + θ)m+t−h∑i=1

Zt−h−i + θZ−m−1

)= σ2

(1 + θ + (1 + θ)2(m+ t− h) + θ2

)In particular, since

Var(Yt) = σ2(1 + θ + (1 + θ)2(m+ t) + θ2

),

we see that

Corr(Yt, Yt−h) =1 + θ + (1 + θ)2(m+ t− h) + θ2

1 + θ + (1 + θ)2(m+ t) + θ2.

If t is large and h is moderate (so that m + t − h is fairly close to m + t), this ratio isapproximately

m+ t− hm+ t

,

which is approximately equal to 1.

The example above provides additional reinforcement of the idea that a slow decay of theautocovariance function can be an indication that the corresponding time series is nonsta-tionary.

18–4

18.4 Multiplicative model

The idea of least squares can be used with any function we think might dictate the generaltrend of our data. This is particularly useful with periodic functions for which there is anatural assumption to be made about the period (e.g. for local meteorological data wherethe trend tends to follow an annual cycle.

Example 18.2. We will look at the data set of monthly totals of international airlinepassengers, 1949 to 1960.

> AP=AirPassengers

> plot(AP)

Time

AP

1950 1952 1954 1956 1958 1960

100

200

300

400

500

600

Plotting this time series, we notice an upward trend together with a clear cyclical behaviorcoupled with an increase of the amplitude of the cycle over time.

Such an increase in amplitude would make it difficult to model well the the time series asXt = mt + st +Yt, where Yt is stationary, since if the amplitude of st doesn’t increase (whichit can’t; it’s periodic), the variability of Yt would increase, making it non-stationary.

One way to get rid of an increase in amplitude is to take logarithms. Let’s ignore the randomcomponent for a minute. Then an appropriate model for such an increase in amplitude wouldbe

Xt = mtstYt,

since in that case, a larger value of the trend would cause a larger amplitude (5 sin t has alarger amplitude than 3 sin t). Now if we take logarithms, we get

lnXt = lnmt + ln st + lnYt,

where lnmt is a new trend and ln st is still a periodic function. However, there is no moremultiplicative effect, so that the new time series shouldn’t exhibit an increasing amplitude.Of course, this only works if we truly have a multiplicative model (Xt = mtstYt), which wemay or may not, but it certainly suggests that taking logarithms might solve the problem ofthe increasing amplitude.

18–5

Therefore, we create a new time series composed of the natural logarithms of the originalseries.

> LAP=log(AP)

This gives the following time series in which we can see that the increase in amplitude seemsto have mostly disappeared.

> plot(LAP)

Time

LogAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

Now we have a time series which actually looks like it could well be modeled as

Yt = mt + st +Wt,

where Wt could be stationary. We of course have to check more carefully if our impressionis correct.

We will first see if we can extract the trend and will then focus on the seasonal component.

Any class of functions might be used for least squares regression but a good first try isgenerally the family of polynomials.

In order to be able to plot our regression curves together with the data set, we first definea time vector which corresponds to the time spanned by our time series. To see what thattime is, we can just type

> time(LAP)

We see that the time goes from 1949 to 1960.917 with increments of 1/12 (representingmonths, or 1/12 of a year). Note that the time series is of length 144. Note that the timesof this time series are not integers, but this doesn’t change the way in which we perform ouranalysis, only the way the horizontal axis is indexed.

We define

> T = c()

> for (i in 1:144) T[i]=1949+(i-1)/12

We are now ready to do a linear regression:

18–6

> LAP.lm=lm(LAP˜time(LAP))

To see what the estimates for the slope and the intercept are, we type

> coef(LAP.lm)

(Intercept) time(LAP)

-230.1878355 0.1205806

This means that the least squares line for the data set LAP is y = −230.1878355+0.1205806t.

To see how this line compares to the time series, let’s draw them together by typing

> plot(LAP)

to get the plot of the time series and

> lines(T,LAP.lm$fit,col=”red”)

for the least squares line (note that the object LAP.lm$fit is the least squares data, whichwe are plotting against T. This gives the following picture:

Time

LAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

To determine how good this fit is (it will be good once the residuals look like they could bestationary), we plot the residuals and their ACF:

> plot(LAP.lm$resid,type=“l”)

0 20 40 60 80 100 120 140

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Index

LogAP.lm$resid

18–7

> acf(LAP.lm$resid)

0 5 10 15 20

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series LAP.lm$resid

The ACF of the residuals exhibits an obvious trend (other than that of a damped sine waveor an exponential curve), so the residuals are certainly not a realization of a stationary timeseries.

If we look at the time series and linear fit, we see that in addition to not having accountedfor a seasonal component, we may not have guessed quite right by choosing the trend to bea straight line. Here is how we fit polynomials of degrees 2 and 3 to the data:

> t=time(LAP)

> t2=tˆ2

> t3=tˆ3

> LAP.lm2=lm(LAP˜t+t2)

> LAP.lm3=lm(LAP˜t+t2+t3)

To see the least squares quadratic curve together with the least squares straight line, we type

> plot(LAP)

> lines(T,LAP.lm1$fit,col=”green”)

> lines(T,LAP.lm2$fit,col=”red”)

Time

LAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

18–8

We can also plot the least squares cubic polynomial, but notice that it covers the quadraticpolynomial, meaning that it doesn’t vary by any noticeable amount from the quadraticpolynomial.

Time

LAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

We therefore choose the trend to be a polynomial of degree two and turn to the periodiccomponent. To find its equation, we type

> coef(LAP.lm2)

(Intercept) t t2

-1.228769e+04 1.245592e+01 -3.154887e-03

and see that a reasonable choice for mt is

mt = −12287.69 + 12.46t− 0.00315t2.

18–9


Lecture #19: More Logarithmic Transformations; The PartialAutocorrelation Function


19.1 Other Models for Which Taking Logarithms is useful

There are other situations where the variance of a time series may change over time andwhere logarithms may be helpful.

Suppose Xt is such that E[Xt] = µt and√

Var(Xt) = µtσ (the second requirement meansthat the standard deviation increases linearly with the mean). Then, by analogy with thenormal case where if X ∼ N(µ, σ2), then X−µ

σ= Z ∼ N(0, 1), so that X = µ + σZ, we see

that

Xt = µt + µtσXt − µtµtσ

= µt

(1 +

Xt − µtµt

).

Therefore, taking logarithms on both sides and using the fact that the Taylor series forln(1 + x) is ln(1 + x) =

∑n≥1(−1)n+1 xn

n= x − x2

2+ x3

3− x4

4+ · · · − · · · for |x| < 1, we see

that ln(1 + x) ≈ x if |x| is small, so

lnXt ≈ lnµt +Xt − µtµt

.

Therefore,E[lnXt] ≈ log µt

and

Var(lnXt) ≈ Var

(Xt − µtµt

)=

1

µ2t

Var(Xt) =µ2tσ

2

µ2t

= σ2,

so we see that while Xt doesn’t have constant variance, lnXt does.

Similarly, if Xt is such that Xt = (1 +Wt)Xt−1 (compare this with the behavior of expo-nential functions), where Wt is a mean 0 stationary time series, then

ln

(Xt

Xt−1

)= ln(Xt)− ln(Xt−1) = ln(1 +Wt) ≈ Wt,

using again, in the last step, the Taylor expansion of ln(1+x). So ∇ lnXt ≈ Wt is stationary,so that taking differences of the logarithms transforms a non-stationary time series into astationary one.

19–1

19.2 Partial Autocorrelation Function

Unfortunately, the sample ACF of an AR(p) process doesn’t generally yield as much in-formation as the sample ACF of an MA(q) or even an AR(1) process. It tends to exhibita combination of exponential decay and sinusoidal behavior, which is a way of detectingthat an AR model may be appropriate, but it doesn’t tell us anything about the order p.Fortunately, there is another object which allows us to determine p, the so-called partialautocorrelation function (PACF).

The partial autocorrelation function at lag h of a stationary time series Xt is loosely definedto be the autocorrelation between Xt and Xt+k when considering only the linear dependencebetween these two random variables. This definition makes sense particularly in the contextof AR processes:

Suppose that Xt is a regression on its k previous lagged values.

Xt+k =k∑i=1

φkiXt+k−i + Zt+k (26)

with Zt+k independent of Xt+k−i, i > 0. In other words, there is a linear relationship be-tween Xt+k and its k direct predecessors and φki can be thought of as the “constant ofproportionality” between Xt+k and Xt+k−i. For any j ∈ 1, . . . , k, (26) implies that

E[Xt+kXt+k−j] =k∑i=1

φkiE[Xt+k−iXt+k−j] + E[Zt+kXt+k−j].

This is equivalent to

γX(j) =k∑i=1

φkiγX(j − i) + 0 =k∑i=1

φkiγX(j − i),

since Zt+k and Xt+k−j are independent if j ≥ 1. We can rewrite these equations in terms ofautocorrelations:

ρX(j) =k∑i=1

φkiρX(j − i).

Now for each k, this gives a set of k linear equations in the k unknowns ρX(1), . . . , ρX(k),which we know how to solve. More explicitly, we get the following sequence of systems ofequations:

• k = 1:ρX(1) = φ11ρX(0) = φ11.

• k = 2:ρX(1) = φ21ρX(0) + φ22ρX(1) = φ21 + φ22ρX(1)ρX(2) = φ21ρX(1) + φ22ρX(0) = φ21ρX(1) + φ22.

19–2

• k = 3:

ρX(1) = φ31ρX(0) + φ32ρX(1) + φ33ρX(2) = φ31 + φ32ρX(1) + φ33ρX(2)ρX(2) = φ31ρX(1) + φ32ρX(0) + φ33ρX(1) = φ31ρX(1) + φ32 + φ33ρX(1)ρX(3) = φ31ρX(2) + φ32ρX(1) + φ33ρX(0) = φ31ρX(2) + φ32ρX(1) + φ33.

• k ≥ 3:

ρX(1) = φk1ρX(0) + φk2ρX(1) + . . .+ φkkρX(k − 1)...

ρX(k) = φk1ρX(k − 1) + φk2ρX(k − 2) + . . . φkk.

(27)

To solve (27), it is useful to recall Cramer’s rule:

Theorem 19.1. Suppose that A is an n×n matrix with det(A) 6= 0 and ~b is an n×1 vector.

Then if ~x = (x1, . . . , xn)′, the solution to the equation A~x = ~b is given by

xi =det(Ai)

det(A),

where Ai is the matrix obtained from A by replacing the ith column by ~b.

Applying this to (27), we getφ11 = ρX(1),

φ22 =

∣∣∣∣ 1 ρX(1)ρX(1) ρX(2)

∣∣∣∣∣∣∣∣ 1 ρX(1)ρX(1) 1

∣∣∣∣ .

φ33 =

∣∣∣∣∣∣1 ρX(1) ρX(1)

ρX(1) 1 ρX(2)ρX(2) ρX(1) ρX(3)

∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)

ρX(1) 1 ρX(1)ρX(2) ρX(1) 1

∣∣∣∣∣∣.

...

φkk =

∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2) · · · ρX(k − 2) ρX(1)

ρX(1) 1 ρX(1) · · · ρX(k − 3) ρX(2)...

......

......

...ρX(k − 1) ρX(k − 2) ρX(k − 3) · · · ρX(1) ρX(k)

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2) · · · ρX(k − 2) ρX(k − 1)

ρX(1) 1 ρX(1) · · · ρX(k − 3) ρX(k − 2)...

......

......

...ρX(k − 1) ρX(k − 2) ρX(k − 3) · · · ρX(1) 1

∣∣∣∣∣∣∣∣∣

.

19–3

Definition 19.1. The partial autocorrelation function α of a time series X is defined by thefollowing equations:

α(0) = 1, α(k) = φk,k for k ≥ 1,

where φk,k is given by equation (27). Equivalently, φk,k is the last component of the vectorφk given by the equation

Rkφk = ρk,

where Rk = (ρ(i − j))ki,j=1 is the autocorrelation matrix and ρk = (ρ(1), . . . , ρ(k))′. Thesample partial autocorrelation function α of a time series X is defined just like α, exceptthat Rk and ρk are replaced by Rk and ρk.

19.3 PACF for AR(p) Processes

The usefulness of the PACF becomes evident once one derives the PACF of an AR(p) process.We do this now, starting with the case p = 2: If

Xt = φ1Xt−1 + φ2Xt−2 + Zt,

thenρX(k) = φ1ρX(k − 1) + φ2ρX(k − 2),

where ρX(h) is the ACF of Xt. We get from the work done above that

α(1) = φ11 = ρ(1).

α(2) = φ22 =

∣∣∣∣ 1 ρX(1)ρX(1) ρX(2)

∣∣∣∣∣∣∣∣ 1 ρX(1)ρX(1) 1

∣∣∣∣ =ρX(2)− ρX(1)2

1− ρX(1)2.

α(3) = φ33 =

∣∣∣∣∣∣1 ρX(1) ρX(1)

ρX(1) 1 ρX(2)ρX(2) ρX(1) ρX(3)

∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)

ρX(1) 1 ρX(1)ρX(2) ρX(1) 1

∣∣∣∣∣∣=

∣∣∣∣∣∣1 ρX(1) φ1 + φ2ρX(1)

ρX(1) 1 φ1ρX(1) + φ2

ρX(2) ρX(1) φ1ρX(2) + φ2ρX(1)

∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)

ρX(1) 1 ρX(1)ρX(2) ρX(1) 1

∣∣∣∣∣∣= 0,

since the last column of the determinant in the numerator is a linear combination of theprevious two columns. The same argument shows that in general, the PACF of an AR(p)process satisfies

α(k) = 0 ∀k ≥ p+ 1, α(k) 6= 0 ∀k ≤ p.

This fact is why the PACF is such a useful object!

In practice, we will be estimating the PACF by computing the sample PACF. We will needto determine if the sample PACF for which α(h) is small enough for us to reasonably assumethat it could be 0. This will again be the case if

α(h) ∈(−1.96√

n,1.96√n

).

19–4

More precisely, if |α(h)| > 1.96√n

for all h ≤ p and |α(h)| < 1.96√n

at least 95% of the time for

h > p, it will be reasonable to look for an AR(p) model for our data. This is under theassumption that the time series is multivariate Gaussian.

Example 19.1. Consider the AR process defined by

Xt − 0.8Xt−1 = Zt,

where Zt ∼ WN(0, 1). The three pictures below show a realization of the time series, as wellas the corresponding ACF and PACF. As expected, since X is an AR(1) process, α(h) = 0for all h > 1. One should also note that α(1) = φ1.

To produce the process and the pictures below, use the following commands:

> Z=rnorm(200)

> X=Z

> for (i in 2:200) X[i]=0.8*X[i-1]+Z[i]


0 50 100 150 200

-4-3

-2-1

01

2

Index

Y

A plot of X

> acf(X)

0 5 10 15 20

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series Y

The sample autocorrelation function of X

19–5

> acf(X)

5 10 15 20

0.0

0.2

0.4

0.6

LagP

artia

l AC

F

Series Y

The sample partial autocorrelation function of X


Xt + 0.8Xt−1 = Zt,


0 50 100 150 200

-4-2

02

4

Index

Y

A plot of X

> acf(X)

0 5 10 15 20

-0.5

0.0

0.5

1.0

Lag

ACF

Series Y


19–6

> pacf(X)

5 10 15 20

-0.8

-0.6

-0.4

-0.2

0.0

LagP

artia

l AC

F

Series Y



Xt + 0.3Xt−1 + 0.4Xt−2 − 0.6Xt−4 = Zt,


To produce the process and the pictures below, use the following commands:

> Z=rnorm(200)

> X=Z

> for (i in 5:200) X[i]=-0.3*X[i-1]-0.4*X[i-2]+0.6*X[i-4]+Z[i]


0 50 100 150 200

-6-4

-20

24

6

Index

X

A plot of X

19–7

> acf(X)

0 5 10 15 20

-0.5

0.0

0.5

1.0

Lag

ACF

Series X


> pacf(X)

5 10 15 20

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

Lag

Par

tial A

CF

Series X


19–8


Lecture #20: Model Selection

Reference. Section 6.3-6.6 from the textbook.

20.1 Yule-Walker Estimation for AR(p) Processes

Consider the ARMA(p, q) process X defined by

Φ(B)Xt = Θ(B)Zt,

where Zt ∼ WN(0, σ2). If X is a causal AR(p) process, we can write

Xt =∑j≥0

ψjZt−j = Ψ(B)Zt,

where Ψ(z) = 1Φ(z)

. Therefore, as we have already seen in Lecture 8, for any j ∈ 0, . . . , p,

Φ(B)Xt = Zt ⇒ Φ(B)XtXt−j = ZtXt−j ⇒ XtXt−j =

p∑i=1

φiXt−iXt−j + ZtXt−j

⇒ E[XtXt−j] =

p∑i=1

φiE[Xt−iXt−j] + E[ZtXt−j]

⇒ γ(j) =

p∑i=1

φiγ(j − i) + E[ZtXt−j],

Using the fact that Xt = Ψ(B)Zt, we get:

• If j = 0,

γ(0) =

p∑i=1

φiγ(i) + E[ZtXt] =

p∑i=1

φiγ(i) + σ2. (28)

• Since Zt is uncorrelated with Xt−j for all j ∈ 1, . . . , p,

γ(j) =

p∑i=1

φiγ(j − i),

which, in matrix notation, can be written as

Γpφφφ = γp, (29)

where Γp = (γ(i − j))pi,j=1 is the covariance matrix, γp = (γ(1), . . . , γ(p))′, and φφφ =(φ1, . . . , φp)

′.

20–1

The equations for j = 0 and j ∈ 1, . . . , p are a set of p+1 equations in the 2p+2 variablesσ2, φ1, . . . , φp, γ(0), . . . , γ(p). If the model is entirely specified, we know σ2, φ1, . . . , φp and cantherefore solve for γ(0), . . . , γ(p). On the other hand, if we happened to know γ(0), . . . , γ(p),we could find σ2, φ1, . . . , φp. The last two sentences are of course true only if the matrixdefining our system of equations is nonsingular.

Generally, we don’t know the true covariances of a time series, but we can estimate them Ifwe replace, γ(i) by γ(i) in (14) and (15), we get

γ(0) =

p∑i=1

φiγ(i) + σ2, (30)

Γpφφφ = γγγp, (31)

where φ = (φ1, . . . , φp)′. It turns out that if γ(0) 6= 0, the matrix Γp is nonsingular for all

p ≥ 1. If γ(0) 6= 0, we can divide both sides of the equations above by γ(0) 6= 0 to get anexpression in terms of the sample ACF:

1 =

p∑i=1

φiρ(i) +σ2

γ(0),

Rpφφφ = ρρρp.

And since if γ(0) = 0, Γp is nonsingular, we can take inverses and solve for φφφ:

σ2 = γ(0)

(1−

p∑i=1

φiρ(i)

)= γ(0)

(1− φφφ

′ρρρp

),

φφφ = R−1p ρρρp.

Using the second equation to re-write the first, we get the sample Yule-Walker equations:

φφφ = R−1p ρρρp, (32)

σ2 = γ(0)(

1− ρρρ′pR−1p ρρρp

). (33)

From this set of equations, we can find, for any m ≥ 1 an AR(m) model based on theYule-Walker method:

Definition 20.1. The process

Xt −p∑i=1

φiXt−i = Zt, Zt ∼ WN(0, vp),

whereφφφ = R−1

p ρρρp

andvp = γ(0)

(1− ρρρ′pR−1

p ρρρp

)is called the fitted Yule-Walker AR(p) model.

20–2

Note 20.1. The estimators one gets via this method will generally vary with p. For instance,if you choose p = 1, the coefficient φ1 will not be the same as if you choose p = 2.

At this point, we have a method (albeit not completely systematic) which allows us to comeup with an AR(p) model for a given data set:

• Look at the PACF to determine p.

• Find the fitted Yule-Walker AR(p) model.

The part of this procedure that is not particularly systematic is that of determining the“right” value of p. It turns out there is a better but considerably more complicated method,which we will discuss later, which simultaneously finds the “best” p (in some sense) and thecorresponding Yule-Walker coefficients for every p.

As always when estimating, we would like to know how good our estimates are. The key factis that under the assumption that the data are a realization of an AR(p) process,

φφφapprox.∼ N(φφφ, n−1σ2Γ−1

p ), (34)

which, when replacing the unknown parameters by their estimators, becomes

φφφapprox.∼ N(φφφ, n−1vpΓ

−1p ), (35)

Equation (35) implies that

Y :=

√n√vp

(φφφ− φφφ)approx.∼ N(0, Γ−1

p ).

If we write vpΓ−1p = (ai,j)1≤i,j≤p, this implies for every j ∈ 1, . . . , p that

P

(φj −

√aj,j√nzα/2 ≤ φj ≤ φj +

√aj,j√nzα/2

)≈ 1− α.

This is a result for a specific coefficient of the AR(p) process. We can also find a confidenceregion for the collection of coefficients φp,i1≤i≤p. Using the fact that if X ∼ N(µµµ,Σ),

(X− µµµ)′Σ−1(X− µµµ) ∼ χ2p

(this is the multi-dimensional analogue of the one-dimensional statement that ifX ∼ N(µµµ, σ2),

then (X−µ)2

σ2 ∼ χ21), we get that

Y′ΓpYapprox∼ χ2

p,

so that

P

((φφφ− φφφ)′Γp(φφφ− φφφ) ≤

vpχ21−α,p

n

)≈ 1− α.

This gives a way of checking our model’s precision, as the region

φφφ ∈ Rp : (φφφ− φφφ)′Γp(φφφ− φφφ) ≤vpχ

21−α,p

n

contains φφφ with approximate probability 1− α.

20–3

20.2 Fitting our first model

We can now use the “ar” command to find an appropriate model. This command findsthe optimal (in the sense of minimizing the AICC; we will discuss this later) p, coefficientsφ1, . . . , φp, and σ2 according to various methods, including the Yule-Walker method, whichis the default method. Type “help(ar)” in R to have more information about this command.

> set.seed(5)

> Z=rnorm(1000)

> X=Z

> for (i in 2:1000) X[i]=0.5*X[i-1]+Z[i]

> options(digits=16)

> ar(X)

Call:

ar(x = X)

Coefficients:

1

0.4804512923397

Order selected 1 sigmaˆ2 estimated as 1.024832068202

We can check that these values are indeed what the Yule-Walker equations prescribe: Recallequation (33):

φφφ = R−1p ρρρp

and equation (33):

σ2 = γ(0)(

1− ρρρ′pR−1p ρρρp

).

If p = 1, these equations are just

φ1 = ρ−10 ρ1, σ2 = γ(0)

(1− ρ′1ρ−1

0 ρ1

),

which can be simplified to

φ1 = ρ1, σ2 = γ(0)(1− ρ2

1

).

Now we can ask R to give us the estimates for ρ1using the Yule-Walker equations directly:

> A=acf(X)

> A$acf[2] %%% This command gives the second value of the acf with greater precision thanjust calling A

[1] 0.4804512923

Therefore,φ1 = ρ1 ≈ 0.480451.

We see that the value R gives directly via the command “ar” is the same for φ1 .

20–4


Lecture #21: Model Fitting and Parameter Estimation;Forecasting

Reference. Sections 6.7, 7.2, and Chapter 9 from the textbook.

In what follows, we will consider cleverly designed algorithms which will allow us to come upwith ARMA models (i.e., find the “best” p, q, φipi=1, θj

qj=1) according to certain criteria,

given a data set. Some of these algorithms are quite complicated and we’ll just gaze at themsuperficially. We start with an algorithm which estimates φipi=1 for AR(p) processes whenp is fixed. We will see later how to actually estimate which p gives the best fitted model.

21.1 Method of Moments Estimation

We won’t discuss this method since it often isn’t particularly effective. The only thing youneed to know is that for AR(p) processes, the method of moments estimates are the sameas the Yule-Walker estimates.

21.2 Least Squares Estimation

We begin with an obvious definition we could have made some time ago and which will beuseful in the future as not all stationary time series have zero mean:

Definition 21.1. If Yt − µ is an ARMA process, we say that Yt is an ARMA processwith mean µ.

21.2.1 AR processes

We start by seeing how least squares estimation works for an AR(1) process (not necessarilymean 0) satisfying the equation

Yt − µ = φ(Yt−1 − µ) + Zt.

The key idea is to see the time series Yt−µ as a function of Yt−1−µ and therefore to minimizethe conditional sum of squares function

Sc(φ, µ) =n∑t=2

((Yt − µ)− φ(Yt−1 − µ))2 .

To do this, we just need to solve∇Sc(φ, µ) = ~0.

21–1

We first look at the partial derivative with respect to µ:

∂Scdµ

=n∑t=2

2 ((Yt − µ)− φ(Yt−1 − µ)) (φ−1) = 2(φ−1)

(n∑t=2

Yt − φn∑t=2

Yt−1 + (n− 1)(φ− 1)µ

).

This is equal to 0 if and only if

n∑t=2

Yt − φn∑t=2

Yt−1 + (n− 1)(φ− 1)µ = 0 ⇐⇒n∑t=2

Yt − φn−1∑t=1

Yt + (n− 1)(φ− 1)µ = 0

⇐⇒ (1− φ)n−1∑t=2

Yt + Yn − φY1 + (n− 1)(φ− 1)µ = 0

⇐⇒ µ =1

(n− 1)(φ− 1)

((φ− 1)

n−1∑t=2

Yt + Yn − φY1

)

⇐⇒ µ =1

n− 1

n−1∑t=2

Yt +1

(n− 1)(1− φ)(Yn − φY1),(36)

so µ ≈ 1n−1

∑n−1t=2 Yt ≈ Y , since as n gets large, 1

(n−1)(φ−1)(Yn − φY1) goes to 0. This is why

one often chooses µ = Y . It is good to keep in mind that if n is small one may wish to usethe exact expression in (36).

Let’s now equate to 0 the partial derivative with respect to φ:

∂Scdφ

=n∑t=2

2 ((Yt − µ)− φ(Yt−1 − µ)) (µ− Yt−1).

This is 0 iff

φn∑t=2

(Yt−1 − µ)2 =n∑t=2

(Yt − µ)(Yt−1 − µ) ⇐⇒ φ =

∑nt=2(Yt − µ)(Yt−1 − µ)∑n

t=2(Yt−1 − µ)2,

so, replacing µ by its estimator Y , we get

φ =

∑nt=2(Yt − Y )(Yt−1 − Y )∑n

t=2(Yt−1 − Y )2.

We now see how the least-squares estimators are obtained for general AR processes:

For p ≥ 2,

Sc(φ1, . . . , φp, µ) =n∑

t=p+1

(Yt − µ−

p∑i=1

φi(Yt−i − µ)

)2

,

and one can show that

∂Sc(φ1, . . . , φp, µ)

∂µ=

n∑t=p+1

(Yt − µ−

p∑i=1

φi(Yt−i − µ)

)p∑i=1

φi = 0 ⇐⇒ µ ≈ Y

21–2

so for 1 ≤ i ≤ p, replacing µ by its estimator Y , we get for 1 ≤ j ≤ p,

∂Sc(φ1, . . . , φp, µ)

∂φj= 0 ⇐⇒

n∑t=p+1

(Yt − Y )(Yt−j − Y ) =n∑

t=p+1

p∑i=1

φi(Yt−i − Y )(Yt−j − Y )

⇐⇒n∑

t=p+1

(Yt − Y )(Yt−j − Y ) =

p∑i=1

φi

n∑t=p+1

(Yt−i − Y )(Yt−j − Y )

⇐⇒∑n

t=p+1(Yt − Y )(Yt−j − Y )∑nt=p+1(Yt − Y )2

=

p∑i=1

φi

∑nt=p+1(Yt−i − Y )(Yt−j − Y )∑n

t=p+1(Yt − Y )2

approx.⇐⇒ rj =

p∑i=1

φiri−j.

These are just the sample Yule-Walker equations, so we see that the conditional least squaresmethod yields the same estimators as the Yule-Walker method (which are the same as themethod of moments estimators).

21.2.2 General ARMA processes

The problem is more difficult in this case and can be addressed by a number of numericalmethods, but the key idea is that since one wants to regress the time series Yt on prior valuesof Y , one might wish to write the time series in invertible form Yt = Zt +

∑j≥1 πjYt−j.

To understand well the idea of maximum likelihood estimation, it is helpful to have somenotions of forecasting first. The idea we develop below fits naturally between the topics ofleast squares estimation and maximum likelihood estimation.

21.3 A Least Squares Predictor

We now turn to one of the important goals of time series analysis: To forecast future valuesof a time series. Since we need to know something about the underlying structure of thedata in order to achieve this, we will continue assuming that the time series we are workingwith are stationary. As we know from hypothesis testing or the construction of confidenceintervals, it is good to have an estimate/prediction, but much better to know how good theestimator/predictor is and in what sense.

If our predictor for a time series X at some future time n + h is to depend on values atprevious times (1 to n), the easiest assumption one can make is that it depends on thosevalues linearly. We define

PnXn+h := a0 +n∑i=1

aiXn+1−i = a0 + a1Xn + . . . anX1,

the best linear predictor of Xn+h to be that with the the least expected square error (or leastmean square error).

21–3

It is not obvious that such a predictor is unique or even exists (not every function has aminimum). The next theorem claims both existence and uniqueness of PnXn+h and showshow to find the coefficients aini=0.

Theorem 21.1. Suppose X is a stationary time series with mean µ and autocovariancefunction γ and

Sn(a0, . . . , an) := E[(Xn+h − PnXn+h)

2] = E[(Xn+h − (a0 + a1Xn + . . . anX1))2] .

Then the unique vector(a0, . . . , an)′

which minimizes Sn satisfies

a0 = µ(1−n∑i=1

ai) (37)

andΓnan = γn(h), (38)

wherean := (a1, . . . , an)′

Γn = (γ(i− j))ni,j=1

is the autocovariance matrix of X and

γn(h) = (γ(h), γ(h+ 1), . . . γ(h+ n− 1))′.

Proof. Sn(a0, . . . , an) is a positive quadratic function in the variables a0, . . . , an and thereforehas a unique minimum. To find it, we need to solve the equation

∇Sn(a0, . . . , an) = 0

or, equivalently, the n+ 1 equations

∂Sn(a0, . . . , an)

∂aj= 0, 0 ≤ j ≤ n.

We can compute partial derivatives inside the expectation in the definition of Sn and re-writethese equations as

E [(Xn+h − (a0 + a1Xn + . . .+ anX1))] = 0, (39)

E [Xj (Xn+h − (a0 + a1Xn + . . .+ anX1))] = 0, 1 ≤ j ≤ n. (40)

Equation (39) becomes

a0 = E [(Xn+h − (a1Xn + . . .+ anX1))] = µ(1− (a1 + . . .+ an)) = µ(1−n∑i=1

ai) (41)

and equation (40) can be rewritten as

a0µ = γ(n− j + h) + µ2 −n∑i=1

ai(γ(n+ 1− i− j) + µ2), 1 ≤ j ≤ n

21–4

or equivalently, replacing j by n+ 1− j (i.e., counting backwards) and using (41),

µ2(1−n∑i=1

ai) = γ(j + h− 1)−n∑i=1

aiγ(j − i) + µ2(1−n∑i=1

ai), 1 ≤ j ≤ n,

so thatn∑i=1

aiγ(j − i) = γ(j + h− 1), 1 ≤ j ≤ n,

which can be re-written in matrix form as

Γnan = γn(h). (42)

21–5


Lecture #22: Forecasting

Reference. Chapter 9 from the textbook.

The following proposition essentially follows from the work done in the proof of Theorem21.1:

Proposition 22.3. (Properties of PnXn+h)

1. PnXn+h = µ+n∑i=1

ai(Xn+1−i − µ), where an satisfies (42).

2. The mean-squared error of the predictor E[(Xn+h − PnXn+h)2] satisfies

E[(Xn+h − PnXn+h)2] = γ(0)− an

′γn(h).

3. E[Xn+h − PnXn+h] = 0

4. E[(Xn+h − PnXn+h)Xj] = 0 for j = 1, . . . , n.

22.1 Prediction

Example 22.1. (predictions for AR(1) process) Suppose Xt = φXt−1 + Zt, where |φ| < 1and Zt ∼WN(0, σ2). Recall that for such a process,

γX(h) =σ2

1− φ2φ|h|, h ∈ Z.

Equation (42) now becomesσ2

1−φ2σ2

1−φ2φσ2

1−φ2φ2 . . . σ2

1−φ2φn−1

σ2

1−φ2φσ2

1−φ2σ2

1−φ2φ . . . σ2

1−φ2φn−2

......

.... . .

...σ2

1−φ2φn−1 σ2

1−φ2φn−3 σ2

1−φ2φn−4 . . . σ2

1−φ2

a1

a2...

an−1

an

=

γ(h)γ(h+ 1)...γ(h+ n− 2)γ(h+ n− 1)

.

In particular, using the fact that h > 0,σ2

1−φ2σ2

1−φ2φσ2

1−φ2φ2 . . . σ2

1−φ2φn−1

σ2

1−φ2φσ2

1−φ2σ2

1−φ2φ . . . σ2

1−φ2φn−2

......

.... . .

...σ2

1−φ2φn−2 σ2

1−φ2φn−3 σ2

1−φ2φn−4 . . . σ2

1−φ2

a1

a2...an−1

an

=

σ2

1−φ2φh

σ2

1−φ2φh+1

...σ2

1−φ2φh+n−2

σ2

1−φ2φh+n−1

,

22–1

clearly implying that a1

a2...an−1

an

=

φh

0...00

.

Since µ = 0, we get from (37) that a0 = 0, so that

PnXn+h = φhXn.

Point 2. of Proposition 22.3 implies that the mean-squared error of the predictor is

γ(0)− an′γn(h) =

σ2

1− φ2− φh σ2

1− φ2φh = σ2 1− φ2h

1− φ2.

Note that the mean-squared error of the predictor increases with the variance of the whitenoise used to generate X. This makes sense, since increasing σ2 increases Var(Xt), yieldingmore variable data, which makes predictions more difficult.

22.2 Reduction to Mean Zero Time Series

We start by showing that whenever we have to deal with a stationary time series thatdoesn’t have zero mean, we can assume that it does (which usually simplifies computations)and deal with the mean only once we’ve made a prediction for the corresponding zero-meantime series:

Suppose Yt is a stationary time series with mean µ. Then if Xt := Yt − µ, Xt is astationary time series with mean 0. Therefore, the linearity of the prediction operatorimplies that

PnYn+h = Pn(Xn+h + µ) = PnXn+h + µ.

Also,

E[(Yn+h − PnYn+h)2] = E[((Yn+h − µ)− (PnYn+h − µ))2]

= E[(Xn+h − PnXn+h)2] = γX(0)− an

′γn(h),

where γn(h) = (γX(h), . . . , γX(h+ n− 1))′.

Example 22.2. (An AR process with nonzero mean) The process Yt is an AR(1) processwith mean µ if Xt = Yt − µ is AR(1). So by Example 22.1,

PnYn+h = φhXn + µ = φh(Yn − µ) + µ

and

E[(Yn+h − PnYn+h)2] =

σ2(1− φ2h)

1− φ2.

22–2

22.3 Forecasting Based on an Infinite Past



Xt −p∑i=1

φiXt−i =

q∑j=0

θjZt−j, (43)

where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials Φ(z) = 1 −p∑i=1

φizi and Θ(z) =

q∑j=0

θjzj have no common factors. In short, we can write

Φ(B)Xt = Θ(B)Zt. (44)

Recall that we derived some lectures ago an expression for ARMA(1,1) processes under the

assumption that |φ| < 1 by defining χ(z) =∑j≥0

φjzj and applying χ(B) to both sides of

Φ(B)Xt = Θ(B)Zt,

thus obtaining

Xt = χ(B)Θ(B)Zt = Zt + (φ+ θ)∑j≥1

φj−1Zt−j.

Similarly, if |θ| < 1, we can write

ξ(z) =1

θ(z)=∞∑j=0

(−θ)jzj.

Then (44) becomesξ(B)Φ(B)Xt = ξ(B)Θ(B)Zt = Zt,

that is,π(B)Xt = Zt,

where

π(B) = ξ(B)Φ(B) =∞∑j=0

(−θ)jBj(1− φB) = 1− (φ+ θ)∑j≥1

(−θ)j−1Bj,

so that

Zt = Xt − (φ+ θ)∑j≥1

(−θ)j−1Xt−j,

22–3

implying that

Xt = (φ+ θ)∑j≥1

(−θ)j−1Xt−j + Zt.

As the computations above show, for some time series, the value Xt depends on all the valuesXs, s < t. Therefore, the best “linear” predictor should depend on all the values Xs, s < tas well.

Definition 22.2. The prediction operator based on the infinite past Pn is defined by

PnXn+h =∞∑j=1

αjXn+1−j,

where the coefficients αj minimize the expected square error

E[(Xn+h − PnXn+h)2].

Note 22.1. The limit in the definition is taken to be the mean square limit. See the briefdiscussion of this topic in Lecture 4.

Proposition 22.4. (Properties of PnXn+h)

1. E[(Xn+h − PnXn+h)Xi] = 0 for all i ≤ n.

2. Pn(aXn+h1 + bXn+h2 + c) = aPn(Xn+h1) + bPn(Xn+h2) + c.

3. Pn(∑i≥1

αiXn+1−i) =∑i≥1

αiXn+1−i

4. PnXn+h = E[Xn+h] if Cov(Xn+h, Xi) = 0 for all i ≤ n.

Example 22.3. Consider the ARMA(1,1) process with |φ| < 1, |θ| < 1:

(1− φB)Xt = (1 + θB)Zt, Zt ∼ WN(0, σ2).

We saw above that

Xt = (φ+ θ)∑j≥1

(−θ)j−1Xt−j + Zt.

Applying Pn to both sides of the equality and using the properties of the proposition aboveyields

PnXn+1 = (φ+ θ)∑j≥1

(−θ)j−1Xn+1−j,

so since

Xn+1 = (φ+ θ)∑j≥1

(−θ)j−1Xn+1−j + Zn+1,

we see thatXn+1 − PnXn+1 = Zn+1,

implying that the expected square error is

E[(Xn+1 − PnXn+1)2] = σ2.

22–4


Lecture #23: More Forecasting

Reference. Section 7.3

23.1 The Innovations Algorithm

The innovations algorithm is designed to facilitate the computation of predictors via predic-tors used in the past. It works not just for stationary time series, but for any time serieswith second moments.

Consider a mean 0 time series Xt with E[X2t ] < ∞ for all t and κ(i, j) = E[XiXj].

Note that since we aren’t assuming here that Xt is stationary, we can’t talk about anautocovariance function, only about covariances. We define

Xn =

0, n = 1Pn−1Xn, n ≥ 2

,

the innovationsUn = Xn − Xn,

and

vn = E

[(Xn+1 − Xn+1

)2]

= E[U2n+1].

Since Xn = −n−1∑i=1

an−1,iXn−i (the constants are arbitrary and the minus sign is here just to

make the equation below look better), we get

U1

U2

U3...Un−1

Un

=

1 0 0 . . . 0 0a1,1 1 0 . . . 0 0a2,2 a2,1 1 . . . 0 0...

......

. . ....

...an−2,n−2 an−2,n−3 an−2,n−4 . . . 1 0an−1,n−1 an−1,n−2 an−1,n−3 . . . an−1,1 1

X1

X2

X3...Xn−1

Xn

,

which can be re-written in short as

Un = AnXn.

We know from linear algebra that the inverse matrix of An can be written in the form

Cn := A−1n =

1 0 0 . . . 0 0θ1,1 1 0 . . . 0 0θ2,2 θ2,1 1 . . . 0 0...

......

. . ....

...θn−2,n−2 θn−2,n−3 θn−2,n−4 . . . 1 0θn−1,n−1 θn−1,n−2 θn−1,n−3 . . . θn−1,1 1

.

23–1

In particular, sinceCnUn = Xn and Un = Xn − Xn,

Xn = Xn −Un = (Cn − 1n)Un = ΘnUn = Θn(Xn − Xn),

where

Θn =

0 0 0 . . . 0 0θ1,1 0 0 . . . 0 0θ2,2 θ2,1 0 . . . 0 0...

......

. . ....


.

If we write the equations of this system individually, we get X1 = 0 and for n ≥ 1,

Xn+1 =n∑j=1

θn,j

(Xn+1−j − Xn+1−j

).

So if at time n we wish to make a prediction Xn+1, we can do so by using our past predictionsXn+1−j1≤j≤n and the actual values Xn+1−j1≤j≤n. Of course, we also need the coefficientsθn,j1≤j≤n. It would be great if we could compute these recursively as well. It turns out wecan:

Theorem 23.1. (Innovations Algorithm)

v0 = κ(1, 1),

and for n ≥ 1,

θn,n−k = v−1k

(κ(n+ 1, k + 1)−

k−1∑j=0

θk,k−jθn,n−jvj

), 0 ≤ k ≤ n− 1, (45)

the sum being empty if k = 0, and

vn = κ(n+ 1, n+ 1)−n−1∑j=0

θ2n,n−jvj.

Note 23.1. So to compute θn,j1≤j≤n, we start by computing

θ1,1 = v−10 κ(2, 1),

thenv1 = κ(2, 2)− θ2

1,1v0,

thenθ2,2 = v−1

0 (κ(3, 1), θ2,1 = v−11 (κ(3, 2)− θ1,1θ2,2v0),

thenv2 = κ(3, 3)− (θ2

2,2v0 + θ22,1v1),

etc.

23–2

Note 23.2. An important property of the innovations is that the components of Xn − Xn

are uncorrelated.

Example 23.1. (Prediction for MA(1) process) In the case of an MA(1) process

Xt = Zt + θZt−1, |θ| < 1,

we have κ(n, n) = γ(0), κ(n + 1, n) = γ(1), and κ(m,n) = 0 if m > n + 1, so the equationsof the innovations algorithm become

v0 = γ(0),

and for n ≥ 2,θn,i = 0, i ≥ 2,

and for n ≥ 1,θn,1 = v−1

n−1γ(1),

vn = γ(0)− θ2n,1vn−1.

To see why θn,i = 0 for n, i ≥ 2, we can note that θn,n = v−10 κ(n + 1, 1) = v−1

0 γ(n) = 0 ifn ≥ 2. Moreover, if k ≤ n− 2, (n+ 1)− (k + 1) ≥ 2, so κ(n+ 1, k + 1) = 0, so that we getfrom (45)

θn,n−k = −k−1∑j=0

θk,k−jθn,n−jvj, (46)

We already know that θn,n = 0. We can now use this in (46) as follows:

θn,n−1 = −0∑j=0

θ1,1−jθn,n−jvj = −θ1,1θn,nv0 = 0,

Now that we know θn,n = θn,n−1 = 0, we get

θn,n−2 = −1∑j=0

θ2,2−jθn,n−jvj = −(θ2,2θn,nv0 + θ2,1θn,n−1v1) = 0.

Continuing like this, we can show θn,i = 0 for 2 ≤ i ≤ n− 2.

In the particular case of an MA(1) process, using the fact that

γ(0) = σ2(1 + θ2), γ(1) = σ2θ,

these equations becomev0 = σ2(1 + θ2),

and for n ≥ 1,θn,i = 0, i ≥ 2,

θn,1 = v−1n−1σ

2θ,

23–3

vn = σ2(1 + θ2)−(v−1n−1σ

2θ)2vn−1 = σ2(1 + θ2 − v−1

n−1σ2θ2).

For example, for the MA process

Xt = Zt +1

2Zt−1,

we get

v0 =5

4σ2,

θ1,1 = v−10

1

2σ2 =

2

5,

v1 = σ2

(1 +

1

4− v−1

0 σ2 1

4

)= σ2

(1 +

1

4− 1

5

)=

21

20σ2,

θ2,1 = v−11 σ2 1

2=

10

21,

v2 = σ2

(1 +

1

4− v−1

1 σ2 1

4

)= σ2

(1 +

1

4− 5

21

)=

85

84σ2,

θ3,1 = v−12 σ2 1

2=

42

85,

...

This gives

X2 = θ1,1

(X1 − X1

)=

2

5X1,

X3 = θ2,1

(X2 − X2

)=

10

21

(X2 − X2

)=

10

21

(X2 −

2

5X1

)=

10

21X2 −

4

21X1,

X4 =42

85

(X3 −

10

21X2 +

4

21X1

)=

42

85X3 −

20

85X2 +

8

85X1,

...

23–4

Math 4506 (Fall 2019) November 27 2019Prof. Christian Benes

Lecture #24: Maximum Likelihood Estimation

Reference. Section 7.3

24.1 Maximum Likelihood Estimation

When looking for estimators, a statistician has a number of tools at her disposal. Two ofthe most common are to look for method of moments or maximum likelihood estimators.We’ve seen already when discussing the Yule-Walker method how to come up with methodof moments estimators. We now discuss the maximum likelihood method. This methodrelies on the knowledge (up to the unknown parameters) of the underlying distribution andthe common assumption (which may of course be wrong, but is often appropriate) that thedata Xn = (X1, . . . , Xn) come from the normal distribution. In that case, the likelihood ofXn is defined by

L =1

(2π)n/2(det Γn)1/2exp

−1

2Xn′Γ−1n Xn

.

The likelihood depends on the only parameters which are present in it, that is, the covari-ances. It turns out that if we think in terms of innovations, we can simplify quite a bit thelast expression by finding appropriate replacements for det Γn and Xn

′Γ−1n Xn.

We know from earlier that if

Xn =

0, n = 1Pn−1Xn, n ≥ 2

is the least squares linear predictor, then Xn = Cn(Xn − Xn), where

Cn =

1 0 0 . . . 0 0θ1,1 1 0 . . . 0 0θ2,2 θ2,1 1 . . . 0 0...

......

. . ....


.

An important property of the innovations is that the components of Xn−Xn are uncorrelated,

so that if vj−1 = E

[(Xj − Xj

)2], the covariance matrix of Xn − Xn is

Dn =

v0 0 0 . . . 0 00 v1 0 . . . 0 00 0 v2 . . . 0 0...

......

. . ....

...0 0 0 . . . vn−2 00 0 0 . . . 0 vn−1

.

24–1

You proved on the first homework assignment of the semester that if Y = a + BX, thenΣY = BΣXB

′. Applying this to our current situation where Xn = Cn(Xn−Xn), we get that

Γn = CnDnC′n,

so that

Xn′Γ−1n Xn = (Xn − Xn)′C ′nC

′−1n D−1

n C−1n Cn(Xn − Xn)

= (Xn − Xn)′D−1n (Xn − Xn) =

n∑j=1

(Xj − Xj)2

vj−1

.

We also get that

det Γn = det(Cn) det(Dn) det(C ′n) =n∏i=1

vi−1,

so that we can rewrite the likelihood as

L =1

(2π)n/2(∏n

i=1 vi−1)1/2exp

−1

2

n∑j=1

(Xj − Xj)2

vj−1

.

All these quantities are easily computed using the innovations algorithm and so is the likeli-hood. In particular, using the definition rn = vn/σ

2 (note that rn is independent of σ2), weobtain the Gaussian likelihood for an ARMA process

L(φφφ,θθθ, σ2) =1√

(2πσ2)n(∏n

i=1 ri−1)exp

− 1

2σ2

n∑j=1

(Xj − Xj)2

rj−1

.

To maximize the Likelihood, one must differentiate it and look for zeros.

Note that since we are now assuming that X is an ARMA process, the innovations algorithmtells us that v0 = Var(X1) and for n ≥ 1, vn = Var(Xn+1)−

∑n−1j=0 θ

2n,n−jvj.

Wold’s theorem (see p. 383 of the textbook) says that every ARMA process can be expressedas an MA(∞) process (you showed this in a homework problem for AR(1) processes). Thismeans that we can write Xt =

∑k≥1 ψkZt−k with ψk = fk(φφφ,θθθ), which implies that v0 =

σ2f(φφφ,θθθ), where σ2 = Var(Zt) and the equality vn = Var(Xn+1)−∑n−1

j=0 θ2n,n−jvj implies that

for n ≥ 1, vn = σ2fn(φφφ,θθθ). This implies that rn = vn/σ2 is not dependent on σ2, so we will

be able to treat it like a constant (when differentiating with respect to σ2). This allows us

to keep the notation simple and write S =n∑j=1

(Xj − Xj)2

rj−1

and P =∏n

i=1 ri−1. Then, the

product rule implies

∂

∂σ2(L) =

∂

∂σ2

((σ2)−n/2√

(2π)nPexp

− 1

2σ2S

)

= −n2

(σ2)−n/2−1√(2π)nP

exp

− 1

2σ2S

+

(σ2)−n/2√(2π)nP

exp

− 1

2σ2S

(1

σ2

)2

S.

24–2

This last term is 0 if and only if

n

2(σ2)−n/2−1 = (σ2)−n/2

(1

σ2

)2S

2⇐⇒ σ2n = S.

This yields the estimator for the white noise σ2

σ2 =1

n

n∑j=1

(Xj − Xj)2

rj−1

.

Now to maximize L(φφφ,θθθ, σ2) is equivalent to maximizing

ln

(1√

σ2n(∏n

i=1 ri−1)exp

− 1

2σ2

n∑j=1

(Xj − Xj)2

rj−1

)= −1

2

(n lnσ2 +

n∑i=1

ln ri−1

)− 1

2σ2

n∑j=1

(Xj − Xj)2

rj−1

,

or, equivalently, to minimizing

n lnσ2 +n∑i=1

ln ri−1 +1

σ2

n∑j=1

(Xj − Xj)2

rj−1

,

which, after replacing σ2 by its estimator (a function of the estimators φφφ and θθθ), becomes

n ln

(1

n

n∑j=1

(Xj − Xj)2

rj−1

)+

n∑i=1

ln ri−1 + n,

so that maximizing the likelihood amounts to computing

σ2 =1

n

n∑j=1

(Xj − Xj)2

rj−1

(47)

and using the predictors obtained from the parameters φφφ and θθθ that minimize

`(φφφ,θθθ) = ln

(1

n

n∑j=1

(Xj − Xj)2

rj−1

)+ n−1

n∑i=1

ln ri−1. (48)

(Note that this is a difficult exercise.)

24–3

Math 4506 (Fall 2019) December 2, 2019Prof. Christian Benes

Lecture #25: AIC, Model Diagnostics

Reference. Section 6.5 and Chapter 8 from the textbook.

25.1 The Akaike Information Criterion

The Akaike criterion assigns to each model (i.e., for each pair p, q) a numerical value relatedto the likelihood of the model given the data. One then chooses p, q,φφφp, θθθq in such a way tominimize

AIC = −2 ln(Likelihood) + 2(p+ q + 1).

The smaller the AIC, the larger the likelihood. Note that sometimes, the AIC is taken to be(this is the case with R)

AIC = −2 ln(Likelihood) + 2(p+ q + 2),

but the values of p and q that minimize it are the same in both cases, so that for our purpose,either choice is fine.

There is an improved version which is as follows: The AICC (Akaike’s Information CorrectedCriterion) is

AICC = AIC +2(p+ q + 1)(p+ q + 3)

n− (p+ q + 3).

We will now see how this can be used in a specific example:

Example 25.1. Suppose X is the ARMA(1,1) process defined by

Xt = 0.5Xt−1 + Zt + 0.4Zt−1.

We can simulate the process as follows:

> Z=rnorm(10000)

> X=Z

> for (i in 2:10000) X[i]=0.5*X[i-1]+Z[i]+0.4*Z[i-1]

Not too surprisingly (since we know what the actual process is), neither the ACF nor thePACF suggest that an MA or AR model is appropriate. However, they do suggest that astationary model might be, so we look for the best fit among ARMA processes with p, q ≤ 5:

> m=matrix(0,6,6)

> for (i in 0:5) for (j in 0:5) m[i+1,j+1]=AIC(arima(X,order=c(i,0,j)))

25–1

> m[, 1] [, 2] [, 3] [, 4] [, 5] [, 6]

[1, ] 35627.07 29773.30 28743.51 28496.02 28466.50 28445.61[2, ] 29174.76 28435.98 28437.21 28439.21 28437.70 28437.46[3, ] 28558.52 28437.20 28435.47 28439.79 28437.97 28438.71[4, ] 28451.61 28439.18 28437.47 28439.12 28439.91 28440.69[5, ] 28445.16 28435.51 28437.51 28439.36 28439.20 28441.21[6, ] 28443.23 28437.51 28439.48 28440.73 28441.20 28443.05

This shows that the smallest AIC value is 28435.47, obtained by an ARMA(1,1) model. Wesee that the AIC of the ARMA(2,2) model is only very slightly larger, so both models shouldcertainly be considered. Note that if you were to perform the same experiment again, theAICs might suggest altogether different models (though with such a large data set, the modelchosen using the AIC is likely to be the right one).

25.2 Residuals

If a model obtained for a time series is good, it should account for all the structure thatis present in that time series. In other words, anything it doesn’t account for should be“completely random”. What the model doesn’t account for (that is, what’s left once welook at the difference between the data and the model) is what we call the residuals. If theresiduals for our model are white noise, we have a reasonable model and can pat ourselveson the back.

Given a time series X, the innovations algorithm computes at each time t a predictor Xt+1,using the recursively computed values of vi and θi,j.

The fact that the innovationsXn−Xn are uncorrelated then implies that the rescaled residuals

Rn :=Xn − Xn√

sn−1

∼ WN(0, 1),

wheresn−1 = E[(Xn − Xn)2].

Now in reality, the time series X is not known. It depends on the parameters θθθ,φφφ, and σ2 forwhich one can only try to guess the true values. If the predictor is based on the maximumlikelihood estimators for the parameters and the model is right, then

Wn :=Xn − Xn(θθθ, φφφ)√rn−1

approx∼ WN(0, σ2).

The rescaled residuals are then defined to be

Rn =Wn

σ.

If the fitted model is appropriate, Rn should look like WN(0, 1). Here are a few ways ofverifying that this is indeed the case:

25–2

25.2.1 Checking Normality of Residuals

To check if xi1≤i≤n could be an independent sample from a N(0, 1), we have several toolsat our disposal. You probably saw how to perform goodness of fit tests in your statisticsclass. Another approach is to consider quantile-quantile plots (also called qq plots):

Assume the data xi1≤i≤n are in increasing order and let Φ be standard normal c.d.f. For1 ≤ i ≤ n+ 1, let

pi =i

n+ 1

andqpi = Φ−1(pi).

The plot of point (xi, qpi) is a quantile-quantile plot and plots the data together with theequiprobable quantiles based on the sample size. If the data come from a standard normaldistribution, the points in this plot should be close to perfectly lined up.

25.2.2 Checking If the Residuals Are White Noise

There are a few methods we’ve seen already to determine if the residuals could be whitenoise. Here is a quick reminder of what they are:

• Straightforward examination of the ACF (for instance, 95% of the autocorrelationsshould fall within (−1.96/

√n, 1.96/

√n).) Note that this is not rigorous and should be

used only to give you a general idea of whether you are in the presence of white noiseor not.

• The Portmanteau test.

• The turning point test.

Given a model Xt for a time series, R will compute the rescaled residuals if instructed todo so. All that is then left to do to convince ourselves that our model is appropriate is toverify that there is no evidence against the residuals being WN(0, 1).

Example 25.2. We will look at a data set containing the Euro/Dollar exchange rates (thevalue of one $US in Euros) from May 6, 2010 to May 6, 2011 (both dates included). Thisdata set can be obtained at

http://userhome.brooklyn.cuny.edu/cbenes/Euro-Dollar.txt

Note that the data set goes backwards in time when going from top to bottom, so we’ll needto invert it. We start by importing this data set:

> www=“http://userhome.brooklyn.cuny.edu/cbenes/Euro-Dollar.txt”

> ED=read.table(www,header=T)

25–3

Now we make the data set go forward in time as follows:

> ed = c()

> for (i in 1:366) ed[i]=ED[367-i,1]

We first take a look at this data set:

> plot(ed,type=“l”)

0 100 200 300

0.70

0.75

0.80

Index

ed

The value of one $US in Euros, May 6, 2010-May 6, 2011

We can then look at the ACF and PACF of this time series:

> acf(ed)

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

USD.EUR

The autocorrelation function of X

> pacf(ed)

25–4

5 10 15 20 25

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

Par

tial A

CF

Series ED

The partial autocorrelation function of X

The PACF suggests that an AR(2) process may be appropriate. However, the fact that theACF decays so slowly should raise some skepticism.

Let’s see what happens if we do look for an AR(2) model for this time series:

> ar2=arima(ed,order=c(2,0,0))

> ar2

Call:

arima(x = ed, order = c(2, 0, 0))

Coefficients:

ar1 ar2 intercept

1.3644 -0.3703 0.7399

s.e. 0.0491 0.0493 0.0261

sigma2 estimated as 1.270e-05: log likelihood = 1541.23, aic = -3074.46

We first check if the residuals look normal:

> qqnorm(ar2$resid)

-3 -2 -1 0 1 2 3

-0.010

-0.005

0.000

0.005

0.010

0.015

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

25–5

The relatively straight quantile-quantile plot suggests that the residuals could indeed benormal.

To determine if this model is appropriate, we use the following command which gives us aplot of the residuals, of their acf, and p-values for the Ljung-Box statistic at various lags.

> tsdiag(ar2)

Standardized Residuals

Time

0 100 200 300

-20

24

0 5 10 15 20 25

0.0

0.4

0.8

Lag

ACF

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8

p values for Ljung-Box statistic

lag

p va

lue

We can also obtain details of the residuals with the following command:

> resid(ar2)

Since the p-values of the Ljung-Box statistic are greater than 5% for all lags shown by R, thediagnostic plot convinces us that the residuals could very well be a realization of white noise,so our model appears to be appropriate. Using the following command to get the standarddeviation of the residuals

> sd(ar2$resid)

[1] 0.003567808

we see that the exact model is

Xt − 0.7399 = 1.3644(Xt−1 − 0.7399)− 0.3703(Xt−2 − 0.7399) + 0.003567808 ·WN(0, 1).

25–6

The model we found is adequate, as shown by inspection of the residuals, but the ACF shouldhave led us away from the AR(2) model. Indeed, the ACF of an AR process generally decaysexponentially (you know at least that this is the case for AR(1) processes) or is a damped sinewave, i.e., an oscillating function stuck between a positive exponentially decaying functionand its negative. Since this is not what we observed in the sample ACF of the Euro-Dollartime series, the AR(2) model is probably wrong (though it was adequate).

When the ACF decays very slowly, differencing is usually recommended, as a slowly decayingACF is symptomatic of a time series that may not be stationary. Think for instance ofrandom walk for which we’ve seen that the ACF decays very slowly, the PACF levels offafter lag 1, which is very similar to what we observed in the Euro-Dollar data. Whilerandom walk is not a stationary process, if we difference the process, we get white noisewhich certainly is stationary. So let’s re-visit the example.

First, let’s create the lag-1 differenced time series:

> Diffed=diff(ed)

Then let’s try to get some visual information about that series:

> plot(Diffed,type=”l”)

0 100 200 300

-0.010

-0.005

0.000

0.005

0.010

Index

Diffed

We can then look at the ACF and PACF of this time series:

> acf(Diffed)

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series Diffed

25–7

> pacf(Diffed)

5 10 15 20 25-0.1

0.0

0.1

0.2

0.3

Lag

Par

tial A

CF

Series Diffed

From these pictures, it isn’t clear that the process is an AR or an MA process, so we lookfor the best ARMA process among all those with p ≤ 4 and q ≤ 4 (higher values of p or qare generally not advised for prediction purposes):

> m=matrix(0,5,5)

> for (i in 0:4)

+ for (j in 0:4) m[i+1,j+1]=AIC(arima(Diffed,order=c(i,0,j)))

The matrix m then contains the AIC for all ARMA(p, q) models for 0 ≤ p, q ≤ 4 from whichwe see (check it at home if you don’t believe me) that the model with the lowest AIC is anARMA(2,3) model.

We can now check if the model is adequate:

> arma23=arima(Diffed,order=c(2,0,3))

> qqnorm(arma23$res)

-3 -2 -1 0 1 2 3

-0.010

-0.005

0.000

0.005

0.010

0.015

Normal Q-Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

The quantile-quantile plot suggests that the data could well be normal.

> tsdiag(arma23)

25–8


Time

0 100 200 300

-20

24

0 5 10 15 20 25

0.0

0.4

0.8

Lag

ACF

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8


lag

p va

lue

The ACF of the residuals clearly suggests that the residuals could very well be white noise,so again, the model is adequate. Moreover, the p-values of the Ljung-Box statistic areconsiderably larger, so we should be more confident that the residuals are white noise thanin our residuals for the non-differenced series.

25–9


Lecture #26: Forecasting

26.1 Exponential Smoothing

Exponential smoothing is a smoothing method which attempts to extract the “true” trendof the time series by assuming that this trend is determined by the entire time series up toany given time, but with less and less weight attached to times that are far back in time.More precisely, for some fixed α ∈ [0, 1], the moving averages mt are defined as follows:

mn =n−2∑j=0

α(1− α)jXn−j + (1− α)n−1X1.

Note that the weights add up to 1. Indeed,

n−2∑j=0

α(1− α)j + (1− α)n−1 = α

(1

1− (1− α)− (1− α)n−1

1− (1− α)

)+ (1− α)n−1 = 1.

Note also that if α is close to 1, then mn is basically equal to Xn and if α is close to 0, thenmn is basically a weighted mean of all values of the time series up to the present.

Note that mn−1 =∑n−2

j=1 α(1− α)j−1Xn−j + (1− α)n−2X1, so

mn = (1− α)mn−1 + αXn.

The h-step predictor based on exponential smoothing is

PnXn+h = mn.

Exponential smoothing doesn’t assume anything about the shape and form of the underlyingtrend and possible cyclical behavior of the time series. It therefore lends itself particularlywell to time series for which we don’t have any a priori information regarding any possiblecyclical behavior or clearly-defined trend. We will see how to handle a trend or periodiccomponent later by extending the idea of exponential smoothing which, together with itsextensions is also called the Holt-Winters method.

We look at the Lake Huron levels time series.

Example 26.1. You are free to choose the value of the parameter α, but if you don’t, Rwill do it in such a way that it minimizes the sum of the squares of the one-step predictionerrors

n−1∑i=1

(mi −Xi)2.

26–1

Exponential smoothing is a particular case of the Holt-Winters method which contains twoadditional parameters (beta and gamma). For simple exponential smoothing, these param-eters should be set to FALSE:

> data(LakeHuron)

> x=LakeHuron

> Huron.hw = HoltWinters(x,beta=FALSE,gamma=FALSE)

> PRED=predict(Huron.hw,n=5)

> plot(x)

> lines(PRED,col=“red”)

This gives the following picture. As expected from the definition above, the predictions forall future lags are the same:

Time

x

1880 1900 1920 1940 1960

576

577

578

579

580

581

582

26.2 Double Exponential Smoothing

We revisit exponential smoothing by looking for a predictor that is linear (rather thanconstant) in h. Define

a1 = b1 = X1

and, for n > 1,

an = (1− α)(an−1 + bn−1) + αXn , bn = (1− β)bn−1 + β(an − an−1).

We then define the lag-h predictor by

PnXn+h = an + hbn.

We now revisit the example above:

26–2

Example 26.2. Again, R will choose α and β in such a way that it minimizes the sum ofthe squares of the one-step prediction errors

n−1∑i=1

(mi −Xi)2.

For double exponential smoothing, gamma should be set to FALSE:

> data(LakeHuron)

> x=LakeHuron

> Huron.hw = HoltWinters(x,gamma=FALSE)

> PRED=predict(Huron.hw,n=5)

> plot(x)


This gives the following picture. As expected from the definition above, the predictions forall future lags are the same:

Time

x

1880 1900 1920 1940 1960

576

577

578

579

580

581

582

We can find the exact values of the predictions by typing

> PRED

Time Series:

Start = 1973

End = 1977

Frequency = 1

26–3

fit

[1,] 580.1901

[2,] 580.4201

[3,] 580.6502

[4,] 580.8802

[5,] 581.1103

26.3 Fitting and Predicting

In this example we will see how to make predictions using an ARMA fit for a data set:

> data(LakeHuron)

> x=LakeHuron

> for (i in 0:5) for (j in 0:5) m[i+1,j+1]=AIC(arima(LakeHuron,order=c(i,0,j)))

> m

[, 1] [, 2] [, 3] [, 4] [, 5] [, 6]335.2698 255.2950 230.9306 222.1263 222.5113 222.6902219.1959 214.4905 216.4645 217.8882 219.3345 221.3152215.2664 216.4764 218.4106 219.5158 220.3386 222.1918216.0377 217.8048 219.6967 221.1937 222.1703 224.1686217.6237 219.2071 220.4332 221.8397 223.2962 224.9742219.5631 220.3120 222.3383 223.2915 225.2813 226.9103

This suggests an ARMA(1,1) model for this data set:

> fit=arima(x,order=c(1,0,1))

> fit

Call:

arima(x = x, order = c(1, 0, 1))

Coefficients:

ar1 ma1 intercept

-0.4925 0.3854 0.0012

s.e. 0.0586 0.0619 0.0093

sigma2 estimated as 1.003: log likelihood = -14204.51, aic = 28417.01

Now let’s use this model to predict the value for the first 5 years following the data set:

> PREDICT=predict(fit,n.ahead=5)

> PREDICT

$pred

26–4

Time Series:

Start = 10001

End = 10005

Frequency = 1

[1] -0.30418498 0.15160347 -0.07287637 0.03768193 -0.01676901

$se

Time Series:

Start = 10001

End = 10005

Frequency = 1

[1] 1.001512 1.007245 1.008630 1.008966 1.009048

Finally, let’s plot the time series with the 5 predicted values and 95% confidence intervalsfor the Lake Huron levels for 1981 to 1985:

> plot(x,xlim=c(1875,1985))

> lines(PREDICT$pred,col=”red”)

> lines(PREDICT$pred+1.96*PREDICT$se,col=”red”,lty=3)

> lines(PREDICT$pred-1.96*PREDICT$se,col=”red”,lty=3)

> lines(PREDICT$pred+1.645*PREDICT$se,col=”red”,lty=3)

> lines(PREDICT$pred-1.645*PREDICT$se,col=”red”,lty=3)

This gives the predicted values up to 5 steps into the future with the 95% confidence band(in red) and the 90% confidence band (in green):

26–5

Time

x

1880 1900 1920 1940 1960 1980

576

577

578

579

580

581

582

26–6


Lecture #27: Two Models Incorporating a Periodic Comonent:Holt-Winters and ARMA

27.1 Periodic Components

Definition 27.1. A function f : R→ R has period d if for every x ∈ R,

f(x+ d) = f(x).

One nice thing about periodic functions is that if we add functions of period d, we end upwith another function of period d. Indeed, if f and g are periodic with period d,

(f + g)(x+ d) = f(x+ d) + g(x+ d)f,g periodic

= f(x) + g(x) = (f + g)(x),

so f + g is periodic.

The most natural candidates for periodic functions of period d are sin(2πtd

) and cos(2πtd

), butof course functions of period d/k, for k ∈ N are also periodic of period d, so sin(4πt

d) and

cos(4πtd

) are candidates as well. In fact, sin(2kπtd

) and cos(2kπtd

) are possible functions and wemay wish to consider all those with 2k ≤ d (if 2k > d, then the period of the sine functionis less than a unit of time which will yield useless information).

27.2 Holt-Winters Method

We revisit exponential smoothing by looking for a predictor that is linear in h and with aperiodic component of period p. The process is again recursive and requires initial valueswhich can be defined in a number of reasonable ways. For instance, we can define

ap+1 = Xp+1,

bp+1 =Xp+1 −X1

p,

and for i = 1, . . . , p+ 1,si = Xi −X1 − (i− 1)bp+1.

For n > p, letan = (1− α)(an−1 + bn−1) + α(Xn − sn−p),

bn = (1− β)bn−1 + β(an − an−1),

and for n > p, letsn = (1− γ)sn−p + γ(Xn − an).

We then define the lag-h predictor by

PnXn+h = an + hbn + sn−p+1+((h−1) mod p).

27–1

Example 27.1. We now revisit the airline passenger model. The time series has a naturalperiod of 12 (in months), since we have monthly data and it is reasonable to assume thatpassenger numbers follow annual cycles.

> AP=AirPassengers

> LAP=log(AP)

> LAP.hw=HoltWinters(LAP)

> LAP.hw

Holt-Winters exponential smoothing with trend and additive seasonal component.

Call:

HoltWinters(x = LAP)

Smoothing parameters:

alpha: 0.3266015

beta : 0.005744138

gamma: 0.8206654

Coefficients:

[,1]

a 6.172308435

b 0.008981893

s1 -0.073201087

s2 -0.140973564

s3 -0.036703294

s4 0.014522733

s5 0.032554237

s6 0.154873570

s7 0.294317062

s8 0.276063997

s9 0.088237657

s10 -0.032657089

s11 -0.198012716

s12 -0.102863837

Let’s predict the next 4 years for the time series:

> PRED=predict(LAP.hw,n=48)

The following command gives some space to the plot for predictions to be added:

> plot(LAP,xlim=c(1949,1965),ylim=c(4.5,7))

27–2

Time

LAP

1950 1955 1960 1965

4.5

5.0

5.5

6.0

6.5

7.0


Time

LAP

1950 1955 1960 1965

4.5

5.0

5.5

6.0

6.5

7.0

Note 27.1. In the example above, R already knew that the period of the data was 12. Ingeneral, to use the Holt-Winters method on any given data set, you will first need to specifythe frequency using, for instance, the command “A=ts(data,frequency=12)” if your dataset is called “data” and you wish to call the time series with period 12 “A”. You can thenperform your analysis on the data set A.

27–3

27.3 A Complete ARMA-based Model with Periodic Component

Example 27.2. We now revisit the airline passenger model, finding an ARMA model witha seasonal component: We are looking for a periodic function of period 1 (in years) or 12 (inmonths) that would fit the data well. There are many of such functions.

So in our problem, we will consider sin(2kπtd

) and cos(2kπtd

) with d = 12, k = 1, . . . , 6:

> AP=AirPassengers

> LAP=log(AP)

> t=time(LAP)

> t2=tˆ2

> COS=SIN=matrix(nr=length(AP),nc=6)

> for (i in 1:6)+ SIN[,i]=sin(2*pi*i*t)

+ COS[,i]=cos(2*pi*i*t)

+ We can now try a least squares fit with a quadratic trend and a periodic component ofincreasing complexity :

> LAP.lm11=lm(LAP˜t+t2+COS[,1]+SIN[,1])

> plot(LAP)

> T = c()

> for (i in 1:144) T[i]=1949+(i-1)/12

> lines(T,LAP.lm11$fit,col=”red”)

Time

LAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

> LAP.lm12=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2])

> lines(T,LAP.lm12$fit,col=”green”)

27–4

TimeLAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

> LAP.lm13=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3])

> lines(T,LAP.lm13$fit,col=”blue”)

We can regress on three more curves

> LAP.lm14=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4])

> LAP.lm15=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4]

+COS[,5]+SIN[,5])

> LAP.lm16=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4]

+COS[,5]+SIN[,5]+COS[,6]+SIN[,6])

(note that SIN[,6] is not included as it would only yield values of zero) and get the following(optimal among those we’ve tried, since it has the largest number of parameters) curve

> plot(LAP)

> lines(T,LAP.lm16$fit,col=“red”)

Time

LAP

1950 1952 1954 1956 1958 1960

5.0

5.5

6.0

6.5

To know what our regression curve is, we type

> coef(LAP.lm16)

27–5

(Intercept) t t2 COS[, 1] SIN[, 1] COS[, 2]

-1.205314e+04 1.221572e+01 -3.093389e-03 -1.471879e-01 2.807718e-02 5.679671e-02

SIN[, 2] COS[, 3] SIN[, 3] COS[, 4] SIN[, 4] COS[, 5]

5.905909e-02 -8.709331e-03 -2.731366e-02 1.111352e-02 -3.199814e-02 5.909835e-03

SIN[, 5] COS[, 6]

-2.126938e-02 -2.936203e-03

Note 27.2. We are omitting the 6th sine component as this is a 2-periodic sine functionwith sin(0) = 0, which implies that sin(t) = 0 for all t ∈ N.

We now check if the residuals could be modeled by a stationary time series:

> acf(LAP.lm16$res)

0 5 10 15 20

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series LAP.lm16$res

> pacf(LAP.lm16$res)

5 10 15 20

-0.2

0.0

0.2

0.4

0.6

Lag

Par

tial A

CF

Series LAP.lm16$res

The ACF and PACF suggest that the residuals from our least-squares fit could be modeledby an AR(1) process. We now check if this impression is validated:

27–6

> LAP.ar=arima(resid(LAP.lm16),order=c(1,0,0))

Call:

arima(x = resid(LAP.lm16), order = c(1, 0, 0))

Coefficients:

ar1 intercept

0.6732 0.0006

s.e. 0.0612 0.0085

sigmaˆ2 estimated as 0.001144: log likelihood = 283.07, aic = -560.14

We look at the residuals of our AR model (note that we include “[-1]” in our command asthe residuals of the AR model are undefined at the first time) and get

> acf(LAP.ar$res[-1])

0 5 10 15 20

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

Lag

ACF

Series LAP.ar$res[-1]

> tsdiag(LAP.ar)


Time

0 20 40 60 80 100 120 140

-3-2

-10

12

0 5 10 15 20

-0.2

0.2

0.6

1.0

Lag

ACF

ACF of Residuals

2 4 6 8 10

0.0

0.4

0.8


lag

p va

lue

Since the residual from the random component could be white noise, we have found an

27–7

adequate model for the Air Passenger time series Xt:

lnXt = −12053 + 12.22t− 0.0031t2 − 0.1472 cos(2πt) + 0.028 sin(2πt) + 0.0568 cos(4πt)

+ 0.059 sin(4πt)− 0.0087 cos(6πt)− 0.0273 sin(6πt) + 0.0111 cos(8πt)− 0.032 sin(8πt)

+ 0.0059 cos(10πt)− 0.0213 sin(10πt)− 0.0029 cos(12πt) + Yt,

where Yt is a mean 0.0006 AR(1) process with φ = 0.6732.

Now this can be used to predict Xt at future times, by just plugging t into the least squarescurve (which is non-random, so there’s nothing to predict there) and by performing a pre-diction for Yt, which is something we know how to do since Yt is an AR process.

27–8


Lecture #28: Q&A

28–1

brooklyn college, cunyuserhome.brooklyn.cuny.edu/cbenes/f19m4506lecturenotes.pdf · here’s where...

Documents