brooklyn college, cunyuserhome.brooklyn.cuny.edu/cbenes/f19m4506lecturenotes.pdf · here’s where...
TRANSCRIPT
Brooklyn College, CUNY
Math 4506 – Time Series
Lecture Notes
Fall 2019
Christian Benes
http://userhome.brooklyn.cuny.edu/cbenes/timeseries.html
Math 4506 (Fall 2019) August 28, 2019Prof. Christian Benes
Lecture #1: Introduction; Probability Review
1.1 What this course is about: time series and time series models
Essentially, all models are wrong, but some are useful.
George E. P. Box
Probabilists study time series models. These are abstract random objects which arecompletely well-defined and can generate sets of data (using random number generators).
Statisticians study time series (which are data sets) and try to find the right model for it,that is, the time series model from which the data could have been generated.
In that sense, probabilists and statisticians do the opposite job, the first being (arguably)more elegant, the second being (definitely) more practical.
Below are some examples of time series. The first three are “real-world” data. The following6 are computer-generated. Our goal in this course will be to find ways to construct modelsfrom which these data could have arisen.
1930 1968
400
450
500
550
600
650
Baltimore city annual water use, liters per capita per day, 1890-1968
1–1
0 100 200 300
0.70
0.75
0.80
Index
ed
Daily value of one $US in Euros, May 6, 2010 - May 6, 2011
1–2
1400
1600
1800
Closing value of NASDAQ 100 index, July 25 2008 - January 23, 2009.
1–3
2 4 6 8 10
-1.5
-1.0
-0.5
0.5
1.0
Ten random data points. What can we say about the underlying distribution?
2 4 6 8 10
0.6
0.8
1.0
1.2
1.4
1.6
What about these 10 data points?
2 4 6 8 10
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
Same question
1–4
Scale is important when visualizing data. Here are the same data sets as on the previouspage, shown all three at the same scale:
2 4 6 8 10
-2
-1
0
1
2
2 4 6 8 10
-2
-1
0
1
2
2 4 6 8 10
-2
-1
0
1
2
1–5
It turns out that these data are drawn from the (multivariate) normal distributionsN(0,Σ1), N(0,Σ2),N(0,Σ3), respectively, where
Σ1 =
1 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 00 0 0 0 1 0 0 0 0 00 0 0 0 0 1 0 0 0 00 0 0 0 0 0 1 0 0 00 0 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 1
,
Σ2 =
1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/5 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1 4/54/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 4/5 1
,
Σ3 =
1 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 1 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 1 24/25 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 1 24/25 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 1 24/25 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 1 24/25 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 1 24/25 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 1 24/25 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 1 24/2524/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 24/25 1
.
If you’re not sure what this means, don’t worry. Details are coming up. In a nutshell, thesamples of the first data set are drawn from independent normal random variables, whilethose from the other two sets are drawn from a family of pairwise positively correlatedrandom variables (with covariances 4/5 in the first case and 24/25 in the second).
The main purpose of time series modeling is to come up (as one would expect) with thestochastic process (time series model) from which the observed data (time series) is a re-alization. This is an impossible task, as suggested by the quote at the beginning of thislecture.
Randomness in the real world is simply too complex to grasp completely. However, thereare ways to determine according to some (sometimes subjective) criteria which models workbetter and which models don’t work as well in a given setting.
1–6
Here’s where finding a model for data is tricky: There are many choices for a model whichat first (and even second) glance seem reasonable for a given data set. I am sure none ofyou would have been shocked if I had told you that the second to last data set above wasdrawn from independent normal random variables with mean 0 and standard deviation 1/2.Nor would you have been very troubled if I’d suggested that they were generated usingindependent exponential random variables with mean 1.
This illustrates the fact that in time series modeling, one often has a choice between anumber of models (in the case I just mentioned, types of random variables) and withinthese, a number of parameters (means, variances, covariances, etc.).
In this course, you will be exposed to a number of models which all depend on a number ofparameters. There usually isn’t a systematic way to choose a model (and the correspondingparameters), so modeling usually requires a fair dose of theoretical understanding (to deter-mine if a model is even acceptable in a given setting) and flair (since all models are wrong,experience comes in handy when trying to find one that is better than others).
Since the title of this course is Time Series, it might be useful if we know what a time seriesis!
Definition 1.1. A time series is simply a set of observations xt, with each data pointbeing observed at a specific time t.
A time series model is a set of random variables Xt, each of which corresponds to a specifictime t.
Notation
The symbol A := B means A is defined to equal B, whereas C = D by itself means simplythat C and D are equal. This is an important distinction because if you write A := B, thenthere is no need to verify the equality of A and B. They are equal by definition. However,if C = D, then there IS something that needs to be proved, namely the equality of C andD (which might not be obvious).
For example, you may recall that for a random variable X,
V ar(X) := E[(X − E[X])2]
andV ar(X) = E[X2]− E[X]2.
1.2 Introduction to Random Variables
While writing my book [Stochastic Processes] I had an argument with Feller. Heasserted that everyone said “random variable” and I asserted that everyone said“chance variable.” We obviously had to use the same name in our books, so wedecided the issue by a stochastic procedure. That is, we tossed for it and he won.
Joe Doob
1–7
In probability, Ω is used to denote the sample space of outcomes of an experiment.
Example 1.1. Toss a die once: Ω = 1, 2, 3, 4, 5, 6.Example 1.2. Toss two dice: Ω = (i, j) : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6.
Note that in each case Ω is a finite set. (That is, the cardinality of Ω, written |Ω|, is finite.)
Example 1.3. Consider a needle attached to a spinning wheel centred at the origin. Whenthe wheel is spun, the angle ω made with the tip of the needle and the positive x-axis ismeasured. The possible values of ω are Ω = [0, 2π).
In this case, Ω is an uncountably infinite set. (That is, Ω is uncountable with |Ω| =∞.)
Definition 1.2. A random variable X is a function from the sample space Ω to the realnumbers R = (−∞,∞). Symbolically,
X : Ω → R
ω 7→ X(ω).
Example 1.4. (1.1 continued). Let X denote the upmost face when a die is tossed. Then,X(i) = i, i = 1, . . . , 6.
Example 1.5. (1.2 continued). Let X denote the sum of the upmost faces when two diceare tossed. Then, X((i, j)) = i + j, i = 1, . . . , 6, j = 1, . . . , 6. Note that the elements ofΩ are ordered pairs, so that the function X(·) acts on (i, j) giving X((i, j)). We will oftenomit the inner parentheses and simply write X(i, j).
Example 1.6. (1.3 continued). Let X denote the cosine of the angle made by the needleon the spinning wheel and the positive x-axis. Then X(ω) = cos(ω) so that X(ω) ∈ [−1, 1].
Remark. As mentioned in the definition, a random variable is really a function whose inputvariable is random, that is, determined by chance (or God, or destiny, or karma, or whateveryou think decides how our world works). The use of the notation X and X(ω) is EXACTLYthe same as the use of f and f(x) in elementary calculus. For example, f(x) = x2, f(t) = t2,f(ω) = ω2, and X(ω) = ω2 all describe EXACTLY the same function (at least if we assumethe domains are the same), namely, the function which takes a number and squares it.
What makes random variables slightly more complicated than functions is that, unlike thevariable x from calculus, the variable ω is random and therefore comes from a distribution.
1.3 Discrete and Continuous Random Variables
Definition 1.3. Suppose that X is a random variable. Suppose that there exists a functionf : R→ R with the properties that f(x) ≥ 0 for all x,
∫∞−∞ f(x) dx = 1, and
P (ω ∈ Ω : X(ω) ≤ a) =: P (X ≤ a) =
∫ a
−∞f(x) dx.
We call f the (probability) density (function) of X and say that X is a continuous randomvariable. Furthermore, the function F defined by F (a) := P (X ≤ a) is called the (probability)distribution (function) of X.
1–8
Note 1.1. By the Fundamental Theorem of Calculus, F ′(x) = f(x).
Remark. There exist continuous random variables which do not have densities. Althoughit’s good to know that the definition of continuous random variables is slightly more generalthan what is suggested above, you won’t need to worry about it in this course.
Example 1.7. A random variable X is said to be normally distributed with parameters µ,σ2, if the density of X is
f(x) =1
σ√
2πexp
(−(x− µ)2
2σ2
), −∞ < µ <∞, 0 < σ <∞.
This is sometimes written X ∼ N (µ, σ2). In Exercise 1.2, you will show that the mean of Xis µ and the variance of X is σ2, respectively.
Definition 1.4. Suppose that X is a random variable. Suppose that there exists a functionp : Z→ R with the properties that p(k) ≥ 0 for all k,
∑∞k=−∞ p(k) = 1, and
P (ω ∈ Ω : X(ω) ≤ N) =: P (X ≤ N) =N∑
k=−∞
p(k).
We call p the (probability mass function or) density of X and say that X is a discreterandom variable. Furthermore, the function F defined by F (N) := P (X ≤ N) is called the(probability) distribution (function) of X.
Example 1.8. 1.2 (continued). If X is defined to be the sum of the upmost faces whentwo dice are tossed, then the density of X, written p(k) := P (X = k), is given by
p(2) p(3) p(4) p(5) p(6) p(7) p(8) p(9) p(10) p(11) p(12)
1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
and p(k) = 0 for any other k ∈ Z.
Remark. There do exist random variables which are neither discrete nor continuous; how-ever, such random variables will not concern us.
1.4 Expectation and Variance
Suppose that X : Ω → R is a random variable (either discrete or continuous), and thatg : R → R is a (piecewise) continuous function. Then Y := g X : Ω → R defined byY (ω) = g(X(ω)) is also a random variable. We usually write Y = g(X).
We now define the expectation of the random variable Y , distinguishing the discrete andcontinuous cases.
1–9
Definition 1.5. If X is a discrete random variable and g is as above, then the expectationof g X is given by
E[g(X)] :=∑k
g(k) p(k)
where p is the probability mass function of X.
Definition 1.6. If X is a continuous random variable and g is as above, then the expectationof g X is given by
E[g(X)] :=
∫ ∞−∞
g(x) f(x) dx
where f is the probability density function of X.
Notice that if g is the identity function (that is, g(x) = x for all x, we get the expectationof X itself:
• E[X] :=∑k
k p(k), if X is discrete, and
• E[X] :=
∫ ∞−∞
x f(x) dx if X is continuous.
µ := E[X] is also called the mean of X. Note that −∞ ≤ µ ≤ ∞. If −∞ < µ < ∞, thenwe say that X has a finite mean, or that X is an integrable random variable, and we writeX ∈ L1.
Exercise 1.1. Suppose that X is a Cauchy random variable. That is, X is a continuousrandom variable with density function
f(x) =1
π· 1
x2 + 1.
Carefully show that X 6∈ L1 (that is, X doesn’t have a finite mean).
Theorem 1.1 (Linearity of Expectation). Suppose that X : Ω → R and Y : Ω → R are(discrete or continuous) random variables with X ∈ L1 and Y ∈ L1. Suppose also thatf : R → R and g : R → R are both (piecewise) continuous and such that f(X) ∈ L1 andg(Y ) ∈ L1. Then, for any a, b ∈ R, af(X) + bg(Y ) ∈ L1 and, furthermore,
E[af(X) + bg(Y )] = aE[f(X)] + bE[g(Y )].
Using Definitions 1.5 and 1.6, we can compute the kth moments E[Xk] of a random variableX. One frequent assumption about a random variable is that it has a finite second moment.This is to ensure that the Central Limit Theorem can be used.
Definition 1.7. If X is a random variable with E[X2] < ∞, then we say that X has afinite second moment and write X ∈ L2. If X ∈ L2, then we define the variance of X to bethe number σ2 := E [(X − µ)2]. The standard deviation of X is the number σ :=
√σ2. (As
usual, this is the positive square root.)
1–10
Remark. It is an important fact that if X ∈ L2, then it must be the case that X ∈ L1.
The following is a useful formula when computing variances (people sometime confuse itwith the definition of variance, which it’s not; for the definition, see above).
Theorem 1.2. Suppose X ∈ L2. Then
V ar(X) = E[X2]− E[X]2.
Proof. By linearity of expectation,
V ar(X) = E[(X − µ)2] = E[X2 − 2µX + µ2] = E[X2]− E[2µX] + E[µ2]
= E[X2]− 2µE[X] + µ2 = E[X2]− 2µ2 + µ2 = E[X2]− µ2 = E[X2]− E[X]2
The following exercise is a little bit tedious, but you should make sure you know how to doit. If you remember doing it and remember well how it works, feel free to skip it. Since thislecture and the next are mostly review, I am including several exercises which are meant torefresh your memory on some basic ideas from probability but which you may know verywell how to do already. That’s why I’m including the comment “(optional)” next to them.I will not include these problems on the homework assignments.
Exercise 1.2. (optional) The purpose of this exercise is to make sure you can compute somestraightforward (but messy) integrals [Hint: A change of variables will make them easier tohandle.]. Suppose that X ∼ N (µ, σ2); that is, X is a normally distributed random variablewith parameters µ, σ2. (See Example 1.7 for the density of X.) Show directly (withoutusing any unstated properties of expectations or distributions) that
• E[X] = µ,
• E[X2] = σ2 + µ2, and
• E[e−θX ] = exp
−(θµ− σ2θ2
2
), for 0 ≤ θ <∞.
• V ar(X) = σ2 [Note that this follows from the first two parts and Theorem 1.2.]
This is the reason that if X ∼ N (µ, σ2), we say that X is normally distributed with meanµ and variance σ2 (not just with parameters µ and σ2).
1.5 Bivariate Random Variables
Theorem 1.3. If X and Y are random variables with X ∈ L2 and Y ∈ L2, then the productXY is a random variable with XY ∈ L1.
1–11
Definition 1.8. If X and Y are both random variables in L2, then the covariance of X andY , written Cov(X, Y ) is defined to be
Cov(X, Y ) := E [(X − µX)(Y − µY )]
where µX := E[X], µY := E[Y ]. Whenever the covariance of X and Y exists, we define thecorrelation of X and Y to be
Corr(X, Y ) :=Cov(X, Y )
σXσY(†)
where σX is the standard deviation of X, and σY is the standard deviation of Y .
Remark. By convention, 0/0 := 0 in the definition of correlation. This arbitrary choiceis designed to simplify some formulas and means that if Var(X) = 0 or V ar(Y ) = 0,then Corr(X, Y ) = 0 (this follows from the fact that if Var(X) = 0 or Var(Y ) = 0, thenCov(X, Y ) = 0). Since if Var(X) = 0, X is constant (in which case we call X degeneratewhich in this context just means non-random), this means that the correlation of two randomvariables is always 0 if one of them is degenerate.
Definition 1.9. We say that X and Y are uncorrelated if Cov(X, Y ) = 0 (or, equivalently,if Corr(X, Y ) = 0).
Fact 1.1. If X ∈ L2 and Y ∈ L2, then the following computational formulas hold:
• Cov(X, Y ) = E[XY ]− E[X]E[Y ];
• Var(X) = Cov(X,X);
Exercise 1.3. Verify the two computational formulas above. [Note that the formulas don’tnecessarily hold without the assumption that X ∈ L2 and Y ∈ L2, so make sure you explainwhy these assumptions are needed in general.]
The following result tells us how to deal with the covariance of linear combinations of randomvariables.
Theorem 1.4. If X, Y, Z ∈ L2 and a, b, c ∈ R, then
Cov(aX + bY + c, Z) = aCov(X,Z) + bCov(Y, Z).
Exercise 1.4. (optional) Prove Theorem 1.4
Note 1.2. From this theorem follows another result which you already know:
Var(aX) = a2 Var(X).
Definition 1.10. Two random variables X and Y are said to be independent if f(x, y), thejoint density of (X, Y ), can be expressed as
f(x, y) = fX(x) · fY (y)
where fX is the (marginal) density of X and fY is the (marginal) density of Y .
1–12
Remark. Notice that we have combined the cases of a discrete and a continuous randomvariable into one definition. You can substitute the phrases probability mass function orprobability density function as appropriate.
The following result is often needed and at a first glance not completely obvious.
Theorem 1.5. If X and Y are independent random variables with X ∈ L1 and Y ∈ L1,then
• the product XY is a random variable with XY ∈ L1, and
• E[XY ] = E[X]E[Y ].
Exercise 1.5. (optional) Using this theorem, quickly prove that if X and Y are independentrandom variables, then they are necessarily uncorrelated. (As the next exercise shows, theconverse, however, is not true: there do exist uncorrelated, dependent random variables.)
Exercise 1.6. (optional) Consider the random variable X defined by P (X = −1) = 1/4,P (X = 0) = 1/2, P (X = 1) = 1/4. Let the random variable Y be defined as Y := X2.Hence, P (Y = 0|X = 0) = 1, P (Y = 1|X = −1) = 1, P (Y = 1|X = 1) = 1.
• Show that the density of Y is P (Y = 0) = 1/2, P (Y = 1) = 1/2.
• Find the joint density of (X, Y ), and show that X and Y are not independent.
• Find the density of XY , compute E[XY ], and show that X and Y are uncorrelated.
The following result allows us to get a grip on the variance in algebraic manipulations whenthe random variables involved are independent:
Theorem 1.6 (Linearity of Variance in the Case of Independence). Suppose that X : Ω→ R
and Y : Ω → R are (discrete or continuous) random variables with X ∈ L2 and Y ∈ L2. IfX and Y are independent, then X + Y ∈ L2 and
Var(X + Y ) = Var(X) + Var(Y ).
1–13
Math 4506 (Fall 2019) September 4, 2019Prof. Christian Benes
Lecture #2: Multivariate Random Variables
2.1 Multivariate Random Variables
We just saw that pairs of random variables can be more complicated than what one mightlike to think. It is not enough to know the distributions of the random variables X and Yto know how they behave together.
Think of the following example: You may know the distribution of the heights (X) andweights (Y ) of people in a certain population. However, this by itself will not tell you howheight affects weight and vice-versa. The information on how the random variables arerelated is not contained in the distributions of X and Y (that is, the marginals). To have anidea of the relative behavior of random variables, one needs the correlation coefficient.
Recall:
• If we want to describe a single random variable (also called univariate random vari-ables), we need a density f(x), which graphically can be described as a curve (or a setof points in the discrete case) in the plane.
• If we want to describe a pair of random variables (also called bivariate random vari-ables), we need a joint density f(x, y), which graphically can be described as a surface(or a set of points in the discrete case) in space.
This extends easily to higher dimensions:
• If we want to describe a family of n random variables, we need a joint density f(x1, . . . , xn),which graphically can be described as a hyper-surface (or a set of points in the discretecase) in (n+ 1)-dimensional space.
We are usually comfortable with drawing or imagining objects in 1, 2, or 3 dimensions.In higher dimensions, we tend to get a headache before we can make a sense of what weare trying to represent, so we will limit ourselves to depicting densities of univariate andbivariate random variables and will deal with the rest algebraically (and refer to pictures indimensions ≤ 3 when we get confused and need a picture to help us out).
We will write
x =
x1
x2
· · ·xn
= (x1, . . . , xn)′
2–1
and will think of random vectors as being column vectors. Therefore, the random vectorX = (X1, . . . , Xn)′ has joint distribution (we will often just say distribution)
F (x1, . . . , xn) = P (X1 ≤ x1, . . . Xn ≤ xn).
An equivalent way of writing this is
F (x) = P (X ≤ x).
Recall that if F (x, y) is a bivariate distribution (say for jointly continuous r.v.’s), then
F (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) =
∫ x
−∞
∫ ∞−∞
f(a, b) db da = F (x,∞).
The distributions of subsets of random variables are obtained in the same way as in 2dimensions: If F (x1, . . . , xn) is a multivariate distribution, then, for instance,
F (x1, x2, xn) = P (X1 ≤ x1, X2 ≤ x2, X3 ≤ ∞, Xn−1 ≤ ∞, Xn ≤ xn) = F (x1, x2,∞, . . . ,∞, xn).
For univariate random variables, you know that the p.d.f. is the derivative of the distributionfunction. In higher dimensions, this is true as well, but since we are dealing with functionsof several variables, we have to talk about partial derivatives.
f(x1, . . . xn) =∂nF (x1, . . . , xn)
∂x1 · · · ∂xn.
The random variables X1, . . . , Xn are independent if
F (x1, . . . , xn) = FX1(x1) · · ·FXn(xn)
or, alternatively, if the joint p.d.f. (p.m.f.) is the product of the marginal p.d.f’s (p.m.f’s).
Since the random vector X = (X1, . . . , Xn)′ is a vector, so is its mean E[X] = (E[X1], . . . E[Xn])′.Since there is a covariance between any two of the Xi, there is a total of n2 covariances whichcompose the covariance matrix
ΣX =
Cov(X1, X1) Cov(X1, X2) . . . Cov(X1, Xn−1) Cov(X1, Xn)Cov(X2, X1) Cov(X2, X2) . . . Cov(X2, Xn−1) Cov(X2, Xn)
......
. . ....
...Cov(Xn−1, X1) Cov(Xn−1, X2) . . . Cov(Xn−1, Xn−1) Cov(Xn−1, Xn)Cov(Xn, X1) Cov(Xn, X2) . . . Cov(Xn, Xn−1) Cov(Xn, Xn)
.
Note that
•
ΣX =
Var(X1) Cov(X1, X2) . . . Cov(X1, Xn−1) Cov(X1, Xn)
Cov(X2, X1) Var(X2) . . . Cov(X2, Xn−1) Cov(X2, Xn)...
.... . .
......
Cov(Xn, X1) Cov(Xn, X2) . . . Cov(Xn, Xn−1) Var(Xn)
.• Since for any i, j ∈ 1, . . . , n,Cov(Xi, Xj) = Cov(Xj, Xi), the covariance matrix is
symmetric.
2–2
2.2 Some Basic Linear Algebra
Caveat 2.1. I may not be entirely consistent with notation in what follows. Sometimes,vectors will be represented by boldfaced symbols (x) and sometimes like this: ~x. On rareoccasions, I may use the same notation as for scalars, since that notation is common as well.If that’s the case, you should be able to figure out from context whether you’re dealing witha vector or not.
For matrices
A = [aij]1≤i≤k,1≤j≤` =
a1,1 a1,2 . . . a1,`−1 a1,`
a2,1 a2,2 . . . a2,`−1 a2,`...
.... . .
......
ak,1 ak,2 . . . ak,`−1 ak,`
,
B = [bij]1≤i≤`,1≤j≤n =
b1,1 b1,2 . . . b1,n−1 b1,n
b2,1 b2,2 . . . b2,n−1 b2,n...
.... . .
......
b`,1 b`,2 . . . b`,n−1 b`,n
,and a vector
~v =
v1
v2...
v`−1
v`
,we have the following definitions:
• The product of two matrices is AB = [ci,j]1≤i≤k,1≤j≤n, where
ci,j =∑k=1
ai,kbk,j.
• In particular, the product of a matrix and a vector is
A~v =
∑i=1
a1,ivi
∑i=1
a2,ivi
...∑i=1
an−1,ivi
∑i=1
an,ivi
.
2–3
• The transpose of matrix A is A′ = [ci,j]1≤i≤`,1≤j≤k, where
ci,j = aj,i.
• The determinant of a matrix A, written det(A), is something fairly easy to computebut its definition isn’t exactly short, so those who can’t remember it should look it upin a book on linear algebra. Wikipedia also has a definition and some examples. Notethat the determinant is defined only for square matrices (with same number of rowsand columns). We say that A is singular if det(A) = 0. Otherwise, A is nonsingular.
• The following definitions are for the case k = ` (that is, A is a square matrix):
– If A is nonsingular, the inverse of A, denoted by A−1 is the unique matrix suchthat
AA−1 = A−1A = 1k :=
1 0 . . . 0 00 1 . . . 0 0...
.... . .
......
0 0 . . . 0 1
.If it is clear from context what the dimensions of the matrix are, we write 1 = 1k.
– A is called orthogonal if A′ = A−1. In that case,
AA′ = A′A = 1.
– A is symmetric if for all 1 ≤ i, j ≤ k,
ai,j = aj,i.
– A is positive semi-definite if for all vectors ~v = [v1, . . . , vk]′,
~v′A~v ≥ 0.
Theorem 2.1. If an n× n matrix A is symmetric, it can be written as
A = PΛP ′,
where
Λ =
λ1 0 . . . 0 00 λ2 . . . 0 0...
.... . .
......
0 0 . . . 0 λn
and P is orthogonal. Here, λ1, . . . , λn are the eigenvalues of A.
Theorem 2.2. The covariance matrix of a random vector ~X is symmetric and positivesemi-definite.
2–4
Proof. Symmetry is obvious. If ~v = [v1, . . . , vk]′ and Σ is a covariance matrix, then
~v′Σ~v =n∑
i,j=1
vivj Cov(Xi, Xj) = Var(n∑i=1
viXi) ≥ 0.
Corollary 2.1. The covariance matrix Σ of a random vector ~X can be written in the form
Σ = PΛP ′,
where
Λ =
λ1 0 . . . 0 00 λ2 . . . 0 0...
.... . .
......
0 0 . . . 0 λn
and P is orthogonal.
Proof. This follows from the symmetry of Σ.
Note 2.1. Since Σ is positive semi-definite, its eigenvalues λ1, . . . , λn are nonnegative, so wecan define
Λ1/2 :=
λ
1/21 0 . . . 0 0
0 λ1/22 . . . 0 0
......
. . ....
...
0 0 . . . 0 λ1/2n
and
B = PΛ1/2P ′,
then, since PP ′ = P ′P = 1,
B2 = BB = PΛ1/2P ′PΛ1/2P ′ = PΛP ′ = Σ.
Since B2 = Σ, it makes perfect sense to define
Σ1/2 := PΛ1/2P ′ = B. (1)
Since we will often deal with linear transformations of random variables, the following propo-sition will be useful:
Proposition 2.1. If X is a random vector, a is a (nonrandom) vector, B is a matrix, andY = BX + a, then
E[Y] = a +BE[X],
ΣY = BΣXB′.
Proof. See first homework assignment.
2–5
2.3 Multivariate Normal Random Variables
You already know that the normal distribution is the most important of them all, since thecentral limit theorem tells us that as soon as we start adding up random variables, a normalpops up. Recall from Lecture 2 that a normal random variable X with parameters µ, σ2 hasdensity
f(x) =1
σ√
2πexp
(−(x− µ)2
2σ2
), −∞ < µ <∞, 0 < σ <∞.
You should verify that this is the one-dimensional particular case of the multivariate normaldensity with mean µ and nonsingular covariance matrix Σ (written X ∼ N(µ,Σ)):
fX(x) =1
((2π)ndet(Σ))1/2exp−1
2(x− µ)′Σ−1(x− µ).
Note 2.2. Make sure you understand why one needs Σ to be nonsingular in order for thedefinition of the multivariate normal density to make sense.
Exercise 2.7. Suppose X ∼ N(0, 1), Y ∼ N(0, 2) are bivariate normal with correlationcoefficient ρ(X, Y ) = 1
2.
• Find the joint density of X and Y .
• Let S1 be the square with vertices (0,0), (1,0), (0,1), and (1,1) and let S2 be the squarewith vertices (0,0), (1,0), (0,-1), and (1,-1). Without doing any computations, explainwhich of P ((X, Y ) ∈ S1) and P ((X, Y ) ∈ S2) should be greater.
You probably recall that if X ∼ N(µ, σ2), you can apply a linear transformation to changeX into a standard normal:
Z =X − µσ
∼ N(0, 1).
The same works for the multivariate normal:
Exercise 2.8. Prove that if X ∼ N(~µ,Σ), then
Z := Σ−1/2(X− ~µ) ∼ N(0,1).
In particular (prove this only in the bivariate case), the components of Z are independent.
Hint: Use proposition 2.1.
Note 2.3. This last exercise shows how to obtain a standard normal vector from any mul-tivariate normal distribution. On the homework, you will also show how to do the converse,that is, obtain any multivariate normal distribution from the standard multivariate normal.
2–6
You can generate multivariate normal random variables in R using the following commands(note that comments about what a line does will follow the symbol %; these comments arenot part of what you should include in your input line):
> library(MASS) % this loads the library in which the multivariate normal generator is
> S=c(1,0,0,1) % this generates the vector (1, 0, 0, 1)
> dim(S)=c(2,2) % this transforms the vector into a 2-by-2 matrix
> S % this allows you to check what S is.
[,1] [,2]
[1,] 1 0
[2,] 0 1
> mu=c(0,0) % this is the mean (row) vector
> mu
[1] 0 0
> dim(mu)=c(2,1) %this makes the mean vector into a column vector
> mu
[,1]
[1,] 0
[2,] 0
> N=mvrnorm(100,mu,S) % this generates 100 samples from the multivariate normal ran-dom distribution with mean mu and covariance matrix S
> plot(N)
-2 -1 0 1 2 3
-2-1
01
2
N[,1]
N[,2]
2–7
> S2=c(1,1,1,1)
> dim(S2)=c(2,2)
> N2=mvrnorm(100,mu,S2)
> plot(N2)
-2 -1 0 1 2
-2-1
01
2
N2[,1]
N2[,2]
> S3=c(1,-0.8,-0.8,1)
> dim(S3)=c(2,2)
> N3=mvrnorm(100,mu,S3)
> plot(N3)
-3 -2 -1 0 1 2 3
-2-1
01
2
N3[,1]
N3[,2]
2–8
The following are the graphs of 3 multivariate normal densities (any two pictures on a sameline are of the same pdf, but seen under different angles) Try to say as much as you canabout their means and covariance matrices.
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.000.050.100.15
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.000.050.100.15
2–9
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.00
0.05
0.10
0.15
The joint pdf of two independent standard normal random variables
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.000.050.100.15
The joint pdf of two normal random variables with mean 0 and covariance matrix
Σ =
[1 1/2
1/2 1
].
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-2
0
2
x
-2
0
2
y
0.000.050.100.15
The joint pdf of two normal random variables with mean 0 and covariance matrix
Σ =
[1 −1/2−1/2 1
].
2–10
When pictures of surfaces don’t make as much sense as we’d like, we can always look at levelcurves. Here are the same graphs as above with level curves:
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
The joint pdf of two independent standard normal random variables
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
The joint pdf of two normal random variables with mean 0 and covariance matrix
Σ =
[1 1/2
1/2 1
].
-2
0
2
x -2
0
2
y
0.00
0.05
0.10
0.15
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
The joint pdf of two normal random variables with mean 0 and covariance matrix
Σ =
[1 −1/2−1/2 1
].
2–11
When you draw samples from a distribution, you should see most of your data points ac-cumulate in areas of high probability. The shapes of these areas are precisely given by thelevel curves:
-1.5 -1.0 -0.5 0.5 1.0 1.5
-2
-1
1
2
3
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
50 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 00 1
].
-1 1 2
-1.5
-1.0
-0.5
0.5
1.0
1.5
2.0
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
50 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 1/2
1/2 1
].
-2.0 -1.5 -1.0 -0.5 0.5 1.0 1.5
-2
-1
1
2
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
50 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 −1/2−1/2 1
].
2–12
The connection between the data and the distribution becomes more obvious as the data setincreases in size:
-3 -2 -1 1 2 3
-3
-2
-1
1
2
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
500 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 00 1
].
-3 -2 -1 1 2 3
-3
-2
-1
1
2
3
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
500 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 1/2
1/2 1
].
-2 -1 1 2 3
-3
-2
-1
1
2
3
-3 -2 -1 0 1 2 3
-3
-2
-1
0
1
2
3
x
y
500 samples from a bivariate normal random variable with mean 0 and covariance matrix
Σ =
[1 −1/2−1/2 1
].
2–13
Math 4506 (Fall 2019) September 5, 2019Prof. Christian Benes
Lecture #3: Decomposing Time Series; Stationarity
Reference. The material in this section is an introduction to time series and is meant tocomplement Chapter 1 in the textbook. Make sure you read that chapter in its entirety andwork in parallel with R to reproduce what is being done in the textbook. This lecture alsocovers most of the topics from Chapter 2, which we will re-visit in more detail in the nextlecture.
3.1 Basic decomposition
The following graph represents the number of monthly aircraft miles (in Millions) flown byU.S. airlines between 1963 and 1970:
Time
Air.ts
1964 1966 1968 1970
6000
8000
10000
12000
14000
16000
Given a data set such as the one above, how can we construct a model for it? The idea willbe to decompose random data into three distinct components:
• A trend component mt (increase of populations, increase in global temperature, etc.)
• A seasonal component st (describing cyclical phenomena such as annual temperaturepatterns, etc.)
3–1
• A random noise component Yt describing the non-deterministic aspect of the timeseries. Note that the book uses zt for this component. In the notes, I’ll write Yt, as theletter z usually suggests a normal distribution, which may not be the actual underlyingdistribution of the random noise component.
A common model is the so-called additive model, that is, one where we try to find mt, st, Ytsuch that a given time series can be expressed as
Xt = mt + st + Yt.
We will never know what mt, st, and Yt actually are, but we can estimate them. Theestimates will be called mt, st, and yt. Note that we’ll use the same notation for estimatesand estimators in this case. Once we see the data, our estimates have to satisfy
xt = mt + st + yt,
where mt is an estimate for mt, st is an estimate for st, and yt is an estimate for Yt.
The corresponding data set can be found at
http://robjhyndman.com/tsdldata/data/kendall3.dat
and looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1963 6827 6178 7084 8162 8462 9644 10466 10748 9963 8194 6848 70271964 7269 6775 7819 8371 9069 10248 11030 10882 10333 9109 7685 76021965 8350 7829 8829 9948 10638 11253 11424 11391 10665 9396 7775 79331966 8186 7444 8484 9864 10252 12282 11637 11577 12417 9637 8094 92801967 8334 7899 9994 10078 10801 12950 12222 12246 13281 10366 8730 96141968 8639 8772 10894 10455 11179 10588 10794 12770 13812 10857 9290 109251969 9491 8919 11607 8852 12537 14759 13667 13731 15110 12185 10645 121611970 10840 10436 13589 13402 13103 14933 14147 14057 16234 12389 11595 12772
In fact, this is not exactly the form in which the data set is found on that website. There,it doesn’t have any labels. As it turns out, it is quite straightforward to include those labelswith R.
Let’s look at the graph above. Two patterns are striking. There appears to be
• an increasing pattern
• a clear cyclical pattern with some apparently fixed period
Some questions we’ll try to answer throughout the course are: “How can we extract thesepatterns?”. “Once we’ve extracted the patterns are we left with pure randomness or does therandomness have a structure?” “Can we use these patterns to make predictions for futurevalues of this time series?”
3–2
3.2 Stationary Time Series
We will eventually return to a more careful analysis of the trend and seasonal component ofa time series, but focus for now on Yt, the random component of a time series after extractionof a trend and cyclical component.
Multidimensional distributions are very complicated objects and involve more parametersthan we would like to deal with. We will focus on two essential quantities giving informationabout a time series: the means and the covariances.
Definition 3.1. If Xt is a time series with Xt ∈ L1 for each t, then the mean function (ortrend) of Xt is the non-random function µ(t) := E[Xt].
Definition 3.2. If Xt is a time series with Xt ∈ L2 for each t, then the autocovariancefunction of Xt is the non-random function
γ(t, s) := Cov(Xt, Xs) = E [(Xt − µ(t))(Xs − µ(s))] .
The autocorrelation function of Xt is
ρ(t, s) =γ(t, s)√
Var(Xt) Var(Xs)= Corr(Xt, Xs).
Definition 3.3. We call the time series Xt second-order (or weakly) stationary if
• there is a constant µ such that µ(t) = µ for all t, and
• γ(t + h, t) only depends on h; that is, if γ(t + h, t) = γ(h, 0) =: γ(h) for all t and forall h.
Exercise 3.9. For a second-order stationary process, show that Var(Xt) = γ(0) for each t.
Via the last exercise, the second condition for second-order stationarity allows us to rephrasethe definition above:
Definition 3.4. Suppose that Xt is a second-order stationary process. The autocovariancefunction (ACVF) at lag h of Xt is
γ(h) := Cov(Xt+h, Xt).
The autocorrelation function (ACF) at lag h of Xt is
ρ(h) := Corr(Xt+h, Xt).
Note 3.1. By Exercise 3.9,
ρ(h) =Cov(Xt+h, Xt)√
Var(Xt+h) Var(Xt)=γ(h)
γ(0).
3–3
3.3 Some simple time series models
All the time series below are discrete-time, that is, the time set is a subset of the integers.
Example 3.1. (White Noise.)
Often when taking measurements, little imprecisions (in the measuring device and on thepart of the measurer) will yield measurements that are a little off. It is often assumed thatthese errors are uncorrelated and that they all come from a same distribution with zero mean.A sequence of random variables Xnn≥1 with E[Xn] = 0 and E[XkXm] = σ2δ(k − m) iscalled white noise. (The name comes from the spectrum of a stationary process which wemay discuss at the end of the semester. There also noises that are pink, red, blue, purple,etc.) Here δ(k −m) is the Dirac delta function, defined by
δ(x) =
1 x = 00 x ∈ R \ 0
Two important particular cases of white noise are:
• The distribution of Xi is binary: P (Xi = a) = 1−P (Xi = −a) = 1/2 for some a ∈ R.
• Xi ∼ N(0, σ2). In this case, we talk about Gaussian white noise.
Example 3.2. (IID Noise.)
A sequence of independent, identically distributed random variables Xnn≥1 with E[Xn] = 0is called i.i.d. noise.
Example 3.3. (Random walk.) If Xii≥1 is i.i.d. noise,
Sn =n∑i=1
Xi
is a random walk. In particular, if P (Xi = 1) = 1−P (Xi = −1) = 1/2, we have a symmetricsimple random walk.
Random walks have been a (very crude) choice of model for the stock market for a long time.
0 10 20 30 40 50 60 70 80 90 100−4
−2
0
2
4
6
8
10
12
14
Number of Steps
Pos
ition
of W
alke
r
0 10 20 30 40 50 60 70 80 90 100−14
−12
−10
−8
−6
−4
−2
0
2
4
6
Number of Steps
Pos
ition
of W
alke
r
Two independent realizations of a simple random walk of 100 time steps.
3–4
Example 3.4. (Gaussian time series.)
Xnn≥1 is a Gaussian time series if for every collection of integers ik1≤k≤n, the vector
(Xi1 , . . . , Xin)
is multivariate Gaussian.
Since many natural quantities have a normal distribution, this is a natural model in manysettings. It also has the advantage of allowing many kinds of dependence between the data.
3.4 Autocovariance function: some examples
We saw that for stationary time series, covariance depends only on one parameter (the timebetween two given random variables), allowing us to define an autocovariance function atlag h. In the examples below, we compute the autocovariance function of the simple timeseries which we defined during the last lecture and use it to determine which of them arestationary and which are not.
Example 3.5 (White Noise). Suppose that Xt is White Noise. We now verify that Xtis second-order stationary. First, it is obvious that µ(t) = 0 for all t. Second, if s 6= t, thenthe assumption that the collection is uncorrelated implies that γ(t, s) = 0, s 6= t. On theother hand, if s = t, then γ(t, t) = Var(Xt) = σ2. Thus, µ(t) = 0 for all t, and
γ(h) = γ(t+ h, t) =
σ2, h = 0,
0, h 6= 0,
This shows that Xt is indeed second-order stationary since γ depends only on h. We writeXt ∼ WN(0, σ2) to indicate that Xt is white noise with Var(Xt) = σ2, for each t.
Example 3.6 (IID Noise). Suppose instead that Xt is collection of independent randomvariables, each with mean 0 and variance σ2. We say that Xt is iid Noise. As with whitenoise, we easily see that iid noise is stationary with trend µ(t) = 0 and
γ(h) = γ(t+ h, t) =
σ2, h = 0,
0, h 6= 0.
We write Xt ∼ IID(0, σ2) to indicate that Xt is iid noise with Var(Xt) = σ2, for each t.
Remark. With these two examples, we see that two different processes may both have thesame trend and autocovariance function. Thus, µ(t) and γ(t+ h, t) are NOT always enoughto distinguish stationary processes. (However, for stationary Gaussian processes they areenough.)
Example 3.7. If St =t∑i=1
Xi (where Xi is a sequence of independent random variables
with P (Xi = 1) = 1 − P (Xi = −1) = 1/2 and therefore Var(Xi) = 1) is symmetric simple
3–5
random walk, we find that if s > t,
γ(s, t) = Cov(Ss, St) = Cov(St +Xt+1 + . . .+Xs, St) = Cov(St, St)
= Var(St) =t∑i=1
VarXi = t.
In particular, γ(t+h, t) = t, which implies that simple random walk is not a stationary timeseries (since stationary time series have a constant variance).
3–6
Math 4506 (Fall 2019) September 9, 2019Prof. Christian Benes
Lecture #4: Linear Processes; MA processes
Reference. Chapter 2 and Sections 4.1 and 4.2 from the textbook.
4.1 Inequalities
Many probabilists are enthralled by inequalities (upper/lower bounds). One of the manypurposes for finding upper bounds is to check that quantities are finite, by checking it fora more tractable but larger quantity. (This is something you’ve seen in the comparisontest for integrals: Though it’s not straightforward to check that
∫∞1000
e−x2
log log log |x +
1| < ∞, the fact that for x ≥ 1000, 0 ≤ e−x2
log log log |x + 1| ≤ e−x implies that 0 ≤∫∞1000
e−x2
log log log |x+ 1| ≤∫Re−x <∞.)
A very common inequality in analysis and probability is Jensen’s inequality.
Definition 4.1. A function φ : R→ R is called convex if for x, y ∈ R, 0 ≤ p ≤ 1,
φ(px+ (1− p)y) ≤ pφ(x) + (1− p)φ(y).
Theorem 4.1 (Jensen’s inequality). Suppose φ : R→ R is convex. Suppose X is a randomvariable satisfying E[|X|] <∞ and E[|φ(x)|] <∞. Then
φ(E[X]) ≤ E[φ(X)].
Proof. If φ is convex, then for every x0 ∈ R, there is a c(x0) such that φ(x)−φ(x0)x−x0 ≥ c(x0).
Choosing x0 = E[X] and letting x = X, we get
φ(X) ≥ c(E[X])(X − E[X]) + φ(E[X]).
Taking expectations on both sides concludes the proof.
Example 4.1. Two straightforward consequences of Jensen’s formula are:
|E[X]| ≤ E[|X|].
E[X]2 ≤ E[X2].
In particular, applying the second inequality to the random variable |X|, we get
E[|X|]2 ≤ E[|X|2] ≤ E[X2],
so that if E[X] = 0,E[|X|] ≤ σ. (2)
Two other very commonly useful inequalities are
4–1
Theorem 4.2. (Cauchy-Schwarz inequality) If X, Y ∈ L2,
E[|XY |]2 ≤ E[X2]E[Y 2].
Note 4.1. This last inequality is the probabilistic version of the C-S inequality and shouldbe compared with the C-S inequality in its most standard form:(
n∑i=1
xiyi
)2
≤n∑i=1
x2i
n∑i=1
y2i . (3)
Theorem 4.3. (Triangle inequality) If x, y ∈ R,
|x+ y| ≤ |x|+ |y|
By induction, if x1, . . . , xn ∈ R,
|n∑i=1
xi| ≤n∑i=1
|xi|.
4.2 Linear Processes
Definition 4.2. We define the backwards shift operator B by
BXt = Xt−1.
For j ≥ 2, we defineBjXt = BBj−1Xt.
In other words,BjXt = Xt−j.
Definition 4.3. A time series Xtt∈Z is a linear process if for every t ∈ Z, we can write
Xt =∞∑
i=−∞
ψiZt−i, (4)
where Zt ∼ WN(0, σ2) and the scalar sequence ψii∈Z satisfies∑i∈Z
|ψi| < ∞. Using the
shortcut Ψ(B) =∞∑
i=−∞
ψiBi, we can write
Xt = ψ(B)Zt.
If ψi = 0 for all i < 0, we call X a moving average or MA(∞) process.
4–2
Note 4.2. Infinite sums of random variables are somewhat delicate. You know what itmeans for an infinite sum of real numbers to converge, but for random variables, it isn’tclear at first what the corresponding meaning would be. In fact, there are a number ofdifferent ways to give a meaning to the notion of convergence of random variables.
For technical reasons, convergence of a sum of random variables is often taken in the meansquare sense: Ykk≥1 converges to Y in the mean square sense if there exists a randomvariable Y such that
E
( n∑k=1
Yk − Y
)2 n→∞→ 0.
In any case, it should be intuitively clear that some requirement on the ψi is necessary, sinceif all the ψi were equal to 1, Xt would be an infinite sum of i.i.d. random variables, whichdoes not converge (since we’re always adding more random variables that don’t shrink, thesum would not stabilize).
The requirement∑i∈Z
|ψi| < ∞ ensures that the random series∞∑
i=−∞
ψiZt−i has a limit. I
won’t expect you to completely understand what this means, but if you care about it, here’sthe argument:∑
i≥0
|ψi| <∞⇒∑i≥0
ψ2i <∞⇒
∑i≥0
ψ2iE[Z2
t−1] <∞⇒n∑
i=m
ψ2iE[Z2
t−i]m,n→∞→ 0.
(The last implication is the Cauchy criterion for convergence of series.) Now by the Cauchy-Schwarz inequality ((3) with yi = 1 for all i ∈ 1, . . . , n),
n∑i=m
ψ2iE[Z2
t−i] = E[n∑
i=m
ψ2iZ
2t−i] ≥ E
( n∑i=m
ψiZt−i
)2 .
Therefore,
n∑i=m
ψ2iE[Z2
t−i]m,n→∞→ 0 ⇒ E
( n∑i=m
ψiZt−i
)2 m,n→∞→ 0
⇒n∑
i=m
ψiZt−i converges as n,m→∞⇒∑i≥0
ψiZt−i converges.
The last implication is the Cauchy criterion for convergence of sequences of random variables.
Now that we know that the process defined in (4) exists, let’s also show that for any t ∈Z, Xt ∈ L1:
If∑i∈Z
|ψi| <∞, using the triangle inequality (for the first inequality; note that since it’s an
infinite sum, we have to take limits) and Jensen’s inequality (for the last), we get
E[|Xt|] ≤∑i∈Z
E|ψiZt−i| ≤∑i∈Z
|ψi|E|Zt−i| ≤ σ∑i∈Z
|ψi|.
4–3
4.3 Moving Average Processes
We will now construct stationary time series that have a non-zero autocovariance up to acertain lag q but have zero autocovariance at all later lags. One simple and natural way isto start with white noise Zt (denoted Zt ∼ WN(0, σ2)) and to construct a new sequence ofrandom variables which depend on an overlapping subset of the Zt.
Definition 4.4. A moving-average process of order q is defined for t ∈ Z by the equation
Xt = Zt + θ1Zt−1 + . . .+ θqZt−q = Zt +
q∑i=1
θiZt−i =
q∑i=0
θiZt−i = Θ(B)Zt,
where Zt ∼ WN(0, σ2), θ0 = 1, θ1, . . . , θq are constants, and Θ(z) = 1 +
q∑i=1
θizi.
We now check that Xt is a stationary sequence:
E[Xt] = E[Zt] +
q∑i=1
θiE[Zt−i] = 0.
If h > q,
Cov(Xt, Xt+h) = Cov(
q∑i=0
θiZt−i,
q∑j=0
θjZt+h−j) =
q∑i,j=0
θiθj Cov(Zt−i, Zt+h−j) = 0,
since if h > q and j ≤ q, then t+h− j > t, so that t+h− j > t− i, so that Zt−i and Zt+h−jare uncorrelated.
If 0 ≤ h ≤ q, the random variables Xt and Xt+h contain some of the same Zi.
Cov(Xt, Xt+h) = Cov(θqZt−q + . . .+ θq−hZt+h−q + . . .+ θ0Zt, θqZt+h−q + . . . θhZt + . . .+ θ0Zt+h)
= Cov(θqZt−q+. . .+θq−h+1Zt+h−q−1+
q∑i=h
θq−iZt−q+i,
q∑i=h
θq+h−iZt−q+i+θh−1Zt+1+. . .+θ0Zt+h)
= σ2
q∑i=h
θq−iθq−i+h = σ2
q−h∑i=0
θq−i−hθq−i
Since this covariance does not depend on t, we see that the moving-average process of orderq is weakly stationary.
To find the autocorrelation function, we just need to compute
E[X2t ] = Cov(Xt, Xt) = Cov(
q∑i=0
θiZt−i,
q∑i=0
θiZt−i) = σ2
q∑i=0
θ2i .
4–4
Combining all our computations above, we get
γX(h) =
σ2
q−h∑i=0
θq−i−hθq−i 0 ≤ |h| ≤ q
0 |h| > q
(5)
and
ρX(h) =
q−h∑i=0
θq−i−hθq−i
q∑i=0
θ2i
0 ≤ |h| ≤ q
0 |h| > q
(6)
4–5
Math 4506 (Fall 2019) September 11, 2019Prof. Christian Benes
Lecture #5: MA processes - Autocovariance; AR processes
Reference. Section 4.2 from the textbook.
5.1 ACF of MA Processes
Example 5.1. (MA(1) process) Let’s examine the ACF of a MA(1) process: If
Xt = Zt + θ1Zt−1,
we have θ0 = 1, θ1 6= 0, and θi = 0 for all i > 1. Therefore, using (5) and (6), we get
γX(0) = σ2
1−0∑i=0
θ1−iθ1−i = σ2(1 + θ2),
γX(1) = σ2
1−1∑i=0
θ−iθ1−i = σ2θ0θ1 = σ2θ1
γX(h) = 0, |h| > 2,
and
ρX(0) = 1,
ρX(1) =σ2θ1
σ2(1 + θ2),
ρX(h) = 0, |h| ≥ 2,
Example 5.2. (MA(2) process) We’ll now compute the ACF of a MA(2) process. Again,this is straightforward with the help of (5) and (6):
γX(0) = σ2
2−0∑i=0
θ2−i−0θ2−i = σ2(1 + θ21 + θ2
2),
γX(1) = σ2
2−1∑i=0
θ2−i−1θ2−i = σ2(θ1θ2 + θ1)
γX(2) = σ2
2−2∑i=0
θ2−i−2θ2−i = σ2θ2
γX(h) = 0, |h| ≥ 3.
5–1
Therefore, the ACF is
ρX(0) = 1,
ρX(1) =θ1θ2 + θ1
1 + θ21 + θ2
2
ρX(2) =θ2
1 + θ21 + θ2
2
ρX(h) = 0, |h| ≥ 3.
Example 5.3. Let us now simulate two MA(2) processes. First, consider the process
Xt = Zt + Zt−1 − Zt−2.
We can simulate it as follows:
> Z=rnorm(500)
> X=Z
> for (i in 3:500) X[i]=Z[i]+Z[i-1]-Z[i-2]
> plot(X,type=”l”)
0 100 200 300 400 500
-4-2
02
4
Index
X
Let’s now change the signs of the coefficients in the time series above to see what the process
Xt = Zt − Zt−1 + Zt−2
looks like.
> Z=rnorm(500)
> X=Z
> for (i in 3:500) X[i]=Z[i]-Z[i-1]+Z[i-2]
> plot(X,type=”l”)
5–2
0 100 200 300 400 500
-4-2
02
4
Index
X
5–3
Math 4506 (Fall 2019) September 16, 2019Prof. Christian Benes
Lecture #6: AR processes
Reference. Section 4.3 from the textbook.
6.1 AR processes
Recall the following definition:
Definition 6.1. We define the backwards shift operator B by
BXt = Xt−1.
For j ≥ 2, we defineBjXt = BBj−1Xt.
In other words,BjXt = Xt−j.
Example 6.1. Recall that for n ≥ 1, we defined random walks Sn as follows: If Xii≥1 isi.i.d. noise,
Sn =n∑i=1
Xi.
Another way of defining random walk is by defining S1 = X1 and for n ≥ 2,
Sn = Sn−1 +Xn
or, with the backward shift notation,
Sn −BSn = Xn.
We can use the factorization that we use for real numbers in this case as well, but have to becareful and realize that the symbolic factorization is for operators (in particular, 1 representsthe identity operator, not the number one). This gives
(1−B)Sn = Xn.
One natural way of introducing correlation into a time series model is by defining the timeseries recursively.
Definition 6.2. We define an autoregressive process of order p to be a process X satisfyingfor all t ∈ Z,
Xt − φ1Xt−1 − . . .− φpXt−p = Zt (7)
⇐⇒ (1− φ1B − φ2B2 − · · · − φpBp)Xt = Zt
⇐⇒ Φp(B)Xt = Zt,
where Zt ∼ WN(0, σ2), Zt is independent of Xs, s < t, and Φp(z) = 1−∑p
i=1 φizi.
6–1
Note that random walk Sn is defined by the equation
(1−B)Sn = Xn,
so random walk is a particular case of an AR(1) process. We already saw that random walkis not stationary, so we see that there are processes satisfying the AR equation that aren’tstationary. Note that this is different from MA processes, which are always stationary.
6.2 Stationarity of AR processes
It turns out that for any set of parameters φi1≤i≤p, this process exists. However, it isn’talways stationary. The criterion for stationarity is quite simple: An AR(p) process is sta-tionary if and only if all roots of the characteristic equation Φp(z) = 0 have modulus greaterthan 1. In that case, the process is uniquely defined by the equation (7) In other words, ifz1, . . . zp are the roots of the characteristic equation, we need |zi| ≥ 1 for all i ∈ 1, . . . , p.Note that the zi have to be thought of as complex numbers.
Let’s see what might go wrong when |φ| = 1 by looking at simple random walk:
Example 6.2. Is the AR(3) process defined by
Xt = Xt−2 +Xt−3 + Zt
stationary?
We can rewrite the equation above as Φ3(B)Xt = Zt, where Φ3(z) = 1− z2 − z3. Therefore,we need to find the roots of the characteristic polynomial Φ3(z) = 1− z2 − z3. This is bestdone with the help of R: First define the vector of coefficients of the polynomial
> a=c(1,0,-1,-1)
Then compute the roots:
> polyroot(a)
which gives
[1] 0.7548777+0.0000000i -0.8774388+0.7448618i -0.8774388-0.7448618i
To be able to tell right away what the modulus of these roots is, type
> roots=polyroot(a)
> abs(roots)
which gives
[1] 0.7548777 1.1509639 1.1509639
Since one of these roots has modulus less than 1, the process is not stationary.
6.3 Simulations of AR(2) Processes
Note that these simulations are for different AR processes than the one we looked at in class,but the principle is exactly the same.
6–2
Example 6.3. Consider the AR(2) process defined by
Xt = 0.7Xt−1 + 0.2Xt−2 + Zt,
where Zt ∼ WN(0, 1). We can produce a realization of the time series:
> Z=rnorm(200)
> X1=Z
> for (i in 3:200) X1[i]=0.7*X1[i-1]+0.2*X1[i-2]+Z[i]
> plot(X1, type=“l”)
0 50 100 150 200
-4-2
02
4
Index
X
A plot of X1
We can now do the same thing for the AR(2) process defined by
Xt = 0.7Xt−1 − 0.2Xt−2 + Zt,
where Zt ∼ WN(0, 1). > for (i in 3:200) X2[i]=0.7*X2[i-1]-0.2*X2[i-2]+Z[i]
> plot(X2, type=”l”)
0 50 100 150 200
-3-2
-10
12
3
Index
X2
A plot of X2
Note that to put both pictures in a same window on your screen, you can use the followingcommands:
6–3
> par(mfrow = c(2, 1))
> plot(X1, type=”l”)
> plot(X2, type=”l”)
0 50 100 150 200
-4-2
02
4
Index
X
0 50 100 150 200
-3-1
01
23
Index
X2
A plot of X1 and X2
We can also see what the ACFs of these processes look like:
> Y1=ARMAacf(ar=c(0.7,0.2),lag.max=15)
> Y2=ARMAacf(ar=c(0.7,-0.2),lag.max=15)
> plot(Y1)
> plot(Y2)
6–4
5 10 15
0.4
0.6
0.8
1.0
Index
Y
5 10 15
0.0
0.4
0.8
Index
Y2
A plot of the ACFs of X and X2
Example 6.4. What does the realization of a non-stationary AR(2) process look like? Let’ssimulate a realization of the process defined by
Xt = Xt−1 + 0.2Xt−2 + Zt,
where Zt ∼ WN(0, 1):
> Z=rnorm(200)
> X=Z
> for (i in 3:200) X[i]=1*X[i-1]+0.2*X[i-2]+Z[i]
> plot(X,type=“l”)
0 50 100 150 200
0.0e+00
2.0e+12
4.0e+12
6.0e+12
8.0e+12
1.0e+13
1.2e+13
Index
X
A plot of X
6–5
As it’s not very clear what is happening other than that the process is blowing up, let’sreduce the domain a bit:
> plot(X[1:50],type=“l”)
0 10 20 30 40 50
0100
200
300
400
500
600
Index
X[1:50]
A plot of X for times 1 to 50
> plot(X[1:20],type=“l”)
5 10 15 20
02
46
Index
X[1:20]
A plot of X for times 1 to 20
We see that the process looks potentially stationary for a short while, but eventually appearsto be going off to infinity. One would certainly not claim based on this picture that thevariance or covariances at a given lag are constant over time.
6–6
6.4 Autocorrelation for Stationary AR(1) Processes
We can write the AR(1) process as follows:
Xt = φXt−1 + Zt. (8)
If Xt is stationary, the constance of E[Xt] implies that E[Xt] = 0. Indeed, if E[Xt] doesn’tdepend on t, we can take expected values on both sides of (8) and get
E[Xt] = φE[Xt−1]+E[Zt] ⇐⇒ E[Xt] = φE[Xt] ⇐⇒ E[Xt](1−φ) = 0 ⇐⇒ φ = 1 or E[Xt] = 0.
If φ = 1, the root of the characteristic equation is 1, so Xt is not stationary. Therefore,E[Xt] = 0 for all t. If h > 0,
γX(h) = E[XtXt−h] = φE[Xt−1Xt−h] + E[ZtXt−h] = φγX(h− 1).
We can repeat this procedure h− 1 times to obtain
γX(h) = φhγX(0).
Here is a case where finding the autocorrelation function is easier than finding the autoco-variance function. Indeed, the last equation yields automatically for all h ∈ Z (using thefact that γX(h) = γX(−h))
ρX(h) =γX(h)
γX(0)= φ|h|.
Since Cov(Zt, Xt−1) = 0, we find that
γX(0) = Cov(Xt, Xt) = Cov(φXt−1 + Zt, φXt−1 + Zt) = φ2γX(0) + σ2,
implying that
γX(0) =σ2
1− φ2.
This gives for all h ∈ Z,
γX(h) = ρX(h)γX(0) = φ|h|σ2
1− φ2.
Note 6.1. When looking at sample correlograms we will often need to determine which ofour models has a correlogram resembling that provided by the data. For that purpose itis important to note that the correlogram above is an exponentially decaying function in h(which alternates between positive and negative values if φ < 0).
6–7
Math 4506 (Fall 2019) September 18, 2019Prof. Christian Benes
Lecture #7: Autocovariance of linear processes; stationarity ofAR processes
7.1 Linear Processes
Proposition 7.2. If Yt is stationary with mean 0 and autocovariance function γY and∑i∈Z
|ψi| <∞, then
Xt =∞∑
i=−∞
ψiYt−i
is stationary with mean 0 and autocovariance function
γX(h) =∞∑
j,k=−∞
ψjψkγY (h+ k − j).
In particular, if Y ∼ WN(0, σ2) (which by definition means that X is linear), X is stationarywith mean 0 and autocovariance function
γX(h) =∞∑
k=−∞
ψkψk+hσ2.
Proof. This is a straightforward computation, once one knows that for a convergent randomseries (which we have here thanks to the requirements on the ψi) the expected value of eachof the infinite sums below is the sum of the expected values.
E[Xt] = E[∞∑
i=−∞
ψiYt−i] =∞∑
i=−∞
ψiE[Yt−i] = 0,
since Yt is a mean zero time series. Also,
E[Xt+hXt] =∞∑
j,k=−∞
ψjψkE[Yt+h−jYt−k] =∞∑
j,k=−∞
ψjψkE[Yk+h−jY0] =∞∑
j,k=−∞
ψjψkγY (k + h− j).
Example 7.1. Suppose Xt is an MA(∞) process, that is,
Xt =∑i≥0
θiZt−i,
7–1
with∑
i≥0 |θi| <∞. Then since Zt is stationary, we can apply Proposition 7.2 to Xt toobtain
γX(h) =∞∑
j,k=−∞
θjθkγZ(h+ k − j) = σ2
∞∑k=0
θkθk+h,
so that
ρX(h) =
∑∞k=0 θkθk+h∑∞k=0 θ
2k
.
In particular, if Xt is an MA(q) process (meaning that θi = 0 if i > q),
γX(h) = σ2
q−h∑k=0
θkθk+h and ρX(h) =
∑q−hk=0 θkθk+h∑qk=0 θ
2k
.
7.2 Stationarity of AR(1) Process - Take Two
Recall Definition 6.2: An autoregressive process of order p satisfies for all t ∈ Z,
Xt − φ1Xt−1 − . . .− φpXt−p = Zt (9)
⇐⇒ (1− φ1B − φ2B2 − · · · − φpBp)Xt = Zt
⇐⇒ Φp(B)Xt = Zt,
where Zt ∼ WN(0, σ2), Zt is independent of Xs, s < t, and Φp(z) = 1 −∑p
i=1 φizi. In
particular, X is an AR(1) process if it satisfies
X1 − φXt−1 = Zt,
where Zt ∼ WN(0, σ2) and Zt is independent of Xs for s < t.
We’ve already seen that an AR(1) process is stationary if and only if |φ| < 1. Here is a wayof seeing why:
Xt = φXt−1 + Zt ⇒ Xt = φ (φXt−2 + Zt−1) + Zt = Zt + φZt−1 + φ2Xt−2
= Zt + φZt−1 + φ2 (φXt−3 + Zt−2) = Zt + φZt−1 + φ2Zt−2 + φ3Xt−3
=∑i≥0
φiZt−i, (10)
where the last step follows from taking limn→∞ φnXt−n and noting that if |φ| < 1, this limit
is 0 (in a subtle sense you might want to think about; in other words, ask yourself this:What does it mean for the limit of a sequence of random variables to be 0?).
What we just did was rewrite a stationary AR(1) process as a linear process (see Lecture 4),more specifically as an MA(∞) process.
However, as the next example shows, one has to be a little bit careful when dealing with theAR equation (9).
7–2
Example 7.2. If |φ| > 1, we can show that the time series defined by
Xt = −∑j≥1
(1
φ
)jZt+j (11)
satisfies the AR(1) equation Xt = φXt−1 + Zt. Indeed,
φXt−1 + Zt = φ(−∑j≥1
(1
φ
)jZt−1+j) + Zt = −
∑j≥1
(1
φ
)j−1
Zt+j−1 + Zt
= −∑j≥0
(1
φ
)jZt+j + Zt = −Zt −
∑j≥1
(1
φ
)jZt+j + Zt
= −∑j≥1
(1
φ
)jZt+j = Xt.
This time series is clearly stationary, since it is a linear process with summable coefficients,i.e., such that
∑j≥1
∣∣ 1φ
∣∣ <∞. So does this mean that there are two distinct stationary AR(1)
processes, those defined in (10) and in (11)? We know that this cannot be the case (byuniqueness of stationary AR processes), so there must be something preventing one of theseprocesses from being an AR(1) process. It turns out that in (11), Zt is not independent of Xs
for s < t since for instance Xt = −∑
j≥1
(1φ
)jZt+j, so that Zt+1 = φ(−Xt−
∑j≥2
(1φ
)jZt+j).
Example 7.3. If Wt = Xt + cφt, where Xt is a stationary AR(1) process (which means thatXt = φXt−1 + Zt with |φ| < 1, we see that
Wt = φWt−1 + Zt.
Indeed,
φWt−1 + Zt = φ(Xt−1 + cφt−1) + Zt = φXt−1 + Zt + cφt = Xt + cφt = Wt.
Also, since by assumption, Zt is independent of Xs, s < t and since Wt = Xt + cφt, we havethat Zt is independent of Ws, s < t. Therefore, Wt is an AR(1) process. Moreover, |φ| < 1.So is Wt a stationary AR(1) process? No, since E[Wt] = E[Xt + cφt] = cφt, which is notconstant since it depends on t. So we see that there is more than one AR(1) process for agiven parameter φ with |φ| < 1. However, as mentioned several times before, only one canbe stationary.
7–3
Math 4506 (Fall 2019) September 23, 2019Prof. Christian Benes
Lecture #8: Yule-Walker equations; Causality and Invertibility
8.1 Yule-Walker Equations for AR(p) Processes
If X is an AR(p) process, then
Xt = φ1Xt−1 + . . .+ φpXt−p + Zt.
If, moreover, X is stationary, then there exists µ such that for all t, E[Xt] = µ. Then
E[Xt] = φ1E[Xt−1] + . . .+ φpE[Xt−p] + E[Zt]⇒ µ(1−p∑j=1
φj) = 0.
Therefore, if X is stationary,
µ = 0 or 1−p∑j=1
φj = 0. (12)
Now note that Φ(z) = 1 −∑p
j=1 φjzj. Since we are assuming that X is stationary, we
know that all solutions of the characteristic equation Φ(z) = 0 must be outside of theunit disk. In particular, z = 1 cannot be a solution of the characteristic equation, so that0 6= Φ(1) = 1−
∑pj=1 φj. Therefore, we see from (12) that if X is stationary, then µ = 0.
If X is an AR(p) process, we can write for any j ∈ 0, . . . , p,
Xt = φ1Xt−1 + . . .+ φpXt−p + Zt ⇒ E[XtXt−j] =
p∑i=1
φiE[Xt−iXt−j] + E[ZtXt−j]
⇒ γ(j) =
p∑i=1
φiγ(j − i) + E[ZtXt−j], (13)
This gives us:
• If j = 0,
γ(0) =
p∑i=1
φiγ(i) + E[ZtXt] =
p∑i=1
φiγ(i) + σ2. (14)
• Since Zt is uncorrelated with Xt−j whenever j ∈ 1, . . . , p, we get for all j ∈ 1, . . . , p,
γ(j) =
p∑i=1
φiγ(j − i),
8–1
which, in matrix notation, can be written as
Γpφφφ = γp, (15)
where Γp = (γ(i − j))p1=i,j is the covariance matrix, γp = (γ(1), . . . , γ(p))′, and φφφ =(φ1, . . . , φp)
′.
In particular, dividing every element on both sides of the equality yields
Rpφφφ = ρp, (16)
where Γp = (ρ(i−j))p1=i,j is the correlation matrix, ρp = (ρ(1), . . . , ρ(p))′, andφφφ = (φ1, . . . , φp)′.
The equations for j = 0 and j ∈ 1, . . . , p are a set of p+1 equations in the 2p+2 variablesσ2, φ1, . . . , φp, γ(0), . . . , γ(p). If the model is entirely specified, we know σ2, φ1, . . . , φp andcan therefore solve for γ(0), . . . , γ(p). This is of course true only if the matrix defining oursystem of equations is nonsingular.
Once we have γ(0), . . . , γ(p), we can use (13) to compute γ(j) for all j ≥ p+ 1 recursively:
γ(j) =
p∑i=1
φiγ(j − i).
8.2 AR(2) processes
We can use the Yule-Walker equations to find the autocorrelation function of a given sta-tionary AR(2) process satisfying
Xt = φ1Xt−1 + φ2Xt−2 + Zt.
Recall that an AR(2) process is stationary if all the solutions of the characteristic equation
1− φ1z − φ2z2 = 0 ⇐⇒ φ2z
2 + φ1z − 1 = 0
have magnitude > 1. For such processes, the Yule-Walker equations yield
γ(0) = φ1γ(1) + φ2γ(2) + σ2
γ(1) = φ1γ(0) + φ2γ(1)
γ(2) = φ1γ(1) + φ2γ(0)
and for j ≥ 3,γ(j) = φ1γ(j − 1) + φ2γ(j − 2).
We can compute this explicitly (by hand, or using some software):
γ(0) =(φ2 − 1)σ2
(1 + φ2)(φ21 − (φ2 − 1)2)
, γ(1) =−φ1σ
2
(1 + φ2)(φ21 − (φ2 − 1)2)
, γ(2) =−(φ2
1 − φ22 + φ2)σ2
(1 + φ2)(φ21 − (φ2 − 1)2)
.
8–2
However, this is not particularly instructive if we wish to know the shape of the ACF.Instead, let’s first focus on the set of points (φ1, φ2) ∈ R2 for which the AR(2) process hasa stationary solution. The quadratic equation tells us that
1− φ1z − φ2z2 = 0 ⇐⇒ z =
−φ1 ±√φ2
1 + 4φ2
2φ2
.
To know when these numbers are greater than one in absolute value, we consider three cases:
1. φ21 + 4φ2 < 0. Then z takes two complex values, each of which has magnitude
φ21
4φ22
+φ2
1 + 4φ2
4φ22
=φ2
1 + 2φ2
2φ22
.
2. φ21 + 4φ2 = 0. Then z takes one real value which has magnitude 1
2
∣∣φ1φ2
∣∣.3. φ2
1 + 4φ2 > 0. Then z takes two real values with magnitudes∣∣∣∣−φ1 +√φ2
1 + 4φ2
2φ2
∣∣∣∣ and
∣∣∣∣−φ1 −√φ2
1 + 4φ2
2φ2
∣∣∣∣.One can show (see Appendix B, p.84) that these roots are greater than 1 in absolute valueif and only if
φ1 + φ2 < 1, φ2 − φ1 < 1, and |φ2| < 1.
It turns out that key quantities in understanding AR(2) processes are the reciprocals of theroots of the characteristic equations,
G1 =2φ2
−φ1 −√φ2
1 + 4φ2
=2φ2
−φ1 −√φ2
1 + 4φ2
φ1 −√φ2
1 + 4φ2
φ1 −√φ2
1 + 4φ2
=φ1 −
√φ2
1 + 4φ2
2
and
G2 =2φ2
−φ1 +√φ2
1 + 4φ2
=2φ2
−φ1 +√φ2
1 + 4φ2
φ1 +√φ2
1 + 4φ2
φ1 +√φ2
1 + 4φ2
=φ1 +
√φ2
1 + 4φ2
2.
The autocorrelations of an AR(2) process can be expressed in terms of G1 and G2:
1. If there is only one root to the characteristic equation, i.e., if φ21 + 4φ2 = 0, we have
ρk =
(1 +
1 + φ2
1− φ2
k
)(φ1
2
)k, k ≥ 0.
2. Otherwise,
ρk =(1−G2
2)Gk+11 − (1−G1)2Gk+1
2
(G1 −G2)(1 +G1G2).
In particular, if the roots are complex, i.e., if φ21 + 4φ2 < 0, we can write
ρk = Rk sin(Θk + Φ)
sin Φ,
with R =√−φ2 (recall that for a stationary AR(2) process, |φ2| < 1 and if the roots
are complex, it is easy to see that φ2 < 0 since φ21 +4φ2 < 0), Θ satisfies cos Θ = φ1
2√−φ2
,
and tan Φ = 1−φ21+φ2
tan Θ.
8–3
Math 4506 (Fall 2019) September 25, 2019Prof. Christian Benes
Lecture #9: Causality and Invertibility
9.1 Causality and Invertibility
There are two dual (you can think of duality as some form of symmetry) forms in which onemight be able to express time series. Roughly,
• if Xt is defined in terms of Zss≤t, we call X causal.
• if Zt is defined in terms of Xss≤t, we call X invertible.
More formally:
Definition 9.1. A time series Xt is
• causal if there exist constants ψj with∑j≥0
|ψj| <∞ such that
Xt =∑j≥0
ψjZt−j,
where Zt ∼ WN(0, σ2). Note that such a process can also be thought of as an MA(∞)process.
• invertible if there exist constants πj with∑j≥0
|πj| <∞ such that
Zt =∑j≥0
πjXt−j,
where Zt ∼ WN(0, σ2).
Clearly, any MA(q) process is causal and any AR(p) process is invertible (both by definition).We will now show that some MA(q) processes are invertible as well and that some AR(p)processes are causal.
9.2 Stationary AR(p) processes are causal
If Xt is an AR(p) process, then, as we’ve seen before, using the backward shift operatorB,
Xt −p∑i=1
φiXt−i = Zt ⇐⇒ Φ(B)Xt = Zt ⇐⇒ Xt = (Φ(B))−1Zt,
9–1
where Φ(z) = 1−∑p
i=1 φizi and (Φ(B))−1 is the inverse operator of Φ(B).
What exactly is the operator (Φ(B))−1? We can try to write it explicitly by assuming thatis has the form Ψ(B) = 1 +
∑i≥1 ψiB
i. Then
(Φ(B))(Φ(B))−1 = 1⇒ Φ(B)Ψ(B) = 1, (17)
where 1 is the identity operator (i.e., the operator that doesn’t do anything to Xt: 1Xt = Xt),NOT the number 1.
We will know what Ψ(B) is if we can figure out what all the ψi are (at least in terms ofthe φi, which define the process Xt. The second equality in (17) can be re-written forpolynomials:
Φ(z)Ψ(z) = 1.
The left side of this equality is a polynomial and the right side is the number 1. For thesetwo to be equal, all the coefficients of the polynomial must be 0, except that of order 0,which must be 1. By solving the equation we get for each coefficient, we can figure out whatthe ψi are:
Φ(z)Ψ(z) = 1 ⇐⇒ (1−p∑i=1
φizi)(1 +
∑i≥1
ψizi) = 1.
Expanding the left side in increasing order of degree, we get
1 + (ψ1 − φ1)z + (ψ2 − ψ1φ1 − φ2)z2 + (ψ3 − ψ2φ1 − ψ1φ2 − φ3)z3 + · · · ,
which yields the equations
1 = 1
ψ1 − φ1 = 0
ψ2 − ψ1φ1 − φ2 = 0
ψ3 − ψ2φ1 − ψ1φ2 − φ3 = 0
...
ψk −k∑i=1
ψk−iφi = 0,
where ψ0 = 1, which give the following equations for the ψi:
ψ1 = φ1
ψ2 = ψ1φ1 + φ2 = φ21 + φ2
ψ3 = ψ2φ1 + ψ1φ2 + φ3 = φ31 + φ1φ2 + φ1φ2 + φ3
...
ψk =k∑i=1
ψk−iφi,
9–2
which gives us the values of all the ψi recursively.
It of course isn’t obvious from the recursive equations above that∑
j≥0 |ψj| < ∞. Thefollowing theorem says exactly when that is the case:
Theorem 9.1. An AR(p) process is causal iff whenever Φp(z) = 0, then |z| > 1. In otherwords, an AR(p) process is causal iff all zeros of Φp(z) = 0 are outside of the unit disk.
In the following example, we use the expression we just obtained to express an AR(1) processexplicitly. In other words, we’ll show that a stationary AR(1) process is causal and will re-derive its acf (we already derived it once in Lecture 9).
Example 9.1. Recall that an AR(1) process is defined to be the stationary solution of
Xt − φXt−1 = Zt,
where Zt ∼ WN(0, σ2). We already know such a process exists if |φ| < 1. Applying therecursive equations above to the AR(1) case, where Φ(z) = 1− φz, |φ| < 1, we get
ψ1 = φ
ψ2 = φ2
ψ3 = φ3
...
ψk = φk,
Therefore,
Ψ(z) = Φ−1(z) =∑k≥0
(φz)k.
Therefore,
Xt = Ψ(B)Zt =∑k≥0
φkZt−k.
Note that since∑
k≥0 φk <∞ (since |φ| < 1), we see that indeed Xt is causal.
There is an easy (more intuitive) way of checking that Xt = Ψ(B)Zt =∑
k≥0 φkZt−k indeed
satisfies the autoregressive equation: Suppose Xt =∑j≥0
φjZt−j. Then
Xt − φXt−1 =∑j≥0
φjZt−j −∑j≥0
φj+1Zt−1−j =∑j≥0
φjZt−j −∑j≥0
φj+1Zt−(j+1)
=∑j≥0
φjZt−j −∑j≥1
φjZt−j = Zt.
By Proposition 7.2, Xt is stationary with
E[Xt] = 0 and γX(h) =σ2φh
1− φ2, h ≥ 0.
9–3
9.3 MA(q) processes can be invertible
We can mimic the work done in the previous section to find an invertible expression forMA(q) processes. Suppose Xt is an MA(q) process. Then
Xt =
q∑i=0
θiZt−i = Θ(B)Zt ⇒ Θ−1(B)Xt = Zt.
(Here, θ0 = 1.) Suppose Θ−1(B) is of the form Π(B) = 1−∑∞
i=1 πiBi. Then
Π(B)Θ(B) = (1−∞∑i=1
πiBi)(
q∑i=0
θiBi) = 1.
(Again, here, “1” is the identity operator, not the number.) Equating coefficients of thepolynomials on both sides of the equality, we get the equations
1 = 1
− π1 + θ1 = 0
− π2 − π1θ1 + θ2 = 0
− π3 − π2θ1 − π1θ2 + θ3 = 0
...
θk −k∑i=1
πiθk−i = 0,
which give the following equations for the πi:
π1 = θ1
π2 = θ2 − π1θ1 = θ2 − θ21
π3 = θ3 − π2θ1 − π1θ2 = θ3 − θ1(θ2 − θ21)− θ1θ2
...
As above, it isn’t obvious from the recursive equations above that∑
j≥0 |πj| < ∞, but thistheorem tells us when this is true:
Theorem 9.2. An MA(q) process is invertible iff whenever Θq(z) = 0, then |z| > 1. Inother words, an MA(q) process is invertible iff all zeros of Θq(z) = 0 are outside of the unitdisk.
Example 9.2. As in the AR case above, we consider the particular case where q = 1 toexpress an MA(1) process in its inverted form. To achieve this, we just need to solve the
9–4
equations above when θ1 = θ and θi = 0 if i ≥ 2. Doing that, we get
π1 = θ
π2 = −θ2
π3 = θ3
...
πk = (−1)k+1θk
In particular, we see that an MA(1) process is invertible if |θ| < 1 (since in that case,∑j≥0 |πj| <∞), thus confirming Theorem 9.2
9–5
Math 4506 (Fall 2019) October 2, 2019Prof. Christian Benes
Lecture #10: ARMA processes
Reference. Sections 4.4 and 4.5 from the textbook.
10.1 ARMA Processes
What happens when you mix an AR(p) and MA(q) process? Not too surprisingly, you getan ARMA(p, q) process.
Definition 10.1. A time series Xt is an ARMA(p, q) process if Xt is stationary andfor all t,
Xt −p∑i=1
φiXt−i =
q∑j=0
θjZt−j, (18)
where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials 1 −p∑i=1
φizi and
q∑j=0
θjzj have no
common factors.
Note 10.1. Clearly, AR(p) processes are just a particular case of ARMA(p, q) processes (inthe case when θi = 0 for i = 1, . . . , q). So are MA(q) processes (in the case when φi = 0 fori = 1, . . . , p).
Note 10.2. Recall the the backward shift operators Bj defined for j ≥ 0 by BjXt = Xt−j.Then if we define the polynomials
Φ(z) = 1−p∑i=1
φizi
and
Θ(z) =
q∑j=0
θjzj,
we can re-write equation (43) in the more succinct form
Φ(B)Xt = Θ(B)Zt. (19)
To simplify the notation and derivations of properties of ARMA processes, we start byfocussing on the case where p = q = 1 and will come back to the general case later. AnARMA(1, 1) process Xt is a stationary time series satisfying the equation
Xt − φXt−1 = Zt + θZt−1,
or, equivalently,Φ(B)Xt = Θ(B)Zt,
10–1
where Φ(z) = 1 − φz and Θ(z) = 1 + θz. We’ve already seen a heuristic derivation of asolution for the AR(1) process. We will now look for a solution in a less explicit but quickerway for the ARMA(1,1) process. Note that we could have used the same quicker method forthe AR(1) process.
First a few generalities:
We know from Proposition 7.2 that if∑i∈Z
|ψi| < ∞ and Yt is a stationary time series, then
the time seriesψ(B)Yt,
where ψ(B) =∑j∈Z
ψjBj, is stationary as well. This suggests that we can repeatedly apply
operators of the form ψ(B) =∑j∈Z
ψjBj (also called filters) to a stationary time series without
losing stationarity:
Suppose that∑i∈Z
|αi| <∞ and∑i∈Z
|βi| <∞ and define the polynomials α(z) =∑j∈Z
αjzj and
β(z) =∑j∈Z
βjzj. Then Proposition 7.2 implies that successive applications of the operators
α(B) and β(B) to a stationary time series Yt yield another stationary time series, that is,
Wt = α(B)β(B)Yt
is stationary. In that case,
Wt =∑j∈Z
ηjYt−j,
where
ηj =∑k∈Z
αkβj−k =∑k∈Z
βkαj−k (20)
(you will show this on Homework 3). Equivalently,
Wt = H(B)Yt,
where H(B) = α(B)β(B) = β(B)α(B). Note that the operator α(B)β(B) is obtainedfrom α(B) and β(B) by performing a formal product of these two operators as if they werepolynomials and grouping the terms which have same powers of B.
Let’s return to the ARMA(1,1) process which, by definition, is a stationary time series Xt
satisfyingXt − φXt−1 = Zt + θZt−1.
Equivalently, it is a stationary solution to
Φ(B)Xt = Θ(B)Zt,
10–2
where Φ(z) = 1− φz and Θ(t) = 1 + θz.
We start by finding the Taylor series (let’s call it Ψ(z)) for 1φ(z)
= 11−φz . By analogy with the
formula for geometric series, we see that
1
φ(z)=
1
1− φz=∑j≥0
φjzj = Ψ(z).
Now if |φ| < 1, the coefficients of the series∑j≥0
φjzj =∑j∈Z
φjzj (so that φj = φj if j ≥ 0 and
φj = 0 if j < 0) are absolutely summable, that is, they satisfy∑j∈Z
|φj| <∞. Therefore, the
generalities above apply and we can use (20) to conclude that Ψ(B)Φ(B) = 1, the identityoperator.
If we apply Ψ(B) to the two sides of the equation which defines the ARMA(1,1) process,
Φ(B)Xt = Θ(B)Zt,
we getXt = Ψ(B)Θ(B)Zt,
Using equation (20), we get
Ψ(B)Θ(B) =∑i≥0
φiBi(1 + θB) =∑j≥0
ηjBj,
where η0 = 1 and ηj = (φ+ θ)φj−1 if j ≥ 1.
Writing H(B) =∑j≥0
ηjBj now gives an explicit expression for the ARMA(1,1) process:
Xt = H(B)Zt = Zt + (φ+ θ)∑j≥1
φj−1Zt−j.
We just studied the ARMA(1,1) process in the case where |φ| < 1. Let us now examine thesame process when |φ| > 1.
First note that in this case (this, again, is something you will show on Homework 3)
1
φ(z)=
1
1− φz= −
∑j≥1
φ−jz−j.
Mimicking the case for |φ| < 1, we write Ψ(z) = −∑j≥1
φ−jz−j and apply ψ(B) to both sides
ofΦ(B)Xt = Θ(B)Zt,
10–3
thus obtaining
Xt = Ψ(B)Θ(B)Zt = −θφ−1Zt − (θ + φ)∑j≥1
φ−(j+1)Zt+j.
Finally, in the cases where |φ| = 1, the ARMA(1,1) equations have no stationary solutions(you showed this on the homework in the purely autoregressive case), implying that in thosecases, there is no ARMA(1,1) process.
Looking at the explicit solutions for the ARMA(1,1) equations which we just derived, we seethat
• If |φ| < 1, Xt depends only on “past” values of Z, that is, Xt is defined in terms ofZss≤t. We call X causal.
• If |φ| > 1, Xt depends only on “future” values of Z, that is, Xt is defined in terms ofZss≥t. We call X noncausal.
• If |φ| = 1, there is no stationary solution to (43).
10.2 Invertibility and causality of ARMA(p, q) Processes
Recall the following definition:
Definition 10.2. A time series Xt is an ARMA(p, q) process if Xt is stationary andfor all t,
Xt −p∑i=1
φiXt−i =
q∑j=0
θjZt−j,
where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials φ(z) = 1−p∑i=1
φizi and θ(z) =
q∑j=0
θjzj
have no common factors.
Using the backward shift operator, we can re-write equation (43) in the more succinct form
φ(B)Xt = θ(B)Zt.
ARMA processes are commonly used for a number of reasons. One of these is their linearstructure which simplifies a number of calculations, particularly when predicting. Anotheris the fact that for many autocovariance functions, one can find an ARMA process with thatautocovariance function.
Definition 10.3. An ARMA(p, q) process is
10–4
• causal if there exist constants ψj with∑j≥0
|ψj| <∞ such that
Xt =∑j≥0
ψjZt−j.
Note that in that case, an ARMA(p, q) process is also what we defined to be an MA(∞)process.
• invertible if there exist constants πj with∑j≥0
|πj| <∞ such that
Zt =∑j≥0
πjXt−j.
Theorem 10.1. An ARMA(p, q) process is
• causal if Φ(z) 6= 0 for all |z| ≤ 1.
• invertible if Θ(z) 6= 0 for all |z| ≤ 1.
Theorem 10.1 tells us how to verify if a given ARMA process is causal or invertible: All oneneeds to do is solve the two equations Φ(z) = 0 and Θ(z) = 0, of order p and q, respectively.
For practical purposes, in particular to find the ACF of a causal invertible ARMA process, itwill be useful to determine the coefficients ψj and πj of the causal and invertible formsof the process. This can be done using an idea you’ve seen already:
If
Xt =∑j≥0
ψjZt−j,
then (44) becomes
Φ(B)∑j≥0
ψjZt−j = Θ(B)Zt.
Therefore, to find the coefficients ψj in terms of the coefficients φj and θj, we just need tomatch the coefficients in (
1−p∑i=1
φizi
)∑k≥0
ψkzk =
q∑j=0
θjzj,
which can be re-written more explicitly as(1− φ1z − φ2z
2 − · · · − φpzp) (ψ0 + ψ1z + ψ2z
2 + ψ3z3 + · · ·
)= 1 + θ1z + θ2z
2 + · · ·+ θqzq.
This yields:ψ0 = 1,
10–5
ψ1 − φ1ψ0 = θ1 ⇒ ψ1 = θ1 + φ1.
ψ2 − φ1ψ1 − φ2ψ0 = θ2 ⇒ ψ2 = θ2 + φ1ψ1 + φ2ψ0.
· · ·
ψj = θj +
j∑i=1
φiψj−i,
with θj = 0 for all j > q and φi = 0 for all i > p. In summary, we get the recursive formula:
ψj =
θj +
j∑i=1
φiψj−i, j ≤ minp, q
θj +
p∑i=1
φiψj−i, p < j ≤ q
j∑i=1
φiψj−i, q < j ≤ p
p∑i=1
φiψj−i, j > maxp, q
These equations will allow us, whenever we are dealing with a causal ARMA process to write
it in its causal form: Xt =∑j≥0
ψjZt−j, with ψj as above.
We can use the same procedure to determine the coefficients πj for an invertible ARMAprocess:
If
Zt =∑j≥0
πjXt−j,
then (44) becomes
Φ(B)Xt = Θ(B)∑j≥0
πjXt−j,
so to find the coefficients πj, we match the coefficients in
1−p∑i=1
φizi =
q∑j=0
θjzj∑k≥0
πkzk,
or, equivalently,
1− φ1z − φ2z2 − · · · − φpzp =
(1 + θ1z + θ2z
2 + · · ·+ θqzq) (π0 + π1z + π2z
2 + π3z3 + · · ·
).
This yields:π0 = 1,
π1 + θ1π0 = −φ1 ⇒ π1 = −φ1 − θ1π0.
10–6
π2 + θ1π1 + θ2π0 = −φ2 ⇒ π2 = −φ2 − θ1π1 − θ2π0.
· · ·
πj = −φj −j∑i=1
θiπj−i,
with φj = 0 for all j > p and θi = 0 for all i > q.
10–7
Math 4506 (Fall 2019) October 7, 2019Prof. Christian Benes
Lecture #11: ACF of ARMA processes; First Statistical Steps
Reference. Sections 4.4 and 4.5 from the textbook.
11.1 ACF for causal ARMA processes
The causal representation of ARMA processes will make it relatively easy to compute theautocorrelation function (ACF) for some ARMA processes. As you already know, one can get
the ACF ρ from the autocovariance function (ACVF) γ using the relationship ρ(h) = γ(h)γ(0)
.
As an example of this, we look at a few particular ARMA processes and derive their ACFs.
If an ARMA process is causal, we know that we can write Xt =∑j≥0
ψjZt−j. Then, using the
fact that Ztt∈Z is a sequence of uncorrelated random variables with variance σ2, we get
γ(h) = Cov(Xt, Xt+h) = Cov(∑j≥0
ψjZt−j,∑k≥0
ψkZt+h−k) = Cov(∑j≥0
ψjZt−j,∑j≥−h
ψj+hZt−j)
=
σ2∑j≥0
ψjψj+h, h ≥ 0
σ2∑j≥−h
ψjψj+h, h ≤ 0=
σ2∑j≥0
ψjψj+h, h ≥ 0
σ2∑i≥0
ψi−hψi, h ≤ 0= σ2
∑j≥0
ψjψj+|h|.
Example 11.1. In this example, we derive the ACVF for causal ARMA(1,1) processes.Note that there is a slightly more detailed discussion of the ACF of ARMA(1,1) processesin Section 4.4 in the textbook. Make sure you read it.
We’ve seen that if |φ| < 1, such a process can be written as the following MA(∞) process:
Xt = ψ(B)Zt = Zt + (φ+ θ)∑j≥1
φj−1Zt−j.
Therefore, the expression
γ(h) = σ2∑j≥0
ψjψj+|h|
yields
γ(0) = σ2∑j≥0
ψ2j = σ2
(1 +
∑j≥1
((φ+ θ)φj−1
)2
)= σ2
(1 +
∑j≥1
(φ+ θ)2φ2j−2
)
= σ2
(1 + (φ+ θ)2
∑j≥1
φ2j−2
)= σ2
(1 + (φ+ θ)2
∑j≥0
φ2j
)= σ2
(1 +
(φ+ θ)2
1− φ2
),
11–1
γ(1) = σ2∑j≥0
ψjψj+1 = σ2
((φ+ θ) +
∑j≥1
(φ+ θ)2φj−1φj
)
= σ2
((φ+ θ) + (φ+ θ)2
∑j≥1
φ2j−1
)= σ2
((φ+ θ) + (φ+ θ)2
∑j≥0
φ2j+1
)
= σ2
((φ+ θ) +
φ(φ+ θ)2
1− φ2
),
and if h ≥ 2,
γ(h) = σ2∑j≥0
ψjψj+h = σ2
((φ+ θ)φh−1 +
∑j≥1
(φ+ θ)2φj−1φj+h−1
)
= σ2
((φ+ θ)φh−1 + (φ+ θ)2
∑j≥1
φ2j+h−2
)= σ2
((φ+ θ)φh−1 + (φ+ θ)2
∑j≥0
φ2j+h
)
= σ2
((φ+ θ)φh−1 +
φh(φ+ θ)2
1− φ2
)= φh−1γ(1).
Example 11.2. Consider the ARMA process defined by
Xt +1
2Xt−1 = Zt −
1
4Zt−1, Zt ∼ WN(0, σ2).
Then the autoregressive polynomial for this process, Φ(z) = 1 + 12z, has one zero, namely
z = −2. Since |− 2| > 1, this ARMA process is causal. Using the fact that θ = −14, φ = −1
2,
we get from Example 11.1
γ(0) = σ2(1 +9/16
3/4) =
7
4σ2
and for h ≥ 1,
γ(h) = σ2
(−1
2
)h−1(−3
4+
(−1/2)(9/16)
3/4
)= σ2
(−1
2
)h−1(−9
8
)= (−1)h9
(1
2
)h+2
σ2.
We therefore see that the ACF
ρ(h) =γ(h)
γ(0)= (−1)h
9
7
(1
2
)his alternating and decays exponentially.
11–2
11.2 The statistics point of view
In everything that follows, we assume that we are dealing with a stationary time series.We will hope that this assumption is satisfied by the residual sequence Yt of our time seriesmodel.
So far, we’ve looked at time series from a probability point of view, that is, we’ve developedtime series models. Of course, our goal is to make sure our models depict reality appropriately,so our next task will be to examine the data and use it to determine which models areappropriate.
11.2.1 Basic Ideas
One of the big goals of Statistics is to estimate population parameters. This is done bycalculating statistics (or estimators), which are simply numbers computed from data, andusing them as point estimates of the appropriate parameter.
If θ is a parameter in a distribution (such as the mean, the variance, or anything else), thestandard notation for an estimator for θ is θ and estimates are denoted by θe. Note that thebook uses a different (and worse) notation, but you should still be able to figure out fromcontext what
So what’s the difference between an estimate and an estimator? At first it may seem quitesubtle. The key is to understand that before we observe data, we are dealing with randomvariables (we don’t know yet which values they will take, so they are still random), while afterwe see the data, these random variables have crystallized into (non-random) real numbers.
• An estimator is a random variable based on the random variables for which onewishes to estimate something.
• An estimate is a number obtained after observing realizations of the random variables.
Typically, there are a number of reasonable estimators (e.g., the maximum likelihood es-timator or the method of moments estimator) for a given parameter and there are severalcriteria according to which their value is judged. One of them is the notion of unbiasedness:
Definition 11.1. An estimator θ for a parameter θ is unbiased if
E[θ] = θ.
If you draw independent samples y1, y2, . . . , yn from the sequence of random variablesY1, . . . , Yn with an unknown mean µ and unknown variance σ2, then an unbiased estimatorof µ is the sample mean
µ = Y =Y1 + · · ·+ Yn
n,
and a common unbiased estimator of σ2 is the sample variance
σ2 = S2 =1
n− 1
n∑i=1
(Yi − Y )2.
11–3
The corresponding estimates are:
µe = y =y1 + · · ·+ yn
n,
and
σ2e = s2 =
1
n− 1
n∑i=1
(yi − y)2.
Note 11.1. Though they are less natural, the following are also unbiased estimators for themean:
• µ1 = Y1
• µ2 = Y1+Y22
• µ3 = 2n
2n−1
n∑i=1
(1/2)iYi
11.2.2 Confidence Intervals
We now discuss another method for drawing conclusions on the processes from which datamight originate. This method relies on confidence intervals, an object somewhat related tothe idea of hypothesis testing. We describe this method via an example.
Suppose that Xini=1 ∼ N(µ, σ2) are i.i.d. Then the sample mean (recall that it is anestimator, thus a random variable)
X =1
n
n∑i=1
Xi ∼ N
(µ,σ2
n
).
[Recall that this can be shown either with moment generating functions or by computingthe convolution of the p.d.f.’s. In either case, one just needs to deal with the case n = 2 andgeneralize the result by induction.]
In particular, knowing the distribution of X allows us to compute probabilities for it.
We already know how we would test the hypothesis that µ has some specific value. We nowlook at how to derive a confidence interval for µ.
We know that if Xini=1 ∼ N(µ, σ2) are independent, then
X ∼ N
(µ,σ2
n
),
which implies thatX − µσ/√n∼ N(0, 1),
11–4
so
P (−zα/2 ≤X − µσ/√n≤ zα/2) = 1− α. (21)
In particular, as a numerical example, we have
P (−1.96 ≤ X − µσ/√n≤ 1.96) = 0.95.
We can now play around with (21) and try to isolate µ to find an interval in which µ has a1− α chance of finding itself:
1− α = P (−zα/2 ≤X − µσ/√n≤ zα/2) = P (−zα/2σ/
√n ≤ X − µ ≤ zα/2σ/
√n)
= P (−zα/2σ/√n− X ≤ −µ ≤ zα/2σ/
√n− X) = P (zα/2σ/
√n+ X ≥ µ ≥ −zα/2σ/
√n+ X)
Writing this in a slightly more elegant way gives
P (X − σ√nzα/2 ≤ µ ≤ X +
σ√nzα/2) = 1− α. (22)
This means that the probability that µ finds itself in the interval
(X − σ√nzα/2, X +
σ√nzα/2)
is 1−α. That interval is therefore called a (1−α)-confidence interval. It is a random intervalwhich has a (1− α) chance of containing the true value µ.
11.2.3 The chi-square distribution
The chi-square distribution is one of the most common distribution in statistics. It appearsin a number of very different tests. This distribution is so natural because it is an offspringof the normal distribution.
Definition 11.2. Suppose Z1, . . . , Zn are i.i.d. N(0,1) random variables. Then
X =n∑i=1
Z2i
is said to have the chi-square distribution with n degrees of freedom. We write
X ∼ χ2n.
Since X is defined from independent normal random variables, one can fairly easily derivethe pdf for X ∼ χ2
n:
f(x) =(1/2)n/2
Γ(n/2)xn/2−1e−x/2.
11–5
Note 11.2. Clearly, since the chi-square is a sum of squares, it is always positive.
Note 11.3. A quick look at the pdf above shows that
• If X ∼ χ22, then X ∼ Exp(1/2).
• Chi-square distributions are a particular case of gamma distributions: If X ∼ χ2n, then
X ∼ Γ(n/2, 1/2)
Since we have a pdf for X ∼ χ2n, we can derive its moments. However, it will be easier to
derive the mean and the variance using the expression X =n∑i=1
Z2i :
E[X] = E[n∑i=1
Z2i ] =
n∑i=1
E[Z2i ] = n.
E[X2] = E[(n∑i=1
Z2i )2] =
n∑i,j=1
E[Z2i Z
2j ] = nE[Z4
1 ] + (n2 − n)E[Z21Z
22 ]
= nE[Z41 ] + (n2 − n)E[Z2
1 ]E[Z22 ] = nE[Z4
1 ] + (n2 − n)].
Since Z1 ∼ N(0, 1), we can use its moment generating function, MZ(t) = et2/2, to compute
E[Z41 ]:
E[Z41 ] = M
(4)Z (0) = 3.
Therefore,E[X2] = 3n+ n2 − n = n2 + 2n,
implying thatVar(X) = E[X2]− E[X]2 = n2 + 2n− n2 = 2n.
11–6
2 4 6 8 10
0.05
0.10
0.15
0.20
0.25
0.30
The pdf of a chi-square random variable with 1 degree of freedom
2 4 6 8 10
0.1
0.2
0.3
0.4
0.5
The pdf of a chi-square random variable with 2 degrees of freedom
2 4 6 8 10 12 14
0.05
0.10
0.15
0.20
The pdf of a chi-square random variable with 3 degrees of freedom
2 4 6 8 10 12 14
0.05
0.10
0.15
The pdf of a chi-square random variable with 4 degrees of freedom
5 10 15 20
0.05
0.10
0.15
The pdf of a chi-square random variable with 5 degrees of freedom
11–7
11.3 Estimators for covariance, correlation
By analogy with the definition of the sample variance, it is probably not too difficult to believethat a reasonable estimator for the covariance of a random vector ~X = (X1, . . . , Xm)′, based
on a sample coming from the n independent random vectors ~X1 = (X1,1, . . . , Xm,1)′, . . . , ~Xn =
(X1,n, . . . , Xm,n)′ would be the sample covariance matrix Q = (Qi,j)1≤i,j≤m, where
Qi,j =1
n− 1
n∑k=1
(Xi,k − Xi)(Xj,k − Xj),
where Xi = 1n
∑nk=1Xi,k. The “n − 1” is there because it makes each Qi,j unbiased, as you
will show on a homework problem.
Note that in the setting of stationary time series, the covariance is a function of one parameteronly, so we will define the sample covariance in a slightly different way.
Sample Autocovariance Function (ACVF)
γ(h) :=1
n
n−|h|∑t=1
(Xt+|h| −X)(Xt −X), −n < h < n
Notice that γ(−h) = γ(h).
Notice also that the sum is divided by n, not n − |h| − 1 as we might have expected byanalogy with the definition of the sample covariance above. The main reason for this is thatwith this definition, the sample covariance matrix ends up being nonnegative definite (don’tworry about why). Though we gain a little with this nice property, we lose a little by havingan estimator which is not unbiased. However, it turns out that when we deal with largesamples, this estimator will usually be close to unbiased.
Sample Autocovariance MatrixThe sample covariance matrix for stationary time series is simply the matrix of sampleautocovariance functions given by
Γn := (γ(i− j))1≤i,j≤n.
Note that Γn is nonnegative definite.
Sample Autocorrelation Function (ACF)
ρ(h) :=γ(h)
γ(0), −n < h < n.
A plot of ρe(h) versus h is called a sample correlogram or often just correlogram.
Sample Autocorrelation MatrixThe sample correlation matrix is simply the matrix of sample autocorrelation functions givenby
Rn := (ρ(i− j))1≤i,j≤n
Notice that Rn is nonnegative definite, and that each diagonal entry of Rn is 1 since ρ(0) =1.
11–8
Math 4506 (Fall 2019) October 16, 2019Prof. Christian Benes
Lecture #12: First Statistical Steps
Reference. Sections 3.2 and 3.6 from the textbook.
Before using the ideas developed in the last few lectures to define and analyze ARMAprocesses, let’s take a step back to take a first look at time series from a statistical point ofview (now that all of you have the tools needed to do so).
12.1 Estimators for covariance, correlation
Example 12.1. Suppose that the observed data set is 0, 4, 8, 4, 0,−4, 0,−4. Viewing thisas a “time series” means that x1, x2, x3, x4, x5, x6, x7, x8 = 0, 4, 8, 4, 0,−4, 0,−4. Thesample mean is therefore
x =1
8
8∑t=1
xt =0 + 4 + 8 + 4 + 0− 4 + 0− 4
8=
8
8= 1,
and the sample autocovariance function is
γe(h) :=1
8
8−|h|∑t=1
(xt+|h| − 1)(xt − 1), −8 < h < 8.
Thus, we can easily compute that
γe(0) =1
8
8∑t=1
(xt − 1)(xt − 1)
=1
8
[(x1 − 1)2 + (x2 − 1)2 + (x3 − 1)2 + (x4 − 1)2 + (x5 − 1)2 + (x6 − 1)2 + (x7 − 1)2 + (x8 − 1)2
]=
1
8
[(0− 1)2 + (4− 1)2 + (8− 1)2 + (4− 1)2 + (0− 1)2 + (−4− 1)2 + (0− 1)2 + (−4− 1)2
]=
120
8
γe(1) = γe(−1) =1
8
7∑t=1
(xt+1 − 1)(xt − 1)
=1
8[(x2 − 1)(x1 − 1) + (x3 − 1)(x2 − 1) + (x4 − 1)(x3 − 1) + (x5 − 1)(x4 − 1) + (x6 − 1)(x5 − 1)
+ (x7 − 1)(x6 − 1) + (x8 − 1)(x7 − 1)]
=1
8[(4− 1)(0− 1) + (8− 1)(4− 1) + (4− 1)(8− 1) + (0− 1)(4− 1) + (−4− 1)(0− 1)
+ (0− 1)(−4− 1) + (−4− 1)(0− 1)]
=51
8
12–1
γe(2) = γe(−2) =1
8
6∑t=1
(xt+2 − 1)(xt − 1)
=1
8[(x3 − 1)(x1 − 1) + (x4 − 1)(x2 − 1) + (x5 − 1)(x3 − 1) + (x6 − 1)(x4 − 1) + (x7 − 1)(x5 − 1)
+ (x8 − 1)(x6 − 1)]
=1
8[(8− 1)(0− 1) + (4− 1)(4− 1) + (0− 1)(8− 1) + (−4− 1)(4− 1) + (0− 1)(0− 1)
+ (−4− 1)(−4− 1)]
=6
8
γe(3) = γe(−3) =1
8
5∑t=1
(xt+3 − 1)(xt − 1)
=1
8[(x4 − 1)(x1 − 1) + (x5 − 1)(x2 − 1) + (x6 − 1)(x3 − 1) + (x7 − 1)(x4 − 1) + (x8 − 1)(x5 − 1)]
=1
8[(4− 1)(0− 1) + (0− 1)(4− 1) + (−4− 1)(8− 1) + (0− 1)(4− 1) + (−4− 1)(0− 1)]
=−39
8
γe(4) = γe(−4) =1
8
4∑t=1
(xt+4 − 1)(xt − 1)
=1
8[(x5 − 1)(x1 − 1) + (x6 − 1)(x2 − 1) + (x7 − 1)(x3 − 1) + (x8 − 1)(x4 − 1)]
=1
8[(0− 1)(0− 1) + (−4− 1)(4− 1) + (0− 1)(8− 1) + (−4− 1)(4− 1)]
=−36
8
γe(5) = γe(−5) =1
8
3∑t=1
(xt+5 − 1)(xt − 1)
=1
8[(x6 − 1)(x1 − 1) + (x7 − 1)(x2 − 1) + (x8 − 1)(x3 − 1)]
=1
8[(−4− 1)(0− 1) + (0− 1)(4− 1) + (−4− 1)(8− 1)]
=−33
8
12–2
γe(6) = γe(−6) =1
8
2∑t=1
(xt+6 − 1)(xt − 1)
=1
8[(x7 − 1)(x1 − 1) + (x8 − 1)(x2 − 1)]
=1
8[(0− 1)(0− 1) + (−4− 1)(4− 1)]
=−14
8
γe(7) = γe(−7) =1
8
1∑t=1
(xt+7 − 1)(xt − 1)
=1
8[(x8 − 1)(x1 − 1)]
=1
8[(−4− 1)(0− 1)]
=5
8
The sample autocorrelation function is
ρe(h) :=γe(h)
γe(0)=γe(h)
120/8, −8 < h < 8,
so that
ρe(0) = 1, ρe(1) = ρe(−1) = 51/120, ρe(2) = ρe(−2) = 5/120, ρe(3) = ρe(−3) = −39/120,
ρe(4) = ρe(−4) = −36/120, ρe(5) = ρe(−5) = −33/120, ρe(6) = ρe(−6) = −14/120,
ρe(7) = ρe(−7) = 5/120.
As for the sample covariance and correlation matrices, we have
Γ8,e =
γe(0) γe(1) γe(2) γe(3) γe(4) γe(5) γe(6) γe(7)γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4) γe(5) γe(6)γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4) γe(5)γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3) γe(4)γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2) γe(3)γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1) γe(2)γe(−6) γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0) γe(1)γe(−7) γe(−6) γe(−5) γe(−4) γe(−3) γe(−2) γe(−1) γe(0)
=
120/8 51/8 6/8 −39/8 −36/8 −33/8 −14/8 5/851/8 120/8 51/8 6/8 −39/8 −36/8 −33/8 −14/86/8 51/8 120/8 51/8 6/8 −39/8 −36/8 −33/8−39/8 6/8 51/8 120/8 51/8 6/8 −39/8 −36/8−36/8 −39/8 6/8 51/8 120/8 51/8 6/8 −39/8−33/8 −36/8 −39/8 6/8 51/8 120/8 51/8 6/8−14/8 −33/8 −36/8 −39/8 6/8 51/8 120/8 51/8
5/8 −14/8 −33/8 −36/8 −39/8 6/8 51/8 120/8
,
12–3
and
R8,e =
ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5) ρe(6) ρe(7)ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5) ρe(6)ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4) ρe(5)ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3) ρe(4)ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2) ρe(3)ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1) ρe(2)ρe(−6) ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0) ρe(1)ρe(−7) ρe(−6) ρe(−5) ρe(−4) ρe(−3) ρe(−2) ρe(−1) ρe(0)
=
1 51/120 6/120 −39/120 −36/120 −33/120 −14/120 5/12051/120 1 51/120 6/120 −39/120 −36/120 −33/120 −14/1206/120 51/120 1 51/120 6/120 −39/120 −36/120 −33/120−39/120 6/120 51/120 1 51/120 6/120 −39/120 −36/120−36/120 −39/120 6/120 51/120 1 51/120 6/120 −39/120−33/120 −36/120 −39/120 6/120 51/120 1 51/120 6/120−14/120 −33/120 −36/120 −39/120 6/120 51/120 1 51/120
5/120 −14/120 −33/120 −36/120 −39/120 6/120 51/120 1
.
Note that this example can be done in R in almost no time, since R will compute samplecorrelograms for you. Here’s how: Type
> x=c(0,4,8,4,0,-4,0,-4)
and
> acf(X)
This gives the following graph:
0 1 2 3 4 5 6 7
-0.5
0.0
0.5
1.0
Lag
ACF
Series x
To see the actual values of the sample autocorrelation function, not just the graph, type:
12–4
> a=acf(x)
> a
This gives the following:
Autocorrelations of series ‘x’, by lag
0 1 2 3 4 5 6 71.000 0.425 0.050 −0.325 −0.300 −0.275 −0.117 0.042
12.2 Test for the residual sequence
The simplest random sequence is white noise as it has the simplest covariance structure. Wewill see here how to determine if the residual sequence in a time series could be modeledby white noise. The key idea here is that if Y1, . . . , Yn are i.i.d. with finite variance, thenρ(1), . . . , ρ(n− 1) are approximately i.i.d. with distribution N(0, 1/n). This fact is far fromobvious, so feel free not to worry about why it is the case. Note that this approximation isgood for small lags, but becomes bad for large lags.
Now suppose X ∼ N(0, 1/n). Then√nX ∼ N(0, 1). Therefore,
P (−1.96/√n ≤ X ≤ 1.96/
√n) = P (−1.96 ≤
√nX ≤ 1.96) ≈ 0.95 = 95%.
Of course, the same applies to our random variables ρ(i) above.
For any i, under the assumption that ρ(i) ∼ N(0, 1/n), ρ(i) has a 95% chance of landing inthat interval. In particular, if our assumption that ρ(i) ∼ N(0, 1/n) and that the ρ(i) areindependent is correct, roughly 95% of the ρ(i) should be in that interval.
This gives us a nice procedure for determining whether the random variables of our residualsequence Y1, . . . , Yn could be iid with finite variance or not:
If much more than 5% of the sample autocorrelations land outside of the interval
(−1.96/√n, 1.96/
√n),
then there is no good reason to believe that ρ(0), . . . , ρ(n) are approximately i.i.d. withdistribution N(0, 1/n) and therefore no good reason to believe that Y1, . . . , Yn are i.i.d. withfinite variance. In that case, we reject the hypothesis that Y1, . . . , Yn are i.i.d. with finitevariance. Although this is not formal (we don’t yet have a systematic quantitative rule forrejecting this hypothesis), it at least suggests a simple way of checking whether a randomsequence may be iid noise or not.
Let’s look at this using a “controlled” example, that is, an example where we know what thetrue time series model is. The only way to know this, is if we create the time series. We willgenerate a white noise Gaussian time series of 2000 time steps and look at its correlogramto see if it is as we expect. We will then reproduce the experiment with a random walk forwhich the increments are the values of the white noise Gaussian time series. This is done asfollows: First generate 2000 independent standard normal random variables by typing
12–5
>w=rnorm(2000)
Generate a plot of the Gaussian white noise by typing
>plot(w,type=”l”)
This gives the following graph (if you reproduce this at home, you’ll get a different graph):
0 500 1000 1500 2000
-3-2
-10
12
3
Index
w
Now we can obtain the sample correlogram for our data set w by typing
>acf(w)
This gives
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series w
To see the values of the autocorrelation function for small lags, type
>a=acf(w)
>a
If the time series were completely uncorrelated (which we know it is, as we generated it), wewould expect about 95% of these values to be less than 1.96/
√n in absolute value. To know
what the value is in this case, type
>1.96/sqrt(2000)
12–6
12.3 Some Examples of Correlograms
To understand a bit better what the correlogram tells us about the underlying process, let’slook at a few additional examples for correlograms of time series which exhibit specific kindsof patterns. We will do this exercise numerically and you will analyze this question morecarefully on the homework from a theoretical point of view.
Example 12.2. Consider random walk which we can generate recursively this way:
>w=rnorm(2000)
>x=w
>for (t in 2:2000) x[t]=x[t-1]+w[t]
To see the picture, type
>plot(x,type=”l”)
0 500 1000 1500 2000
-20
020
4060
Index
x
The correlogram for the random walk is the following:
>acf(x)
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series x
Below, we examine the correlograms of two particularly nice time series, one of which isperfectly linear and the other perfectly cyclical.
12–7
Example 12.3. If for t ≥ 1, Xt = t, Xt is a non-random time series. We can nonethelesscompute the autocorrelation function for this time series. Let’s do this with R, for instancefor a time series of length 1000:
First, we generate the time series:
> X=c(1:1000)
> for (i in 1:1000) X[i]=i
If you want to check the numerical values of X, don’t forget you can just type
> X
Now to see the correlogram, type
> acf(X)
This gives the following graph:
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series X
Note that the sample autocovariances are all fairly close to 1. This is not surprising, sincethe values of the underlying time series are very strongly correlated for small lags. To seethe actual values of the sample autocorrelations in the graph, type
> a=acf(X)
> a
Note also that R only gives you the first 30 values of the correlogram. This is because bydefault it only gives you the first 10 log10(n) values of the correlogram (where n is the lengthof the time series). This is partly because the estimates of the sample correlation quicklybecome bad estimates for larger lags. To see all the sample correlations, type the following:
> a=acf(X,lag.max=999)
> a
This will produce the following picture, as well as the corresponding numerical values:
12–8
0 200 400 600 800 1000
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series X
It may seems surprising that there are negative correlations at large lags. This can beexplained by the fact that since correlations are computed relatively to the mean at the dataset and that for large lags, typically one value lies above the mean ((n + 1)/2) and one liesbelow.
Exercise 12.10. Reproduce the steps above with a linear time series of length 10,000. Is thenumber of lags given by default by R what you expected it to be? How does the correlogramfor the first 10 lags compare to the correlogram for the first 10 lags obtained above?
Note 12.1. On the homework, you will show that the phenomenon you observed in Exercise12.10 is not accidental.
Example 12.4. We now do the same thing as above for a periodic time series. Noting thatcos(2πx
n) has period n, we define the following periodic time series of period 10:
> X=c(1:200)
> for (i in 1:200) X[i]=cos(pi*i/5)
> acf(X)
This gives
0 5 10 15 20
-1.0
-0.5
0.0
0.5
1.0
Lag
ACF
Series X
At a first glance, the graph looks perfectly periodic with period 10. Typing
12–9
> a=acf(X) > a
shows that this is not completely the case. The amplitude decreases. This is due to the factthat the sample autocovariance function
γ(h) :=1
n
n−|h|∑t=1
(Xt+|h| −X)(Xt −X)
contains a sum with fewer and fewer terms as the lag h increases, but the multiplicativefactor 1/n is always the same.
Note 12.2. You will show on the homework that for a periodic time series of integer periodthe true autocorrelation function is actually perfectly periodic, unlike the sample autocorre-lation function.
12–10
Math 4506 (Fall 2019) October 21, 2019Prof. Christian Benes
Lecture #13: Testing if a Time Series Could Be White Noise;Inference for Mean; Additive Model
Reference. Sections 8.1, 3.2 from the textbook.
13.1 Tests for the estimated noise sequence
When trying to determine an appropriate decomposition (e.g. additive)
Xt = mt + st + Yt
for a time series, where the goal will be to find a description for m, s, and Y which is intune with our observations, a good first thing to check is whether Yt is stationary with thesimplest possible covariance structure, in other words, if Yt is white noise.
Below, we develop a few tests (the more, the better) aimed at answering the same question.One of these relies on the chi-square distribution. Note that if you are interested in findingquantiles for the chi-square, one option is to go to
http://www.stat.tamu.edu/˜west/applets/chisqdemo.html
Alternatively, you can use the R command “qchisq(alpha,n)”, which will give you the valueof χ2
α,n.
13.1.1 The Turning Point Test
We already saw that we should be very skeptical of independence of a sequence of randomvariables if the signs of their realizations alternate with too much regularity. This idea is atthe center of the turning point test.
Definition 13.1. Suppose y1, . . . , yn is a sequence of realizations of a time series. We saythat there is a turning point at time i if yi = maxyi−1, yi, yi+1 or yi = minyi−1, yi, yi+1.
If Y1, . . . , Yn is a random sequence, we can define the random variable T to be the numberof turning points.
We now assume that Y1, . . . , Yn is an i.i.d. sequence. Clearly, if 2 ≤ i ≤ n− 1,
P (i is a turning point) = P (Yi = maxYi−1, Yi, Yi+1) + P (Yi = minYi−1, Yi, Yi+1 =2
3.
Now suppose that for 2 ≤ i ≤ n− 1,
Ii = 1i is a turning point =
1 i is a turning point0 i is not a turning point
.
13–1
Then T =n−1∑i=2
Ii and
E[Ii] = P (i is a turning point),
so
µT = E[T ] = E[n−1∑i=2
Ii] =n−1∑i=2
P (i is a turning point) = (n− 2)2
3.
Similarly, one can show that
σ2T = Var(T ) =
16n− 29
90.
One of the many versions of the central limit theorem now implies that if n is large,
Tapprox∼ N
((n− 2)
2
3,16n− 29
90
).
So we will reject the hypothesis at level α if
|T − µT |σT
> zα/2.
13.1.2 The Portmanteau Tests
Recall from Lecture 12 that if Y1, . . . , Yn are i.i.d. with finite variance, then ρ(1), . . . , ρ(n−1)are approximately i.i.d. with distribution N(0, 1/n). Therefore, if Y1, . . . , Yn are i.i.d. withfinite variance, then
√nρ(0), . . . ,
√nρ(n) are approximately i.i.d. with distribution N(0, 1),
so the sum of their squares is approximately a chi-square. More precisely, if 1 ≤ k ≤ n− 1,
k∑i=1
(√nρ(i))2)
approx∼ χ2k,
that is,
Q = n
k∑i=1
ρ(i)2 approx∼ χ2k. (23)
This gives us a number of different options for which test statistic to use (one for each1 ≤ k ≤ n − 1). So which k do we choose? Typically, k = log10 n is a choice that ensuresthat the approximation is good.
The estimator Q is called the Box-Pierce statistic. Based on (23), we should reject thehypothesis that Y1, . . . , Yn are i.i.d. with finite variance if Q falls within an unlikely regionfor a χ2
n random variable. We will reject the hypothesis at level α if q, the realization of Qsatisifes
q > χ21−α,k,
13–2
where χ21−α,k is such that if X ∼ χ2
k, then
P (X > χ21−α,k) = α.
There is a variant (really an improvement) of the Box-Pierce statistic, called the Ljung-Boxstatistic:
Q = n(n+ 2)k∑i=1
ρ(i)2
n− iapprox∼ χ2
k.
As this estimator is better than the Box-Pierce statistic, you should use the Ljung-Boxstatistic rather than the Box-Pierce statistic. The reason I mentioned the Box-Pierce statisticis because it is easier to see why its distribution might be well approximated by the chisquare distribution. In R, the command for both tests is Box.test. The optional argument“type=“L”” tells R to use the Ljung-Box statistic while the argument “type=“B”” tells Rto use the Box-Pierce statistic.
Example 13.1. In this example, we will generate two time series. Z will be white noise andX will be an AR(1) process. We will test for both times series via both the Box-Pierce andthe Ljung-Box test if they could be white noise. (Of course, since we generated the data, weknow the answer already.)
> Z=rnorm(100)
> X=Z
> for (i in 2:100) X[i]=X[i-1]/2+Z[i]/2
> Box.test(Z,lag=2,type=”L”)
Box-Ljung test
data: Z
X-squared = 0.033, df = 2, p-value = 0.9836
> Box.test(R,lag=2,type=”B”)
Box-Pierce test
data: Z
X-squared = 0.0318, df = 2, p-value = 0.9842
> Box.test(X,lag=2,type=”L”)
Box-Ljung test
data: X
X-squared = 25.3737, df = 2, p-value = 3.091e-06
> Box.test(X,lag=2,type=”B”)
Box-Pierce test
data: X
X-squared = 24.6017, df = 2, p-value = 4.548e-06
13–3
We see that in both cases, both tests do exactly what one would hope, i.e., reject (over-whelmingly) the white noise hypothesis for X and fail to reject (by very much) the whitenoise hypothesis for Z. Try this at home with different values of n and different time seriesX that aren’t white noise.
13.2 Inference for µ
Even in the case of dependent variable X1, . . . , Xn, X is a natural estimator for µ. However,since we are not assuming here that the Xt are independent, we can’t just claim that X ∼N(µ, σ
2
n
).
It is still true that
E[Xn] = E[1
n
n∑i=1
Xi] =1
n
n∑i=1
E[Xi] =1
nnµ = µ.
However,
Var(Xn) = Var(1
n
n∑i=1
Xi) =1
n2Var(
n∑i=1
Xi) =1
n2
E[
(n∑i=1
Xi
)2
]− E[n∑i=1
Xi]2
=
1
n2
(n∑i=1
n∑j=1
E[XiXj]− E[Xi]E[Xj]
)=
1
n2
n∑i=1
n∑j=1
Cov(Xi, Xj)
=1
n2
n∑i=1
n∑j=1
γ(i− j) =1
n2
n−1∑h=−n+1
(n− |h|)γ(h) =1
n
n−1∑h=−n+1
(1− |h|
n
)γ(h)
Now any nonsingular linear transformation of a multivariate Gaussian vector is multivariateGaussian too, so since
1n
1n
1n· · · 1
n1n
0 1 0 · · · 0 00 0 1 · · · 0 0...
......
......
...0 0 0 · · · 1 00 0 0 · · · 0 1
X1
X2...
Xn−1
Xn
=
Xn
X2...
Xn−1
Xn
,
the vector Xn
X2...
Xn−1
Xn
13–4
is multivariate normal, which means that its marginals are normal, so that Xn has a normaldistribution. Since we know its mean and variance, we know everything about it:
Xn ∼ N
(µ,
1
n
n−1∑h=−n+1
(1− |h|
n
)γ(h)
).
Equivalently, √n(Xn − µ)√
v∼ N(0, 1),
where v =n−1∑
h=−n+1
(1− |h|
n
)γ(h).
So we know everything about Xn for a Gaussian time series. But what if X is not Gaussian?Then we turn to the usual trick, namely, the central limit theorem, which tells us that alarge sum of random variables is close to being Gaussian. In that case, we get the following:
√n(Xn − µ)√
v
approx.∼ N(0, 1),
where v is as above. This gives the following 100(1−α)%-confidence interval (or approximateconfidence interval if Xt is not Gaussian) for µ:
(X −√v√nzα/2, X +
√v√nzα/2)
Note 13.1. Since we usually don’t know v, we have to estimate it too. A natural estimatoris
v =n∑
h=−n
(1− |h|
n
)γ(h).
13–5
Math 4506 (Fall 2019) October 23, 2019Prof. Christian Benes
Lecture #14: Trend and Seasonal Variation
Reference. Section 3.3 from the textbook.
14.1 The Additive Model
Recall the example we briefly looked at in Lecture 3, where we had a data set composed ofthe number of monthly aircraft miles (in Millions) flown by U.S. airlines between 1963 and1970. The graph for this data set was the following:
Time
Air.ts
1964 1966 1968 1970
6000
8000
10000
12000
14000
16000
Given a data set such as the one above, how can we construct a model for it? The idea willbe to decompose random data into three distinct components:
• A trend component mt (increase of populations, increase in global temperature, etc.)
• A seasonal component st (describing cyclical phenomena such as annual temperaturepatterns, etc.)
• A random noise component Yt describing the non-deterministic aspect of the timeseries. Note that the book uses zt for this component. In the notes, I’ll write Yt, as theletter z usually suggests a normal distribution, which may not be the actual underlyingdistribution of the random noise component.
14–1
A common model is the so-called additive model, that is, one where we try to find mt, st, Ytsuch that a given time series can be expressed as
Xt = mt + st + Yt.
We will never know what mt, st, and Yt actually are, but we can estimate them. Theestimates will be called mt, st, and yt. Note that we’ll use the same notation for estimatesand estimators in this case. Once we see the data, our estimates have to satisfy
xt = mt + st + yt,
where mt is an estimate for mt, st is an estimate for st, and yt is an estimate for Yt.
The corresponding data set can be found at
http://robjhyndman.com/tsdldata/data/kendall3.dat
and looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1963 6827 6178 7084 8162 8462 9644 10466 10748 9963 8194 6848 70271964 7269 6775 7819 8371 9069 10248 11030 10882 10333 9109 7685 76021965 8350 7829 8829 9948 10638 11253 11424 11391 10665 9396 7775 79331966 8186 7444 8484 9864 10252 12282 11637 11577 12417 9637 8094 92801967 8334 7899 9994 10078 10801 12950 12222 12246 13281 10366 8730 96141968 8639 8772 10894 10455 11179 10588 10794 12770 13812 10857 9290 109251969 9491 8919 11607 8852 12537 14759 13667 13731 15110 12185 10645 121611970 10840 10436 13589 13402 13103 14933 14147 14057 16234 12389 11595 12772
In fact, this is not exactly the form in which the data set is found on that website. There,it doesn’t have any labels. As it turns out, it is quite straightforward to include those labelswith R.
Let’s look at the graph above. Two patterns are striking. There appears to be
• an increasing pattern
• a clear cyclical pattern with some apparently fixed period
14.1.1 The Trend
There are a number of methods available to analyze the trend. We will see here how thefunction “decompose” in R estimates the trend and will discuss some refinements of thislater in the semester.
For a given time series Xt, one natural way of estimating the trend mt is to assume that it isinfluenced by factors on a number of times around t, so we can let mt be a moving average
14–2
of values of Xt around time t. In general, if the time series Xt1≤t≤N consists of N datapoints, we can, for some arbitrary a, define, for a+ 1 ≤ t ≤ N − a,
mt =1
1 + 2a
a∑k=−a
Xt+k.
Alternatively, if there is good reason to take an average with values weighted by 12a
, as inthe case where there is a good reason to think the natural period of the times series is 2a,we can define instead
mt =1
2a(1
2Xt−a +
a−1∑k=−a+1
Xt+k +1
2Xt+a)
so that the sum of the weights equals 1 as well.
In the time series above, since the period is 12, we would define, for 7 ≤ t ≤ 90,
mt =1
12(1
2Xt−6 +
5∑k=−5
Xt+k +1
2Xt+6).
Note that this process means that the trend estimate is undefined for 1 ≤ t ≤ 6 and91 ≤ t ≤ 96.
14.1.2 Seasonal Variation
In cyclical data, numerical values start repeating themselves with each new cycle. Forexample, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, . . . is a cyclical data set with period 4. Thecycles have length 4.
If we have already found a model for the trend for a time series, we define
a(t) = Xt − mt
for which we now wish to estimate a seasonal component (assuming this makes sense; in theexample above, it certainly does) and later a random component.
If there is a true cyclical component in a time series, its values have to repeat themselveswith every period. Since the values of the time series at are unlikely to be exactly cyclical,we have to estimate the actual values of this cyclical component. This is done by averagingthe values of at corresponding to a same position in the cycle. For instance, in the exampleabove, the values of at for each given month will be averaged. More precisely, we define, fort = 7, . . . , 18,
ct = ct+12 = · · · = ct+72 =1
7
6∑i=0
at+12i.
Since in our example at is only defined for t = 7, . . . , 90 (since mt is only defined for those t),we define by extension ct for t = 1, . . . , 6, by ct = ct+12 and for t = 91, . . . 96, by ct = ct−12, so
14–3
that ct is defined for all t = 1, . . . , 96. Why did we call this object ct rather than st? Becausewe need to do one more thing to get st. The time series ct is cyclical, but we will transformit into a series for which the mean is 0. This is achieved by defining for t = 1, . . . , 96,
st = ct −1
12
18∑t=7
ci.
Exercise 14.11. Show that the mean of the values stt∈1,...,96 is 0.
Note again that everything we are doing here is based on the fact that a natural cycle of ourtime series has length 12, but all the steps can be reproduced for time series of any period.
14.1.3 Random Component
The estimate for the random component is just
yt = xt − mt − st.
This random component our main focus for this course, but let’s see how to obtain it fromany time series for which the additive model would be a good fit.
14.1.4 Decomposing a Time Series with R
First, in order for R to be able to do any time series analysis with your data, it must knowthat it is dealing with a time series. You will need to use a command to transform your dataset into a time series.
First, load the data set from the web by typing
>www=“http://robjhyndman.com/tsdldata/data/kendall3.dat”
and create a data file by typing
>Air=scan(www)
I chose the name “Air”, but you can of course call the data set what you’d like. Then,transform that data set into a time series. Since the years go from 1963 to 1970 and for eachyear, the months go from 1 to 12, it makes sense to put the data into an 8-by-12 array. Thisis done as follows:
>Air.ts=ts(Air,start=c(1963,1),end=c(1970,12),fr=12)
The command “fr=12” tells R that your time series follows a natural cycle that has period12. R will automatically deduce from this information that your data is measured on amonthly basis and yield the table above. But the fact that you’ve created an object of thetime series class by defining an object with the suffix “.ts” will allow R to do much more, infact most things one might want to do with time series and which R can’t do with files thatdon’t have the “.ts” suffix.
Typing
14–4
>decompose(Air.ts)
will show all the values of the time series mt, st, and yt (which are the estimates for mt, st,and Yt. If you type
>plot(decompose(Air.ts))
you will see the graphs of xt, mt, st, and yt:
8000
12000
observed
9000
11000
13000
trend
-2000
01000
seasonal
-2500
-1000
01000
1964 1966 1968 1970
random
Time
Decomposition of additive time series
Note that if you wish to analyze the seasonal component, trend, or random componentseparately, you can use the following commands:
> D=decompose(Air.ts)
> S=D$seasonal
> T=D$trend[7:90]
> R=D$random[7:90]
This gives you the data arrays S, T, and R containing all the values of the seasonal compo-nent, trend, and random component. Note that since T and R have no values for the first 6and last six times, we had to ask the software to disregard them when creating T and R.
14–5
14.2 The Least Squares Method
Not all time series can be modeled by stationary processes. One might hope, however, thata time series Xt could be expressed as follows:
Xt = f(t) + Yt,
where f(t) is a deterministic function and Yt is stationary. The least squares method can beof use when trying to extract a function f(t) which might describe a trend and a seasonalcomponent from our time series.
Here is a quick reminder of the basic principles of the least square method.
The basic idea is as follows. Suppose that we decide (subjectively) that the best model forthe evolution of a measurable quantity over time is a linear function of the form f(t) = y =β0t+ β1. Not all straight lines will seem to be equally good models once we see the data. Inparticular, a model will not be too good if all data points lie above or below the line givenby the model.
The least squares method tries to minimize the sum of the squares of the differences betweenthe data values and values predicted by the model. More specifically, if the observed dataconsists of the n points (t1, y1), . . . (tn, yn), the least square method finds the parameters aand b so as to minimize
n∑i=1
(yi − f(ti))2.
IfYt = β0 + β1t+Xt,
where Xt is stationary (which means that Yt is the sum of a stationary process and a linearfunction), the estimators for β0 and β1 that yield the least squares estimates are as follows:
β1 =1n
∑nt=1 tYt − tY
1n
∑nt=1 t
2 − t2, β0 = Y − β1t = Y − t
1n
∑nt=1 tYt − tY
1n
∑nt=1 t
2 − t2.
The same idea works for any deterministic function, not just linear, which can be defined interms of any number of parameters (for instance polynomials, exponential functions, etc.).
Example 14.1. We now see how R does least squares regression for us by looking at thedaily closing prices of Hewlett-Packard stock for 672 trading days up to June 7, 2007. Thedata can be obtained as follows:
> www=“http://www.maths.adelaide.edu.au/andrew.metcalfe/Data/HP.txt”
> HP.dat=read.table(www,header=T);attach(HP.dat)
> plot(Price,type=“l”)
This gives:
14–6
0 100 200 300 400 500 600
2025
3035
4045
Index
Price
We will perform a linear regression on the data set, using the command
> HP.lm=lm(Price time(Price))
The summary is then
> HP.lm
Call:
lm(formula = Price time(Price))
Coefficients:
(Intercept) time(Price)
17.2333 0.0398
This means that the least squares line is
f(t) = 17.2333 + 0.0398t.
Note that you can obtain confidence intervals for the parameters as follows:
> confint(HP.lm)
2.5 % 97.5 %
(Intercept) 17.00349739 17.46311534
time(Price) 0.03921052 0.04039385
Now let’s see if as we might hope Yt = Xt − f(t) can be described by an ARMA model, byexamining the ACF of the residuals:
> acf(resid(HP.lm))
14–7
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series resid(HP.lm)
The slowly decaying ACF suggests that we may not be in presence of a time series thatcould be well modeled by an ARMA process, so we will need to find a way to deal withnon-stationary time series.
14–8
Math 4506 (Fall 2019) October 28, 2019Prof. Christian Benes
Lecture #15: Pre-Midterm Q&A
15–1
Math 4506 (Fall 2019) October 30, 2019Prof. Christian Benes
Lecture #16: Midterm
16–1
Math 4506 (Fall 2019) November 4, 2019Prof. Christian Benes
Lecture #17: Differencing
Reference. Section 5.1 from the textbook.
17.1 Differencing
In time series analysis, one goal is to reduce a given time series model Xt to a stationary timeseries, whenever possible. An important observation we have made already is that takingdifferences of a nonstationary time series can yield a stationary time series as is the casewith random walk for which taking differences yields white noise.
We will need the following definition, part of which we’ve seen already:
Definition 17.1. We define the backwards shift operator B by
BXt = Xt−1.
For j ≥ 2, we define the operator Bj by
BjXt = B(Bj−1Xt) = Xt−j.
The order 1 difference operator ∇ is defined by
∇Xt = Xt −Xt−1.
For j ≥ 2, the order j difference operator ∇j is defined by
∇jXt = ∇∇j−1Xt.
Note 17.1. Conveniently, operations on the operator B follow the same rules as polynomials:For instance,
∇2Xt = ∇(∇Xt) = ∇(Xt −Xt−1) = (Xt −Xt−1)− (Xt−1 −Xt−2) = Xt − 2Xt−1 +Xt−2
and
∇3Xt = (1−B)(1−B)(1−B)Xt = (1−B)3Xt = (1−3B+3B2−B3)Xt = Xt−3Xt−1+3Xt−2−Xt−3,
The first important thing to note is that if the time series Xt is stationary, then so is ∇Xt.This is a direct consequence of Proposition 7.2. So we certainly don’t lose stationarity bydifferencing. However, as we saw in the random walk example, differencing can transform anonstationary time series into one that is stationary.
Assume now that our time series model can reasonably be written in the form
Xt = mt + st + Yt,
17–1
where mt is a trend, st a seasonal component, and Yt a random noise component (which mayor may not be stationary), we saw in the past lecture how m and s can be estimated, leavingus with the random process Yt. If Yt is not stationary, we can try taking differences until wehave a time series that is stationary. However, we can also apply this process to Xt beforeestimating m and s. The main idea of this method comes from calculus. It basically saysthat for many functions, taking derivatives “flattens” the function.
• What do I mean by this? Take for instance f(x) = x2. For large x, the slope of thetangent to the graph is very steep. However, the slope of the tangent to the graph off ′(x) is the same everywhere (it’s 2). Moreover, f ′′(x) is a constant. Clearly, the samething works for any polynomial. Since many functions look locally like a polynomial,one can also hope for this idea to work for a larger class of functions.
• Why is this useful? A stationary series has constant mean, so transforming the trendm into a constant is necessary if we wish to transform our time series into a stationarytime series. We will also see a slight modification of this idea that gets rid of seasonalcomponents.
17.1.1 Differencing when there is no seasonal component
Suppose that our time series model is
Xt = mt + Yt.
Let’s see how this method works on time series where m is a polynomial: Suppose mt =n∑i=0
aiti and Xt = mt + Yt. You will show in the homework that
∇nXt = cn +∇nYt.
Since ∇nYt is a stationary sequence with mean 0, ∇nXt is stationary with constant meancn.
This suggests that we may try to apply successive difference operators to a given timeseries until it is stationary (that is, the trend has been removed). Note that this methodis considerably less sophisticated than the method where we remove the trend using leastsquares.
17.1.2 Differencing when there is a seasonal component
Suppose now that our time series model is
Xt = mt + st + Yt.
Since we know that taking successive derivatives of (e.g.) sin(x) only sends us back andforth between sin and cos functions, it should be clear that if the goal is to get rid of the
17–2
seasonal term, the lag-1 differencing operator won’t take us far. However, if we know thatthe seasonality period is d, we can try to use a differencing operator taking into account theperiod:
Definition 17.2. The lag-d differencing operator ∇d is defined by
∇dXt = Xt −Xt−d = (1−Bd)Xt.
If Xt = mt + st + Yt and st is d-periodic which means that st+d = st, we get
∇dXt = mt −mt−d + st − st−d + Yt − Yt−d = mt −mt−d + Yt − Yt−d,
the sum of a trend term and a noise term. Now we’re back in the case of the last subsection,which we already know how to handle.
17–3
Math 4506 (Fall 2019) November 6, 2019Prof. Christian Benes
Lecture #18: Differencing and ARIMA models;LogarithmicTransformations
Reference. Sections 5.2 and 5.4 from the textbook.
18.1 Differencing
18.1.1 A Model
Recall the example with which we finished the last lecture, examining the daily closing pricesof Hewlett-Packard stock for 672 trading days up to June 7, 2007. We now have a way ofgoing a bit further in that example.
Example 18.1.
0 100 200 300 400 500 600
2025
3035
4045
Index
Price
After extraction of a linear trend we ended up with the following sample ACF for theresiduals.
18–1
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series resid(HP.lm)
The slowly decaying sample ACF suggests that we may not be in presence of a time series,so let’s try to difference it:
> Diff=diff(resid(HP.lm))
> plot(Diff,type=”l”)
0 100 200 300 400 500 600
-2-1
01
23
Index
Diff
> acf(Diff)
18–2
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series Diff
The ACF of the differenced series suggest that we may be in presence of white noise. Let’sfind its mean and variance:
> mean(Diff)
[1] 6.368846e-05
> var(Diff)
[1] 0.2112592
Since the differenced Yt can be modeled by white noise, Yt can be modeled by random walk.So a reasonable model would be Xt = 17.233 + 0.0398t + St, where St is random walkconstructed from adding normal r.v.Os with mean 0 and variance 0.2112592, that is,
Xt = 17.233 + 0.0398t+t∑i=1
Ui,
where Ui ∼ N(0, 0.21126) are independent.
18.2 ARIMA Processes
We can now use the ideas developed above to define a new class of processes which are notnecessarily stationary. ARIMA (auto-regressive integrated moving average) processes are anextension of ARMA processes. In short, a process is an ARIMA(p, d, q) process if differencingit d times gives an ARMA(p, q) process. In particular, if d = 0, an ARIMA process is anARMA process. More precisely, an ARIMA(p, 0, q) process is the same thing as a causalARMA(p, q) process. Formally,
Definition 18.1. For d ∈ N ∪ 0, a process Xt is an ARIMA(p, d, q) process if Yt :=∇dXt = (1−B)dXt is an ARMA(p, q) process.
18–3
18.3 ARIMA(p, 1, q) Processes
If Yt is an ARIMA process, we generally assume that Yt is observed for time t ≥ 1 butthat time is indexed starting at some negative time −m, and so one assumes that Yt = 0if t < −m. Then, by definition of an ARIMA(p, 1, q) process, the process Wt defined fort ≥ −m+ 1 by Wt = Yt − Yt−1 is an ARMA(p, q) process, so that for t ≥ −m+ 1 + p,
Wt =
p∑i=1
φiWt−i +
q∑i=0
θiZt−i,
with θ0 = 1.
Moreover, note that
Yt = (Yt − Yt−1) + (Yt−1 − Yt−2) + (Yt−2 − Yt−3) + · · ·+ (Y−m+1 − Y−m) + (Y−m − Y−m−1)
=t+m+1∑i=1
(Yt+1−i − Yt−i) =t+m+1∑i=1
Wt+1−i =t∑
i=−m
Wi. (24)
In particular, this allows us to understand the ARIMA(0,1,1) model, also called the IMA(1,1)model (since there is no autoregressive part), where
Wt = Zt + θZt−1
with |θ| < 1. Using (24) gives
Yt =t∑
i=−m
Wi =t∑
i=−m
(Zi + θZi−1) = Zt + (1 + θ)m+t∑i=1
Zt−i + θZ−m−1. (25)
Note that as t increases, the number of terms in this sum increases, which suggests (correctly)that the sum doesn’t represent a stationary time series. Equation (24) allows us to computethe covariance and correlation of an IMA(1,1) process: For h > 0,
Cov(Yt, Yt−h) = Cov
(Zt + (1 + θ)
m+t∑i=1
Zt−i + θZ−m−1, Zt−h + (1 + θ)m+t−h∑i=1
Zt−h−i + θZ−m−1
)= σ2
(1 + θ + (1 + θ)2(m+ t− h) + θ2
)In particular, since
Var(Yt) = σ2(1 + θ + (1 + θ)2(m+ t) + θ2
),
we see that
Corr(Yt, Yt−h) =1 + θ + (1 + θ)2(m+ t− h) + θ2
1 + θ + (1 + θ)2(m+ t) + θ2.
If t is large and h is moderate (so that m + t − h is fairly close to m + t), this ratio isapproximately
m+ t− hm+ t
,
which is approximately equal to 1.
The example above provides additional reinforcement of the idea that a slow decay of theautocovariance function can be an indication that the corresponding time series is nonsta-tionary.
18–4
18.4 Multiplicative model
The idea of least squares can be used with any function we think might dictate the generaltrend of our data. This is particularly useful with periodic functions for which there is anatural assumption to be made about the period (e.g. for local meteorological data wherethe trend tends to follow an annual cycle.
Example 18.2. We will look at the data set of monthly totals of international airlinepassengers, 1949 to 1960.
> AP=AirPassengers
> plot(AP)
Time
AP
1950 1952 1954 1956 1958 1960
100
200
300
400
500
600
Plotting this time series, we notice an upward trend together with a clear cyclical behaviorcoupled with an increase of the amplitude of the cycle over time.
Such an increase in amplitude would make it difficult to model well the the time series asXt = mt + st +Yt, where Yt is stationary, since if the amplitude of st doesn’t increase (whichit can’t; it’s periodic), the variability of Yt would increase, making it non-stationary.
One way to get rid of an increase in amplitude is to take logarithms. Let’s ignore the randomcomponent for a minute. Then an appropriate model for such an increase in amplitude wouldbe
Xt = mtstYt,
since in that case, a larger value of the trend would cause a larger amplitude (5 sin t has alarger amplitude than 3 sin t). Now if we take logarithms, we get
lnXt = lnmt + ln st + lnYt,
where lnmt is a new trend and ln st is still a periodic function. However, there is no moremultiplicative effect, so that the new time series shouldn’t exhibit an increasing amplitude.Of course, this only works if we truly have a multiplicative model (Xt = mtstYt), which wemay or may not, but it certainly suggests that taking logarithms might solve the problem ofthe increasing amplitude.
18–5
Therefore, we create a new time series composed of the natural logarithms of the originalseries.
> LAP=log(AP)
This gives the following time series in which we can see that the increase in amplitude seemsto have mostly disappeared.
> plot(LAP)
Time
LogAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
Now we have a time series which actually looks like it could well be modeled as
Yt = mt + st +Wt,
where Wt could be stationary. We of course have to check more carefully if our impressionis correct.
We will first see if we can extract the trend and will then focus on the seasonal component.
Any class of functions might be used for least squares regression but a good first try isgenerally the family of polynomials.
In order to be able to plot our regression curves together with the data set, we first definea time vector which corresponds to the time spanned by our time series. To see what thattime is, we can just type
> time(LAP)
We see that the time goes from 1949 to 1960.917 with increments of 1/12 (representingmonths, or 1/12 of a year). Note that the time series is of length 144. Note that the timesof this time series are not integers, but this doesn’t change the way in which we perform ouranalysis, only the way the horizontal axis is indexed.
We define
> T = c()
> for (i in 1:144) T[i]=1949+(i-1)/12
We are now ready to do a linear regression:
18–6
> LAP.lm=lm(LAP˜time(LAP))
To see what the estimates for the slope and the intercept are, we type
> coef(LAP.lm)
(Intercept) time(LAP)
-230.1878355 0.1205806
This means that the least squares line for the data set LAP is y = −230.1878355+0.1205806t.
To see how this line compares to the time series, let’s draw them together by typing
> plot(LAP)
to get the plot of the time series and
> lines(T,LAP.lm$fit,col=”red”)
for the least squares line (note that the object LAP.lm$fit is the least squares data, whichwe are plotting against T. This gives the following picture:
Time
LAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
To determine how good this fit is (it will be good once the residuals look like they could bestationary), we plot the residuals and their ACF:
> plot(LAP.lm$resid,type=“l”)
0 20 40 60 80 100 120 140
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Index
LogAP.lm$resid
18–7
> acf(LAP.lm$resid)
0 5 10 15 20
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series LAP.lm$resid
The ACF of the residuals exhibits an obvious trend (other than that of a damped sine waveor an exponential curve), so the residuals are certainly not a realization of a stationary timeseries.
If we look at the time series and linear fit, we see that in addition to not having accountedfor a seasonal component, we may not have guessed quite right by choosing the trend to bea straight line. Here is how we fit polynomials of degrees 2 and 3 to the data:
> t=time(LAP)
> t2=tˆ2
> t3=tˆ3
> LAP.lm2=lm(LAP˜t+t2)
> LAP.lm3=lm(LAP˜t+t2+t3)
To see the least squares quadratic curve together with the least squares straight line, we type
> plot(LAP)
> lines(T,LAP.lm1$fit,col=”green”)
> lines(T,LAP.lm2$fit,col=”red”)
Time
LAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
18–8
We can also plot the least squares cubic polynomial, but notice that it covers the quadraticpolynomial, meaning that it doesn’t vary by any noticeable amount from the quadraticpolynomial.
Time
LAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
We therefore choose the trend to be a polynomial of degree two and turn to the periodiccomponent. To find its equation, we type
> coef(LAP.lm2)
(Intercept) t t2
-1.228769e+04 1.245592e+01 -3.154887e-03
and see that a reasonable choice for mt is
mt = −12287.69 + 12.46t− 0.00315t2.
18–9
Math 4506 (Fall 2019) November 11, 2019Prof. Christian Benes
Lecture #19: More Logarithmic Transformations; The PartialAutocorrelation Function
Reference. Section 6.2 from the textbook.
19.1 Other Models for Which Taking Logarithms is useful
There are other situations where the variance of a time series may change over time andwhere logarithms may be helpful.
Suppose Xt is such that E[Xt] = µt and√
Var(Xt) = µtσ (the second requirement meansthat the standard deviation increases linearly with the mean). Then, by analogy with thenormal case where if X ∼ N(µ, σ2), then X−µ
σ= Z ∼ N(0, 1), so that X = µ + σZ, we see
that
Xt = µt + µtσXt − µtµtσ
= µt
(1 +
Xt − µtµt
).
Therefore, taking logarithms on both sides and using the fact that the Taylor series forln(1 + x) is ln(1 + x) =
∑n≥1(−1)n+1 xn
n= x − x2
2+ x3
3− x4
4+ · · · − · · · for |x| < 1, we see
that ln(1 + x) ≈ x if |x| is small, so
lnXt ≈ lnµt +Xt − µtµt
.
Therefore,E[lnXt] ≈ log µt
and
Var(lnXt) ≈ Var
(Xt − µtµt
)=
1
µ2t
Var(Xt) =µ2tσ
2
µ2t
= σ2,
so we see that while Xt doesn’t have constant variance, lnXt does.
Similarly, if Xt is such that Xt = (1 +Wt)Xt−1 (compare this with the behavior of expo-nential functions), where Wt is a mean 0 stationary time series, then
ln
(Xt
Xt−1
)= ln(Xt)− ln(Xt−1) = ln(1 +Wt) ≈ Wt,
using again, in the last step, the Taylor expansion of ln(1+x). So ∇ lnXt ≈ Wt is stationary,so that taking differences of the logarithms transforms a non-stationary time series into astationary one.
19–1
19.2 Partial Autocorrelation Function
Unfortunately, the sample ACF of an AR(p) process doesn’t generally yield as much in-formation as the sample ACF of an MA(q) or even an AR(1) process. It tends to exhibita combination of exponential decay and sinusoidal behavior, which is a way of detectingthat an AR model may be appropriate, but it doesn’t tell us anything about the order p.Fortunately, there is another object which allows us to determine p, the so-called partialautocorrelation function (PACF).
The partial autocorrelation function at lag h of a stationary time series Xt is loosely definedto be the autocorrelation between Xt and Xt+k when considering only the linear dependencebetween these two random variables. This definition makes sense particularly in the contextof AR processes:
Suppose that Xt is a regression on its k previous lagged values.
Xt+k =k∑i=1
φkiXt+k−i + Zt+k (26)
with Zt+k independent of Xt+k−i, i > 0. In other words, there is a linear relationship be-tween Xt+k and its k direct predecessors and φki can be thought of as the “constant ofproportionality” between Xt+k and Xt+k−i. For any j ∈ 1, . . . , k, (26) implies that
E[Xt+kXt+k−j] =k∑i=1
φkiE[Xt+k−iXt+k−j] + E[Zt+kXt+k−j].
This is equivalent to
γX(j) =k∑i=1
φkiγX(j − i) + 0 =k∑i=1
φkiγX(j − i),
since Zt+k and Xt+k−j are independent if j ≥ 1. We can rewrite these equations in terms ofautocorrelations:
ρX(j) =k∑i=1
φkiρX(j − i).
Now for each k, this gives a set of k linear equations in the k unknowns ρX(1), . . . , ρX(k),which we know how to solve. More explicitly, we get the following sequence of systems ofequations:
• k = 1:ρX(1) = φ11ρX(0) = φ11.
• k = 2:ρX(1) = φ21ρX(0) + φ22ρX(1) = φ21 + φ22ρX(1)ρX(2) = φ21ρX(1) + φ22ρX(0) = φ21ρX(1) + φ22.
19–2
• k = 3:
ρX(1) = φ31ρX(0) + φ32ρX(1) + φ33ρX(2) = φ31 + φ32ρX(1) + φ33ρX(2)ρX(2) = φ31ρX(1) + φ32ρX(0) + φ33ρX(1) = φ31ρX(1) + φ32 + φ33ρX(1)ρX(3) = φ31ρX(2) + φ32ρX(1) + φ33ρX(0) = φ31ρX(2) + φ32ρX(1) + φ33.
• k ≥ 3:
ρX(1) = φk1ρX(0) + φk2ρX(1) + . . .+ φkkρX(k − 1)...
ρX(k) = φk1ρX(k − 1) + φk2ρX(k − 2) + . . . φkk.
(27)
To solve (27), it is useful to recall Cramer’s rule:
Theorem 19.1. Suppose that A is an n×n matrix with det(A) 6= 0 and ~b is an n×1 vector.
Then if ~x = (x1, . . . , xn)′, the solution to the equation A~x = ~b is given by
xi =det(Ai)
det(A),
where Ai is the matrix obtained from A by replacing the ith column by ~b.
Applying this to (27), we getφ11 = ρX(1),
φ22 =
∣∣∣∣ 1 ρX(1)ρX(1) ρX(2)
∣∣∣∣∣∣∣∣ 1 ρX(1)ρX(1) 1
∣∣∣∣ .
φ33 =
∣∣∣∣∣∣1 ρX(1) ρX(1)
ρX(1) 1 ρX(2)ρX(2) ρX(1) ρX(3)
∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)
ρX(1) 1 ρX(1)ρX(2) ρX(1) 1
∣∣∣∣∣∣.
...
φkk =
∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2) · · · ρX(k − 2) ρX(1)
ρX(1) 1 ρX(1) · · · ρX(k − 3) ρX(2)...
......
......
...ρX(k − 1) ρX(k − 2) ρX(k − 3) · · · ρX(1) ρX(k)
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2) · · · ρX(k − 2) ρX(k − 1)
ρX(1) 1 ρX(1) · · · ρX(k − 3) ρX(k − 2)...
......
......
...ρX(k − 1) ρX(k − 2) ρX(k − 3) · · · ρX(1) 1
∣∣∣∣∣∣∣∣∣
.
19–3
Definition 19.1. The partial autocorrelation function α of a time series X is defined by thefollowing equations:
α(0) = 1, α(k) = φk,k for k ≥ 1,
where φk,k is given by equation (27). Equivalently, φk,k is the last component of the vectorφk given by the equation
Rkφk = ρk,
where Rk = (ρ(i − j))ki,j=1 is the autocorrelation matrix and ρk = (ρ(1), . . . , ρ(k))′. Thesample partial autocorrelation function α of a time series X is defined just like α, exceptthat Rk and ρk are replaced by Rk and ρk.
19.3 PACF for AR(p) Processes
The usefulness of the PACF becomes evident once one derives the PACF of an AR(p) process.We do this now, starting with the case p = 2: If
Xt = φ1Xt−1 + φ2Xt−2 + Zt,
thenρX(k) = φ1ρX(k − 1) + φ2ρX(k − 2),
where ρX(h) is the ACF of Xt. We get from the work done above that
α(1) = φ11 = ρ(1).
α(2) = φ22 =
∣∣∣∣ 1 ρX(1)ρX(1) ρX(2)
∣∣∣∣∣∣∣∣ 1 ρX(1)ρX(1) 1
∣∣∣∣ =ρX(2)− ρX(1)2
1− ρX(1)2.
α(3) = φ33 =
∣∣∣∣∣∣1 ρX(1) ρX(1)
ρX(1) 1 ρX(2)ρX(2) ρX(1) ρX(3)
∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)
ρX(1) 1 ρX(1)ρX(2) ρX(1) 1
∣∣∣∣∣∣=
∣∣∣∣∣∣1 ρX(1) φ1 + φ2ρX(1)
ρX(1) 1 φ1ρX(1) + φ2
ρX(2) ρX(1) φ1ρX(2) + φ2ρX(1)
∣∣∣∣∣∣∣∣∣∣∣∣1 ρX(1) ρX(2)
ρX(1) 1 ρX(1)ρX(2) ρX(1) 1
∣∣∣∣∣∣= 0,
since the last column of the determinant in the numerator is a linear combination of theprevious two columns. The same argument shows that in general, the PACF of an AR(p)process satisfies
α(k) = 0 ∀k ≥ p+ 1, α(k) 6= 0 ∀k ≤ p.
This fact is why the PACF is such a useful object!
In practice, we will be estimating the PACF by computing the sample PACF. We will needto determine if the sample PACF for which α(h) is small enough for us to reasonably assumethat it could be 0. This will again be the case if
α(h) ∈(−1.96√
n,1.96√n
).
19–4
More precisely, if |α(h)| > 1.96√n
for all h ≤ p and |α(h)| < 1.96√n
at least 95% of the time for
h > p, it will be reasonable to look for an AR(p) model for our data. This is under theassumption that the time series is multivariate Gaussian.
Example 19.1. Consider the AR process defined by
Xt − 0.8Xt−1 = Zt,
where Zt ∼ WN(0, 1). The three pictures below show a realization of the time series, as wellas the corresponding ACF and PACF. As expected, since X is an AR(1) process, α(h) = 0for all h > 1. One should also note that α(1) = φ1.
To produce the process and the pictures below, use the following commands:
> Z=rnorm(200)
> X=Z
> for (i in 2:200) X[i]=0.8*X[i-1]+Z[i]
> plot(X,type=“l”)
0 50 100 150 200
-4-3
-2-1
01
2
Index
Y
A plot of X
> acf(X)
0 5 10 15 20
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series Y
The sample autocorrelation function of X
19–5
> acf(X)
5 10 15 20
0.0
0.2
0.4
0.6
LagP
artia
l AC
F
Series Y
The sample partial autocorrelation function of X
Example 19.2. Consider the AR process defined by
Xt + 0.8Xt−1 = Zt,
where Zt ∼ WN(0, 1). The three pictures below show a realization of the time series, as wellas the corresponding ACF and PACF. As expected, since X is an AR(1) process, α(h) = 0for all h > 1. One should also note that α(1) = φ1.
0 50 100 150 200
-4-2
02
4
Index
Y
A plot of X
> acf(X)
0 5 10 15 20
-0.5
0.0
0.5
1.0
Lag
ACF
Series Y
The sample autocorrelation function of X
19–6
> pacf(X)
5 10 15 20
-0.8
-0.6
-0.4
-0.2
0.0
LagP
artia
l AC
F
Series Y
The sample partial autocorrelation function of X
Example 19.3. Consider the AR process defined by
Xt + 0.3Xt−1 + 0.4Xt−2 − 0.6Xt−4 = Zt,
where Zt ∼ WN(0, 1). The three pictures below show a realization of the time series, as wellas the corresponding ACF and PACF. As expected, since X is an AR(4) process, α(h) = 0for all h > 4. One should also note that α(4) = φ4.
To produce the process and the pictures below, use the following commands:
> Z=rnorm(200)
> X=Z
> for (i in 5:200) X[i]=-0.3*X[i-1]-0.4*X[i-2]+0.6*X[i-4]+Z[i]
> plot(X,type=“l”)
0 50 100 150 200
-6-4
-20
24
6
Index
X
A plot of X
19–7
> acf(X)
0 5 10 15 20
-0.5
0.0
0.5
1.0
Lag
ACF
Series X
The sample autocorrelation function of X
> pacf(X)
5 10 15 20
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
Lag
Par
tial A
CF
Series X
The sample partial autocorrelation function of X
19–8
Math 4506 (Fall 2019) November 13, 2019Prof. Christian Benes
Lecture #20: Model Selection
Reference. Section 6.3-6.6 from the textbook.
20.1 Yule-Walker Estimation for AR(p) Processes
Consider the ARMA(p, q) process X defined by
Φ(B)Xt = Θ(B)Zt,
where Zt ∼ WN(0, σ2). If X is a causal AR(p) process, we can write
Xt =∑j≥0
ψjZt−j = Ψ(B)Zt,
where Ψ(z) = 1Φ(z)
. Therefore, as we have already seen in Lecture 8, for any j ∈ 0, . . . , p,
Φ(B)Xt = Zt ⇒ Φ(B)XtXt−j = ZtXt−j ⇒ XtXt−j =
p∑i=1
φiXt−iXt−j + ZtXt−j
⇒ E[XtXt−j] =
p∑i=1
φiE[Xt−iXt−j] + E[ZtXt−j]
⇒ γ(j) =
p∑i=1
φiγ(j − i) + E[ZtXt−j],
Using the fact that Xt = Ψ(B)Zt, we get:
• If j = 0,
γ(0) =
p∑i=1
φiγ(i) + E[ZtXt] =
p∑i=1
φiγ(i) + σ2. (28)
• Since Zt is uncorrelated with Xt−j for all j ∈ 1, . . . , p,
γ(j) =
p∑i=1
φiγ(j − i),
which, in matrix notation, can be written as
Γpφφφ = γp, (29)
where Γp = (γ(i − j))pi,j=1 is the covariance matrix, γp = (γ(1), . . . , γ(p))′, and φφφ =(φ1, . . . , φp)
′.
20–1
The equations for j = 0 and j ∈ 1, . . . , p are a set of p+1 equations in the 2p+2 variablesσ2, φ1, . . . , φp, γ(0), . . . , γ(p). If the model is entirely specified, we know σ2, φ1, . . . , φp and cantherefore solve for γ(0), . . . , γ(p). On the other hand, if we happened to know γ(0), . . . , γ(p),we could find σ2, φ1, . . . , φp. The last two sentences are of course true only if the matrixdefining our system of equations is nonsingular.
Generally, we don’t know the true covariances of a time series, but we can estimate them Ifwe replace, γ(i) by γ(i) in (14) and (15), we get
γ(0) =
p∑i=1
φiγ(i) + σ2, (30)
Γpφφφ = γγγp, (31)
where φ = (φ1, . . . , φp)′. It turns out that if γ(0) 6= 0, the matrix Γp is nonsingular for all
p ≥ 1. If γ(0) 6= 0, we can divide both sides of the equations above by γ(0) 6= 0 to get anexpression in terms of the sample ACF:
1 =
p∑i=1
φiρ(i) +σ2
γ(0),
Rpφφφ = ρρρp.
And since if γ(0) = 0, Γp is nonsingular, we can take inverses and solve for φφφ:
σ2 = γ(0)
(1−
p∑i=1
φiρ(i)
)= γ(0)
(1− φφφ
′ρρρp
),
φφφ = R−1p ρρρp.
Using the second equation to re-write the first, we get the sample Yule-Walker equations:
φφφ = R−1p ρρρp, (32)
σ2 = γ(0)(
1− ρρρ′pR−1p ρρρp
). (33)
From this set of equations, we can find, for any m ≥ 1 an AR(m) model based on theYule-Walker method:
Definition 20.1. The process
Xt −p∑i=1
φiXt−i = Zt, Zt ∼ WN(0, vp),
whereφφφ = R−1
p ρρρp
andvp = γ(0)
(1− ρρρ′pR−1
p ρρρp
)is called the fitted Yule-Walker AR(p) model.
20–2
Note 20.1. The estimators one gets via this method will generally vary with p. For instance,if you choose p = 1, the coefficient φ1 will not be the same as if you choose p = 2.
At this point, we have a method (albeit not completely systematic) which allows us to comeup with an AR(p) model for a given data set:
• Look at the PACF to determine p.
• Find the fitted Yule-Walker AR(p) model.
The part of this procedure that is not particularly systematic is that of determining the“right” value of p. It turns out there is a better but considerably more complicated method,which we will discuss later, which simultaneously finds the “best” p (in some sense) and thecorresponding Yule-Walker coefficients for every p.
As always when estimating, we would like to know how good our estimates are. The key factis that under the assumption that the data are a realization of an AR(p) process,
φφφapprox.∼ N(φφφ, n−1σ2Γ−1
p ), (34)
which, when replacing the unknown parameters by their estimators, becomes
φφφapprox.∼ N(φφφ, n−1vpΓ
−1p ), (35)
Equation (35) implies that
Y :=
√n√vp
(φφφ− φφφ)approx.∼ N(0, Γ−1
p ).
If we write vpΓ−1p = (ai,j)1≤i,j≤p, this implies for every j ∈ 1, . . . , p that
P
(φj −
√aj,j√nzα/2 ≤ φj ≤ φj +
√aj,j√nzα/2
)≈ 1− α.
This is a result for a specific coefficient of the AR(p) process. We can also find a confidenceregion for the collection of coefficients φp,i1≤i≤p. Using the fact that if X ∼ N(µµµ,Σ),
(X− µµµ)′Σ−1(X− µµµ) ∼ χ2p
(this is the multi-dimensional analogue of the one-dimensional statement that ifX ∼ N(µµµ, σ2),
then (X−µ)2
σ2 ∼ χ21), we get that
Y′ΓpYapprox∼ χ2
p,
so that
P
((φφφ− φφφ)′Γp(φφφ− φφφ) ≤
vpχ21−α,p
n
)≈ 1− α.
This gives a way of checking our model’s precision, as the region
φφφ ∈ Rp : (φφφ− φφφ)′Γp(φφφ− φφφ) ≤vpχ
21−α,p
n
contains φφφ with approximate probability 1− α.
20–3
20.2 Fitting our first model
We can now use the “ar” command to find an appropriate model. This command findsthe optimal (in the sense of minimizing the AICC; we will discuss this later) p, coefficientsφ1, . . . , φp, and σ2 according to various methods, including the Yule-Walker method, whichis the default method. Type “help(ar)” in R to have more information about this command.
> set.seed(5)
> Z=rnorm(1000)
> X=Z
> for (i in 2:1000) X[i]=0.5*X[i-1]+Z[i]
> options(digits=16)
> ar(X)
Call:
ar(x = X)
Coefficients:
1
0.4804512923397
Order selected 1 sigmaˆ2 estimated as 1.024832068202
We can check that these values are indeed what the Yule-Walker equations prescribe: Recallequation (33):
φφφ = R−1p ρρρp
and equation (33):
σ2 = γ(0)(
1− ρρρ′pR−1p ρρρp
).
If p = 1, these equations are just
φ1 = ρ−10 ρ1, σ2 = γ(0)
(1− ρ′1ρ−1
0 ρ1
),
which can be simplified to
φ1 = ρ1, σ2 = γ(0)(1− ρ2
1
).
Now we can ask R to give us the estimates for ρ1using the Yule-Walker equations directly:
> A=acf(X)
> A$acf[2] %%% This command gives the second value of the acf with greater precision thanjust calling A
[1] 0.4804512923
Therefore,φ1 = ρ1 ≈ 0.480451.
We see that the value R gives directly via the command “ar” is the same for φ1 .
20–4
Math 4506 (Fall 2019) November 18, 2019Prof. Christian Benes
Lecture #21: Model Fitting and Parameter Estimation;Forecasting
Reference. Sections 6.7, 7.2, and Chapter 9 from the textbook.
In what follows, we will consider cleverly designed algorithms which will allow us to come upwith ARMA models (i.e., find the “best” p, q, φipi=1, θj
qj=1) according to certain criteria,
given a data set. Some of these algorithms are quite complicated and we’ll just gaze at themsuperficially. We start with an algorithm which estimates φipi=1 for AR(p) processes whenp is fixed. We will see later how to actually estimate which p gives the best fitted model.
21.1 Method of Moments Estimation
We won’t discuss this method since it often isn’t particularly effective. The only thing youneed to know is that for AR(p) processes, the method of moments estimates are the sameas the Yule-Walker estimates.
21.2 Least Squares Estimation
We begin with an obvious definition we could have made some time ago and which will beuseful in the future as not all stationary time series have zero mean:
Definition 21.1. If Yt − µ is an ARMA process, we say that Yt is an ARMA processwith mean µ.
21.2.1 AR processes
We start by seeing how least squares estimation works for an AR(1) process (not necessarilymean 0) satisfying the equation
Yt − µ = φ(Yt−1 − µ) + Zt.
The key idea is to see the time series Yt−µ as a function of Yt−1−µ and therefore to minimizethe conditional sum of squares function
Sc(φ, µ) =n∑t=2
((Yt − µ)− φ(Yt−1 − µ))2 .
To do this, we just need to solve∇Sc(φ, µ) = ~0.
21–1
We first look at the partial derivative with respect to µ:
∂Scdµ
=n∑t=2
2 ((Yt − µ)− φ(Yt−1 − µ)) (φ−1) = 2(φ−1)
(n∑t=2
Yt − φn∑t=2
Yt−1 + (n− 1)(φ− 1)µ
).
This is equal to 0 if and only if
n∑t=2
Yt − φn∑t=2
Yt−1 + (n− 1)(φ− 1)µ = 0 ⇐⇒n∑t=2
Yt − φn−1∑t=1
Yt + (n− 1)(φ− 1)µ = 0
⇐⇒ (1− φ)n−1∑t=2
Yt + Yn − φY1 + (n− 1)(φ− 1)µ = 0
⇐⇒ µ =1
(n− 1)(φ− 1)
((φ− 1)
n−1∑t=2
Yt + Yn − φY1
)
⇐⇒ µ =1
n− 1
n−1∑t=2
Yt +1
(n− 1)(1− φ)(Yn − φY1),(36)
so µ ≈ 1n−1
∑n−1t=2 Yt ≈ Y , since as n gets large, 1
(n−1)(φ−1)(Yn − φY1) goes to 0. This is why
one often chooses µ = Y . It is good to keep in mind that if n is small one may wish to usethe exact expression in (36).
Let’s now equate to 0 the partial derivative with respect to φ:
∂Scdφ
=n∑t=2
2 ((Yt − µ)− φ(Yt−1 − µ)) (µ− Yt−1).
This is 0 iff
φn∑t=2
(Yt−1 − µ)2 =n∑t=2
(Yt − µ)(Yt−1 − µ) ⇐⇒ φ =
∑nt=2(Yt − µ)(Yt−1 − µ)∑n
t=2(Yt−1 − µ)2,
so, replacing µ by its estimator Y , we get
φ =
∑nt=2(Yt − Y )(Yt−1 − Y )∑n
t=2(Yt−1 − Y )2.
We now see how the least-squares estimators are obtained for general AR processes:
For p ≥ 2,
Sc(φ1, . . . , φp, µ) =n∑
t=p+1
(Yt − µ−
p∑i=1
φi(Yt−i − µ)
)2
,
and one can show that
∂Sc(φ1, . . . , φp, µ)
∂µ=
n∑t=p+1
(Yt − µ−
p∑i=1
φi(Yt−i − µ)
)p∑i=1
φi = 0 ⇐⇒ µ ≈ Y
21–2
so for 1 ≤ i ≤ p, replacing µ by its estimator Y , we get for 1 ≤ j ≤ p,
∂Sc(φ1, . . . , φp, µ)
∂φj= 0 ⇐⇒
n∑t=p+1
(Yt − Y )(Yt−j − Y ) =n∑
t=p+1
p∑i=1
φi(Yt−i − Y )(Yt−j − Y )
⇐⇒n∑
t=p+1
(Yt − Y )(Yt−j − Y ) =
p∑i=1
φi
n∑t=p+1
(Yt−i − Y )(Yt−j − Y )
⇐⇒∑n
t=p+1(Yt − Y )(Yt−j − Y )∑nt=p+1(Yt − Y )2
=
p∑i=1
φi
∑nt=p+1(Yt−i − Y )(Yt−j − Y )∑n
t=p+1(Yt − Y )2
approx.⇐⇒ rj =
p∑i=1
φiri−j.
These are just the sample Yule-Walker equations, so we see that the conditional least squaresmethod yields the same estimators as the Yule-Walker method (which are the same as themethod of moments estimators).
21.2.2 General ARMA processes
The problem is more difficult in this case and can be addressed by a number of numericalmethods, but the key idea is that since one wants to regress the time series Yt on prior valuesof Y , one might wish to write the time series in invertible form Yt = Zt +
∑j≥1 πjYt−j.
To understand well the idea of maximum likelihood estimation, it is helpful to have somenotions of forecasting first. The idea we develop below fits naturally between the topics ofleast squares estimation and maximum likelihood estimation.
21.3 A Least Squares Predictor
We now turn to one of the important goals of time series analysis: To forecast future valuesof a time series. Since we need to know something about the underlying structure of thedata in order to achieve this, we will continue assuming that the time series we are workingwith are stationary. As we know from hypothesis testing or the construction of confidenceintervals, it is good to have an estimate/prediction, but much better to know how good theestimator/predictor is and in what sense.
If our predictor for a time series X at some future time n + h is to depend on values atprevious times (1 to n), the easiest assumption one can make is that it depends on thosevalues linearly. We define
PnXn+h := a0 +n∑i=1
aiXn+1−i = a0 + a1Xn + . . . anX1,
the best linear predictor of Xn+h to be that with the the least expected square error (or leastmean square error).
21–3
It is not obvious that such a predictor is unique or even exists (not every function has aminimum). The next theorem claims both existence and uniqueness of PnXn+h and showshow to find the coefficients aini=0.
Theorem 21.1. Suppose X is a stationary time series with mean µ and autocovariancefunction γ and
Sn(a0, . . . , an) := E[(Xn+h − PnXn+h)
2] = E[(Xn+h − (a0 + a1Xn + . . . anX1))2] .
Then the unique vector(a0, . . . , an)′
which minimizes Sn satisfies
a0 = µ(1−n∑i=1
ai) (37)
andΓnan = γn(h), (38)
wherean := (a1, . . . , an)′
Γn = (γ(i− j))ni,j=1
is the autocovariance matrix of X and
γn(h) = (γ(h), γ(h+ 1), . . . γ(h+ n− 1))′.
Proof. Sn(a0, . . . , an) is a positive quadratic function in the variables a0, . . . , an and thereforehas a unique minimum. To find it, we need to solve the equation
∇Sn(a0, . . . , an) = 0
or, equivalently, the n+ 1 equations
∂Sn(a0, . . . , an)
∂aj= 0, 0 ≤ j ≤ n.
We can compute partial derivatives inside the expectation in the definition of Sn and re-writethese equations as
E [(Xn+h − (a0 + a1Xn + . . .+ anX1))] = 0, (39)
E [Xj (Xn+h − (a0 + a1Xn + . . .+ anX1))] = 0, 1 ≤ j ≤ n. (40)
Equation (39) becomes
a0 = E [(Xn+h − (a1Xn + . . .+ anX1))] = µ(1− (a1 + . . .+ an)) = µ(1−n∑i=1
ai) (41)
and equation (40) can be rewritten as
a0µ = γ(n− j + h) + µ2 −n∑i=1
ai(γ(n+ 1− i− j) + µ2), 1 ≤ j ≤ n
21–4
or equivalently, replacing j by n+ 1− j (i.e., counting backwards) and using (41),
µ2(1−n∑i=1
ai) = γ(j + h− 1)−n∑i=1
aiγ(j − i) + µ2(1−n∑i=1
ai), 1 ≤ j ≤ n,
so thatn∑i=1
aiγ(j − i) = γ(j + h− 1), 1 ≤ j ≤ n,
which can be re-written in matrix form as
Γnan = γn(h). (42)
21–5
Math 4506 (Fall 2019) November 20, 2019Prof. Christian Benes
Lecture #22: Forecasting
Reference. Chapter 9 from the textbook.
The following proposition essentially follows from the work done in the proof of Theorem21.1:
Proposition 22.3. (Properties of PnXn+h)
1. PnXn+h = µ+n∑i=1
ai(Xn+1−i − µ), where an satisfies (42).
2. The mean-squared error of the predictor E[(Xn+h − PnXn+h)2] satisfies
E[(Xn+h − PnXn+h)2] = γ(0)− an
′γn(h).
3. E[Xn+h − PnXn+h] = 0
4. E[(Xn+h − PnXn+h)Xj] = 0 for j = 1, . . . , n.
22.1 Prediction
Example 22.1. (predictions for AR(1) process) Suppose Xt = φXt−1 + Zt, where |φ| < 1and Zt ∼WN(0, σ2). Recall that for such a process,
γX(h) =σ2
1− φ2φ|h|, h ∈ Z.
Equation (42) now becomesσ2
1−φ2σ2
1−φ2φσ2
1−φ2φ2 . . . σ2
1−φ2φn−1
σ2
1−φ2φσ2
1−φ2σ2
1−φ2φ . . . σ2
1−φ2φn−2
......
.... . .
...σ2
1−φ2φn−1 σ2
1−φ2φn−3 σ2
1−φ2φn−4 . . . σ2
1−φ2
a1
a2...
an−1
an
=
γ(h)γ(h+ 1)...γ(h+ n− 2)γ(h+ n− 1)
.
In particular, using the fact that h > 0,σ2
1−φ2σ2
1−φ2φσ2
1−φ2φ2 . . . σ2
1−φ2φn−1
σ2
1−φ2φσ2
1−φ2σ2
1−φ2φ . . . σ2
1−φ2φn−2
......
.... . .
...σ2
1−φ2φn−2 σ2
1−φ2φn−3 σ2
1−φ2φn−4 . . . σ2
1−φ2
a1
a2...an−1
an
=
σ2
1−φ2φh
σ2
1−φ2φh+1
...σ2
1−φ2φh+n−2
σ2
1−φ2φh+n−1
,
22–1
clearly implying that a1
a2...an−1
an
=
φh
0...00
.
Since µ = 0, we get from (37) that a0 = 0, so that
PnXn+h = φhXn.
Point 2. of Proposition 22.3 implies that the mean-squared error of the predictor is
γ(0)− an′γn(h) =
σ2
1− φ2− φh σ2
1− φ2φh = σ2 1− φ2h
1− φ2.
Note that the mean-squared error of the predictor increases with the variance of the whitenoise used to generate X. This makes sense, since increasing σ2 increases Var(Xt), yieldingmore variable data, which makes predictions more difficult.
22.2 Reduction to Mean Zero Time Series
We start by showing that whenever we have to deal with a stationary time series thatdoesn’t have zero mean, we can assume that it does (which usually simplifies computations)and deal with the mean only once we’ve made a prediction for the corresponding zero-meantime series:
Suppose Yt is a stationary time series with mean µ. Then if Xt := Yt − µ, Xt is astationary time series with mean 0. Therefore, the linearity of the prediction operatorimplies that
PnYn+h = Pn(Xn+h + µ) = PnXn+h + µ.
Also,
E[(Yn+h − PnYn+h)2] = E[((Yn+h − µ)− (PnYn+h − µ))2]
= E[(Xn+h − PnXn+h)2] = γX(0)− an
′γn(h),
where γn(h) = (γX(h), . . . , γX(h+ n− 1))′.
Example 22.2. (An AR process with nonzero mean) The process Yt is an AR(1) processwith mean µ if Xt = Yt − µ is AR(1). So by Example 22.1,
PnYn+h = φhXn + µ = φh(Yn − µ) + µ
and
E[(Yn+h − PnYn+h)2] =
σ2(1− φ2h)
1− φ2.
22–2
22.3 Forecasting Based on an Infinite Past
Recall the following definition:
Definition 22.1. A time series Xt is an ARMA(p, q) process if Xt is stationary andfor all t,
Xt −p∑i=1
φiXt−i =
q∑j=0
θjZt−j, (43)
where θ0 = 1, Zt ∼ WN(0, σ2), and the polynomials Φ(z) = 1 −p∑i=1
φizi and Θ(z) =
q∑j=0
θjzj have no common factors. In short, we can write
Φ(B)Xt = Θ(B)Zt. (44)
Recall that we derived some lectures ago an expression for ARMA(1,1) processes under the
assumption that |φ| < 1 by defining χ(z) =∑j≥0
φjzj and applying χ(B) to both sides of
Φ(B)Xt = Θ(B)Zt,
thus obtaining
Xt = χ(B)Θ(B)Zt = Zt + (φ+ θ)∑j≥1
φj−1Zt−j.
Similarly, if |θ| < 1, we can write
ξ(z) =1
θ(z)=∞∑j=0
(−θ)jzj.
Then (44) becomesξ(B)Φ(B)Xt = ξ(B)Θ(B)Zt = Zt,
that is,π(B)Xt = Zt,
where
π(B) = ξ(B)Φ(B) =∞∑j=0
(−θ)jBj(1− φB) = 1− (φ+ θ)∑j≥1
(−θ)j−1Bj,
so that
Zt = Xt − (φ+ θ)∑j≥1
(−θ)j−1Xt−j,
22–3
implying that
Xt = (φ+ θ)∑j≥1
(−θ)j−1Xt−j + Zt.
As the computations above show, for some time series, the value Xt depends on all the valuesXs, s < t. Therefore, the best “linear” predictor should depend on all the values Xs, s < tas well.
Definition 22.2. The prediction operator based on the infinite past Pn is defined by
PnXn+h =∞∑j=1
αjXn+1−j,
where the coefficients αj minimize the expected square error
E[(Xn+h − PnXn+h)2].
Note 22.1. The limit in the definition is taken to be the mean square limit. See the briefdiscussion of this topic in Lecture 4.
Proposition 22.4. (Properties of PnXn+h)
1. E[(Xn+h − PnXn+h)Xi] = 0 for all i ≤ n.
2. Pn(aXn+h1 + bXn+h2 + c) = aPn(Xn+h1) + bPn(Xn+h2) + c.
3. Pn(∑i≥1
αiXn+1−i) =∑i≥1
αiXn+1−i
4. PnXn+h = E[Xn+h] if Cov(Xn+h, Xi) = 0 for all i ≤ n.
Example 22.3. Consider the ARMA(1,1) process with |φ| < 1, |θ| < 1:
(1− φB)Xt = (1 + θB)Zt, Zt ∼ WN(0, σ2).
We saw above that
Xt = (φ+ θ)∑j≥1
(−θ)j−1Xt−j + Zt.
Applying Pn to both sides of the equality and using the properties of the proposition aboveyields
PnXn+1 = (φ+ θ)∑j≥1
(−θ)j−1Xn+1−j,
so since
Xn+1 = (φ+ θ)∑j≥1
(−θ)j−1Xn+1−j + Zn+1,
we see thatXn+1 − PnXn+1 = Zn+1,
implying that the expected square error is
E[(Xn+1 − PnXn+1)2] = σ2.
22–4
Math 4506 (Fall 2019) November 25, 2019Prof. Christian Benes
Lecture #23: More Forecasting
Reference. Section 7.3
23.1 The Innovations Algorithm
The innovations algorithm is designed to facilitate the computation of predictors via predic-tors used in the past. It works not just for stationary time series, but for any time serieswith second moments.
Consider a mean 0 time series Xt with E[X2t ] < ∞ for all t and κ(i, j) = E[XiXj].
Note that since we aren’t assuming here that Xt is stationary, we can’t talk about anautocovariance function, only about covariances. We define
Xn =
0, n = 1Pn−1Xn, n ≥ 2
,
the innovationsUn = Xn − Xn,
and
vn = E
[(Xn+1 − Xn+1
)2]
= E[U2n+1].
Since Xn = −n−1∑i=1
an−1,iXn−i (the constants are arbitrary and the minus sign is here just to
make the equation below look better), we get
U1
U2
U3...Un−1
Un
=
1 0 0 . . . 0 0a1,1 1 0 . . . 0 0a2,2 a2,1 1 . . . 0 0...
......
. . ....
...an−2,n−2 an−2,n−3 an−2,n−4 . . . 1 0an−1,n−1 an−1,n−2 an−1,n−3 . . . an−1,1 1
X1
X2
X3...Xn−1
Xn
,
which can be re-written in short as
Un = AnXn.
We know from linear algebra that the inverse matrix of An can be written in the form
Cn := A−1n =
1 0 0 . . . 0 0θ1,1 1 0 . . . 0 0θ2,2 θ2,1 1 . . . 0 0...
......
. . ....
...θn−2,n−2 θn−2,n−3 θn−2,n−4 . . . 1 0θn−1,n−1 θn−1,n−2 θn−1,n−3 . . . θn−1,1 1
.
23–1
In particular, sinceCnUn = Xn and Un = Xn − Xn,
Xn = Xn −Un = (Cn − 1n)Un = ΘnUn = Θn(Xn − Xn),
where
Θn =
0 0 0 . . . 0 0θ1,1 0 0 . . . 0 0θ2,2 θ2,1 0 . . . 0 0...
......
. . ....
...θn−2,n−2 θn−2,n−3 θn−2,n−4 . . . 0 0θn−1,n−1 θn−1,n−2 θn−1,n−3 . . . θn−1,1 0
.
If we write the equations of this system individually, we get X1 = 0 and for n ≥ 1,
Xn+1 =n∑j=1
θn,j
(Xn+1−j − Xn+1−j
).
So if at time n we wish to make a prediction Xn+1, we can do so by using our past predictionsXn+1−j1≤j≤n and the actual values Xn+1−j1≤j≤n. Of course, we also need the coefficientsθn,j1≤j≤n. It would be great if we could compute these recursively as well. It turns out wecan:
Theorem 23.1. (Innovations Algorithm)
v0 = κ(1, 1),
and for n ≥ 1,
θn,n−k = v−1k
(κ(n+ 1, k + 1)−
k−1∑j=0
θk,k−jθn,n−jvj
), 0 ≤ k ≤ n− 1, (45)
the sum being empty if k = 0, and
vn = κ(n+ 1, n+ 1)−n−1∑j=0
θ2n,n−jvj.
Note 23.1. So to compute θn,j1≤j≤n, we start by computing
θ1,1 = v−10 κ(2, 1),
thenv1 = κ(2, 2)− θ2
1,1v0,
thenθ2,2 = v−1
0 (κ(3, 1), θ2,1 = v−11 (κ(3, 2)− θ1,1θ2,2v0),
thenv2 = κ(3, 3)− (θ2
2,2v0 + θ22,1v1),
etc.
23–2
Note 23.2. An important property of the innovations is that the components of Xn − Xn
are uncorrelated.
Example 23.1. (Prediction for MA(1) process) In the case of an MA(1) process
Xt = Zt + θZt−1, |θ| < 1,
we have κ(n, n) = γ(0), κ(n + 1, n) = γ(1), and κ(m,n) = 0 if m > n + 1, so the equationsof the innovations algorithm become
v0 = γ(0),
and for n ≥ 2,θn,i = 0, i ≥ 2,
and for n ≥ 1,θn,1 = v−1
n−1γ(1),
vn = γ(0)− θ2n,1vn−1.
To see why θn,i = 0 for n, i ≥ 2, we can note that θn,n = v−10 κ(n + 1, 1) = v−1
0 γ(n) = 0 ifn ≥ 2. Moreover, if k ≤ n− 2, (n+ 1)− (k + 1) ≥ 2, so κ(n+ 1, k + 1) = 0, so that we getfrom (45)
θn,n−k = −k−1∑j=0
θk,k−jθn,n−jvj, (46)
We already know that θn,n = 0. We can now use this in (46) as follows:
θn,n−1 = −0∑j=0
θ1,1−jθn,n−jvj = −θ1,1θn,nv0 = 0,
Now that we know θn,n = θn,n−1 = 0, we get
θn,n−2 = −1∑j=0
θ2,2−jθn,n−jvj = −(θ2,2θn,nv0 + θ2,1θn,n−1v1) = 0.
Continuing like this, we can show θn,i = 0 for 2 ≤ i ≤ n− 2.
In the particular case of an MA(1) process, using the fact that
γ(0) = σ2(1 + θ2), γ(1) = σ2θ,
these equations becomev0 = σ2(1 + θ2),
and for n ≥ 1,θn,i = 0, i ≥ 2,
θn,1 = v−1n−1σ
2θ,
23–3
vn = σ2(1 + θ2)−(v−1n−1σ
2θ)2vn−1 = σ2(1 + θ2 − v−1
n−1σ2θ2).
For example, for the MA process
Xt = Zt +1
2Zt−1,
we get
v0 =5
4σ2,
θ1,1 = v−10
1
2σ2 =
2
5,
v1 = σ2
(1 +
1
4− v−1
0 σ2 1
4
)= σ2
(1 +
1
4− 1
5
)=
21
20σ2,
θ2,1 = v−11 σ2 1
2=
10
21,
v2 = σ2
(1 +
1
4− v−1
1 σ2 1
4
)= σ2
(1 +
1
4− 5
21
)=
85
84σ2,
θ3,1 = v−12 σ2 1
2=
42
85,
...
This gives
X2 = θ1,1
(X1 − X1
)=
2
5X1,
X3 = θ2,1
(X2 − X2
)=
10
21
(X2 − X2
)=
10
21
(X2 −
2
5X1
)=
10
21X2 −
4
21X1,
X4 =42
85
(X3 −
10
21X2 +
4
21X1
)=
42
85X3 −
20
85X2 +
8
85X1,
...
23–4
Math 4506 (Fall 2019) November 27 2019Prof. Christian Benes
Lecture #24: Maximum Likelihood Estimation
Reference. Section 7.3
24.1 Maximum Likelihood Estimation
When looking for estimators, a statistician has a number of tools at her disposal. Two ofthe most common are to look for method of moments or maximum likelihood estimators.We’ve seen already when discussing the Yule-Walker method how to come up with methodof moments estimators. We now discuss the maximum likelihood method. This methodrelies on the knowledge (up to the unknown parameters) of the underlying distribution andthe common assumption (which may of course be wrong, but is often appropriate) that thedata Xn = (X1, . . . , Xn) come from the normal distribution. In that case, the likelihood ofXn is defined by
L =1
(2π)n/2(det Γn)1/2exp
−1
2Xn′Γ−1n Xn
.
The likelihood depends on the only parameters which are present in it, that is, the covari-ances. It turns out that if we think in terms of innovations, we can simplify quite a bit thelast expression by finding appropriate replacements for det Γn and Xn
′Γ−1n Xn.
We know from earlier that if
Xn =
0, n = 1Pn−1Xn, n ≥ 2
is the least squares linear predictor, then Xn = Cn(Xn − Xn), where
Cn =
1 0 0 . . . 0 0θ1,1 1 0 . . . 0 0θ2,2 θ2,1 1 . . . 0 0...
......
. . ....
...θn−2,n−2 θn−2,n−3 θn−2,n−4 . . . 1 0θn−1,n−1 θn−1,n−2 θn−1,n−3 . . . θn−1,1 1
.
An important property of the innovations is that the components of Xn−Xn are uncorrelated,
so that if vj−1 = E
[(Xj − Xj
)2], the covariance matrix of Xn − Xn is
Dn =
v0 0 0 . . . 0 00 v1 0 . . . 0 00 0 v2 . . . 0 0...
......
. . ....
...0 0 0 . . . vn−2 00 0 0 . . . 0 vn−1
.
24–1
You proved on the first homework assignment of the semester that if Y = a + BX, thenΣY = BΣXB
′. Applying this to our current situation where Xn = Cn(Xn−Xn), we get that
Γn = CnDnC′n,
so that
Xn′Γ−1n Xn = (Xn − Xn)′C ′nC
′−1n D−1
n C−1n Cn(Xn − Xn)
= (Xn − Xn)′D−1n (Xn − Xn) =
n∑j=1
(Xj − Xj)2
vj−1
.
We also get that
det Γn = det(Cn) det(Dn) det(C ′n) =n∏i=1
vi−1,
so that we can rewrite the likelihood as
L =1
(2π)n/2(∏n
i=1 vi−1)1/2exp
−1
2
n∑j=1
(Xj − Xj)2
vj−1
.
All these quantities are easily computed using the innovations algorithm and so is the likeli-hood. In particular, using the definition rn = vn/σ
2 (note that rn is independent of σ2), weobtain the Gaussian likelihood for an ARMA process
L(φφφ,θθθ, σ2) =1√
(2πσ2)n(∏n
i=1 ri−1)exp
− 1
2σ2
n∑j=1
(Xj − Xj)2
rj−1
.
To maximize the Likelihood, one must differentiate it and look for zeros.
Note that since we are now assuming that X is an ARMA process, the innovations algorithmtells us that v0 = Var(X1) and for n ≥ 1, vn = Var(Xn+1)−
∑n−1j=0 θ
2n,n−jvj.
Wold’s theorem (see p. 383 of the textbook) says that every ARMA process can be expressedas an MA(∞) process (you showed this in a homework problem for AR(1) processes). Thismeans that we can write Xt =
∑k≥1 ψkZt−k with ψk = fk(φφφ,θθθ), which implies that v0 =
σ2f(φφφ,θθθ), where σ2 = Var(Zt) and the equality vn = Var(Xn+1)−∑n−1
j=0 θ2n,n−jvj implies that
for n ≥ 1, vn = σ2fn(φφφ,θθθ). This implies that rn = vn/σ2 is not dependent on σ2, so we will
be able to treat it like a constant (when differentiating with respect to σ2). This allows us
to keep the notation simple and write S =n∑j=1
(Xj − Xj)2
rj−1
and P =∏n
i=1 ri−1. Then, the
product rule implies
∂
∂σ2(L) =
∂
∂σ2
((σ2)−n/2√
(2π)nPexp
− 1
2σ2S
)
= −n2
(σ2)−n/2−1√(2π)nP
exp
− 1
2σ2S
+
(σ2)−n/2√(2π)nP
exp
− 1
2σ2S
(1
σ2
)2
S.
24–2
This last term is 0 if and only if
n
2(σ2)−n/2−1 = (σ2)−n/2
(1
σ2
)2S
2⇐⇒ σ2n = S.
This yields the estimator for the white noise σ2
σ2 =1
n
n∑j=1
(Xj − Xj)2
rj−1
.
Now to maximize L(φφφ,θθθ, σ2) is equivalent to maximizing
ln
(1√
σ2n(∏n
i=1 ri−1)exp
− 1
2σ2
n∑j=1
(Xj − Xj)2
rj−1
)= −1
2
(n lnσ2 +
n∑i=1
ln ri−1
)− 1
2σ2
n∑j=1
(Xj − Xj)2
rj−1
,
or, equivalently, to minimizing
n lnσ2 +n∑i=1
ln ri−1 +1
σ2
n∑j=1
(Xj − Xj)2
rj−1
,
which, after replacing σ2 by its estimator (a function of the estimators φφφ and θθθ), becomes
n ln
(1
n
n∑j=1
(Xj − Xj)2
rj−1
)+
n∑i=1
ln ri−1 + n,
so that maximizing the likelihood amounts to computing
σ2 =1
n
n∑j=1
(Xj − Xj)2
rj−1
(47)
and using the predictors obtained from the parameters φφφ and θθθ that minimize
`(φφφ,θθθ) = ln
(1
n
n∑j=1
(Xj − Xj)2
rj−1
)+ n−1
n∑i=1
ln ri−1. (48)
(Note that this is a difficult exercise.)
24–3
Math 4506 (Fall 2019) December 2, 2019Prof. Christian Benes
Lecture #25: AIC, Model Diagnostics
Reference. Section 6.5 and Chapter 8 from the textbook.
25.1 The Akaike Information Criterion
The Akaike criterion assigns to each model (i.e., for each pair p, q) a numerical value relatedto the likelihood of the model given the data. One then chooses p, q,φφφp, θθθq in such a way tominimize
AIC = −2 ln(Likelihood) + 2(p+ q + 1).
The smaller the AIC, the larger the likelihood. Note that sometimes, the AIC is taken to be(this is the case with R)
AIC = −2 ln(Likelihood) + 2(p+ q + 2),
but the values of p and q that minimize it are the same in both cases, so that for our purpose,either choice is fine.
There is an improved version which is as follows: The AICC (Akaike’s Information CorrectedCriterion) is
AICC = AIC +2(p+ q + 1)(p+ q + 3)
n− (p+ q + 3).
We will now see how this can be used in a specific example:
Example 25.1. Suppose X is the ARMA(1,1) process defined by
Xt = 0.5Xt−1 + Zt + 0.4Zt−1.
We can simulate the process as follows:
> Z=rnorm(10000)
> X=Z
> for (i in 2:10000) X[i]=0.5*X[i-1]+Z[i]+0.4*Z[i-1]
Not too surprisingly (since we know what the actual process is), neither the ACF nor thePACF suggest that an MA or AR model is appropriate. However, they do suggest that astationary model might be, so we look for the best fit among ARMA processes with p, q ≤ 5:
> m=matrix(0,6,6)
> for (i in 0:5) for (j in 0:5) m[i+1,j+1]=AIC(arima(X,order=c(i,0,j)))
25–1
> m[, 1] [, 2] [, 3] [, 4] [, 5] [, 6]
[1, ] 35627.07 29773.30 28743.51 28496.02 28466.50 28445.61[2, ] 29174.76 28435.98 28437.21 28439.21 28437.70 28437.46[3, ] 28558.52 28437.20 28435.47 28439.79 28437.97 28438.71[4, ] 28451.61 28439.18 28437.47 28439.12 28439.91 28440.69[5, ] 28445.16 28435.51 28437.51 28439.36 28439.20 28441.21[6, ] 28443.23 28437.51 28439.48 28440.73 28441.20 28443.05
This shows that the smallest AIC value is 28435.47, obtained by an ARMA(1,1) model. Wesee that the AIC of the ARMA(2,2) model is only very slightly larger, so both models shouldcertainly be considered. Note that if you were to perform the same experiment again, theAICs might suggest altogether different models (though with such a large data set, the modelchosen using the AIC is likely to be the right one).
25.2 Residuals
If a model obtained for a time series is good, it should account for all the structure thatis present in that time series. In other words, anything it doesn’t account for should be“completely random”. What the model doesn’t account for (that is, what’s left once welook at the difference between the data and the model) is what we call the residuals. If theresiduals for our model are white noise, we have a reasonable model and can pat ourselveson the back.
Given a time series X, the innovations algorithm computes at each time t a predictor Xt+1,using the recursively computed values of vi and θi,j.
The fact that the innovationsXn−Xn are uncorrelated then implies that the rescaled residuals
Rn :=Xn − Xn√
sn−1
∼ WN(0, 1),
wheresn−1 = E[(Xn − Xn)2].
Now in reality, the time series X is not known. It depends on the parameters θθθ,φφφ, and σ2 forwhich one can only try to guess the true values. If the predictor is based on the maximumlikelihood estimators for the parameters and the model is right, then
Wn :=Xn − Xn(θθθ, φφφ)√rn−1
approx∼ WN(0, σ2).
The rescaled residuals are then defined to be
Rn =Wn
σ.
If the fitted model is appropriate, Rn should look like WN(0, 1). Here are a few ways ofverifying that this is indeed the case:
25–2
25.2.1 Checking Normality of Residuals
To check if xi1≤i≤n could be an independent sample from a N(0, 1), we have several toolsat our disposal. You probably saw how to perform goodness of fit tests in your statisticsclass. Another approach is to consider quantile-quantile plots (also called qq plots):
Assume the data xi1≤i≤n are in increasing order and let Φ be standard normal c.d.f. For1 ≤ i ≤ n+ 1, let
pi =i
n+ 1
andqpi = Φ−1(pi).
The plot of point (xi, qpi) is a quantile-quantile plot and plots the data together with theequiprobable quantiles based on the sample size. If the data come from a standard normaldistribution, the points in this plot should be close to perfectly lined up.
25.2.2 Checking If the Residuals Are White Noise
There are a few methods we’ve seen already to determine if the residuals could be whitenoise. Here is a quick reminder of what they are:
• Straightforward examination of the ACF (for instance, 95% of the autocorrelationsshould fall within (−1.96/
√n, 1.96/
√n).) Note that this is not rigorous and should be
used only to give you a general idea of whether you are in the presence of white noiseor not.
• The Portmanteau test.
• The turning point test.
Given a model Xt for a time series, R will compute the rescaled residuals if instructed todo so. All that is then left to do to convince ourselves that our model is appropriate is toverify that there is no evidence against the residuals being WN(0, 1).
Example 25.2. We will look at a data set containing the Euro/Dollar exchange rates (thevalue of one $US in Euros) from May 6, 2010 to May 6, 2011 (both dates included). Thisdata set can be obtained at
http://userhome.brooklyn.cuny.edu/cbenes/Euro-Dollar.txt
Note that the data set goes backwards in time when going from top to bottom, so we’ll needto invert it. We start by importing this data set:
> www=“http://userhome.brooklyn.cuny.edu/cbenes/Euro-Dollar.txt”
> ED=read.table(www,header=T)
25–3
Now we make the data set go forward in time as follows:
> ed = c()
> for (i in 1:366) ed[i]=ED[367-i,1]
We first take a look at this data set:
> plot(ed,type=“l”)
0 100 200 300
0.70
0.75
0.80
Index
ed
The value of one $US in Euros, May 6, 2010-May 6, 2011
We can then look at the ACF and PACF of this time series:
> acf(ed)
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
USD.EUR
The autocorrelation function of X
> pacf(ed)
25–4
5 10 15 20 25
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
Par
tial A
CF
Series ED
The partial autocorrelation function of X
The PACF suggests that an AR(2) process may be appropriate. However, the fact that theACF decays so slowly should raise some skepticism.
Let’s see what happens if we do look for an AR(2) model for this time series:
> ar2=arima(ed,order=c(2,0,0))
> ar2
Call:
arima(x = ed, order = c(2, 0, 0))
Coefficients:
ar1 ar2 intercept
1.3644 -0.3703 0.7399
s.e. 0.0491 0.0493 0.0261
sigma2 estimated as 1.270e-05: log likelihood = 1541.23, aic = -3074.46
We first check if the residuals look normal:
> qqnorm(ar2$resid)
-3 -2 -1 0 1 2 3
-0.010
-0.005
0.000
0.005
0.010
0.015
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
25–5
The relatively straight quantile-quantile plot suggests that the residuals could indeed benormal.
To determine if this model is appropriate, we use the following command which gives us aplot of the residuals, of their acf, and p-values for the Ljung-Box statistic at various lags.
> tsdiag(ar2)
Standardized Residuals
Time
0 100 200 300
-20
24
0 5 10 15 20 25
0.0
0.4
0.8
Lag
ACF
ACF of Residuals
2 4 6 8 10
0.0
0.4
0.8
p values for Ljung-Box statistic
lag
p va
lue
We can also obtain details of the residuals with the following command:
> resid(ar2)
Since the p-values of the Ljung-Box statistic are greater than 5% for all lags shown by R, thediagnostic plot convinces us that the residuals could very well be a realization of white noise,so our model appears to be appropriate. Using the following command to get the standarddeviation of the residuals
> sd(ar2$resid)
[1] 0.003567808
we see that the exact model is
Xt − 0.7399 = 1.3644(Xt−1 − 0.7399)− 0.3703(Xt−2 − 0.7399) + 0.003567808 ·WN(0, 1).
25–6
The model we found is adequate, as shown by inspection of the residuals, but the ACF shouldhave led us away from the AR(2) model. Indeed, the ACF of an AR process generally decaysexponentially (you know at least that this is the case for AR(1) processes) or is a damped sinewave, i.e., an oscillating function stuck between a positive exponentially decaying functionand its negative. Since this is not what we observed in the sample ACF of the Euro-Dollartime series, the AR(2) model is probably wrong (though it was adequate).
When the ACF decays very slowly, differencing is usually recommended, as a slowly decayingACF is symptomatic of a time series that may not be stationary. Think for instance ofrandom walk for which we’ve seen that the ACF decays very slowly, the PACF levels offafter lag 1, which is very similar to what we observed in the Euro-Dollar data. Whilerandom walk is not a stationary process, if we difference the process, we get white noisewhich certainly is stationary. So let’s re-visit the example.
First, let’s create the lag-1 differenced time series:
> Diffed=diff(ed)
Then let’s try to get some visual information about that series:
> plot(Diffed,type=”l”)
0 100 200 300
-0.010
-0.005
0.000
0.005
0.010
Index
Diffed
We can then look at the ACF and PACF of this time series:
> acf(Diffed)
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series Diffed
25–7
> pacf(Diffed)
5 10 15 20 25-0.1
0.0
0.1
0.2
0.3
Lag
Par
tial A
CF
Series Diffed
From these pictures, it isn’t clear that the process is an AR or an MA process, so we lookfor the best ARMA process among all those with p ≤ 4 and q ≤ 4 (higher values of p or qare generally not advised for prediction purposes):
> m=matrix(0,5,5)
> for (i in 0:4)
+ for (j in 0:4) m[i+1,j+1]=AIC(arima(Diffed,order=c(i,0,j)))
The matrix m then contains the AIC for all ARMA(p, q) models for 0 ≤ p, q ≤ 4 from whichwe see (check it at home if you don’t believe me) that the model with the lowest AIC is anARMA(2,3) model.
We can now check if the model is adequate:
> arma23=arima(Diffed,order=c(2,0,3))
> qqnorm(arma23$res)
-3 -2 -1 0 1 2 3
-0.010
-0.005
0.000
0.005
0.010
0.015
Normal Q-Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
The quantile-quantile plot suggests that the data could well be normal.
> tsdiag(arma23)
25–8
Standardized Residuals
Time
0 100 200 300
-20
24
0 5 10 15 20 25
0.0
0.4
0.8
Lag
ACF
ACF of Residuals
2 4 6 8 10
0.0
0.4
0.8
p values for Ljung-Box statistic
lag
p va
lue
The ACF of the residuals clearly suggests that the residuals could very well be white noise,so again, the model is adequate. Moreover, the p-values of the Ljung-Box statistic areconsiderably larger, so we should be more confident that the residuals are white noise thanin our residuals for the non-differenced series.
25–9
Math 4506 (Fall 2019) December 4, 2019Prof. Christian Benes
Lecture #26: Forecasting
26.1 Exponential Smoothing
Exponential smoothing is a smoothing method which attempts to extract the “true” trendof the time series by assuming that this trend is determined by the entire time series up toany given time, but with less and less weight attached to times that are far back in time.More precisely, for some fixed α ∈ [0, 1], the moving averages mt are defined as follows:
mn =n−2∑j=0
α(1− α)jXn−j + (1− α)n−1X1.
Note that the weights add up to 1. Indeed,
n−2∑j=0
α(1− α)j + (1− α)n−1 = α
(1
1− (1− α)− (1− α)n−1
1− (1− α)
)+ (1− α)n−1 = 1.
Note also that if α is close to 1, then mn is basically equal to Xn and if α is close to 0, thenmn is basically a weighted mean of all values of the time series up to the present.
Note that mn−1 =∑n−2
j=1 α(1− α)j−1Xn−j + (1− α)n−2X1, so
mn = (1− α)mn−1 + αXn.
The h-step predictor based on exponential smoothing is
PnXn+h = mn.
Exponential smoothing doesn’t assume anything about the shape and form of the underlyingtrend and possible cyclical behavior of the time series. It therefore lends itself particularlywell to time series for which we don’t have any a priori information regarding any possiblecyclical behavior or clearly-defined trend. We will see how to handle a trend or periodiccomponent later by extending the idea of exponential smoothing which, together with itsextensions is also called the Holt-Winters method.
We look at the Lake Huron levels time series.
Example 26.1. You are free to choose the value of the parameter α, but if you don’t, Rwill do it in such a way that it minimizes the sum of the squares of the one-step predictionerrors
n−1∑i=1
(mi −Xi)2.
26–1
Exponential smoothing is a particular case of the Holt-Winters method which contains twoadditional parameters (beta and gamma). For simple exponential smoothing, these param-eters should be set to FALSE:
> data(LakeHuron)
> x=LakeHuron
> Huron.hw = HoltWinters(x,beta=FALSE,gamma=FALSE)
> PRED=predict(Huron.hw,n=5)
> plot(x)
> lines(PRED,col=“red”)
This gives the following picture. As expected from the definition above, the predictions forall future lags are the same:
Time
x
1880 1900 1920 1940 1960
576
577
578
579
580
581
582
26.2 Double Exponential Smoothing
We revisit exponential smoothing by looking for a predictor that is linear (rather thanconstant) in h. Define
a1 = b1 = X1
and, for n > 1,
an = (1− α)(an−1 + bn−1) + αXn , bn = (1− β)bn−1 + β(an − an−1).
We then define the lag-h predictor by
PnXn+h = an + hbn.
We now revisit the example above:
26–2
Example 26.2. Again, R will choose α and β in such a way that it minimizes the sum ofthe squares of the one-step prediction errors
n−1∑i=1
(mi −Xi)2.
For double exponential smoothing, gamma should be set to FALSE:
> data(LakeHuron)
> x=LakeHuron
> Huron.hw = HoltWinters(x,gamma=FALSE)
> PRED=predict(Huron.hw,n=5)
> plot(x)
> lines(PRED,col=“red”)
This gives the following picture. As expected from the definition above, the predictions forall future lags are the same:
Time
x
1880 1900 1920 1940 1960
576
577
578
579
580
581
582
We can find the exact values of the predictions by typing
> PRED
Time Series:
Start = 1973
End = 1977
Frequency = 1
26–3
fit
[1,] 580.1901
[2,] 580.4201
[3,] 580.6502
[4,] 580.8802
[5,] 581.1103
26.3 Fitting and Predicting
In this example we will see how to make predictions using an ARMA fit for a data set:
> data(LakeHuron)
> x=LakeHuron
> for (i in 0:5) for (j in 0:5) m[i+1,j+1]=AIC(arima(LakeHuron,order=c(i,0,j)))
> m
[, 1] [, 2] [, 3] [, 4] [, 5] [, 6]335.2698 255.2950 230.9306 222.1263 222.5113 222.6902219.1959 214.4905 216.4645 217.8882 219.3345 221.3152215.2664 216.4764 218.4106 219.5158 220.3386 222.1918216.0377 217.8048 219.6967 221.1937 222.1703 224.1686217.6237 219.2071 220.4332 221.8397 223.2962 224.9742219.5631 220.3120 222.3383 223.2915 225.2813 226.9103
This suggests an ARMA(1,1) model for this data set:
> fit=arima(x,order=c(1,0,1))
> fit
Call:
arima(x = x, order = c(1, 0, 1))
Coefficients:
ar1 ma1 intercept
-0.4925 0.3854 0.0012
s.e. 0.0586 0.0619 0.0093
sigma2 estimated as 1.003: log likelihood = -14204.51, aic = 28417.01
Now let’s use this model to predict the value for the first 5 years following the data set:
> PREDICT=predict(fit,n.ahead=5)
> PREDICT
$pred
26–4
Time Series:
Start = 10001
End = 10005
Frequency = 1
[1] -0.30418498 0.15160347 -0.07287637 0.03768193 -0.01676901
$se
Time Series:
Start = 10001
End = 10005
Frequency = 1
[1] 1.001512 1.007245 1.008630 1.008966 1.009048
Finally, let’s plot the time series with the 5 predicted values and 95% confidence intervalsfor the Lake Huron levels for 1981 to 1985:
> plot(x,xlim=c(1875,1985))
> lines(PREDICT$pred,col=”red”)
> lines(PREDICT$pred+1.96*PREDICT$se,col=”red”,lty=3)
> lines(PREDICT$pred-1.96*PREDICT$se,col=”red”,lty=3)
> lines(PREDICT$pred+1.645*PREDICT$se,col=”red”,lty=3)
> lines(PREDICT$pred-1.645*PREDICT$se,col=”red”,lty=3)
This gives the predicted values up to 5 steps into the future with the 95% confidence band(in red) and the 90% confidence band (in green):
26–5
Time
x
1880 1900 1920 1940 1960 1980
576
577
578
579
580
581
582
26–6
Math 4506 (Fall 2019) December 9, 2019Prof. Christian Benes
Lecture #27: Two Models Incorporating a Periodic Comonent:Holt-Winters and ARMA
27.1 Periodic Components
Definition 27.1. A function f : R→ R has period d if for every x ∈ R,
f(x+ d) = f(x).
One nice thing about periodic functions is that if we add functions of period d, we end upwith another function of period d. Indeed, if f and g are periodic with period d,
(f + g)(x+ d) = f(x+ d) + g(x+ d)f,g periodic
= f(x) + g(x) = (f + g)(x),
so f + g is periodic.
The most natural candidates for periodic functions of period d are sin(2πtd
) and cos(2πtd
), butof course functions of period d/k, for k ∈ N are also periodic of period d, so sin(4πt
d) and
cos(4πtd
) are candidates as well. In fact, sin(2kπtd
) and cos(2kπtd
) are possible functions and wemay wish to consider all those with 2k ≤ d (if 2k > d, then the period of the sine functionis less than a unit of time which will yield useless information).
27.2 Holt-Winters Method
We revisit exponential smoothing by looking for a predictor that is linear in h and with aperiodic component of period p. The process is again recursive and requires initial valueswhich can be defined in a number of reasonable ways. For instance, we can define
ap+1 = Xp+1,
bp+1 =Xp+1 −X1
p,
and for i = 1, . . . , p+ 1,si = Xi −X1 − (i− 1)bp+1.
For n > p, letan = (1− α)(an−1 + bn−1) + α(Xn − sn−p),
bn = (1− β)bn−1 + β(an − an−1),
and for n > p, letsn = (1− γ)sn−p + γ(Xn − an).
We then define the lag-h predictor by
PnXn+h = an + hbn + sn−p+1+((h−1) mod p).
27–1
Example 27.1. We now revisit the airline passenger model. The time series has a naturalperiod of 12 (in months), since we have monthly data and it is reasonable to assume thatpassenger numbers follow annual cycles.
> AP=AirPassengers
> LAP=log(AP)
> LAP.hw=HoltWinters(LAP)
> LAP.hw
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = LAP)
Smoothing parameters:
alpha: 0.3266015
beta : 0.005744138
gamma: 0.8206654
Coefficients:
[,1]
a 6.172308435
b 0.008981893
s1 -0.073201087
s2 -0.140973564
s3 -0.036703294
s4 0.014522733
s5 0.032554237
s6 0.154873570
s7 0.294317062
s8 0.276063997
s9 0.088237657
s10 -0.032657089
s11 -0.198012716
s12 -0.102863837
Let’s predict the next 4 years for the time series:
> PRED=predict(LAP.hw,n=48)
The following command gives some space to the plot for predictions to be added:
> plot(LAP,xlim=c(1949,1965),ylim=c(4.5,7))
27–2
Time
LAP
1950 1955 1960 1965
4.5
5.0
5.5
6.0
6.5
7.0
> lines(PRED,col=“red”)
Time
LAP
1950 1955 1960 1965
4.5
5.0
5.5
6.0
6.5
7.0
Note 27.1. In the example above, R already knew that the period of the data was 12. Ingeneral, to use the Holt-Winters method on any given data set, you will first need to specifythe frequency using, for instance, the command “A=ts(data,frequency=12)” if your dataset is called “data” and you wish to call the time series with period 12 “A”. You can thenperform your analysis on the data set A.
27–3
27.3 A Complete ARMA-based Model with Periodic Component
Example 27.2. We now revisit the airline passenger model, finding an ARMA model witha seasonal component: We are looking for a periodic function of period 1 (in years) or 12 (inmonths) that would fit the data well. There are many of such functions.
So in our problem, we will consider sin(2kπtd
) and cos(2kπtd
) with d = 12, k = 1, . . . , 6:
> AP=AirPassengers
> LAP=log(AP)
> t=time(LAP)
> t2=tˆ2
> COS=SIN=matrix(nr=length(AP),nc=6)
> for (i in 1:6)+ SIN[,i]=sin(2*pi*i*t)
+ COS[,i]=cos(2*pi*i*t)
+ We can now try a least squares fit with a quadratic trend and a periodic component ofincreasing complexity :
> LAP.lm11=lm(LAP˜t+t2+COS[,1]+SIN[,1])
> plot(LAP)
> T = c()
> for (i in 1:144) T[i]=1949+(i-1)/12
> lines(T,LAP.lm11$fit,col=”red”)
Time
LAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
> LAP.lm12=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2])
> lines(T,LAP.lm12$fit,col=”green”)
27–4
TimeLAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
> LAP.lm13=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3])
> lines(T,LAP.lm13$fit,col=”blue”)
We can regress on three more curves
> LAP.lm14=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4])
> LAP.lm15=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4]
+COS[,5]+SIN[,5])
> LAP.lm16=lm(LAP˜t+t2+COS[,1]+SIN[,1]+COS[,2]+SIN[,2]+COS[,3]+SIN[,3]+COS[,4]+SIN[,4]
+COS[,5]+SIN[,5]+COS[,6]+SIN[,6])
(note that SIN[,6] is not included as it would only yield values of zero) and get the following(optimal among those we’ve tried, since it has the largest number of parameters) curve
> plot(LAP)
> lines(T,LAP.lm16$fit,col=“red”)
Time
LAP
1950 1952 1954 1956 1958 1960
5.0
5.5
6.0
6.5
To know what our regression curve is, we type
> coef(LAP.lm16)
27–5
(Intercept) t t2 COS[, 1] SIN[, 1] COS[, 2]
-1.205314e+04 1.221572e+01 -3.093389e-03 -1.471879e-01 2.807718e-02 5.679671e-02
SIN[, 2] COS[, 3] SIN[, 3] COS[, 4] SIN[, 4] COS[, 5]
5.905909e-02 -8.709331e-03 -2.731366e-02 1.111352e-02 -3.199814e-02 5.909835e-03
SIN[, 5] COS[, 6]
-2.126938e-02 -2.936203e-03
Note 27.2. We are omitting the 6th sine component as this is a 2-periodic sine functionwith sin(0) = 0, which implies that sin(t) = 0 for all t ∈ N.
We now check if the residuals could be modeled by a stationary time series:
> acf(LAP.lm16$res)
0 5 10 15 20
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series LAP.lm16$res
> pacf(LAP.lm16$res)
5 10 15 20
-0.2
0.0
0.2
0.4
0.6
Lag
Par
tial A
CF
Series LAP.lm16$res
The ACF and PACF suggest that the residuals from our least-squares fit could be modeledby an AR(1) process. We now check if this impression is validated:
27–6
> LAP.ar=arima(resid(LAP.lm16),order=c(1,0,0))
Call:
arima(x = resid(LAP.lm16), order = c(1, 0, 0))
Coefficients:
ar1 intercept
0.6732 0.0006
s.e. 0.0612 0.0085
sigmaˆ2 estimated as 0.001144: log likelihood = 283.07, aic = -560.14
We look at the residuals of our AR model (note that we include “[-1]” in our command asthe residuals of the AR model are undefined at the first time) and get
> acf(LAP.ar$res[-1])
0 5 10 15 20
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Lag
ACF
Series LAP.ar$res[-1]
> tsdiag(LAP.ar)
Standardized Residuals
Time
0 20 40 60 80 100 120 140
-3-2
-10
12
0 5 10 15 20
-0.2
0.2
0.6
1.0
Lag
ACF
ACF of Residuals
2 4 6 8 10
0.0
0.4
0.8
p values for Ljung-Box statistic
lag
p va
lue
Since the residual from the random component could be white noise, we have found an
27–7
adequate model for the Air Passenger time series Xt:
lnXt = −12053 + 12.22t− 0.0031t2 − 0.1472 cos(2πt) + 0.028 sin(2πt) + 0.0568 cos(4πt)
+ 0.059 sin(4πt)− 0.0087 cos(6πt)− 0.0273 sin(6πt) + 0.0111 cos(8πt)− 0.032 sin(8πt)
+ 0.0059 cos(10πt)− 0.0213 sin(10πt)− 0.0029 cos(12πt) + Yt,
where Yt is a mean 0.0006 AR(1) process with φ = 0.6732.
Now this can be used to predict Xt at future times, by just plugging t into the least squarescurve (which is non-random, so there’s nothing to predict there) and by performing a pre-diction for Yt, which is something we know how to do since Yt is an AR process.
27–8
Math 4506 (Fall 2019) December 11, 2019Prof. Christian Benes
Lecture #28: Q&A
28–1