stationary stochastic processes, parts of chapters …rootzen/fintid/stationary120312.pdf1...

1

Stationary stochastic processes,parts of Chapters 2 and 6

Georg Lindgren, Holger Rootzen,and Maria Sandsten

Question marks indicate references to other parts of the book. Comments and plotsregarding spectral densities are not supposed to be understood.

Chapter 1

Stationary processes

1.1 Introduction

In Section 1.2, we introduce the moment functions: the mean value function, whichis the expected process value as a function of time t, and the covariance function,which is the covariance between process values at times s and t. We remind ofsome simple rules for expectations and covariances, for example that covariancesare linear in both arguments. We also give many example of how the mean valueand covariance functions shall be interpreted.

The main focus is on processes for which the statistical properties do not changewith time – they are (statistically) stationary. Strict stationarity and weak statio-narity are defined.

Dynamical systems, for example a linear system, is often described by a set ofstate variables, which summarize all important properties of the system at time t,and which change with time under influence of some environmental variables. Oftenthe variables are random, and then they must be modeled as a stochastic process.State variables are further dealt with in Chapter ??.

The statistical problem of how to find good models for a random phenomenon isalso dealt with in this chapter, and in particular how one should estimate the meanvalue function and covariance function from data. The dependence between differentprocess values needs to be taken care of when constructing confidence intervals andtesting hypotheses.

1.2 Moment functions

The statistical properties of a stochastic process {X(t), t ∈ T} are determined bythe distribution functions. Expectation and standard deviation catch two importantproperties of the marginal distribution of X(t), and for a stochastic process thesemay be functions of time. To describe the time dynamics of the sample functions,we also need some simple measures of the dependence over time. The statisticaldefinitions are simple, but the practical interpretation can be complicated. Weillustrate this by the simple concepts of “average temperature” and “day-to-day”

3

4 Stationary processes Chapter 1

5 10 15 20 25 30−25

−20

−15

−10

−5

0

5

10

Day

Ave

rage

tem

pera

ture

1997

1992

Figure 1.1: Daily average temperature in Malilla during January, for 1988 – 1997.The fat curves mark the years 1992 and 1997.

correlation.

Example 1.1. (’Daily temperature’) Figure 1.1 shows plots of the temperature inthe small Swedish village of Malilla, averaged over each day, during the month ofJanuary for the ten years 1988 – 1997. Obviously, there has been large variationsbetween years, and it has been rather cold for several days in a row.

The global circulation is known to be a very chaotic systems, and it is hard topredict the weather more than a few days ahead. However, modern weather forecastshas adopted a statistical approach in the predictions, together with the computerintense numerical methods, which form the basis for all weather forecasts. Natureis regarded as a stochastic weather generator, where the distributions depend ongeographical location, time of the year, etc, and with strong dependence from dayto day. One can very well imagine that the data in the figure are the results of sucha “weather roulette”, which for each year decides on the dominant weather systems,and on the day-to-day variation. With the statistical approach, we can think of theten years of data as observations of a stochastic process X1, . . . , X31 . The meanvalue function is m(t) = E[Xt]. Since there is no theoretical reason to assume anyparticular values for the expected temperatures, one has to rely on historical data.In meteorology, the observed mean temperature during a 30 year period is oftenused as a standard.

The covariance structure in the temperature series can also be analyzed fromthe data. Figure 1.2 illustrates the dependence between the temperatures from oneday to the next. For each of the nine years 1988 – 1996 we show to the left scatter

Section 1.2 Moment functions 5

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

−20 0 20−20

−10

0

10

20

Figure 1.2: Scatter plots of temperatures for years 1988 – 1996 for two successivedays (left plot) and two days, five days apart (right plot). One can see a weaksimilarity between temperatures for adjacent days, but it is hard to see anyconnection with five days separation.

plots of the pairs (Xt, Xt+1), i.e., with temperature one day on the horizontal axisand the temperature the next day on the vertical axis. There seems to be a weakdependence, two successive days are correlated. To the right we have similar scatterplots, but now with five days separation, i.e., data are (Xt, Xt+5). There is almostno correlation between two days that are five days apart. 2

1.2.1 Definitions

We now introduce the basic statistical measures of average and correlation. Let{X(t), t∈T} be a real valued stochastic process with discrete or continuous time.

Definition 1.1 For any stochastic process, the first and second order moment func-tions are defined as

m(t) = E[X(t)] mean value function (mvf)v(t) = V[X(t)] variance function (vf)r(s, t) = C[X(s), X(t)] covariance function (cvf)b(s, t) = E[X(s)X(t)] second-moment functionρ(s, t) = ρ[X(s), X(t)] correlation function

There are some simple relations between these functions:

r(t, t) = C[X(t), X(t)] = V[X(t)] = v(t),

r(s, t) = b(s, t)−m(s)m(t),

ρ(s, t) =C[X(s), X(t)]√V[X(s)]V[X(t)]

=r(s, t)√

r(s, s) r(t, t).


These functions provide essential information about the process. The meaning ofthe mean value and variance functions are intuitively clear and easy to understand.For example, the mean value function describes how the expected value changes withtime, like we expect colder weather during winter months than during summer. The(square root of the) variance function tells us what magnitude of fluctuations wecan expect. The covariance function has no such immediate interpretation, even ifits statistical meaning is clear enough as a covariance. For example, in the oceanwave example, Example ??, the covariance r(s, s+5) is negative and r(s, s+10) ispositive, corresponding to the fact that measurements five seconds apart often fallon the opposite side of the mean level, while measurements at ten seconds distanceoften are on the same side. Simply stated, it is the similarity between observationsas a function of the times of measurements.

If there are more than one stochastic process in a study, one can distinguishthe moment functions by indexing them, as mX , rX , etc. A complete name for thecovariance function is then auto-covariance function, to distinguish it from a cross-covariance function. In Chapter ??, we will investigate this measure of co-variationbetween two stochastic processes.

Definition 1.2 The function

rX,Y (s, t) = C[X(s), Y (t)] = E[X(s)Y (t)]−mX(s)mY (t),

is called the cross-covariance function between {X(t), t ∈ T} and {Y (t), t ∈ T}.

1.2.2 Simple properties and rules

The first and second order moment functions are linear and bi-linear, respectively.We formulate the following generalization of the rules E[aX + bY ] = aE[X ] + bE[Y ],V[aX + bY ] = a2V(X) + b2V[Y ], which hold for uncorrelated random variables, Xand Y .

Theorem 1.1. Let a1, . . . , ak and b1, . . . , bl be real constants, and let X1, . . . , Xk

and Y1, . . . , Yl be random variables in the same experiment, i.e., defined on a com-mon sample space. Then

E

[k∑

i=1

aiXi

]=

k∑

i=1

aiE[Xi],

V

[k∑

i=1

aiXi

]=

k∑

i=1

k∑

j=1

aiajC[Xi, Xj],

C

[k∑

i=1

aiXi,

l∑

j=1

bjYj

]=

k∑

i=1

l∑

j=1

aibjC[Xi, Yj].

The rule for the covariance between sums of random variables, C[∑

aiXi,∑

bjYj] iseasy to remember and use: the total covariance between two sums is a double sum


of all covariances between pairs of one term from the first sum, and one term fromthe second sum.

Remember, that if two independent random variables X and Y are always uncor-related, i.e., C[X, Y ] = 0, the reverse does not necessarily hold; even if C[X, Y ] = 0,there can be a strong dependence between X and Y .

We finish the section with some examples of covariance calculations.

Example 1.2. Assume X1 and X2 to be independent random variables and definea new variable

Z = X1 − 2X2.

The variance of Z is then, since C[X1, X2] = 0,

V[Z] = V[X1 − 2X2] = C[X1 − 2X2, X1 − 2X2]

= C[X1, X1]− 2C[X1, X2]− 2C[X2, X1] + 4C[X2, X2] = V[X1] + 4V[X2].

We also calculate the variance for the variable Y = X1 − 3:

V[Y ] = V[X1 − 3] = C[X1 − 3, X1 − 3]

= C[X1, X1]− C[X1, 3]− C[3, X1] + C[3, 3] = V[X1],

i.e., the same as for X1 , which is clear as the difference between X1 and Y only isa constant value, which does not effect the variance of the random variable. 2

Example 1.3. From a sequence {Ut} of independent random variables with meanzero and variance σ2 , we construct a new process {Xt} by

Xt = Ut + 0.5 · Ut−1.

This a “moving average” process, which is the topic of Chapter 2. By means ofTheorem 1.1, we can calculate its mean value and covariance function. Of course,m(t) = E[Xt] = E[Ut + 0.5 · Ut−1] = 0. For the covariance function, we have towork harder, and to keep computations under control, we do separate calculationsaccording to the size of t− s. First, take s = t,

r(t, t) = V[Xt] = V[Ut + 0.5 · Ut−1]

= V[Ut] + 0.5 · C[Ut−1, Ut] + 0.5 · C[Ut, Ut−1] + 0.52 · V[Ut−1]

= σ2 + 0 + 0 + 0.25σ2 = 1.25σ2,

where we used that V[Ut] = V[Ut−1] = σ2 , and that Ut and Ut−1 are independent,so C[Ut, Ut−1] = C[Ut−1, Ut] = 0. For s = t+ 1 we get

r(s, t) = C[Ut+1 + 0.5 · Ut, Ut + 0.5 · Ut−1]

= C[Ut+1, Ut] + 0.5 · C[Ut+1, Ut−1] + 0.5 · C[Ut, Ut] + 0.52 · C[Ut, Ut−1]

= 0 + 0 + 0.5 · V[Ut] + 0

= 0.5σ2.


The case s = t− 1 gives the same result, and for s ≥ t+ 2 or s ≤ t− 2, one easilyfinds that r(s, t) = 0. Process values X(s) and X(t), with time separation |s− t|greater than 1, are therefore uncorrelated (they are even independent). All movingaverage processes share the common property that they have a finite correlationtime: after some time the correlation is exactly zero. 2

Example 1.4. (“Independent, stationary increments”) Processes with independent,stationary increments, (see Section ??, page ??), have particularly simple expecta-tion and covariance functions, as we shall now see.

Let the process {X(t), t ≥ 0} start at X(0) = 0 and have independent, sta-tionary increments, with finite variance. Thus, the distribution of the changeX(s + t) − X(s) over an interval (s, s + t] only depends on the interval lengtht and not on its location. In particular, X(s+ t)−X(s) has the same distributionas X(t) = X(t)−X(0), and also the same mean m(t) and variance v(t). We firstshow that both m(t) and v(t) are proportional to the interval length t.

Since E[X(s+ t)−X(s)] = E[X(t)], one has

m(s + t) = E[X(s+ t)] = E[X(s)] + E[X(s+ t)−X(s)] = m(s) +m(t),

which means that for s, t ≥ 0, the mean function is a solution to the equation

m(s + t) = m(s) +m(t),

which is known as Cauchy’s functional equation. If we now look only for continuoussolutions to the equation, it is easy to argue that m(t) is of the form (note: m(0) =0),

m(t) = E[X(t)] = k1 · t, t ≥ 0,

for some constant k1 = m(1). (The reader could prove this by first taking t = 1/q ,with integer q , and then t = p/q for integer p.)

The variance has a similar form, which follows from the independence and sta-tionarity of the increments. For s, t ≥ 0, we write X(s + t) as the sum of twoincrements, Y = X(s) = X(s) − X(0), and Z = X(s + t) − X(s). Then, Y hasvariance V[Y ] = V[X(s)] = v(s), by definition. The second increment is over aninterval with length t, and since the distribution of an increment only depends onthe interval length and not on its location, Z has the same distribution as X(t),and hence V[Z] = V[X(t)] = v(t). Thus,

v(s+ t) = V[X(s + t)] = V[Y + Z] = V[Y ] + V[Z] = v(s) + v(t).

As before, the only continuous solution is

v(t) = V[X(t)] = k2 · t (t ≥ 0)

for some constant k2 = V[X(1)] ≥ 0. Thus, we have shown that both the mean andvariance functions are proportional to the interval length.


Finally, we turn to the covariance function. First take the case s ≤ t. Then, wecan split X(t) as the sum of X(s) and the increment from s to t, and get

r(s, t) = C[X(s), X(t)] = C[X(s), X(s) + (X(t)−X(s))]

= C[X(s), X(s)] + 0 = V[X(s)] = k2 · s.

For s > t, we just interchange t and s, and realize that it is the minimum timethat determines the covariance: r(s, t) = k2 · t.

Summing up: the covariance function for a process with stationary independentincrement starting with X(0) = 0, is

r(s, t) = V[X(1)] ·min(s, t).

Besides the Poisson process, we will meet the Wiener process as an importantexample of a process of this type; see Section ??. 2

Remark 1.1. We have defined the second-moment function as b(s, t) = E[X(s)X(t)],as distinguished from the covariance function, r(s, t) = b(s, t)−m(s)m(t). In signalprocessing literature, b(s, t) is often called the auto-correlation function. Since instatistics, the technical term correlation is reserved for the normalized covariance,we will not use that terminology. In daily language, correlation means just “co-variation”.

A practical drawback with the second moment function is that the definitiondoes not correct for a non-zero mean value. For signals with non-zero mean, b(s, t)may be dominated by the mean value product, hiding the interesting dependencestructure. The covariance function is always equal to the second moment functionfor the mean value corrected series. 2

1.2.3 Interpretation of moments and moment functions

Expectation and the mean function

The mean function m(t) of a stochastic process {X(t), t ∈ T} is defined as theexpected value of X(t) as a function of time. From the law of large numbersin probability theory we know the precise meaning of this statement: for manyindependent repetitions of the experiment, i.e., many independent observations ofthe random X(t), the arithmetic mean (i.e., the average) of the observations tendsto be close to m(t). But as its name suggests, one would also like to interpret itin another way: the mean function should say something about the average of therealization x(t), over time. These two meanings of the word “average” is one of thesubtle difficulties in the applications of stochastic process theory – later we shall tryto throw some light upon the problem when we discuss how to estimate the meanvalue function, and introduce the concept of ergodicity in Section 1.5.2.


2 4 6 8 10 12 14 16−50

0

50

channel PZ

s

µ V

2 4 6 8 10 12 14 16−50

0

50

s

µ V

channel CZ

−50 0 50−50

−40

−30

−20

−10

0

10

20

30

40

50

CZ

PZ

Correlation

Figure 1.3: Measurements of EEG from two different channels, and scatter plot ofthe two signals.

Correlation and covariance function

The second order moment functions, the covariance and correlation functions, mea-sure a degree of correlation between process values at different times, r(s, t) =C[X(s), X(t)], ρ(s, t) = ρ[X(s), X(t)].

First, we discuss the correlation coefficient, ρ = ρ[X, Y ], between two randomvariables X and Y with positive variance:

ρ =C[X, Y ]√V[X ]V[Y ]

=E[(X −mX)(Y −mY )]√

V[X ]V[Y ].

The correlation coefficient is a dimensionless constant, that remains unchanged aftera change of scale: for constants a > 0, c > 0, b, d ,

ρ(aX + b, cY + d) = ρ(X, Y ).

Further, it is easy to see that it is always bounded to be between −1 and +1. To seethis, use Theorem 1.1 to find the variance of the sum X −λY for the special choiceλ = −C[X, Y ]/V[Y ] = ρ

√V[X ]/V[Y ] . Since a variance is always non-negative, we

have

0 ≤ V[X − λY ] = V[X ]− 2λC[X, Y ] + λ2V[Y ] = V[X ](1− ρ2), (1.1)

which is possible only if −1 ≤ ρ ≤ 1.

Example 1.5. (“ElectroEncephaloGram, EEG”) The ElectroEncephaloGram is thegraphic representation of spontaneous brain activity measured with electrodes at-tached to the scalp. Usually, EEG is measured from several channels at differentpositions of the head. Channels at nearby positions will be heavily correlated, whichcan be seen in Figure 1.3, where the curves have a very similar appearance. They arehowever not exactly the same and using the samples as observations of two differentstochastic processes, an estimate of the correlation coefficient between X(t) and


Y (t) will be ρ∗ ≈ 0.9, i.e., rather close to one. Figure 1.3 also shows a scatter plotof the two signals, where the strong correlation is seen as the samples are distributedclose to a straight line. 2

The covariance E[(X−mX)(Y −mY )] measures the degree of linear covariation.If there is a tendency for observations of X and Y to be either both large or bothsmall, compared to their expected values, then the product (X −mX)(Y −mY ) ismore often positive than negative, and the correlation is positive. If, on the otherhand, large values of X often occur together with small values of Y , and vice versa,then the product is more often negative than positive, and the correlation is negative.From (1.1) we see that if the correlation coefficient is +1 or −1, there is an exactlinear relation between X and Y , in the sense that there are constants a, b such thatV[X − bY )] = 0, which means that X − bY is a constant, i.e., P(X = a+ bY ) = 1.Repeated observations of the pair (X, Y ) would then fall on a straight line. Thecloser the correlation is to ±1 the closer to a straight line are the observations.1

Figure 1.4 shows scatter plots of observations of two-dimensional normal va-riables with different degrees of correlation. As seen in the figure, there is quite ascatter around a straight line even with a correlation as high as 0.9.

−4 −2 0 2 4−4

−2

0

2

4ρ = 0

−4 −2 0 2 4−4

−2

0

2

4ρ = 0.9

−4 −2 0 2 4−4

−2

0

2

4ρ = 0.5

−4 −2 0 2 4−4

−2

0

2

4ρ = −0.5

Figure 1.4: Observations of two-dimensional normal variables X, Y with E[X ] =E[Y ] = 0 and V[X ] = V[Y ] = 1 for some different correlation coefficients ρ.

Now back to the interpretation of the covariance function and its scaled version,the correlation function, of a stochastic process. If the correlation function, ρ(s, t),

1 Note, however, that the correlation coefficient only measures the degree of linear dependence,and one can easily construct an example with perfect dependence even though the variables areuncorrelated; take for example Y = X2 with X ∈ N(0, 1).


0 50 100−3

−2

−1

0

1

2

3ρ = 0

0 20 40 60 80 100−10

−5

0

5

10ρ = 0.9

0 20 40 60 80 100−3

−2

−1

0

1

2

3ρ = 0.5

0 20 40 60 80 100−3

−2

−1

0

1

2

3ρ = −0.5

Figure 1.5: Realizations of normal sequences with mt = 0 and r(s, t) = ρ|s−t| .

attains a value close to 1 for some arguments s and t, then the realizations x(s)and x(t) will vary together. If one the other hand, the correlation is close to −1,the covariation is still strong, but goes in the opposite direction.

Example 1.6. This example illustrates how the sign of the correlation function isreflected in the variation of a stochastic process. Figure 1.5 shows realizations of asequence of normal random variables {Xt} with mt = 0 and r(t, t) = V [Xt] = 1,and the covariance function (= correlation function since the variance is 1), r(s, t) =ρ|s−t| , for four different ρ-values. The realization with ρ = 0.9 shows rather strongcorrelation between neighboring observations, which becomes a little less obviouswith ρ = 0.5. For ρ = −0.5 the correlation between observations next to eachother, i.e., |s − t| = 1, is negative, and this is reflected in the alternating signs inthe realization.

Figure 1.6 illustrates the same thing in a different way. Pairs of successiveobservations (Xt, Xt+k), for t = 1, 2, . . . , n, are plotted for three different time lags,k = 1, k = 2, k = 5. For ρ = 0.9 the correlation is always positive, but becomesweaker with increasing distance. For ρ = −0.9 the correlation is negative whenthe distance is odd, and positive when it is even, becoming weaker with increasingdistance. 2

1.3 Stationary processes

There is an unlimited number of ways to generate dependence between X(s) andX(t) in a stochastic process, and it is necessary to impose some restrictions and fur-ther assumptions if one wants to derive any useful and general properties about theprocess. There are three main assumptions that make the dependence manageable.

Section 1.3 Stationary processes 13

−4 −2 0 2 4−4

−2

0

2

4ρ = 0.9, k = 1

−4 −2 0 2 4−4

−2

0

2

4ρ = 0.9, k = 2

−4 −2 0 2 4−4

−2

0

2

4ρ = 0.9, k = 5

−4 −2 0 2 4−4

−2

0

2

4ρ = −0.9, k = 1

−4 −2 0 2 4−4

−2

0

2

4ρ = −0.9, k = 2

−4 −2 0 2 4−4

−2

0

2

4ρ = −0.9, k = 5

Figure 1.6: Scatter plots of (Xt, Xt+k) for different k -values and ρ = 0.9 andρ = −0.9.

The first is the Markov principle, that says that the statistical distribution ofwhat will happen between time s and time t depends on what happened up to times only through the value at time s. This means that X(s) is a state variable thatsummarizes the history before time s. In Section 1.4, we will meet some applicationsof this concept. The second principle is a variation of the Markov principle, thatassumes that the expected future change is 0, independently of how the processreached its present value. Processes with this property are called martingales, andthey are central in stochastic calculus and financial statistics.

The third principle that makes the dependence manageable is the stationarityprinciple, and that is the topic of this book. In everyday language, the word statio-nary indicates something that does not change with time, or stays permanently inits position. In statistics, it means that the statistical properties do not change. Arandom function is called “stationary” if the fluctuations have the same statisticaldistributions whenever one chooses to observe it. The word stationary is mostlyused for processes in time. An alternative term is homogeneous, which can be usedalso for processes with a space parameter, for example a random surface.

For a process to be stationary, all statistical properties have to be unchangedwith time. This is a very strict requirement, and to make life, and mathematics,simpler one can often be content with a weaker condition, namely that the meanvalue and covariances do not change.

1.3.1 Strictly stationary processes

Definition 1.3 A stochastic process {X(t), t ∈ T} is called strictly stationary ifits statistical distributions remain unchanged after a shift of the time scale.


Since the distributions of a stochastic process are defined by the finite-dimensionaldistribution functions,2 we can formulate an alternative definition of strict stationa-rity:

If, for every n, every choice of times t1, . . . , tn ∈ T and every time lag τ suchthat ti+τ ∈ T , the n-dimensional random vector (X(t1+τ), . . . , X(tn+τ)) has thesame distribution as the vector (X(t1), . . . , X(tn)), then the process {X(t), t ∈ T}is said to be strictly stationary.

If {X(t), t ∈ T} is strictly stationary, then the marginal distribution of X(t)is independent of t. Also the two-dimensional distributions of (X(t1), X(t2)) areindependent of the absolute location of t1 and t2 , only the distance t1− t2 matters.As a consequence, the mean function m(t) is constant, and the covariance functionr(s, t) is a function of t−s only, not of the absolute location of s and t. Also higherorder moments, like the third order moment, E[X(s)X(t)X(u)], remain unchangedif one adds a constant time shift to s, t, u , and so on for fourth, fifth order, etc.

1.3.2 Weakly stationary processes

There are very good reasons to study so called weakly stationary processes where thefirst and second order moments are time invariant, i.e., where the mean is constantand the covariances only depend on the time distance. The two main reasons are,first of all, that weakly stationary Gaussian processes are automatically also strictlystationary, since their distributions are completely determined by mean values andcovariances; see Section ?? and Chapter ??. Secondly, stochastic processes passedthrough linear filters are effectively handled by the first two moments of the input;see Chapters ??–??.

Definition 1.4 If the mean function m(t) is constant and the covariance functionr(s, t) is everywhere finite, and depends only on the time difference τ = t − s, theprocess {X(t), t ∈ T} is called weakly stationary, or covariance stationary.

Note: When we say that a process is stationary, we mean that it is weaklystationary, if we do not explicitly say anything else. A process that is not stationaryis called non-stationary.

It is clear that every strictly stationary process with finite variance is also weaklystationary. If the process is normal and weakly stationary, then it is also strictlystationary, more on that in Chapter ??, but in general one can not draw such aconclusion.

For a stationary process, we write m for the constant mean value and make thefollowing simplified definition and notation for the covariance function.

2 See Section ?? on Kolmogorov’s existence theorem


Definition 1.5 If {X(t), t ∈ T} is a weakly stationary process with mean m, thecovariance and correlation functions3are defined as,

r(τ) = C[X(t), X(t+ τ)]

= E[(X(t)−m)(X(t+ τ)−m)] = E[X(t)X(t+ τ)]−m2,

ρ(τ) = ρ[X(t), X(t+ τ)] = r(τ)/r(0),

respectively. In particular, the variance is

r(0) = V[X(t)] = E[(X(t)−m)2].

We will use the symbol τ to denote a time lag, as in the definition of r and ρ.

We have used the same notation for the covariance function for a stationaryas well as for a non-stationary process. No confusion should arise from this – oneargument for the stationary and two arguments for the general case. For a stationaryprocess, r(s, s+ τ) = r(τ).

The mean and covariance function can tell us much about how process values areconnected, but they fail to provide detailed information about the sample functions,as is seen in the next two examples.

Example 1.7. (”Random telegraph signal”) This extremely simple process jumpsbetween two states, 0 and 1, according to the following rules.

Let the signal X(t) start at time t = 0 with equal probability for the two states,i.e., P(X(0) = 0) = P(X(0) = 1) = 1/2, and let the switching times be decidedby a Poisson process {Y (t), t≥0} with intensity λ independently of X(0). Then,{X(t), t ≥ 0} is a weakly stationary process; in fact, it is also strictly stationary,but we don’t show that.

Let us calculate E[X(t)] and E[X(s)X(t)]. At time t, the signal is equal to

X(t) =1

2

(1− (−1)X(0)+Y (t)

),

since, for example, if X(0) = 0 and Y (t) is an even number, X(t) is back at 0, butif it has jumped an uneven number of times, it is at 1. Since X(0) and Y (t) areindependent,

E[X(t)] = E[1

2

(1− (−1)X(0)+Y (t)

)] =

1

2− 1

2E[(−1)X(0)]E[(−1)Y (t)],

which is constant equal to 1/2, since E[(−1)X(0)] = 12(−1)0 + 1

2(−1)1 = 0. As a

byproduct, we get that P(X(t) = 0) = P(X(t) = 1) = E[X(t)] = 12.

For E[X(s)X(t)], we observe that the product X(s)X(t) can be either 0 or 1,and it is 1 only when both X(s) and X(t) are 1. Therefore, for s < t,

E[X(s)X(t)] = P(X(s) = X(t) = 1) = P(X(s) = 1, X(t)−X(s) = 0).

3 Norbert Wiener introduced the term covariance function in his work on stochastic harmonicanalysis in the nineteen twenties.


Now X(t)−X(s) = 0 only if there is an even number of jumps in (s, t], i.e., Y (t)−Y (s) is even. Using the independence between X(s) and the Poisson distributedincrement Y (t)− Y (s), with expectation λ(t− s), we get for τ = t− s > 0,

E[X(s)X(t)] = P(X(s) = 1) · P(Y (t)− Y (s) is even)

=1

2P(Y (t)− Y (s) is even) =

1

2

∑

k=0,2,4,...

e−λτ (λτ)k/k!

=1

4e−λτ{eλτ + e−λτ} =

1

4(1 + e−2λτ ).

For 0 < t < s we just replace t− s with s− t = −(t− s) = |t− s| .Conclusion: The random telegraph signal is weakly stationary with mean m =

E[X(t)] = 1/2, and exponentially decreasing covariance function

r(τ) = E[X(s)X(s+ τ)]−m2 =1

4e−2λ|τ |.

2

Example 1.8. Figure 1.7 shows realizations of two stationary processes with dif-ferent distributions and quite different sample function behavior, but with exactlythe same covariance function

r(τ) = σ2 e−α|τ |.

-2

-1

0

1

2

0 100 200 300 400 500 600 700 800 900 1000.

.(a)

-2

-1

0

1

0 100 200 300 400 500 600 700 800 900 1000

(b)

Figure 1.7: Realizations of two processes with the same covariance function,r(τ) = σ2e−α|τ | : (a) random telegraph signal, (b) Gaussian process.

The “random telegraph signal” in (a) jumps in a random fashion between the twolevels 0 and 1, while the normal process in (b) is continuous, but rather irregular.


This normal process is called a Gauss-Markov process, and it will be studied morein Chapter ??, under the name of the Ornstein-Uhlenbeck process. 2

1.3.3 Important properties of the covariance function

All covariance functions share the following very important properties.

Theorem 1.2. If r(τ) is the covariance function for a stationary process {X(t), t ∈T} , then

a. V[X(t)] = r(0) ≥ 0,

b. V[X(t + h)±X(t)] = E[(X(t+ h)±X(t))2] = 2(r(0)± r(h)),

c. r(−τ) = r(τ),

d. |r(τ)| ≤ r(0),

e. if |r(τ)| = r(0), for some τ 6= 0 , then r is periodic,

f. if r(τ) is continuous for τ = 0 , then r(τ) is continuous everywhere.

Proof: (a) is clear by definition. (b) Take the variance of the variables X(t +h) +X(t) and X(t+ h)−X(t):

V[X(t + h)±X(t)] = V[X(t + h)] + V[X(t)]± 2C[X(t), X(t+ h)]

= r(0) + r(0)± 2r(h) = 2(r(0)± r(h)).

(c) The covariance is symmetric in the arguments, so

r(−τ) = C[X(t), X(t− τ)] = C[X(t− τ), X(t)] = r(t− (t− τ)) = r(τ).

(d) Since the variance of X(t+h)±X(t) is non-negative regardless of the sign, part(b) gives that r(0)± r(h) ≥ 0, and hence |r(h)| ≤ r(0).(e) If r(τ) = r(0), part (b) gives that V[X(t + τ) − X(t)] = 0, which implies thatX(t+ τ) = X(t) for all t, so X(t) is periodic with period τ . If, on the other hand,r(τ) = −r(0), then V[X(t + τ) +X(t)] = 0, and X(t + τ) = −X(t) for all t, andX(t) is periodic with period 2τ . Finally, it is easy to see that if X(t) is periodic,then also r(t) is periodic.

(f) We consider the increment of the covariance function at t,

(r(t+ h)− r(t))2 = (C[X(0), X(t+ h)]− C[X(0), X(t)])2

= (C[X(0), X(t+ h)−X(t)])2 , (1.2)

where we used that C[U, V ] − C[U,W ] = C[U, V − W ]. Further, covariances obeySchwarz’ inequality,

(C[Y, Z])2 ≤ V[Y ]V[Z].

Applied to the right hand side in (1.2) this yields, according to (b),

(r(t+ h)− r(t))2 ≤ V[X(0)] · V[X(t + h)−X(t)] = 2r(0)(r(0)− r(h)).


If r(τ) is continuous for τ = 0, then the right hand side r(0)− r(h) → 0 as h → 0.Then also the left hand side (r(t+ h)− r(t)) → 0, and hence r(τ) is continuous atτ = t. 2

The theorem is important since it restricts the class of functions that can be usedas covariance functions for real stationary processes. For example, a covariancefunction must be symmetric and attain its maximum at τ = 0. It must also becontinuous if it is continuous at the origin, which excludes for example the functionin Figure 1.8. In Chapter ??, Theorem ??, we shall present a definite answer to thequestion, what functions can appear as covariance functions.

0 τ

Figure 1.8: A function with a jump discontinuity for τ 6= 0 cannot be a covariancefunction.

1.4 State variables

In Section 1.3 we mentioned the two main methods to model the dependence betweenX(s) and X(t) in a stochastic process: the Markov principle, and the stationarityprinciple. The two principles can be combined, and are often so when one seeksto model how a dynamical system evolves under the influence of external randomforces or signals. A mathematical model of a dynamical system often consists of aset of state variables, which develop under the influence of a number of input signals,i.e., external factors that affect the system and may change with time. A set ofvariables may be called state variables for a dynamical system if

• they completely determine the state of the system at any given time t0 , and

• the state of the system at time t > t0 is completely determined by the statevariables at time t0 and the input signals between t0 and t.

The current values of the state variables for a dynamical system, determinecompletely, together with the new input signals, the state variable value at any latertime point. How the system arrived at its present state is of no interest. In thetheory of dynamical systems, the exact relation between input signals and the statevariables is often described by a differential equation, like (??). If the input signalis a stochastic process, it is not very meaningful to solve this equation, since theinput is random and will be different from experiment to experiment. But there will

Section 1.5 Estimation of mean value and covariance function 19

always be reasons to determine the statistical properties of the system and its statevariables, like the distribution, expected value, variance, covariances etc., since theseare defined by the physical properties of the system, and by the statistical propertiesof the input. We deal with these relations in Chapters ?? and ??.

1.5 Estimation of mean value and covariance func-

tion

So far, we have described the most simple characteristics of a stationary process, themean value and covariance function, and these were defined in a probabilistic way,as expectations that can be calculated from a distribution or probability densityfunction. They were used to describe and predict what can be expected from arealization of the process. We did not pay much attention to how the model shouldbe chosen.

As always in statistics, one aim is data analysis, parameter estimation, and modelfitting, in order to find a good model that can be used in practice for prediction ofa time series and for simulation and theoretical analysis. In this section, we discussthe properties of the natural estimators of the mean value and covariance functionof a stationary process {Xn, n = 1, 2, . . .} , with discrete time, from a series of datax1, . . . , xn .

Before we start with the technical details, just a few words about how to thinkabout parameter estimation, model fitting, and data analysis.

Formal parameter estimation: In the formal statistical setting, a model is pos-tulated beforehand, and the data is assumed to be observations of randomvariables with distributions that are known, apart from some unknown para-meters. The task is then to suggest a function of the observations that hasgood properties as an estimate of the unknown parameters. The model isnot questioned, and the estimation procedure is evaluated from how close theparameter estimates are to the true parameter values.

Model selection: In this setting, one seeks to find which model, among many pos-sible alternatives, that is most likely to have produced the data. This includesparameter estimation as well as model choice. The procedure is evaluated fromits ability to reproduce the observations and to predict future observations.

Model fitting: This is the most realistic situation in practice. No model is assumedto be “true”, and the task is to find a model, including suitable parametervalues, that best reproduces observed data, and that well predicts the future.

In this course, we will stay with the first level, and the technical properties ofestimation procedures, but the reader is reminded that this is a formal approach.No model should be regarded as true, unless there are some external reasons toassume some specific structure, for example derived from physical principles, andperhaps not even then!


1.5.1 Estimation of the mean value function

Let {Xn, n = 1, 2, . . .} be a weakly stationary sequence with mean value m andwith (known or unknown) covariance function r(τ). The process need not be strictlystationary, and we make no specific assumption about its distribution, apart fromthe constant, but unknown, mean value m. Let x1, . . . , xn be observations of thefirst X1, . . . , Xn .

Theorem 1.3. (a) The arithmetic mean of the observations m∗n = 1

n

∑nt=1 xt is an

unbiased estimator of mean value m of the process, i.e., E[m∗n] = m, regardless of

the distribution.

(b) If the infinite series∑∞

t=0 r(t) is convergent, then the asymptotic variance of m∗n

is given by

limn→∞

nV[m∗n] =

∞∑

t=−∞r(t) = r(0) + 2

∞∑

1

r(t), (1.3)

which means that, for large n, the variance of the mean value estimator is V[m∗n] ≈

1n

∑t r(t) .

(c) Under the condition in (b), m∗n is a consistent estimator of m as n → ∞ in the

sense that E[(m∗n −m)2] → 0 , and also P(|m∗

n −m| > ε) → 0 , for all ε > 0 .

Proof: (a) The expectation of a sum of random variables is equal to the sum ofthe expectations; this is true regardless of whether the variables are dependent ornot. Therefore, E[m∗

n] =1n

∑nt=1 E[X(t)] = m.

(b) We calculate the variance by means of Theorem 1.1. From the theorem,

V[m∗n] =

1

n2V

[n∑

t=1

Xt

]=

1

n2

n∑

s=1

n∑

t=1

C[Xs, Xt] =1

n2

n∑

s=1

n∑

t=1

r(s− t),

where we sum along the diagonals to collect all the n− |u| terms with s− t = u toget V[m∗

n] =1n2

∑n−1u=−n+1(n− |u|)r(u). Hence,

nV[m∗n] = −r(0) +

2

n

n−1∑

u=0

(n− u)r(u). (1.4)

Now, if∑∞

t=0 r(t) is convergent, Sn =∑n−1

t=0 r(t) → ∑∞t=0 r(t) = S , say, which

implies that 1n

∑nk=1 Sk → S , as n → ∞ . (If xn → x then also 1/n

∑nk=1 xk → x.)

Thus,

1

n

n−1∑

u=0

(n− u) r(u) =1

n

n∑

k=1

Sk → S,

and this, together with (1.4), gives the result, i.e., nV [m∗n)] → −r(0)+2

∑∞0 r(u) =∑∞

−∞ r(u) = r(0) + 2∑∞

1 r(u).


(c) The first statement follows from (a) and (b), since E[(m∗n − m)2] = V[m∗

n] +E[m∗

n − m]2 → 0. The second statement is a direct consequence of Chebysjev’sinequality,4

P(|m∗n −m| > ε) ≤ E[(m∗

n −m)2]

ε2.

2

The consequences of the theorem are extremely important for data analysiswith dependent observations. A positive correlation between successive observa-tions tends to increase the variance of the mean value estimate, and make it moreuncertain. We will give two examples on this, first to demonstrate the difficultieswith visual interpretation, and then use the theorem for a more precise analysis.

Example 1.9. (”Dependence can deceive the eye”) Figure 1.9 shows 40 realizationsof 25 successive data point in two stationary processes with discrete time. In theupper diagram, the data are dependent and each successive data contains part of theprevious value, Xt+1 = 0.9Xt + et , where the et are independent normal variable.(This is an AR(1)-process, an autoregressive process, which we will study in detailin Chapter 2.) In the lower diagram, all variables are independent normal variablesYt = e′t . Variances are chosen so that V[Xt] = V[Yt]. The solid thick line connectsthe average of the 25 points in each sample.

One should note that the observations in the AR(1)-process within each sampleare less spread out than those in the samples with independent data. Even if thereis less variation in each sample, the observed average values are more variable forthe AR(1)-process. The calculated standard deviations in the 40 dependent samplesis 0.68 on the average, while the average standard deviation is 0.99 for independentdata. Therefore, if one had only one sample to analyze, one could be lead to believethat a dependent sample would give better precision in the estimate of the overallmean level. But it is just the opposite, the more spread out sample gives betterprecision. 2

Example 1.10. (”How many data samples are necessary”) How many data pointsshould be sampled from a stationary process in order to obtain a specified pre-cision in an estimate of the mean value function? As we saw from the previousexample, the answer depends on the covariance structure. Successive time seriesdata taken from nature often exhibit a very simple type of dependence, and oftenthe covariance function can be approximated by a geometrically decreasing function,r(τ) = σ2e−α|τ | .

Suppose that we want to estimate the mean value m of a stationary sequence{Xt} , and that we have reasons to believe that the covariance function is r(τ) =σ2e−|τ | . We estimate m by the average m∗

n = x = 1n

∑n1 xk of observations of

X1, . . . , Xn .

4 Pafnuty Lvovich Chebysjev, Russian mathematician, 1821-1894.


0 5 10 15 20 25 30 35 40−4

−2

0

2

440 samples of 25 dependent AR(1)−observations

0 5 10 15 20 25 30 35 40−4

−2

0

2

440 samples with 25 independent observations

Figure 1.9: Upper diagram: 40 samples of 25 dependent AR(1)-variables, Xt+1 =0.9Xt + et . Lower diagram: independent data Yt with V[Yt] = V[Xt] = 1.Solid lines connect the averages of the 25 data points in each series.

If the variables had been uncorrelated, the standard deviation of m∗n would have

been σ/√n . We use Theorem 1.3, and calculate

∞∑

t=−∞r(t) = σ2

∞∑

t=−∞e−|t| = σ2

(1 + 2

∞∑

t=1

e−|t|

)

= σ2

(1 + 2

1/e

1− 1/e

)= σ2 e+ 1

e− 1,

so m∗n is unbiased and consistent. If n is large, the variance is

V[m∗n] ≈

σ2

n· e + 1

e− 1

and the standard deviation

D[m∗n] ≈

σ√n

(e+ 1

e− 1

)1/2

= σ1.471√

n.

We see that positively correlated data gives almost 50% larger standard deviation inthe m-estimate, than uncorrelated data. To compensate this reduction in precision,it is necessary to measure over a longer time period, more precisely, one has toobtain 1.4712n ≈ 2.16n measurements instead of n.


The constant α determines the decay of correlation. For a general exponentialcovariance function r(τ) = σ2e−α|τ | = σ2θ|τ | , the asymptotic standard deviation is,for large n,

D[m∗n] ≈

σ√n

(eα + 1

eα − 1

)1/2

=σ√n

(1 + θ

1− θ

)1/2

, (1.5)

which can be quite large for θ near 1. As a rule of thumb, one may have to increasethe number of observations by a factor (K1/τK + 1)/(K1/τK − 1), where τK is thetime lag where the correlation is equal to 1/K . 2

Example 1.11. (“Oscillating data can decrease variance”) If the decay parameterθ in r(τ) = σ2θ|τ | is negative, the observations oscillate around the mean value, andthe variance of the observed average will be smaller than for independent data. The“errors” tend to compensate each other. With θ = −1/e, instead of θ = 1/e asin the previous example, the standard deviation of the observed mean is D[m∗

n] ≈σ√n

(e−1e+1

)1/2= σ 0.6798√

n. 2

Example 1.12. (Confidence interval for the mean) If the process {Xt} in Example 1.10is a Gaussian process, as in Section ??, the estimator m∗

n = 1n

∑n1 Xt has a normal

distribution with expectation m and approximative standard deviation D[m∗n] ≈

1√n1.471, i.e., m∗

n ∈ N(m,D[m∗n]

2). This means, for example, that

P(m− λα/2D[m∗n] ≤ m∗

n ≤ m+ λα/2D[m∗n]) = 1− α, (1.6)

where λα/2 is a quantile in the standard normal distribution:

P(−λα/2 ≤ Y ≤ λα/2) = 1− α,

if Y ∈ N(0, 1). Rearranging the inequality, we can write (1.6) as

P(m∗n − λα/2D[m

∗n] ≤ m ≤ m∗

n + λα/2D[m∗n]) = 1− α.

To obtain a confidence interval for m,

Im : {m∗n − λα/2D[m

∗n], m

∗n + λα/2D[m

∗n]}, (1.7)

with confidence level 1 − α . The interpretation is this: If the experiment ”observeX1, . . . , Xn and calculate m∗

n and the interval Im according to (1.7)” is repeatedmany times, some of the so constructed intervals will be “correct” and cover the truem-value, and others will not. In the long run, the proportion of correct intervalswill be 1− α .

Suppose now that we have observed the first 100 values of Xt , and got the sumx1 + · · · + x100 = 34.9. An estimate of m is m∗

n = 34.9/100 = 0.349 and a 95%confidence interval for m is

0.349± λ0.025 ·1.471√100

= 0.349± 1.96 · 0.1471 = 0.349± 0.288.


(The confidence level shall be 0.95, i.e., α = 0.05, and λ0.025 = 1.96.) Thus, wefound the 95% confidence interval for m to be (0.061, 0.637).

Compare this with the interval, constructed under the assumption of independentobservations:

0.349± λ0.025 ·1√100

= 0.349± 1.96 · 0.1 = 0.349± 0.2,

see Figure 1.9, which shows the increased variability in the observed average forpositively dependent variables. Of course, oscillating data with negative one-stepcorrelation will give smaller variability.

The analysis was based on the normality assumption, but it is approximatelyvalid also for moderately non-normal data. 2

1.5.2 Ergodicity

Ensemble average

The expectation m = E[X ] of a random variable X is what one will get on theaverage, in a long series of independent observations of X , i.e. the long run averageof the result when the experiment is repeated many times:

m∗n =

1

n

n∑

k=1

xk → m, when n → ∞. (1.8)

The x1, . . . , xn are n independent observations of X . The expectation m is calledthe ensemble average of the experiment, and it is the average of the possible outcomesof the experiment, weighted by their probabilities.

Time average

If Xt is a variable in a time series, or more generally in a stochastic process, theexpectation is a function of time, m(t) = E[Xt], but if the process is stationary,strictly or weakly, it is a constant, i.e., all variables have the same expectation. Anatural question is therefore: Can one replace repeated observations of Xt at a fixedt, with observations of the entire time series, and estimate the common expectationm by the time average m∗

n = 1n

∑nt=1 xt , observed in one single realization of the

process: The vector (x1, . . . , xn) is here one observation of (X1, . . . , Xn).We know, from the law of large numbers, that the average of repeated measure-

ments, (1.8), converges to the ensemble average m as n → ∞ . But is the same truefor the time average? Does it also converge to m?

The answer is yes, for some processes. These processes are called ergodic, ormore precisely linearly ergodic.

A linearly ergodic stationary sequence is a stationary process {Xn, n = 1, 2, . . .}where the common mean value m (= the ensemble average) can be consistentlyestimated by the time average,

x1 + · · ·+ xn

n→ m,


when x1, . . . , xn are observations in one single realization of {Xt} .Theorem 1.3(c) gives a sufficient condition for a stationary sequence to be linearly

ergodic:∑∞

0 r(τ) < ∞ .If the process is linearly ergodic, one can estimate the expectation of any linear

function aXt + b of Xt by the corresponding time average 1n

∑nt=1(axt + b). This

explains the name, “linearly” ergodic.

Example 1.13. The essence of an ergodic phenomenon or process is that everythingthat can conceivably happen in repeated experiments, also happens already in onesingle realization, if it is extended indefinitely. We give an example of a non-ergodicexperiment, to clarify the meaning.

What is the meaning of “the average earth temperature”? A statistical inter-pretation could be the expectation of a random experiment where the temperaturewas measured at a randomly chosen location. Choosing many locations at randomat the same time, would give a consistent estimate of the average temperature atthe chosen time. The procedure gives all temperatures present on the earth at thetime a chance to contribute to the estimate. However, the average may well changewith time of the year, sun activity, volcanic activity, etc.

Another type of average is obtained from stationary measurements over time. Along series of temperature data from Lund, spread out over the year, would give agood estimate of the average temperature in Lund, disregarding “global warming”,of course. The procedure gives all temperatures that occur in Lund over the year achance to contribute to the estimate. The temperature process in Lund is probablylinearly ergodic, but as a model for the earth temperature it is not ergodic.

An automatic procedure to get an estimate of the average earth temperature,would be to start a random walk, for example a Brownian motion, (see Section ??),to criss-cross the earth surface, continuously measuring the temperature. 2

1.5.3 Estimating the covariance function

The covariance function of a stationary sequence {Xn} is r(τ) = E[(Xt−m)(Xt+τ −m)] = E[XtXt+τ ]−m2 , and we assume to begin with, that m is known.

Theorem 1.4. The estimator

r∗n(τ) =1

n

n−τ∑

t=1

(x(t)−m)(x(t + τ)−m) (τ ≥ 0) (1.9)

is asymptotically unbiased, i.e., E[r∗n(τ)] → r(τ) when n → ∞ .

Proof:

E[r∗n(τ)] =1

n

n−τ∑

t=1

E[(Xt −m)(Xt+τ −m)]

=1

n

n−τ∑

t=1

r(τ) =1

n(n− τ)r(τ) → r(τ) when n → ∞.


2

Remark 1.2. Since r∗n(τ) is asymptotically unbiased it is consistent as soon asits variance goes to 0 when n → ∞ . Theorem 1.3 applied to the process Yt =(Xt−m)(Xt+τ −m) will yield the result, since Yt has expectation r(τ). But to usethe theorem, we must calculate the covariances between Ys and Yt , i.e.,

C[(Xs −m)(Xs+τ −m), (Xt −m)(Xt+τ −m)],

and that will require some knowledge of the fourth moments of the X -process. Wedeal with this problem in Theorem 1.5. 2

Remark 1.3. Why divide by n instead of n − τ ? The estimator r∗n(τ) is onlyasymptotically unbiased for large n. There are many good reasons to use the biasedform in (1.9), with the n in the denominator, instead of n−τ . The most importantreason is that r∗n(τ) has all the properties of a true covariance function; in Theo-rem ?? in Chapter ??, we shall see what these are. Furthermore, division by n givessmaller mean square error, i.e.,

E[(r∗n(τ)− r(τ))2

]

is smaller with n in the denominator than with n− τ . 2

Example 1.14. It can be hard to interpret an estimated covariance function, andit is easy to be misled by a visual inspection. It turns out that, for large n, both thevariance V[r∗n(τ)] and the covariances C[r∗n(s), r

∗n(t)] are of the same order, namely

1/n, and that the correlation coefficient

C[r∗n(s), r∗n(t)]√

V[r∗n(s)]V[r∗n(t)]

, (1.10)

between two covariance function estimates will not go to 0 as n increases. Therealways remains some correlation between r∗n(s) och r∗n(t), which gives the covariancefunction r∗n a regular, almost periodic shape, also when r(τ) is almost 0. This factcaused worries about the usefulness of sample covariance calculations, and spurredthe interest for serious research on time series analysis in the nineteen fourties. Theexact expression for the covariances for the covariance estimates are given in thefollowing Theorem 1.5.

The phenomenon is clear in Figure 1.10. We generated three realizations ofan AR(2)-process and produced three estimates of its covariance function basedon n = 128 observations each. Note, for example, that for τ ≈ 8 − 10 two ofthe covariance estimates are clearly positive, and the third is negative, while thetrue covariance is almost zero; for more on the AR(2)-process, see Example 2.2 inChapter 2. 2


0 5 10 15 20 25 30−1

−0.5

0

0.5

1

1.5

2

2.5Three estimates of the same covariance function

Figure 1.10: Three estimates of the theoretical covariance function for an AR(2)-process Xt = Xt−1 − 0.5Xt−2 + et , (n = 128 observations in each estimate).The true covariance function is bold.

Theorem 1.5. (a) If {Xn, n = 1, 2, . . .} is a stationary Gaussian process with meanm and covariance function r(τ) , such that

∑∞0 r(t)2 < ∞ , then r∗n(t) defined by

(1.9) is a consistent estimator of r(τ) .

(b) Under the same condition, for s, t = s+ τ ,

nC[r∗n(s), r∗n(t)] →

∞∑

u=−∞{r(u)r(u+ τ) + r(u− s)r(u+ t)} , (1.11)

when n → ∞ .

(c) If Xn =∑∞

k=−∞ cken−k is an infinite moving average (with∑∞

k=−∞ |ck| < ∞), ofindependent, identically distributed random variables {ek} with E[ek] = 0 , V[ek] =σ2 , and E[e4k] = ησ4 < ∞ , then the conclusion of (b) still holds, when the righthand side in (1.11) is replaced by

(η − 3)r(s)r(t) +

∞∑

u=−∞{r(u)r(u+ τ) + r(u− s)r(u+ t)} ,

Note, that η = 3 when ek are Gaussian.

(d) Under the conditions in (a) or (c), the estimates are asymptotically normal whenn → ∞ .


Proof: We prove (b). Part (a) follows from (b) and Theorem 1.4. For part (c)and (d), we refer to [4, Chap. 7].

We can assume m = 0, and compute

nC[r∗n(s), r∗n(t)] =

1

nC

[n−s∑

j=1

XjXj+s,

n−t∑

k=1

XkXk+t

]

=1

n

{n−s∑

j=1

n−t∑

k=1

E[XjXj+sXkXk+t]−n−s∑

j=1

E[XjXj+s] ·k+t∑

k=1

E[XkXk+t]

}. (1.12)

Now, it is a nice property of the normal distribution, known as Isserlis’ theorem,that the higher product moments can be expressed in terms of the covariances. Inthis case,

E[XjXj+sXkXk+t] = E[XjXj+s]E[XkXk+t]+E[XjXk]E[Xj+sXk+t]+E[XjXk+t]E[Xj+sXk].

Collecting terms with k− j = u and summing over u , the normed covariance (1.12)can, for τ ≥ 0, be written as,

n−t−1∑

u=−n+s+1

(1− a(u)

n) {ruru+t−s + ru−sru+t} →

∞∑

u=−∞{r(u)r(u+ τ) + r(u− s)r(u+ t)} ,

when n → ∞ , where

a(u) =

t+ |u|, for u < 0,t, for 0 ≤ u ≤ τ ,s+ |u|, for u > τ.

The convergence holds under the condition that∑∞

0 r(t)2 is finite; cf. the proof ofTheorem 1.3(b). 2

If the mean value m is unknown, one just subtracts an estimate, and use theestimate

r∗n(τ) =1

n

n−τ∑

t=1

(xt − x)(xt+τ − x),

where x is the total average of the n observations x1, . . . , xn . The conclusions ofTheorem 1.5 remain true.

Example 1.15. (“Interest rates”) Figure 1.11 shows the monthly interest rates forU.S. “1-year Treasury constant maturity” over the 19 years 1990-2008. There seemsto be a cyclic variation around a downward trend. If we remove the linear trendby subtracting a linear regression line, we get a series of data that we regard as arealization of a stationary sequence for which we estimate the covariance function.

2


0 50 100 150 200−5

0

5

10Interest and interest residuals

0 20 40 60 80 100−1

0

1

2

3Estimated covariance function

Figure 1.11: Interest rate with linear trend, the residuals = variation around thetrend line, and the estimated covariance function for the residuals.

Testing zero correlation

In time series analysis, it is often important to make a statistical test to see if thevariables in a stationary sequence are uncorrelated. Then, one can estimate thecorrelation function ρ(τ) = r(τ)/r(0) by

ρ∗(τ) =r∗n(τ)

r∗n(0),

based on n sequential observations. If ρ(τ) = 0, for τ = 1, 2, . . . , p, then theBox-Ljung statistic,

Q = n(n + 2)

p∑

1

ρ∗(τ)2

n− τ,

has an approximative χ2 -distribution with p degrees of freedom, if n is large. Thismeans that if Q > χ2

α(p), one can reject the hypothesis of zero correlation. (χ2α(p)

is the upper α-quantile of the χ2 -distribution function, with p degrees of freedom.)The somewhat simpler Box-Pierce statistic,

Q = n

p∑

1

ρ∗(τ)2,

has the same asymptotic distribution, but requires larger sample size to work pro-perly.


1.5.4 Ergodicity a second time

Linear ergodicity means that one can estimate the expectation = ensemble average,of a stationary processes by means of the observed time average in a single reali-zation. We have also mentioned that under certain conditions, also the covariancefunction, which is the ensemble average of a cross product,

r(τ) = E[(X(t)−m)(X(t+ τ)−m)],

can be consistently estimated from the corresponding time average. A process withthis property could be called ergodic of second order.

In a completely ergodic, or simply “ergodic”, process one can consistently esti-mate any expectation

E[g(X(t1), . . . , X(tp))],

where g is an arbitrary function of a finite number of X(t)-variables, by the corres-ponding time average in a single realization,

1

n

n∑

t=1

g(x(t+ t1), . . . , x(t + tp)).

Chapter 2

ARMA-processes

This chapter deals with two of the oldest and most useful of all stationary processmodels, the autoregressive AR-model and the moving average MA-model. Theyform the basic elements in time series analysis of both stationary and non-stationarysequences, including model identification and parameter estimation. Predictions canbe made in an algorithmically simple way in these models, and they can be used inefficient Monte Carlo simulation of more general time series.

Modeling a stochastic process by means of a spectral density gives a lot of flexi-bility, since every non-negative, symmetric, integrable function is possible. On theother hand, the dependence in the process can take many possible shapes, whichmakes estimation of spectrum or covariance function difficult. Simplifying the co-variance, by assuming some sort of independence in the process generation, is oneway to get a more manageable model. The AR- and MA-models both contain suchindependent generators. The AR-models include feedback, for example coming froman automatic control system, and they can generate sequences with heavy dynamics.They are time-series versions of a Markov model. The MA-models are simpler, butless flexible, and they are used when correlations have a finite time span.

In Section 2.1 we present the general covariance and spectral theory for AR- andMA-processes, and also for a combined model, the ARMA-model. Section 2.2 dealswith parameter estimation in the AR-model, and Section 2.3 presents an introduc-tion to prediction methods based on AR- and ARMA-models.

2.1 Auto-regression and Moving average: AR(p)

och MA(q)

In this chapter, {et, t = 0, ±1, . . . } denotes white noise in discrete time, i.e., asequence of uncorrelated random variables with mean 0 and variance σ2 ,

E[et] = 0,

C[es, et] =

{σ2 if s = t,0 otherwise.

31

32 ARMA-processes Chapter 2

The sequence {et} is called the innovation process and its spectral density is constant,

Re(f) = σ2 for − 1/2 < f ≤ 1/2.

2.1.1 Autoregressive process, AR(p)

An autoregressive process of order p, or shorter, AR(p)-process, is created by whitenoise passing through a feedback filter as in Figure 2.1.

Xt = −a1Xt−1 − a2Xt−2 − · · · − apXt−p + et

et

−ap

ΣXt

Xt

T−1Xt−1

T−1Xt−2

−a1

−a2

T−1Xt−p

Figure 2.1: AR(p)-process. (The operator T−1 delays the signal one time unit.)

An AR(p)-process is defined by its generating polynomial A(z). Let a0 =1, a1, . . . , ap , be real coefficients and define the polynomial

A(z) = a0 + a1z + · · ·+ apzp,

in the complex variable z . The polynomial is called stable if the characteristicequation

zpA(z−1) = a0zp + a1z

p−1 . . .+ ap = 0

has all its roots inside the unit circle, or equivalently, that all the zeros of thegenerating polynomial A(z) lie outside the unit circle.

Definition 2.1 Let A(z) be a stable polynomial of degree p. A stationary sequence{Xt} is called an AR(p)-process with generating polynomial A(z), if the sequence{et}, given by

Xt + a1Xt−1 + · · ·+ apXt−p = et, (2.1)

is a white noise sequence with E[et] = 0, constant variance, V[et] = σ2 , and etuncorrelated with Xt−1, Xt−2, . . . . The variables et are the innovations to the AR-process. In a Gaussian stationary AR-process, also the innovations are Gaussian.

Equation (2.1) becomes more informative if written in the form

Xt = −a1Xt−1 − · · · − apXt−p + et,

Section 2.1 Auto-regression and Moving average: AR(p) och MA(q) 33

where one can see how new values are generated as linear combinations of old valueplus a small uncorrelated innovation. Note, that it is important that the innovationat time t is uncorrelated with the process so far; it should be a real “innovation”,introducing something new to the process. Of course, et is correlated with Xt andall subsequent Xs for s ≥ t. Figure 2.2 illustrates how et+k influences future Xs .

t s

Xset et+1 et+2 et+3

Figure 2.2: In an AR-process the innovation et influences all Xs for s ≥ t.

Remark 2.1. If A(z) is a stable polynomial of degree p, and {et} a sequence ofindependent normal random variables, et ∈ N(0, σ2), there always exists a Gaussianstationary AR(p)-process with A(z) as its generating polynomial and et as innova-tions. The filter equation Xt + a1Xt−1 + · · ·+ apXt−p = et gives the X -process assolution to a linear difference equation with right hand side equal to {et} . If theprocess was started a very long time ago, T ≈ ∞ , the solution is approximatelyindependent of the initial values. 2

Theorem 2.1. If {Xt} is an AR(p)-process, with generating polynomial A(z) andinnovation variance σ2 , then mX = E[Xt] = 0 , and the covariance function rX isthe solution of the Yule-Walker equations,

rX(k) + a1rX(k − 1) + · · ·+ aprX(k − p) = 0, k = 1, 2, . . . (2.2)

with initial condition

rX(0) + a1rX(1) + · · ·+ aprX(p) = σ2. (2.3)

The general solution to (2.2) is of the form rX(τ) =∑p

1Ckrτk , where rk, k =

1, 2, . . . , p, with |rk| < 1 , are the roots of the characteristic equation, or modifi-cations thereof, if there are multiple roots.

Proof: The filter equation (2.1) is used to define the AR-process from the innova-tions, and Figure 2.1 illustrates how Xt is obtained as a filtration of the et -sequence.

Taking expectations in (2.1), we find that me = E[et] and mX = E[Xt] satisfythe equation

mX + a1mX + · · ·+ apmX = me,

i.e., mXA(1) = me = 0, and since A(1) 6= 0, one has mX = 0.To show that rX(τ) satisfies the Yule-Walker equations, we take covariances

between Xt−k and the variables on both sides of equation (2.1),

C[Xt−k, Xt + a1Xt−1 + · · ·+ apXt−p] = C[Xt−k, et].


Here the left hand side is equal to

rX(k) + a1rX(k − 1) + · · ·+ aprX(k − p),

while the right hand side is equal to 0 for k = 1, 2, . . . , and equal to σ2 for k = 0:

C[Xt−k, et] =

{0 for k = 1, 2, . . .σ2 for k = 0.

This follows from the characteristics of an AR-process: For k = 1, 2, . . . the inno-vations et are uncorrelated with Xt−k , while for k = 0, we have,

C[Xt, et] = C[−a1Xt−1 − · · · − apXt−p + et, et] = C[et, et] = σ2,

by definition. 2

The Yule-Walker equation (2.2) is a linear difference equation, which can besolved recursively. To find the initial values, one has to solve the system of p + 1linear equations,

rX(0) + a1rX(−1) + . . . + aprX(−p) = σ2,rX(1) + a1rX(0) + . . . + aprX(−p+ 1) = 0,

...rX(p) + a1rX(p− 1) + . . . + aprX(0) = 0.

(2.4)

Note that there are p+ 1 equations and p+ 1 unknowns, since rX(−k) = rX(k).

Remark 2.2. The Yule-Walker equations are named after George Udny Yule, Bri-tish statistician (1871-1951), and Sir Gilbert Thomas Walker, British physicist, cli-matologist, and statistician (1868-1958), who were the first to use AR-processes asmodels for natural phenomena. In the 1920s, Yule worked on time series analy-sis and suggested the AR(2)-process as an alternative to the Fourier method as ameans to describe periodicities and explain correlation in the sunspot cycle; cf. thecomment about A. Schuster and the periodogram in Chapter ??. Yule’s analysiswas published as: On a method of investigating periodicities in disturbed series, withspecial reference to Wolfer’s sunspot numbers, 1927.

G.T. Walker was trained in physics and mathematics in Cambridge, but workedfor 20 years as head of the Indian Meteorological Department, where he was concer-ned with the serious problem of monsoon forecasting. He shared Yule’s scepticismabout deterministic Fourier methods, and favored correlation and regression me-thods. He made systematic studies of air pressure variability at Darwin, Australia,and found that it exhibited an “quasi-periodic” behavior, with no single period, butrather a band of periods – we would say “continuous spectrum” – between 3 and3 1

4years. Walker extended Yule’s AR(2)-model to an AR(p)-model and derived the

general form of the Yule-Walker equation, and applied it to the Darwin pressure: Onperiodicity in series of related terms, 1931. His name is now attached to the Walkeroscillation as part of the complex El Nino – Southern Oscillation phenomenon1.

2

1 See R.W. Katz: Sir Gilbert Walker and a connection between El Nino and statistics, StatisticalScience, 17 (2002) 97-112


Remark 2.3. There are (at least) three good reasons to use AR-processes in timeseries modeling:

• Many series are actually generated in a feedback system,

• The AR-process is flexible, and by smart choice of coefficients they can ap-proximate most covariance and spectrum structures; parameter estimation issimple,

• They are easy to use in forecasting: suppose we want to predict, at time t,the future value Xt+1 , knowing all . . . , Xt−p+1, . . . , Xt . The linear predictor

Xt+1 = −a1Xt − a2Xt−1 − · · · − apXt−p+1,

is the best prediction of Xt+1 in least squares sense.

For further arguments, see Sections 2.2 and 2.3. 2

Example 2.1. (”AR(1)-process”) A process in discrete time with geometrically de-caying covariance function is an AR(1)-process, and it can be generated by filteringwhite noise through a one-step feedback filter. With θ1 = −a1 , the recurrenceequation

Xt + a1Xt−1 = et, i.e., Xt = θ1Xt−1 + et,

has a stationary process solution if |a1| < 1. With innovation variance V[et] = σ2 ,the initial values for the Yule-Walker equation rX(k + 1) = −a1rX(k), are foundfrom the equation system (2.4),

rX(0) + a1rX(1) = σ2,

rX(1) + a1rX(0) = 0,

which gives V[Xt] = rX(0) = σ2/(1− a21), and the covariance function,

r(τ) =σ2

1− a21(−a1)

|τ | =σ2

1− θ21θ|τ |1 .

2

Example 2.2. (”AR(2)-process”) An AR(2)-process is a simple model for dampedrandom oscillations with “quasi-periodicity”, i.e., a more or less vague periodicity,

Xt + a1Xt−1 + a2Xt−2 = et.

The condition for stability is that the coefficients lie inside the triangle

|a2| < 1,|a1| < 1 + a2,

illustrated in Figure 2.3.


a1

a2

-2 -1 1 2

-1

1

Figure 2.3: Stability region for the AR(2)-process. The parabola a2 = a21/4 is theboundary between a covariance function of type (2.8) with complex roots andtype (2.6) with real roots.

To find the variance and initial value in the Yule-Walker equation (2.2), were-arrange (2.4), with r(−k) = r(k), for k = 1, 2:

r(0) + a1r(1) + a2r(2) = σ2,a1r(0) + (1 + a2)r(1) = 0, (k = 1),a2r(0) + a1r(1) + r(2) = 0, (k = 2),

leading to the variance and first order covariance,

V[Xt] = r(0) = σ2 · 1 + a21− a2

· 1

(1 + a2)2 − a21,

r(1) = −σ2 · a11− a2

· 1

(1 + a2)2 − a21,

(2.5)

respectively.We can now express the general solution to the Yule-Walker equation in terms

of the roots

z1,2 = −a1/2±√

(a1/2)2 − a2,

to the characteristic equation, z2A(z−1) = z2+a1z+a2 = 0. The covariance functionis of one of the types,

r(τ) = K1z|τ |1 +K2z

|τ |2 , (2.6)

r(τ) = K1z|τ |1 (1 +K2|τ |), (2.7)

r(τ) = K1ρ|τ | cos(β|τ | − φ), (2.8)

where the different types appear if the roots are (1) real-valued and different, (2)real-valued and equal, or (3) complex conjugated.

For the real root cases, the constants K1 , K2 , can be found by solving theequation system {

K1 +K2 = r(0),

K1z1 +K2z2 = r(1),


with the starting values from (2.5).For the complex root case, write the complex conjugated roots in polar form,

z1 = ρei2πf and z2 = ρe−i2πf ,

where 0 < ρ < 1 and 0 < f ≤ 1/2. Then, the covariance function r(τ) is (forτ ≥ 0),

r(τ) = K1zτ1 +K2z

τ2 = ρτ (K1e

i2πfτ +K2e−i2πfτ )

= ρτ ((K1 +K2) cos(2πfτ) + i(K1 −K2) sin(2πfτ))

= ρτ (K3 cos(2πfτ) +K4 sin(2πfτ))

where K3 are K4 are real constants (since r(τ) is real-valued). With

K5 = |K3 + iK4| =√K2

3 +K24

andφ = arg(K3 + iK4),

we can write

K3 = K5 cos φ,

K4 = K5 sin φ,

and find thatr(τ) = ρτK5 cos(2πfτ − φ).

Figure 2.4 shows realizations together with covariance function and spectral den-sities for two different Gaussian AR(2)-processes. Note, the peak in the spec-tral density for the process in (a), not present in (b), depending on the rootsz1,2 = −a2/2±

√a21/4− a2).

a) With σ2 = 1, a1 = −1 and a2 = 0.5, the roots to the characteristic equationare conjugate comples z1,2 = (1± i)/2 = 2−1/2eiπ/4 , and

Xt = Xt−1 − 0.5Xt−2 + et,

rX(τ) =√6.4 2−|τ |/2 cos(

1

4π|τ | − θ), where θ = arctan

1

3,

.

b) With σ2 = 1, a1 = −0.5 and a2 = −0.25, the roots to the characteristicequation are real, z1,2 = (1±

√5)/4, and

Xt = 0.5Xt−1 + 0.25Xt−2 + et,

r(τ) =0.96 + 0.32

√5

(−1 +√5)|τ |

+0.96− 0.32

√5

(−1−√5)|τ |

,

.

The possibility of having complex roots to the characteristic equation makes theAR(2)-process a very flexible modeling tool in the presence of “quasi-periodicities”near one single period; cf. Remark 2.2. 2


0 10 20 30 40−4

−2

0

2

4

(a) Xt=X

t−1−0.5X

t−2+e

t

0 10 20 30 40−4

−2

0

2

4

(b) Xt=0.5X

t−1+0.25X

t−2+e

t

0 5 10 15 20−1

0

1

2

3r(τ)

0 5 10 15 200

0.5

1

1.5

2r(τ)

0 0.1 0.2 0.3 0.4 0.510

−1

100

101

R(f)

0 0.1 0.2 0.3 0.4 0.510

−1

100

101

102

R(f)

Figure 2.4: Realization, covariance function, and spectral density (log scale) fortwo different AR(2)-processes with (a) a1 = −1, a2 = 0.5, and (b) a1 = −0.5,a2 = −0.25.

2.1.2 Moving average, MA(q)

A moving average process is generated by filtration of a white noise process {et}through a transversal filter; see Figure 2.5.

Xt = et + c1et−1 + c2et−2 + · · ·+ cqet−q

etT−1

et−1

c1

T−1et−2

c2

T−1et−q

cq ΣXt

Figure 2.5: MA(q)-process. (The operator T−1 delays the signal one time unit.)

An MA(q)-process is defined by its generating polynomial

C(z) = c0 + c1z + · · ·+ cqzq.

There are no necessary restrictions on its zeros, like there are for the AR(p)-process,but it is often favorable to require that it has all its zeros outside the unit circle;


then the filter is called invertible. Expressed in terms of the characteristic equationzqC(z−1) = 0, the roots should be inside the unit circle. Usually, one normalizesthe polynomial and adjusts the innovation variance V[et] = σ2 , and takes c0 = 1.

Definition 2.2 The process {Xt}, given by

Xt = et + c1et−1 + · · ·+ cqet−q,

is called a moving average process of order q , MA(q)-process, with innovation se-quence {et} and generating polynomial C(z).

The sequence {Xt} is an improper average of the latest q + 1 innovations. We donot require the weights ck to be positive, and their sum need not be equal to 1.

Theorem 2.2. An MA(q)-process {Xt} is stationary, with mX = E[Xt] = 0 , and

rX(τ) =

{σ2∑

j−k=τ cjck, if |τ | ≤ q,

0, otherwise.

The main feature of an MA(q)-process is that its covariance function is 0 for |τ | > q .

Proof: This is left to the reader. Start with the easiest case which is treated inthe next example. and ??. 2

0 10 20 30 40−4

−2

0

2

4

(a) Xt=e

t+0.9e

t−1

0 10 20 30 40−4

−2

0

2

4

(b) Xt=e

t−0.9e

t−1

0 5 10 15 20−1

0

1

2r(τ)

0 5 10 15 20−1

0

1

2r(τ)

0 0.1 0.2 0.3 0.4 0.510

−2

100

R(f)

0 0.1 0.2 0.3 0.4 0.510

−2

10−1

100

101

R(f)

Figure 2.6: Realizations, covariance functions, and spectral densities (log-scale) fortwo different MA(1)-processes: (a) Xt = et + 0.9et−1 , (b) Xt = et − 0.9et−1.


Example 2.3. For the MA(1)-process, Xt = et + c1et−1 the covariance is,

r(τ) =

σ2(1 + c21) for τ = 0,σ2c1 for |τ | = 1,0 for |τ | ≥ 2.

Figure 2.6 shows realizations, covariance functions, and spectral densities for twodifferent Gaussian MA(1)-processes with c1 = ±0.9, and

r(0) = 1.81, r(1) = ±0.9,

R(f) = 1.81± 1.8 cos 2πf.

2

2.1.3 Mixed model, ARMA(p,q)

A natural generalization of the AR- and MA-processes is a combination, with oneAR- and one MA-filter in series, letting the right hand side of the AR-definition(2.1), be an MA-process. The result is called an ARMA(p,q)-process,

Xt + a1Xt−1 + · · ·+ apXt−p = et + c1et−1 + · · ·+ cqet−q,

where {et} is a white noise process, such that et and Xt−k are uncorrelated fork = 1, 2, . . .

2.2 Estimation of AR-parameters

AR-, MA-, and ARMA-models are the basic elements in statistical time series ana-lysis, both for stationary phenomena and for non-stationary. Here, we will only givea first example of parameter estimation for the most simple case, the AR(p)-process.

Estimation of the parameters a1, . . . , ap and σ2 = V[et] in an AR(p)-process iseasy. The AR-equation

Xt + a1Xt−1 + . . .+ apXt−p = et, (2.9)

can be seen as a multiple regression model,

Xt = −a1Xt−1 − . . .− apXt−p + et,

were the residuals (= innovations) et are uncorrelated with the regressors Xt−1, Xt−2, . . ..With terminology borrowed from regression analysis one can call Ut = (−Xt−1, . . . ,−Xt−p)the independent regressor variables, and regard the new observation Xt as the de-pendent variable. With the parameter vector

θ = (a1, . . . , ap)′,

we can write the AR-equation in standard multiple regression form,

Xt = Utθ + et,

Section 2.3 Prediction in AR- and ARMA-models 41

and use standard regression technique.Suppose, we have n successive observations of the AR(p)-process (2.9), x1, x2, . . . , xn ,

and define ut = (−xt−1, . . . ,−xt−p),

xt = utθ + et, for t = p+ 1, . . . , n.

The least squares estimate θ is the θ -value that minimizes

Q(θ) =

n∑

t=p+1

(xt − utθ)2.

The solution can be formulated in matrix language. With

X =

xp+1

xp+2...xn

, U =

up+1

up+2...un

=

−xp −x(p−1) . . . −x1

−x(p+1) −xp . . . −x2...

......

−xn−1 −xn−2 . . . −xn−p

, E =

ep+1

ep+2...en

,

the regression equation can be written in compact form, X = Uθ + E , and thefunction to minimize is

Q(θ) = (X−Uθ)′(X−Uθ).

Theorem 2.3. The least squares estimates of the parameters (a1, . . . , ap) = θ , andthe innovation variance σ2 = V[et], in an AR(p)-process are given by

θ = (U′U)−1U′X, σ2 = Q(θ )/(n− p).

The estimates are consistent, and converge to the true values when n → ∞ ; seeAppendix ??.

The theorem claims that if the observations come from an AR(p)-process, thenthe parameters can be correctly estimated if only the series is long enough. Alsocovariance function can then be estimated. However, one does not know for sure ifthe process is an AR(p)-process, and even if one did, the value of p would probablybe unknown. Statistical time series analysis has developed techniques to test possiblemodel order, and to evaluate how well the fitted model model agrees with data.

Example 2.4. We use the estimation technique on the AR(2)-process from Example 2.2,Xt = Xt−1 − 0.5Xt−2 + et . Based on n = 512 observations the estimates were ama-zingly close to the true values, namely a1 = −1.0361, a2 = 0.4954, σ = 0.9690; seeTable 2.1 for the standard errors of the estimates. 2

2.3 Prediction in AR- and ARMA-models

Forecasting or predicting future values in a time series is one of the most importantapplications of ARMA-models. Given a sequence of observations xt, xt−1, xt−2, . . .,of a stationary sequence {Xt} , one wants to predict the value xt+τ , of the processas well as possible in mean square sense, τ time units later. The value of τ is calledthe prediction horizon. We only consider predictors, which are linear combinationsof observed values.


a1 a2 σtrue value -1 0.5 1estimated value -1.0384 0.4954 0.969standard error 0.0384 0.0385

Table 2.1: Parametric estimates of AR(2)-parameters.

2.3.1 Forecasting an AR-process

Let us first consider one-step ahead prediction, i.e., τ = 1, and assume that theprocess {Xt} is an AR(p)-process,

Xt + a1Xt−1 + · · ·+ apXt−p = et, (2.10)

with uncorrelated innovations {et} with mean 0 and finite variance, σ2 = V[et],and with et+1 uncorrelated with Xt, Xt−1, . . .. In the relation (2.10), delayed onetime unit,

Xt+1 = −a1Xt − a2Xt−1 − · · · − apXt−p+1 + et+1,

all terms on the right hand side are known at time t, except et+1 , which in turnis uncorrelated with the observations of Xt, Xt−1, . . .. It is then clear that it is notpossible to predict the value of et+1 from the known observations – we only knowthat it will be an observation from a distribution with mean 0 and variance σ2 . Thebest thing to do is to predict et+1 with its expected value 0. The predictor of Xt+1

would then be,

Xt+1 = −a1Xt − a2Xt−1 − · · · − apXt−p+1. (2.11)

Theorem 2.4. The predictor (2.11) is optimal in the sense, that if Yt+1 is any otherlinear predictor, based only on Xt, Xt−1, . . ., then

E[(Xt+1 − Yt+1)2] ≥ E[(Xt+1 − Xt+1)

2].

Proof: Since Xt+1 and Yt+1 are based only on Xt, Xt−1, . . ., they are uncorrelatedwith et+1 , and since Xt+1 = Xt+1 + et+1 , one has

E[(Xt+1 − Yt+1)2] = E[(Xt+1 + et+1 − Yt+1)

2]

= E[e2t+1] + 2E[et+1]E[Xt+1 − Yt+1] + E[(Xt+1 − Yt+1)2]

= E[e2t+1] + E[(Xt+1 − Yt+1)2] ≥ E[e2t+1] = E[(Xt+1 − Xt+1)

2],

with equality only if Yt+1 = Xt+1 . 2

Repeating the one-step ahead prediction, one can extend the prediction horizon.To predict Xt+2 , consider the identity

Xt+2 = −a1Xt+1 − a2Xt − · · · − apXt−p+2 + et+2,


and insert Xt+1 = Xt+1 + et+1 , to get

Xt+2 = −a1(−a1Xt − · · · − apXt−p+1 + et+1)− a2Xt − · · · − apXt−p+2 + et+2

= (a21 − a2)Xt + (a1a2 − a3)Xt−1 · · ·+ (a1ap−1 − ap)Xt−p+2

+ a1apXt−p+1 − a1et+1 + et+2.

Here, −a1et+1 + et+2 is uncorrelated with Xt, Xt−1, . . ., and in the same way asbefore, we see that the best two-step ahead predictor is

Xt+2 = (a21 − a2)Xt + (a1a2 − a3)Xt−1

+ · · ·+ (a1ap−1 − ap)Xt−p+2 + a1apXt−p+1.

Repeating the procedure gives the best, in mean square sense, predictor for anyprediction horizon.

2.3.2 Prediction of ARMA-processes

To predict an ARMA-process requires more work than the AR-process, since theunobserved old innovations are correlated with the observed data, and have delayedinfluence on future observations. An optimal predictor therefore requires recons-truction of old es -values, based on observed Xs , s ≤ t.

Let the ARMA-process be defined by,

Xt + a1Xt−1 + · · ·+ apXt−p = et + c1et−1 + · · ·+ cqet−q, (2.12)

with {et} as before. We present the solution to the one-step ahead prediction;generalization to many-steps ahead is very similar.

We formulate the solution by means of the generating polynomials, (with a0 =c0 = 1),

A(z) = 1 + a1z + · · ·+ apzp,

C(z) = 1 + c1z + · · ·+ cqzq,

and assume both polynomials have their zeros outside the unit circle, so A(z) isstable and C(z) is invertible. Further, define the backward translation operatorT−1 ,

T−1Xt = Xt−1,T−1et = et−1,T−2Xt = (T−1)2Xt = T−1(T−1Xt) = T−1Xt−1 = Xt−2, etc.

The defining equation (2.12) can now be written in compact form as,

A(T−1)Xt = C(T−1)et, (2.13)

and by formal operation with the polynomials one can write

Xt+1 =C(T−1)

A(T−1)et+1 = et+1 +

C(T−1)− A(T−1)

A(T−1)T−1T−1 et+1

= et+1 +C(T−1)−A(T−1)

A(T−1)T−1et. (2.14)


According to (2.13), et =A(T−1)C(T−1)

Xt , and inserting this into (2.14) we get,

Xt+1 = et+1 +C(T−1)− A(T−1)

A(T−1)T−1· A(T

−1)

C(T−1)Xt

= et+1 +C(T−1)− A(T−1)

C(T−1)T−1Xt.

Here, the innovation et+1 is uncorrelated with known observations, while the secondterm only contains known X -values, and can be used as predictor. To find theexplicit form, we expand the polynomial ratio in a power series,

C(z)− A(z)

C(z)z=

(c1 − a1)z + (c2 − a2)z2 + · · ·

z(1 + c1z + · · ·+ cqzq)= d0 + d1z + d2z

2 + · · · ,

which, with the T−1 -operator inserted, gives the desired form,

Xt+1 = et+1 +C(T−1)−A(T−1)

C(T−1)T−1Xt = et+1 + d0Xt + d1Xt−1 + d2Xt−2 + · · · .

Hence, the best predictor of Xt+1 is

Xt+1 =C(T−1)− A(T−1)

C(T−1)T−1Xt = d0Xt + d1Xt−1 + d2Xt−2 + · · · . (2.15)

Our computations have been formal, but one can show that if the sums

g0et + g1et−1 + · · · and d0Xt + d1Xt−1 + · · · , (2.16)

where g0, g1, . . ., are the coefficients in the power series expansion of (C(z)−A(z))/A(z),are absolutely convergent, then

Xt+1 = et+1 +∞∑

k=0

dkXt−k

and Xt+1 is the optimal predictor.Here are some more arguments for the calculations. We have assumed that the

polynomials A(z) och C(z) in the complex variable z have their zeros outside theunit circle. Then, known theorems for analytic functions say that the radius ofconvergence for the power series C(z) =

∑ckz

k and D(z) =∑

dkzk are greater

than 1, and then |ck|, |dk| ≤ constant · θk , for some θ, |θ| < 1. This in turn implies,by some more elaborated probability theory, that all series in (2.16) are absolutely

convergent for (almost) all realizations of {Xt} . In summary, the predictor Xt+1

is always well-defined and it is the optimal one-step ahead predictor in the stable-invertible case, when A(z) and C(z) have their zeros outside the unit circle.

For more details, see e.g. Astrom: Introduction to Stochastic Control, [31], orYaglom: An Introduction to the Theory of Stationary Random Functions, [30], for aclassical account.


Example 2.5. (”Prediction of electrical power demand”) Trading electrical poweron a daily or even hourly basis, has become economically important and the use ofstatistical methods has grown. Predictions of demand a few days ahead is one ofthe basic factors in the pricing of electrical power, that takes place on the powermarkets. (Of course, also forecasts of the demand several months or years ahead areimportant, but that requires different data and different methods.)

Short term predictions are needed also for production planning of water gene-rated power, and for decisions about selling and buying electricity. If a distributorhas made big errors in the prediction, he can be forced to start up expensive fossilproduction units, or sell the surplus at a bargain price.

Electrical power consumption is not a stationary process. It varies systemati-caly over the year, depending on season and weather, and there are big differencesbetween days of the week, and time of the day. Before one can use any statio-nary process model, these systematic variations need to be estimated. If one hassuccessfully estimated and subtracted the systematic part, one can hope to fit anARMA-model to the residuals, i.e., the variation around the weekly/daily profile,and the parameters of the model has to be estimated. Then a new problem arises,since the parameters need not be constant, but be time and weather dependent. Fi-gure 2.7 shows observed power consumption during one autumn week, together withone-hour ahead predicted consumption based on an ARMA-model, and predictionerror. It turned out, in this experiment, that the ARMA-model contained just asmuch information about the future demand, as a good weather prediction, and thatthere was no need to include any more data in the prediction. 2

2.3.3 The orthogonality principle

The optimal AR-predictor (2.11) and Theorem 2.4 illustrate a general property ofoptimal linear prediction. Let Y and X1, . . . , Xn be correlated random variableswith mean zero.

Theorem 2.5. A linear predictor, Y =∑n

k=1 akXk , of Y by means of X1, . . . , Xn is

optimal in mean square sense if and only if the prediction error Y −Y is uncorrelatedwith each Xk , i.e. C[Y − Y , Xk] = 0 , for k = 1, . . . , n.

The coefficients {ak} of the optimal predictor is a solution to the linear equationsystem

C[Y,Xk] =

n∑

j=1

aj C[Xj , Xk], k = 1, . . . , n. (2.17)


Figure 2.7: Measured electricity consumption during one autumn week (top dia-gram), ARMA-predicted consumption, including weekday and hour correction,one hour ahead (middle diagram), and the prediction error (lower diagram).

Bibliography

[1] D. L. Bartholomew and J. A. Tague. Quadratic power spectrum estimationwith orthogonal frequency division multiple windows. IEEE Trans. on SignalProcessing, 43(5):1279–1282, May 1995.

[2] J.S. Bendat & A.G. Piersol: Random Data. Wiley, New York, 2:a uppl, 1986.

[3] P. Bloomfield: Fourier Analysis of Time Series: An Introduction. Wiley, NewYork 1976.

[4] P.J. Brockwell & R.A. Davis: Time Series: Theory and Methods. 2nd ed,Springer-Verlag, New York 1991.

[5] M. P. Clark and C. T. Mullis. Quadratic estimation of the power spectrum usingorthogonal time-division multiple windows. IEEE Trans. on Signal Processing,41(1):222–231, Jan 1993.

[6] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation ofFourier series. Math. Comput., 19:297–301, 1965.

[7] H. Cramer, & M.R. Leadbetter: Stationary and Related Stochastic Processes.Wiley, New York 1967.

[8] Graybill, F.A.: Introduction to Matrices with Applications in Statistics, Wad-sworth, 1969.

[9] G. Grimmett and D. Stirzaker: Probability and Random Processes. Oxford Uni-versity Press, 2001.

[10] K. Hasselmann et al.: Measurements of wind-wave growth and swell decayduring the JOint North Sea Wave Project (JONSWAP). Deutsche Hydrogra-phische Zeitschrift, Reihe A, No.8, 1973.

[11] S.M. Kay: Modern Spectral Estimation: Theory and Application. Prentice-Hall,Englewood Cliffs 1988.

[12] A.N. Kolmogorov: Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer-Verlag, Berlin, 1933.

[13] A.N. Kolmogorov: Foundations of the Theory of Probability, 2nd English Edi-tion, Chelsea Publishing Company, New York, 1956. (Translation of [12].)

47

48 BIBLIOGRAPHY

[14] G. Lindgren: Lectures on Stationary Stochastic Processes. Lund 2006.

[15] Mazo, R.M.: Brownian motion. Fluctuations, Dynamics and Application, Ox-ford University Press, Oxford, 2002.

[16] L. Olbjer, U. Holst, J. Holst: Tidsserieanalys. Matematisk statistik, LTH, 5thEd., 2002.

[17] D. B. Percival and A. T. Walden. Spectral analysis for Physical Applications:Multitaper and Conventional Univariate Techniques. Cambridge UniversityPress, 1993.

[18] D.S.G. Pollock: A handbook of time-series analysis, signal processing and dy-namics. Academic Press, 1999.

[19] S.O. Rice: Mathematical analysis of random noise. Bell System Technical Jour-nal, Vol. 23 och 24, sid 282–332 resp 46–156.

[20] K. S. Riedel. Minimum bias multiple taper spectral estimation. IEEE Trans.on Signal Processing, 43(1):188–195, January 1995.

[21] A. Schuster. On the investigation of hidden periodicities with application to asupposed 26-day period of meterological phenomena. Terr. Magnet., 3:13–41,1898.

[22] D. Slepian. Prolate spheriodal wave functions, Fourier analysis and uncertainty-v: The discrete case. Bell System Journal, 57(5):1371–1430, May-June 1978.

[23] G. Sparr and Sparr, A.: Kontinuerliga system. Studentlitteratur, Lund, 2000.

[24] P. Sroica and R. Moses.: Spectral analysis of signals. Pearson, Prentice Hall,2005.

[25] D. J. Thomson. Spectrum estimation and harmonic analysis. Proc. of the IEEE,70(9):1055–1096, Sept 1982.

[26] A. T. Walden, E. McCoy, and D. B. Percival. The variance of multitaper spec-trum estimates for real gaussian processes. IEEE Trans. on Signal Processing,42(2):479–482, Feb 1994.

[27] P. D. Welch. The use of fast Fourier transform for the estimation of powerspectra: A method based on time averaging over short, modified periodograms.IEEE Trans. on Audio Electroacoustics, AU-15(2):70–73, June 1967.

[28] http://en.wikipedia.org/wiki/Distribution_(mathematics)

[29] A.M. Yaglom: Correlation Theory of Stationary and Related random Functions,I-II. Springer-Verlag, New York 1987.

BIBLIOGRAPHY 49

[30] A.M. Yaglom: An Introduction to the Theory of Stationary Random Functions,Dover Publications (reprint of 1962 edition).

[31] K.J. Astrom: Introduction to Stochastic Control. Academic Press, New York1970.

[32] B. Øksendal: Stochastic differential equations, sn introduction with applications.6th Ed. Springer, 2003.

stationary stochastic processes, parts of chapters …rootzen/fintid/stationary120312.pdf1...

Documents