between and beyond: irregular series, interpolation, variograms, and smoothing nicholas j. cox

52
Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox

Upload: alan-curtis

Post on 27-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

Between and beyond: Irregular series, interpolation, variograms, and smoothing

Nicholas J. Cox

Mind the gap!

Repeated reminder, London Underground.

2

Irregular series

Irregular series are series in which non-missing values are not all equally spaced.

Special case: Values would be equally spaced (every day, every year, …), but there are some gaps with missing values, for human or inhuman reasons.

General case: Values are just at known times or points with no necessary rules about spacing.

Irregular series often seem to invite interpolation.

3

Luke Howard (1772 – 1864) 

Best remembered for his nomenclature for clouds (cumulus, stratus, cirrus and so forth).

Here we use as sandbox some of his temperature data from Plaistow, near London, in 1807.

4

Howard, Luke. 1818. The Climate of London, Deduced from Meteorological Observations, Made at Different Places in the Neighbourhood of the Metropolis. Volume I. London: W. Phillips, etc.

5

6

50

60

70

80

90

max

imum

( F

)

7 May 14 May 21 May 28 May 4 Jun

40

45

50

55

min

imu

m ( F

)

7 May 14 May 21 May 28 May 4 Jun

Series of events

N.B. We are not talking here about series of events, or realisations of point processes.

In such series occurrences are typically irregularly spaced, but the gaps are inherent in the process, not a failing of our data.

Examples range from eruptions to elections.

7

-8000 -6000 -4000 -2000 0 2000

eruptions, Mt Adams WA

1789 2016

elections of black Presidents, USA

1789 2016

elections of women Presidents, USA

8

Interpolation

Interpolation is the art of reading between the lines.

Historically, it is a deterministic process, often a matter of going beyond printed tables of functions (logarithmic, trigonometric, and so forth).

In principle, we should worry about the statistical properties of interpolation. It is estimation or prediction.

In practice, imputation now appears better known among statistical researchers.

9

Interpolation in (official) Stata

The ipolate command for linear interpolation (and extrapolation) was added in Stata 3.1 (1993).

The Mata functions spline3() and spline3eval() were added in Stata 9.0 (2005).

10

User-written programs

Programs (NJC) are available from SSC for

cubic interpolation: cipolate (2002) cubic spline interpolation: csipolate (2009) piecewise cubic Hermite interpolation: pchipolate (2012)nearest neighbour interpolation: nnipolate (2012)

A combined and extended program mipolate will shortly be available too.

11

Two dimensions too

Note also bipolate (Joseph Canner, SSC) (2014).

By default it uses quintic polynomials.

Other available methods include thin plate splines and Shepard’s method.

Note also twoway contour.

12

mipolate generalises ipolate

Interpolation is of yvar with respect to specified xvar.

Prior tsset or xtset is not assumed.

Regular spacing is not assumed.

Multiple values of yvar at the same xvar are averaged first.

Groupwise operations using by: are supported.

13

Linear and cubic

Linear interpolation just uses previous and following known values (only). This is done by ipolate, and also mipolate by default.

Cubic interpolation is another classic method, using two previous and two following known values (only). This is done by mipolate, cubic.

The default of mipolate with either method (as with ipolate) is not to extrapolate.

14

Un peu d’histoire

Cubic interpolation is often attributed to Joseph-Louis Lagrange (1736–1813) but was proposed earlier by Edward Waring (1735?–1798).

15

Lagrange Waring

16

Cubic splines

As before, we are using cubic polynomials locally, but they are constrained to join smoothly.

The syntax is mipolate, spline.

This is merely a wrapper for the official Mata functions.

As before, the default of mipolate with this option is not to extrapolate.

17

Linear extrapolation

As with ipolate linear extrapolation is available as an option in mipolate to fill in missings at the end of series.

What your teachers told you is true: extrapolation is dangerous.

“Don’t point that straight line: It can go off anywhere.”

(Allude here to Mark Twain on the Mississippi.)

18

Piecewise cubic Hermite interpolationThis method also uses piecewise cubics joining smoothly. The syntax is mipolate, pchip.

The interpolant is shape-preserving and cannot overshoot locally.

Sections in which yvar is increasing, decreasing or constant with xvar remain so after interpolation. Hence local maxima and minima also remain so.

This interpolation method also extrapolates.

19

Charles Hermite (1822–1901)

20

Other methods

mipolate adds forward, backward and nearest neighbour interpolation:Use the previous, next or the nearest known value.

Using the last known value is often dubious statistically, but it is a very common request in data management.

The other methods are provided mostly for completeness.

There is small print (option choices) about how to break ties when two values are equally near.

21

mipolate summary

Seven methods:linear cubic (cubic) spline pchipforward backward nearest

Linear extrapolation? yesyesyes nononono

22

23

50

60

70

80

90m

axim

um ( F

)

7 May 14 May 21 May 28 May 4 Jun

splinecubicpchiplinear

24

40

45

50

55m

inim

um

( F

)

7 May 14 May 21 May 28 May 4 Jun

linearpchipcubicspline

Simple messages

There are many interpolation methods to choose from.

They will often disagree, even for simple-looking instances.

Disagreement gives a handle on uncertainty.

In a real problem, simulate missings and test how well known values are estimated.

What makes most sense in your problem will reflect its dependence structure.

25

Leo Breiman (1928–2005)

The main thing to learn about statistics is what is sensible and honest and possible.

Doubt and suspicion, as well as technical knowledge, are indispensable tools in statistics.

1973. Statistics: With a view towards applications. 

Boston: Houghton Mifflin, pp.1, 18.

26

We turn from a project that is nearly done to one that is very much in progress.

27

Variograms

Variograms (more properly semivariograms) are plots of

(mean) half difference between values squared

versus

separation, distance or lag.

By a tempting abuse of terminology, we often use the same name for the underlying relationship as a function.

28

First known use of term ‘variogram’Geoffrey H. Jowett (1922– ) in 1955:

The comparison of means of sets of observations from sections of independent stochastic series. Journal of the Royal Statistical Society. Series B (Methodological) 17: 208–227.

29

Spatial and time series

Variograms are central to one approach to spatial statistics, in this context often known as geostatistics.

Georges Matheron (1930–2000) is most often mentioned here.

But variograms can be very useful for time series too.

30

Time series too

Variograms are prominent in these texts on time series and longitudinal data:

Diggle, P.J. 1990. Time Series: A Biostatistical Introduction. Oxford: Oxford University Press.

Diggle, P.J., Heagerty, P.J., Liang, K-Y. and Zeger, S.L. 2002. Analysis of Longitudinal Data. Oxford: Oxford University Press.

31

User-written programs

Programs (NJC) are available from SSC for

variograms in one dimension: variog (2005) variograms in two dimensions: variog2 (2005)

A combined and extended program vgram is under development.

32

Generality of variograms

So, variograms are – without undue strain – defined

for time series and for spatial series, whether regular or irregular,

as they just depend on separation being measured.

Plotting the mean for each distinct separation is a common, but not compulsory, convention.

33

A simple example: webuse air2

34

100

200

300

400

500

600

Airl

ine

Pas

seng

ers

(194

9-1

960)

1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960

Variograms

vgram air vgram air, recast(connected) xla(0(12)72)

35

0

5000

10000

15000

20000

Sem

i-va

rian

ce

0 20 40 60 80Lag

Semi-variogram of Airline Passengers (1949-1960)

0

5000

10000

15000

20000

Sem

i-va

rian

ce

0 12 24 36 48 60 72Lag

Semi-variogram of Airline Passengers (1949-1960)

Comparison at different lagsWe are plotting mean squared differences between values compared at lags 1, 2, 3, …

In this example, we have monthly data, so are comparing values 1, 2, 3, … months apart.

Many readers may be familiar with the same idea for calculating autocorrelation and cross-correlation.

The variogram – like the raw data plot – hints at a structure of trend plus seasonality.

36

Variograms of residuals, not dataHere, as elsewhere, it is a good idea to work with residuals, rather than the original data.

Time series modellers could have a happy time arguing which model was best for the airline data, but we just use a Poisson regression on time and look at its residuals.

On the versatility and virtuosity of Poisson regression, check out Gould, William. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/

37

Sometimes, structure is this simplePoisson regression Residuals from Poisson

38

100

200

300

400

500

600

Airl

ine

Pas

seng

ers

(194

9-1

960)

1950 1955 1960Time (in months)

n = 144 RMSE = 45.799

air = exp(-224.1 + .11747 time) R2 = 85.5%

0

1000

2000

3000

Sem

i-va

rian

ce

0 20 40 60 80Lag

Semi-variogram of response residual

A little more formally

The semivariogram γ(h) for response z is given by

2 γ(h) = A{ [z(i) − z(i + h)]2 }

where A{} denotes averaging over pairs of values at lag h.

As emphasised, using a mean is a convention. The fuller picture (literally!) is a plot of [z(i) − z(i + h)]2 versus h. This is often known as a variogram cloud.

I borrow the notation A() from Whittle, P. 1970. Probability. Harmondsworth: Penguin.

39

Where does the 2 come from?The units of the semivariogram are those of the response squared.

Adding the variance to the graph as a reference line underlines the connection.

A non-standard formula for the variance is, for any i, j,(1/2) E{ (zi − zj)2 } .

40

variance

0

1000

2000

3000

Sem

i-va

rian

ce

0 20 40 60 80Lag

Semi-variogram of response residual

Back to vgram

vgram (not yet public) is already quite general. We take possibilities one by one.

oWith just one argument, the response, it checks for a tsset or xtset time variable and uses it to define separations if found. Note that panel data are supported for free.

oWith just one argument otherwise, the order of the observations is taken to define position in time or space.

41

o With two arguments, the second variable is taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Equal and unequal spacing can thus both be accommodated.

o With three arguments, the second and third variables are taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Distance is calculated from coordinates using Pythagoras’ theorem.

42

Why not just use autocorrelation? Variograms are defined for a wider class of processes. Autocorrelation functions require weak stationarity; variograms are defined for processes with stationary increments.

Variograms are more flexible in the face of irregular spacing.

The very wide use of autocorrelation reflects custom and familiarity as well as intrinsic merit.

43

A further example

We look at rainfalls for 8 May 1986 (a single day) for 467 stations in Switzerland.

44

45

-3.33.3 - 9.99.9 - 15.2

15.2 - 26.326.3 - 39.439.4 -

percentile breaks 5 25 50 75 95%

rainfall 8 May 1986 (mm)

46

0

50

100

150

200

Sem

i-va

rian

ce

0 10 20 30 40lags are 10 km bands

Semi-variogram of rainfall 8 May 1986 (mm)

How much information ?

Optionally the semivariogram results can be saved in vgram to new variables.

Keeping track of the number of pairs used at each lag is important.

Here we exploit the feature that spikeplot can show frequencies on a square root scale.

47

48

0

1000

2000

3000

4000

5000

6000

Fre

quen

cy o

n ro

ot s

cale

0 10 20 30 40Lag

To do list

variogram clouds

robust estimators

more flexible binning

spherical distances too

direction as well as lag

model fitting (valid functional forms)

use for interpolation (and smoothing) (kriging, Gaussian process regression)

49

Variogram virtues

Defined for time and spatial series. Defined for regular and irregular series. Can help identify and check for structure.

… even if you have no interest in their most mentioned use, as a means towards the end of spatial interpolation.

50

This paper…

This paper fills a much needed gap in the literature.

See Jackson, A. 1997. Chinese acrobatics, an old-time brewery,

and the “much needed gap”: The life of Mathematical Reviews.

Notices of the American Mathematical Society 44: 330–337.

51

Acknowledgments

Historical portraits: Wikipedia.

MATLAB code for pchip: Moler, C. 2004. Numerical Computing with MATLAB. Philadelphia: SIAM. Chapter 3. http://www.mathworks.com/moler/interp.pdf)

The Swiss rainfall data can be found here: http://www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOSTATSData/sic97data_01.zip

52