introduction to optimal interpolation and variational...

27
Statistical Analysis of Biological data and Times-Series Introduction to Optimal Interpolation and Variational Analysis Alexander Barth, Aida Alvera Azc´ arate, Pascal Joassin, Jean-Marie Beckers, Charles Troupin [email protected] Revision 1.1

Upload: others

Post on 20-Apr-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Statistical Analysis of Biological data andTimes-Series

Introduction to OptimalInterpolation and Variational

Analysis

Alexander Barth, Aida Alvera Azcarate, Pascal Joassin,

Jean-Marie Beckers, Charles Troupin

[email protected]

Revision 1.1

Chapter 1

Theory of optimal interpolation andvariational analysis

Contents1.1 The gridding problem . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Sampling locations . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 True field at the sampling locations . . . . . . . . . . . . . . 9

1.4 Observation errors . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Data analysis and gridding methodologies . . . . . . . . . . 13

1.6 Objective analysis . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 Cressman method . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.8 Optimal interpolation . . . . . . . . . . . . . . . . . . . . . . . 20

1.8.1 Error covariances . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.8.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.9 Variational Inverse Method . . . . . . . . . . . . . . . . . . . 26

1.9.1 Equivalence between OI and variational analysis . . . . . . . . 29

1.9.2 Error fields by analogy with optimal interpolation . . . . . . . 29

1.9.3 Diva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.9.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

1.1 The gridding problem

Gridding is the determination of a field φ(r), on a regular grid of positions r based onarbitrarily located observations. Often the vector r is on a 2D, 3D or even 4D space.

1

• The fewer observations are available, the harder the gridding problem is

• In oceanography, in situ observations are sparse

• Observations are inhomogeneously distributed in space and time (more observationsin the coastal zones and in summer)

• The variability of the ocean is the sum of various processes occurring at differentspatial and temporal scales.

Figure 1.1: Example of oceanographic field.

• Figure 1.1 shows an idealized square domain with a barrier (e.g. a peninsula or adike).

• This field is the true field that we want to reconstruct based on observations. Let’sassume that the field represents temperature.

• The barrier suppresses the exchanges between each side of the barrier.

• The field varies smoothly over some length-scale

2

1.2 Sampling locations

Figure 1.2: Sampling locations within the domain

• In regions where a measurement campaign has been carried out, a higher spatialcoverage is achieved.

• Based on the value of the field at the shown location, we will estimate the true field.

1.3 True field at the sampling locations

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

16.5

17

17.5

18

18.5

19

19.5

20

20.5

21

21.5

22

Figure 1.3: Values of the true field extracted at the location of the observations.

• some information about the position of the structures and fronts is lost

• no method can provide exactly the true field. But the more information about itsstructure we include in the analysis, the closer we can get to the true field.

3

1.4 Observation errors

Observations are in general affected by different error sources and other “problems” thatneed to be taken into account:

1. Instrumental errors (limited precision or possible bias of the sensor)

2. Representative errors: the observations do not necessarily corresponds to thefield we want to obtain. For example, we want to have a monthly average, but theobservations are instantaneous (or averages over a very short period of time).

3. Synopticity errors: all observations are not taken at the same time.

4. Other errors sources: human errors (e.g. permutation of longitude and latitude),transmission errors, malfunctioning of the instrument, wrong decimal separators...

1 2 3 4 5 6 7 8 9 101

2

3

4

5

6

7

8

9

10

16.5

17

17.5

18

18.5

19

19.5

20

20.5

21

21.5

22

Figure 1.4: Observations with errors

• a random perturbation was added to the observations.

• This simulates the impact of the different error sources.

• To simplify matters, each observation was perturbed independently.

1.5 Data analysis and gridding methodologies

Because observations have errors, it is always better to produce a field approximationand never a strict interpolation.

4

Figure 1.5: Gridded field using linear interpolation. This method is implemented in thefunction griddata of Matlab and GNU Octave. Right panel shows true field.

• Figure 1.5 shows what would happen if the observations would have been interpo-lated linearly.

• The domain is decomposed into triangles where the vertices are the location of thedata points based on the Delaunay triangulation.

• Within each triangle, the value is interpolated linearly.

• The noise on the observations is in general amplified when higher order schemes,such as cubic interpolation, are used.

There are a number of methods that have been traditionally used to solve the griddingproblem:

• Subjective: drawing by hand (1.6)

• Objective: predefined mathematical operations

5

Figure 1.6: Isohyet (lines of constant precipitation) drawn by hand (from http://www.

srh.noaa.gov/hgx/hurricanes/1970s.htm)

1.6 Objective analysis

As opposed to the subjective method, objective analysis techniques aim to use mathe-matical formulations to infer the value of the field at unobserved locations based on theobservations.

• first guess is removed from observations

• Gridding procedure (also called “analysis”):

Gridded field = First guess of field + weighted sum of observation

There are several ways to define the weights, which result in different gridding tech-niques. We will review in the following the most used gridding methods.

1.7 Cressman method

Cressman weights depend only on the distance between the location where the value ofthe field should be estimated and the location of the observation:

6

-4 -3 -2 -1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 1.7: Cressman weights (blue) and Barnes weights for (red).

The search radius is the typical control parameter and defines the length-scale overwhich an observation is used. This length scale can be made to vary in space dependingon data coverage and/or physical scales. This parameter is chosen by the users based ontheir knowledge of the domain and the problem.

Figure 1.8: Gridded field by Cressman weighting. Right panel shows true field.

The Cressman weighting is a very simple and numerically quite efficient method.However, it suffers from some limitations which are apparent in figure 1.8.

• No estimate can be obtained at locations when no observation is located within theR.

• In regions with very few observations, the method can return a discontinuous field.

• The presence of barriers cannot be taken into account easily.

• All observations are assumed to have a similar error variance since the weighting isbased only on distance.

7

1.8 Optimal interpolation

• The first guess and the observations have errors

• These errors are characterized by the error covariance

• Optimal interpolation seeks the optimal combination between the first guess andthe observations

1.8.1 Error covariances

• Need to be specified for observation and first guess

• Error covariances describe how different variables or the same variable atdifferent locations are related in average.

• The error covariance is in general a matrix.

• the square root of the diagonal elements is the average error of the observationsand the first guess

• the off-diagonal elements describe the relationships between the variables

• Covariance is directly related to the correlation

1.8.2 Assumptions

Hypotheses concerning those errors:

• observations and the first guess are unbiased, i.e. the error is zero in average.

• error covariance matrices are known. Those matrices are related to the ex-pected magnitude of the error.

• the error of the first guess is independent of the observation errors.

1.8.3 Analysis

The optimal interpolation (OI) scheme can be derived as the Best Linear Unbiased Esti-mator (BLUE) of the true state which has the following properties:

• The estimator is linear in the first guess and observations

• The estimator is not biased

• This estimate has a minimal total variance i.e. no other estimator would havean error variance lower that the BLUE estimator.

From these assumptions the optimal weights of the observations can be derived togrid the field. The optimal weights depend on the error covariances. In particular,

8

• Large expected error of observations compared to first guess → more weight to theobservations than the first guess

• Variables are strongly correlated spatially → smooth gridded field.

• See file diva intro slides.pdf if you want to know how the weights are computed

• An error estimate can also be computed by optimal interpolation.

Figure 1.9: Gridded field using optimal interpolation. Right panel shows true field.

As an illustration, the field obtained by optimal interpolation is shown in figure 1.9.

Figure 1.10: Relative error variance of the gridded field by optimal interpolation

• The error variance of the field is clearly related to the location of the observations.

• This relative error is bounded between 0 and 1 because it is scaled by the errorvariance of the first guess.

9

1.9 Variational Inverse Method

The variational inverse method (VIM) or variational analysis (Brasseur et al., 1996) aimsto find a field with the following characteristics:

• smooth

• somewhat close to first guess

• close to the observed values

This is achieved by defining a cost function which penalizes a field that doesn’t satisfythose requirements.

• “close” is quantified using the RMS error

• “smooth” is quantified by averaging the square of the first and second derivativesof the field (for a slowly varying field those derivatives are small).

• those different requirements are weighted to define what is more important, asmooth field or a field close to observations.

1.9.1 Equivalence between OI and variational analysis

• The variational inverse method is identical to the optimal interpolation analysisif the weights are defined as the inverse of the covariance matrix used in optimalinterpolation analysis.

• This allows to compute the weights based on the error covariances

• Error covariance can be obtained by the statistical behaviour of the field

1.9.2 Error fields by analogy with optimal interpolation

Error fields can be derived since:

• The optimal interpolation analysis is equivalent to VIM if the weights used by VIMare the inverse of the error covariances used by optimal interpolation.

• In the context of optimal interpolation, it can be shown that the error field equalsthe analysis of the covariance fields

• ⇒ error field of VIM equals analysis (by VIM) of covariance fields (the input of theanalysis tool is a vector containing the covariance of data points with the point inwhich the error estimate is to be calculated)

10

1.9.3 Diva

Diva is a program written in Fortran and Shell scripts performing a variational analysis.This program is available at http://modb.oce.ulg.ac.be/modb/diva.html. Its featuresinclude:

• graphical interface or command line

• 3D fields are computed by performing a succession of 2D analyses

• generalized cross-validation for parameter estimation

• Land and topography are taken into account.

1.9.4 Example

Figure 1.11: Field reconstructed using Diva implementing the variational inverse method.Right panel shows true field.

• The results are similar to the field obtained by optimal interpolation.

• But now, the land-sea boundary is taken into account.

11

Figure 1.12: Relative error of the field reconstructed using Diva

• The error is lowest near the observations and tends to 1 far away from them.

• The presence of the barrier also impacts the estimation of error variance.

• The error estimate should be interpreted cautiously since we supposed that the usederror covariances are correct.

1.10 Summary

• A first guess is needed

• Observations have errors (in particular the observations do not necessarily rep-resent what we want)

• Error covariances are used to quantify the magnitude of the errors

• gridded field of minimum expected error → optimal interpolation

• smooth gridded field close to observations → variational inverse method

• Both approaches are actually the same (but some things are more easy to doin one approach than in the other)

12

Chapter 2

Modelling of error covariances

“Garbage in, garbage out”

2.1 Background error covariance matrix

The error covariance needs to be specified for optimal interpolation and variational anal-ysis. It plays a crucial role in the accuracy of the output. A common test in the griddingproblem is to apply the method for a single isolated data point. The resulting field isproportional to the error covariance of the background.

2.1.1 Parametric error covariance

• In most optimal interpolation approaches, the error covariance is explicitly modelledbased on the decomposition between variance and correlation.

• For simplicity, the variance of the background field is often assumed constant.

• The correlation matrix is parameterized using analytical functions.

• The correlation is essentially a function of distance

• The correlation is one for zero distance

• Generally, the correlation decreases if distance increases

• The correlation function is often a Gaussian.

• The length-scale of the function is an important parameter since it determines overwhich distance a data point can be extrapolated.

• This length-scale is the typical scale of the processes in the field. In the ocean,mesoscale processes have a length-scale of the order of the radius of deformation(figure 2.1).

• Biochemical tracers advected by ocean currents will have also this typical length-scale.

13

Figure 2.1: Gulf Stream and eddy field as measured by satellite altimetry the 28 January2008 (units are m). The diameter of these eddies are related to the radius of deformation.

Modelling directly the error covariance matrix offers numerous “tweaks” such thatthe error covariance reflects better our expectations on how variables are correlated.

• The effect of island and barriers can be taken into account by imposing that theerror covariance between two location is zero if a straight line connecting themintersects a land point.

• Additional fields can be used model to the error covariance. For example, satellitealtimetry gives an indication about which water masses are likely to have similarproperties.

The impact of an observation is can be limited to the structures of similar sea surfaceheight. The impact of the sea surface height on the covariance for a realistic exampleis shown in figure 2.2.

14

Figure 2.2: (a) Sea surface height. (b) Isotropic covariance. (c) Covariance modified bysea surface height.

2.1.2 Error covariance based on differential operators

• For variational inverse methods, the concept of error covariance is implicit. In-stead of modelling the error covariance directly, the variational approach models itsinverse with the help of a differential operator.

• Additional constraints can be included to modify the implicit covariance function.

• For example, an advection constraint is used which imposes that the field is (ap-proximately) a stationary solution to the advection diffusion equation.

• The advection constraint aims thus at modeling the effects of velocity field on thereconstructed field. Such as constraint is introduced by adding a term to the costfunction

• The effect of this constraint is to propagate the information along the direction ofthe flow (figure 2.3). This option is available in Diva.

15

Figure 2.3: (a) Flow field. (b) Correlation modified by flow field.

2.2 Observation error covariance

The observations are often assumed to be uncorrelated. This maybe a valid assumptionfor the instrumental error, but it is not in general for the representativity error since thiserror is spatially correlated. However, is it complicated to use a non-diagonal matrix forthe observations error covariance since it must be inverted.

• Binning of observations: instead of working with all observations, those which areclose are averaged into bins and they are treated as a single observation.

• Inflating the error variance. The spatial correlation observation errors reduce thenthe weight of the observations in the analysis.

• In situ observations have a high vertical and/or temporal resolution (e.g. verticalprofiles, moorings) and a low horizontal resolution. Thus the error correlation ofthe observations is important in time and vertical dimension. However, by per-forming 2D-analysis of vertical levels and time instances independently of verticallyand temporally interpolated (or averaged similar to the binning approach), thecorrelation in those dimensions does not play a role in this case.

2.3 Summary

• the more realistic the error covariances are the better the analysis will be

• in optimal interpolation, the error covariances are often analytical functions

• in variational inverse method, the error covariance are implicitly defined. The costfunction penalize abrupt changes of the field and large deviation from the firstguess

16

• Key parameters:

• correlation length of the error of the first guess

• error variance of the observation and error variance of the first guess

• their ratio is called signal-to-noise ratio

17

Chapter 3

Parameter optimization andcross-validation

Optimal interpolation and the variational inverse method are only optimal if their un-derlying error covariance matrices are known. Even after one particular class of errorcovariance matrices has been adopted, several parameters have to be determined suchas the correlation length and the signal-to-noise ratio. The optimality of any method isobviously lost if some parameters are wrong.

3.1 A priori parameter optimization

By analysing the difference between the background estimate and the observations, usefulinformation on their statistics can be derived. Since the errors of the background estimateand the errors of the observations are supposed to be independent, one can show thatthe covariance of their difference is the sum of two covariance matrices

This relationship can be used to estimate some parameters of the error statistics ifthe following assumptions are true:

• The error variance of the background is everywhere the same

• The correlation of the background error between two points depends only on thedistance between them.

• The observation errors have all the same variance and are uncorrelated.

The shape of the covariance function is determined empirically by the following steps:

• We the error covariance by averaging over a sample of couples of observations (orall possible couples of observations).

• The couples are binned according to their distance.

• The covariance is computed by averaging over all couples in a given bin.

• This procedure is repeated for all bins.

18

• The analytical function is fitted to obtain the correlation length-scale and the errorvariance of the first guess σ2.

• The first data bin with zero radial distance is not taken into account for the fittingsince it includes the observation error variance. The variance in the first bin can beused to estimate the error variance.

0 0.2 0.4 0.6 0.8 1

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2 data covarianceerror on covariance data used for fittingfitted curverelative weight during fitting

Figure 3.1: Determination of the correlation length-scale and error covariance of thebackground. This approach is realized with the tool divafit.

3.2 Validation

The validation consists in using an independent data set to determine the accuracy ofthe results. The data sets maybe some observations explicitly set aside for the validationpurposes or some newly available data. One can use for example satellite SST to validatea climatology obtained by in situ data.

Some poorly known parameters can be varied to minimize the RMS error between theanalysis and the validation data set. This provides a mean to objectively calibrate someparameters. It is crucial for this step that the validation data set is independent fromthe observation used in the analysis.

Exercise 1

What would be the “optimal” signal-to-noise ratio if the analysed observations wouldinclude the validation data set (in the context of optimal interpolation or variationalinverse method and all other parameters are kept constant)?

19

3.3 Cross-validation

In the cross-validation approach,

• one data point is excluded from the data-set

• the analysis is performed without it.

• The analyzed field is interpolated at the location of the excluded point. This inter-polated value is thus not influenced by the excluded data point. Their differencegive thus an indication of the accuracy of the field at the location.

• This procedure is repeated for all observations

• RMS error is computed as the sum of the square of all difference

However, this procedure would require an additional analysis for every observation.For most applications, this is too prohibitive in terms of calculation. Craven and Wahba(1979) showed are numerical efficient way to compute the cross-validation RMS errorwith only one additional analysis.

If a large number of observations are available, then a different approach can beadopted to estimate the error of the analysis:

• A small subset of randomly chosen observations are set aside (the cross-validationdata set).

• The analysis is performed without the cross-validation data set.

• The result is compared to the observations initially set aside and an RMS error iscomputed.

• The two last steps are repeated to optimize certain parameters of the analysis.

• Once the optimal parameters are derived a final analysis is carried out using alldata.

The size of the cross-validation data set has to be sufficiently large so that the errorestimate is robust but at the same time its size has to be a negligible fraction of the totaldata set (typically 1%). This explains why a large amount of data is necessary. Thisapproach is used in DINEOF to determine the optimal number of EOFs.

• An estimate of the global accuracy of an analysis is already an useful informationin itself.

• But by performing the analysis with several set of parameters, one can computethe cross-validator for each analysis.

• The set of parameters producing the lowest cross-validator is thus the optimal one.

20

3.4 Summary

• Don’t skip the validation of your results!

• If you don’t have an additional dataset, Cross-validation allows you to estimatethe error of your analysis

• set some date aside

• do the analysis

• compute the RMS error between the analysis and the data set aside

• By minimizing the RMS error, you can estimate some parameters of youranalysis

21

Chapter 4

Real world example

The following shows an application of Diva to compute three-dimensional fields of tem-perature and salinity in the Northern Atlantic by Troupin et al. (2008).

4.1 Data

Data were gathered from various databases in the region 0 − 60◦N × 0 − 50◦ W. Weprocessed them to remove duplicates, detect outliers and perform vertical interpolationwith Weighted Parabolas method (Reiniger and Ross, 1968).

22

(a) Winter, 200m (b) Winter, 3500m

(c) Summer, 200m (d) Summer, 3500m

Figure 4.1: Localisation of temperature observations at 200m and 3500m in winter andsummer.

4.2 Finite-element mesh

Resolution of the minization problem relies on a highly optimized finite-element tech-nique, which permits computational efficiency independent on the data number and theconsideration of real boundaries (coastlines and bottom).

23

−50 −40 −30 −20 −10 00

10

20

30

40

50

60200 m

−50 −40 −30 −20 −10 00

10

20

30

40

50

603500 m

Figure 4.2: Meshes generated at the 200m (left) and 3500m.

4.3 Results

Our results are compared with 1/4◦ climatology from the World Ocean Atlas 2001(WOA01 in the following, (Boyer et al., 2005))

24

(a) Diva, 200m (b) WOA01, 200m

(c) Diva, 3500m (d) WOA01, 3500m

Figure 4.3: Comparison of analysed temperature field with Diva (left) and WOA01.

Acknowledgements

• The National Fund of Scientific Research (F.R.S.-FNRS, Belgium)

• SeaDataNet project

• SESAME, Southern European Seas: Assessing and Modelling Ecosystem changes

• ECOOP, European COastal-shelf sea OPerational observing and forecasting system

25

Bibliography

Abramowitz, M. and A. Stegun, 1964: Handbook of Mathematical Functions with Formu-las, Graphs, and Mathematical Tables . Dover Publications, New York.

Alvera-Azcarate, A., A. Barth, M. Rixen, and J. M. Beckers, 2005: Reconstructionof incomplete oceanographic data sets using Empirical Orthogonal Functions. Ap-plication to the Adriatic Sea Surface Temperature. Ocean Modelling , 9, 325–346,doi:10.1016/j.ocemod.2004.08.001.

Beckers, J.-M. and M. Rixen, 2003: EOF calculation and data filling from incompleteoceanographic datasets. Journal of Atmospheric and Oceanic Technology , 20, 1839–1856.

Boyer, T., S. Levitus, H. Garcia, R. Locarnini, C. Stephens, and J. Antonov, 2005:Objective analyses of annual, seasonal, and monthly temperature and salinity for theworld ocean on a 0.25◦ degree grid. International Journal of Climatology , 25, 931–945.

Brankart, J.-M. and P. Brasseur, 1996: Optimal analysis of in situ data in the West-ern Mediterranean using statistics and cross-validation. Journal of Atmospheric andOceanic Technology , 13, 477–491.

Brasseur, P., J.-M. Beckers, J.-M. Brankart, and R. Schoenauen, 1996: Seasonal tem-perature and salinity fields in the Mediterranean Sea: Climatological analyses of anhistorical data set. Deep–Sea Research, 43, 159–192.

Craven, P. and G. Wahba, 1979: Smoothing noisy data with spline functions: estimat-ing the correct degree of smoothing by the method of generalized cross-validation.Numerische Mathematik , 31, 377–403.

Girard, D., 1989: A fast Monte-Carlo cross-validation procedure for large least squaresproblems with noisy data. Numerische Mathematik , 56, 1–23.

Reiniger, R. and C. Ross, 1968: A method of interpolation with application to oceano-graphic data. Deep-Sea Research, 15, 185–193.

Troupin, C., M. Ouberdous, F. Machin, M. Rixen, D. Sirjacobs, and J.-M. Beckers, 2008:Three-dimensional analysis of oceanographic data with the software DIVA. GeophysicalResearch Abstracts , 4th EGU General Assembly, volume 10.

26