data-driven modelling. part 4. models of uncertaintymodels ... · pdf file1 data-driven...

1

Data-driven modelling.Part 4.

Models of uncertaintyModels of uncertainty Dimitri P. Solomatine

UNESCO-IHE Institute for Water EducationHydroinformatics Chair

Let’s consider a (physically-based) Model

Example: a hydrological conceptual modelSF SnowSF Snow

Input(s)

SM

RF

R

EA

SP

SF

IN

SF – Snow

RF – Rain

EA – Evapotranspiration

SP – Snow cover

IN – Infiltration

R – Recharge

SM – Soil moisture

CFLUX – Capillary transport

UZ – Storage in upper reservoir

PERC – Percolation

LZ – Storage in lower reservoirSM

RFRF

RR

EAEA

SP

SFSF

ININ

SF – Snow

RF – Rain


SP – Snow cover

IN – Infiltration

R – Recharge





LZ – Storage in lower reservoir

ParametersStates

D.P. Solomatine. Data-driven models of uncertainty 2

LZ

UZ

R

PERC Q=Q0+Q1Q1

Transformfunction

Q0

CFLUX Qo – Fast runoff component

Q1 – Slow runoff component

Q – Total runoff

LZ

UZ

RR

PERCPERC Q=Q0+Q1Q1Q1

Transformfunction

Q0Q0

CFLUXCFLUX Qo – Fast runoff component


Q – Total runoff Output

2

Model output: deterministic forecasts. How to find the 90% uncertainty bounds?

3500

4000

1500

2000

2500

3000

3500

Dis

char

ge(m

3/s)


900 920 940 960 980 1000 10200

500

1000

Time(days)

Steps in uncertainty analysis

Identification of sources of uncertainty (Input, parameter, model operational)model, operational)

Quantification of uncertainty (e.g. distribution and range of parameters)

Reduction of uncertainty (e.g. better understanding of the model, accurate measurement etc)

Studying propagation of uncertainty through the model ( b M t C l i l ti )


(e.g. by Monte Carlo simulation) Quantification of uncertainty in the model outputs (e.g.

analysis of the outputs of the results) Application of the uncertain information in decision

making process

3

Sources of model uncertaintyy = M(x, p) + εs + εθ + εx + εy

SF – Snow

RF – Rain

SF – Snow

RF – Rain


LZ

UZ

SM

RF

R

PERC

EA

Q=Q0+Q1Q1

Transformfunction

SP

Q0

SF

CFLUX

IN


SP – Snow cover

IN – Infiltration

R – Recharge






Qo – Fast runoff component


Q – Total runoff

LZ

UZ

SM

RFRF

RR

PERCPERC

EAEA

Q=Q0+Q1Q1Q1

Transformfunction

SP

Q0Q0

SFSF

CFLUXCFLUX

ININ


SP – Snow cover

IN – Infiltration

R – Recharge








Q – Total runoff

Sources of model uncertainty

Observational (data) uncertainty (εx , εy) a) measurement errors due to both instrumental and human

y = M(x, p) + εs + εθ + εx + εy

a) measurement errors due to both instrumental and human errors

b) error due to inadequate representativeness of a data sample due to scale incompatibility or difference in time and space between the variable measured by the instrument and the corresponding model variable.

Model (structural) uncertainty εs


( ) y s

Parametric uncertainty εθ

4

Measures of Uncertainty

1 400

Probability distributionConfidence bounds (two quantiles, say 5% and 95%)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

dic

tive

Pro

ba

bili

ty P

(H<

=h

)

50

100

150

200

250

300

350

Dis

cha

rge

(m

3.s

)


8 8.5 9 9.5 10 10.5 11 11.5 120

Observed River Stage h (m)

10 20 30 40 50 60 700

Time (hr)

Other measures : range, variance, other moments

Uncertainty of a calibrated (“optimal”) model

Output YModel

Model output y^ Measured

Uncertainty of an optimal model M Model is calibrated on measured data y

time

Actual

Actual value y*(unknown)

Measured value y

MeasuredModel error

Observation error


Model is calibrated on measured data y We say the model M uncertainty is manifested in the model error This error incorporates all uncertainties due to:

observational errors, inaccurately estimated parameters, inadequate model structure

We can bulid a ML model U to predict the pdf of the model M error

5

Propagation of uncertainty through the model

y^ = M (x, p) x = input p = parameters x = input, p = parameters

Uncertainty in X and p propagates to output y pdf of parameters pdf of output pdfp pdfy pdf of inputs pdfx pdf of output pdfx pdfy


Suggested approach to “building up”model uncertainty

uncertainty of an optimal model M (p*)t i t f M ( *) d t DATA t i t + uncertainty of M (p*) due to DATA uncertainty

+ uncertainty of M (p) due to PARAMETERS uncertainty + uncertainty of M (p) due to STRUCTURAL uncertainty = uncertainty of a model class M (p), given the

probabilistic properties of parameters and data


6

Monte Carlo Simulation

Mote Carlo casino: roulette wheel


It is a random number generator – uses uniform distribution with the range of [0, 36]

7

Monte Carlo simulation in analysing parametric uncertainty


Monte Carlo sampling: illustrationy = M(x, s, θ) + εs + εθ + εx + εy


8

Monte Carlo simulation of parametric uncertainty

Consider the model M calculating y (e.g., discharge) y^(t) = M (X(t) p) y (t) = M (X(t), p) where X(t) = vector of inputs (precipitation, temperature etc)

known for t = 1,…, T p = vector of parameters (soil properties, roughness, etc)

Monte Carlo approach: sample N parameter vectors pi

run the model for each of them y^i(t) = M (X(t), pi)


y i( ) ( ( ), pi)and generate N outputs (leave some of them if GLUE used)

assess distribution of Qi(t) for each time moment t (or its parameters - mean, variance, prediction intervals, quantiles)

Issues with Monte Carlo

There is however a problem during model operation: How to assess the parametric uncertainty of the p y

model M for t = T+1 when new input data X(t+1) is fed?

It is difficult since MC analysis is computationally intensive: 1) convergence of the Monte Carlo simulation is very slow

(O(N^-0.5)) so larger number of runs needed to establish a reliable estimate of uncertainties2) b f i l ti i ti ll ith th


2) number of simulation increases exponentially with the dimension of the parameter vector ((O(n^d)) to cover the entire parameter domain

Idea: encapsulate the results of MC simulation in a machine learning

model (MLUE method, to be considered later)

9

Economic sampling: Latin Hypercube Sampling (LHS), McKay 1979

the range of each variable is divided into M equally probable intervals

M sample points are placed as shown

the number of divisions, M, is equal for each variable


for each variable LHS does not require

more samples for more dimensions (variables)

Building Models of Uncertainty : Using Methods of

Computational Intelligence

10

Data: attributes, inputs, outputs

Data often missing, noisyg, y temporal and spatial, heterogeneous non-stationary

Measured data Attributes

Inputs OutputInstances x1 x2 … xn y

Model output

y* = f (x)

y*


Instances x1 x2 … xn yInstance 1 x11 x12 x1n y1

Instance 2 x21 x21 x2 n y2

…Instance K xK1 xK2 xK n yK

yy1*y2*…yK*

CI in building models of natural processes -why not build a model of uncertainty?

Actual (observed) t t Y

Input dataNatural process

Xoutput Y

Data-drivenmodel

M (p, x)P di t d t t Y’

Error (p) min

yyKKxxKK nnxxKK22xxKK11Instance KInstance K……

yy22xx22 nnxx2121xx2121Instance Instance 22yy11xx11nnxx1212xx1111Instance Instance 11yyxxnn……xx22xx11InstancesInstances

OutputOutputInputsInputsAttributesAttributesMeasured dataMeasured data

yyKKxxKK nnxxKK22xxKK11Instance KInstance K……

yy22xx22 nnxx2121xx2121Instance Instance 22yy11xx11nnxx1212xx1111Instance Instance 11yyxxnn……xx22xx11InstancesInstances

OutputOutputInputsInputsAttributesAttributesMeasured dataMeasured data


CI provides methods to build Data-driven models Ideally, such models are “ultimate models” since they are not

polluted by theories

(p, )Predicted output Y’

11

Example of a data-driven (ML) model

observed data characterises the input-output relationship

Linear regression model

Y = a0 + a1 X

Y actual input output relationship X Y

model parameters are found by optimization

the model then predicts output for the new input without actual knowledge of what drives Y

(e.g., flow)output value

model predicts new output value

redgreen

blue


X (e.g. rainfall)

new input value

Which model is “better”:green, red or blue?

An example of a ML model: artificial neural network

Artificial neural network


i

tiijj

j

hidjkk

out xaagbbgY 2)(00 ))((

e + 1

1 = (u) gwhere

u-

12

A novel idea: train a machine learning model to predict Error distribution

Observed outputPHYSICAL SYSTEM

Observed outputPHYSICAL SYSTEM

HYDROLOGIC FORECASTING

MODEL

Input data Modelerrors

Forecastederrors

DATA-DRIVEN

error forecasting

Improvedoutput

Modelparameters

Model outputHYDROLOGIC

MODEL

Input data Modelerrors

Forecastederrors

Improvedoutput

Modelparameters

Model output

ML ERROR FORE-CASTING


Train ML model (e.g. Neural Network) to represent pdf of output uncertaintyD.P. Solomatine, D.L. Shrestha (2009). A novel method to estimate model uncertainty

using machine learning techniques. Water Resources Res. 45, W00B11. D. L. Shrestha, N. Kayastha, and D. P. Solomatine (2009). A novel approach to

parameter uncertainty analysis of hydrological models using neural networks. Hydrol. Earth Syst. Sci., 13, 1235–1248.

gmodelMODEL

Models and CI in Hydroinformatics

1. Models of systems (processes) M (s)

2. Models of models M (M(s))

3. Models of models’ uncertainty U (M(s))

4. Optimisation of models and systems s*, M*(s)


4. Optimisation of models and systems s , M (s)

13

Uncertainty analysis methods considered

1. UNNEC = machine learning model of the past errors of the optimal process model is builtthe optimal process model is built

2. MLUE = machine learning model of the process model’s Monte Carlo simulation results is built


1. UNEEC methodUNcertainty Estimation based on local Errors and Clustering


machine learning model of the past errors of the optimal process model is built

14

UNEEC: assumptions, constraints

Assumptions

Model error is an indicator of the model uncertainty Model error is an indicator of the model uncertainty

Model error depends on the current condition of a natural system and can be predicted

Model errors are similar for similar conditions

Constraints

Model structure and parameters are fixed


p

Need to re-train the error model with the changes in the catchment characteristics (e.g. land use change)

Data hungry, more data are needed for reliable results

Idea 1: assess CDF from all historical data about errors

Error (Qt-Qt’)

time

past records (examples)

i

iN

i

1

iN

i

12/

iN

i

1)2/1(

P di ti i t l

Error distribution


Prediction interval

15

Idea 2: local modelling of errors (link CDF to “characteristis variables”)

Error (Qt-Qt’)

Flow Qt-1

past records (examples in the space of inputs)

Outputi

iN

i

1

iN

i

12/

iN

i

1)2/1(

Error distribution in cluster


Rainfall Rt-2

Prediction interval(different for each cluster)

Idea 3: Use fuzzy clustering of examples to generate training data sets

Lclus

Nclus

exampleclusLexample PICPI ,

Eager learning (ANN or M5 model tree)

Error limits(or prediction intervals)

Flow Qt-1


Output

clus1

•clus,,example is the membership grade of the example to cluster clus

Train regression (ANN) models:

( )


New record. The trained f L and f U models will estimate the prediction interval

Rainfall Rt-2

models:PIL = fL (X)PIU = fU (X)

16

Using instance-based learning

Lclus

Nclus

exampleclusLexample PICPI ,

Instance based learning

Error limits(or prediction intervals)

Flow Qt-1


New record

Output

cluspp

1,

•clus,,example is the membership grade of the example to cluster clus

g


Rainfall Rt-2

record

The distance function is computed to estimate fuzzy weight

Clustering (finding groups of data in the space characterising hydro-meteo condition): K-means clustering, fuzzy C-means l t i

UNEEC details. Step 1: clustering

clustering

Obj. function

c

j

N

iji

mjim DVUJVU

1 1

2,,),(),min(

Distance22

, Ajiji vxD

Degree ofFuzzification

1m


Constraint ic

jji

,1

1,

Fuzzification

17

iLj ePIC

UNEEC details. Step 2: Determining Prediction Interval (PI) for each cluster

UPIC

N

ijiji

i

ki

1,,

12/:

Ni

ji,

N

iji

1,

N

iji

1,)2/1(

iUj ePIC

ijiji

ki

1,,

1)2/1(:


N

iji

1,2/

UNEEC details. Step 3, 4, 5: Building and using the model

Lj

c

jji

Li PICPI

1, U

jc

jji

Ui PICPI

1,

Step 3: Generation of Prediction intervals for

each example

)X( uL

uL fPI )X(U

uU fPI

Step 4: Building the uncertainty Model

)X(LL fPI )X(UU fPI Step 5: Using the uncertainty Modelep

ende

nt C

ompu

tati

on


)X( vufPI )X( vufPI uncertainty Model

Uii

Ui PIyPL ˆL

iiLi PIyPL ˆ

Model Outputs with uncertainty bounds

Inde

18

UNEEC methodology


2. MLUE methodMachine Learning in Uncertainty Estimation


machine learning model of the process model’s Monte Carlo simulation results is built

19

Monte Carlo simulation of parametric uncertaintyy = M(x, s, θ) + εs + εθ + εx + εy


Monte Carlo simulation of parametric uncertainty

Consider the model M calculating y (e.g., discharge) y(t) = M (X(t), p)y( ) ( ( ), p) where X(t) = vector of inputs (precipitation, temperature etc)

known for t = 1,…, T p = vector of parameters (soil properties, roughness, etc)

Monte Carlo approach: sample N parameter vectors pi run the model for each of them yi(t) = M (X(t), pi)

and generate N outputs (leave some of them if GLUE used)


and generate N outputs (leave some of them if GLUE used) assess distribution of Qi(t) for each time moment t (or its

parameters - mean, variance, prediction intervals, quantiles) The problem:

How to assess the parametric uncertainty of the model M for t = T+1 when new input data X(t+1) is fed?

20

Issues with MC

Issues with re-running MC for new inputs: 1) convergence of the Monte Carlo simulation is very slow 1) convergence of the Monte Carlo simulation is very slow

(O(N^-0.5)) so larger number of runs needed to establish a reliable estimate of uncertainties

2) number of simulation increases exponentially with the dimension of the parameter vector ((O(n^d)) to cover the entire parameter domain

Idea:


encapsulate the results of MC simulation in a machine learning model

MLUE Methodology (1)

Consider the sources of the uncertainty analysis to be conducted within the framework of Monte Carlo simulation

Execute the MC simulations to generate the datayi(t) = M (X(t), pi)

Estimate the uncertainty measures of the MC realizations, e.g., mean, variance, prediction intervals, quantiles to start with, estimate two quantiles (say, 5% and


, q ( y,95%), forming the prediction interval PI

21

MLUE Methodology (2)

Analyze the dependency of the uncertainty measures (quantiles) on the input and state variables of the(quantiles) on the input and state variables of the hydrological model we used Correlation and Average mutual information

analysis Select the input variables for machine learning model

based on the dependency analysisT i th hi l i d l U t di t th


Train the machine learning model U to predict the uncertainty measures of MC realizations PI = U (X)

Validate machine learning model U by estimating the uncertainty measures with the “new” input data

Validation

Measuring predictive capability of uncertainty model U (measures the accuracy of uncertainty models in approximating the quantiles of the model outputs generated by MC simulations) Coefficient of correlation (r) and root mean squared error (RMSE)

Measuring the statistics of the uncertainty estimation (i.e. goodness of the model U as uncertainty estimator) Prediction interval coverage probability (PICP) and

mean prediction interval (MPI) (Shrestha & Solomatine 2006, 2008) 1 n


Visualizing such as scatter and time plot of the prediction intervals obtained from the MC simulation and their predicted values

1

1

1, with =

0, otherwise

t

L Ut t t

PICP Cn

PL y PLC

1

1( )

nU Lt t

t

MPI PL PLn

22

Application

• Catchment area: 135 km2

• Location: south west of England

Study area: Brue catchment, UK

• Average annual rainfall: 867 mm

• Mean river flow: 1.92 m3/s

• Calibration data: 24/06/94-24/06/95

• Validation data: 24/06/95-31/05/96


23

Study area: Brue catchment, UK

40

45 0

5

10

15

20

25

30

35

40

Dis

char

ge

(m3/s

)

5

10

15

20

25

30

35

Rai

nfa

ll (

mm

/ho

ur)


0

5

0 2000 4000 6000 8000 10000 12000 14000 16000

Time (houry) (1994/6/24 05:00 - 1996/05/31 13:00)

35

40

Calibration data (8760 points):

Validation data (8217 points):

Conceptual Hydrological model HBV

SF – Snow

RF – Rain


SF – Snow

RF – Rain


SM

RF

R

EA

SP

Q0

SF

CFLUX

IN

p p

SP – Snow cover

IN – Infiltration

R – Recharge








Q T t l ff

SM

RFRF

RR

EAEA

SP

Q0Q0

SFSF

CFLUXCFLUX

ININ

p p

SP – Snow cover

IN – Infiltration

R – Recharge








Q T t l ff


LZ

UZ

PERC Q=Q0+Q1Q1

Transformfunction

Q – Total runoff

LZ

UZ

PERCPERC Q=Q0+Q1Q1Q1

Transformfunction

Q – Total runoff

24

Data Analysis

Analysis of dependency btw various combinations of the input variables and the outputinput variables and the output

Correlation

Average mutual information (AMI) between REt and PIs,

( optimal lag time is around 7-9 hours).

Additional analysis of the correlation and AMI between the PIs and observed discharge Qt are carried out. (i.e. with


g Q (the lag of 0, 1, 2) have very high correlation with the PIs.

Experimental setup

MC simulation 9 Parameters of HBV model are sampled uniformly from 9 Parameters of HBV model are sampled uniformly from the feasible ranges

Nash-Sutcliffe coefficient of efficiency (CE) is used as error measure

Convergence – stabilized after 10,000 (75,000 runs made) Only 25,000 “good” models considered (rejection threshold

is set to 0) to compute prediction quantiles


is set to 0) to compute prediction quantiles

25

Experimental setup

Machine learning model U PI = U (RE Q Q ) PI = U (REt-5a, Qt-1, Qt-1 )

PI - lower or upper prediction intervals,

REt-5a - average of REt-5, REt-6, REt-7, REt-8, and REt-9 Qt-1 - Qt-1 - Qt-2.

Input variables were selected based on the analysis of their relatedness to output error (average mutual information)

( , )AMI ( ) l

XY i jP x yP


Methods: M5 model trees, locally weighted regression MLP neural networks

2,

AMI= ( , ) log( ) ( )

jXY i j

X i Y ji j

P x yP x P y

Results

26

UNEEC: Clustering result example

35

40

C1

15

20

25

30

Dis

char

ge (

m3 /s

)

C1

C2

C3C4

C5


0 1000 2000 3000 4000 5000 6000 7000 80000

5

10

Time (hours)

UNEEC: Performance (MLP ANN)

200

250

val (

m3 /s

)

200

250

val (

m3 /s

)

0 50 100 150 200 250

50

100

150

Target lower interval (m3/s)

Pre

dict

ed lo

wer

inte

rv

0 50 100 150 200 250

50

100

150

Target upper interval (m3/s)

Pre

dict

ed u

pper

inte

rv

Prediction interval Data set Mean Std. dev. RMSE Corr. coef.


Prediction interval Data set Mean Std. dev. RMSE Corr. coef.

training 110.91 53.6 5.9582 0.9937

CV 112.18 52.64 6.0852 0.9934 lower

training+CV 111.35 53.32 5.9582 0.9937

training 115.16 55.11 3.9002 0.9975

CV 116.69 54.18 3.9332 0.9974 upper

training+CV 115.66 54.79 3.9002 0.9975

27

UNEEC: Estimation of uncertainty bounds

25 0 10

15

20

25

sch

arg

e (m

3/s

)

0

5

10

15

tion

bo

und

s (m

3/s

)

C1C2C3C4C5

PIs by instance baseObservedPIs by regression

0

5

10

15

20

Dis

cha

rge

(m

3/s

) 5

10

15

20

25

Pre

dic

tion

bo

un

ds

(m3/s

)

/s)

C1C2C3C4C5

PIs by instance baseObservedPIs by regression

0

5

Dis

20

25

Pre

dic

4700 4750 4800 4850 4900 4950 5000-2

0

2

Time (hours)Re

sid

ual

s (m

3/s

)


0 25

0 1000 2000 3000 4000 5000 6000 7000 8000-20

0

20

Time (hours)Re

sid

ua

ls (

m3/

MLUE: Estimation of prediction intervals


28

MLUE: Performances

Predictive capability Goodness of uncertainty measures

MCS MT LWR ANN

PICP%

77.24 66.97 75.16 65.54

MPI m3/s

2.09 2.03 1.93 1.96

Corr C RMSE

PIL PIU PIL PIU

MT 0.841 0.792 0.614 1.641

LWR 0.822 0.798 0.643 1.604

ANN 0.847 0.806 0.584 1.568

uncertainty measures


MCS = Monte CarloMT = M5 Model treeLWR = local weighted regressionANN =MLP neural network

Extensions

Estimation of several quantiles 5%, 10%:10%:90%, 95% i.e. estimating cdf of MC realizations by machine learning

modelsmodels


29

Use of Machine learning methods in predicting uncertainty: conclusions

Machine learning methods are able to replicate: Past performance of a process model Past performance of a process model Results of Monte-Carlo simulations

The methods are computationally efficient and can be used in real time application

They are applicable to various kinds of models Future work:

Other ML methods are to be tested


Other ML methods are to be tested The methods can be applied in the context of other

sources of uncertainty - input, structure, or combined

Thank you for your attention


data-driven modelling. part 4. models of uncertaintymodels ... · pdf file1 data-driven...

Documents