flood stage forecasting with support vector machines

JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATIONVOL. 38, NO. 1 AMERICAN WATER RESOURCES ASSOCIATION FEBRUARY 2002

FLOOD STAGE FORECASTING WITHSUPPORT VECTOR MACHINES'

Shie-Yui Liong and Chandrasekaran Sivapragasam2

ABSTRACT: Machine learning techniques are finding more andmore applications in the field of forecasting. A novel regressiontechnique, called Support Vector Machine (SVM), based on the sta-tistical learning theory is explored in this study. SVM is based onthe principle of Structural Risk Minimization as opposed to theprinciple of Empirical Risk Minimization espoused by conventionalregression techniques. The flood data at Dhaka, Bangladesh, areused in this study to demonstrate the forecasting capabilities ofSVM. The result is compared with that of Artificial Neural Net-work (ANN) based model for one-lead day to seven-lead day fore-casting. The improvements in maximum predicted water levelerrors by SVM over ANN for four-lead day to seven-lead day are 9.6cm, 22.6 cm, 4.9 cm and 15.7 cm, respectively. The result showsthat the prediction accuracy of SVM is at least as good as and insome cases (particuiarly at higher lead days) actually better thanthat of ANN, yet it offers advantages over many of the limitationsof ANN, for example in arriving at ANN's optimal network archi-tecture and choosing useful training set. Thus, SVM appears to bea very promising prediction tool.(KEY TERMS: structural risk minimization; support vectormachines; flood forecasting; neural networks.)

INTRODUCTION

Classical statistical estimation methods (like thestandard regression) are linear, model driven, andparametric in nature, which assumes strong a prioriknowledge about the unknown dependency. The train-ing data is used to estimate the parameter values.However, in many real world problems, this underly-ing assumption is not always true. Further, suchapproaches involve impracticality in dealing withhigh-dimensional cases, as large numbers of trainingsamples are required for parameter estimation. In therecent past, empirical, nonlinear, data driven models[like Artificial Neural Network (ANN)] are being

widely used to address the shortcomings of the para-metric approach. The model depends on the availabledata to be "learned," without any a priori hypothesisabout the kind of relationship, which is allowed to becomplex and nonlinear. However, these methods lackunderlying mathematical theory and are usuallymotivated by biological arguments.

Recently, the support vector machine has attractedthe attention of many researchers. It has a functionalform (similar to model driven approach), the complex-ity of which is decided by the available data to be"learned" (similar to data driven approach), andalthough SVM has an underlying functional form, itsexact nature is not assumed a priori (unlike modeldriven methods). In other words, this machine can beseen as a statistical tool that approaches the problemsimilar to a Neural Network with a novel way to trainpolynomial function, Radial Basis Function, or neuralnetwork regression estimators. More precisely, it is anapproximate implementation of the method of struc-tural risk minimization. This induction principle min-imizes an upper bound on the error rate of a learningmachine on test data (i.e. generalization error) ratherthan minimizing the training error itself (used inempirical risk minimization as in ANN). This helps inmaking generalizations on the unseen data.

Support vector classifiers have already becomecompetitive with the best available techniques forclassification. In the recent past, its excellent perfor-mances on regression and time series prediction havebeen demonstrated in various studies (Vapnik et al.,1996; Mukherjee et al., 1997).

The intention of this paper is two fold — to intro-duce SVM with its applications and to compare itwith ANN. ANN has been chosen for comparison

lPaper No. 01047 of the Journal of the American Water Resources Association. Discussions are open until October 1, 2002.2Respectively, Associate Professor and Research Scholar, Department of Civil Engineering, National University of Singapore, 10 Kent

Ridge Crescent, Singapore 119260 (E-MaiLILiong: [email protected]).

JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATION 173 JAWRA

Liong and Sivapragasam

because SVM is essentially a data driven modelalthough it has the final functional form similar tomodel driven approach. Further, ANN has beenshown to perform better than many conventionalregression methods (Karunanithi et al., 1994; Hsu etal., 1995; Tokar and Johnson, 1999; Toth et al., 2000;Arthur et al., 2001. In the next section, a discussionon Structural Risk Minimization principle is present-ed followed by an introduction to support vectormachine for regression. Next is a qualitative discus-sion on the advantages of SVM over ANN in light ofthe recent report by ASCE Task Committee (2000aand 2000b). The Bangladesh flood problem is thendescribed, followed by the implementation of thistechnique to the flood stage data. The result of thisstudy are compared with those resulting from a neu-ral network model (Liong et al., 1999) in terms of pre-diction accuracy for one- to seven-lead day advanceforecasting. Further, as special case study, the bene-fits of SVM in identifying useful training set and itsgeneralization capability under data deficit situationsare demonstrated. Finally, a brief discussion con-cludes the paper.

STRUCTURAL RISKMINIMIZATION PRINCIPLE

The problem of learning from data (examples) is tochoose from the given set of functions fF3, 13 E A,the one that best approximates the measured outputbased on a training exemplars of n examples(x1,y1),. . each generated from an unknownprobability distribution P(x,y). The best approxima-tion implies the smallest possible value of the follow-ing risk, R(13),

R(F3) = J(y—

f)2dP(x,y) (1)

The problem is that R(F3) is unknown, since P(x,y) isunknown. Therefore an induction principle for riskminimization is necessary.

The straightforward approach is to minimize theempirical risk given byi

Remp(13) = )2i= 1

However, this approach does not guarantee a smallactual risk (test set error) for a small error on train-ing exemplars, if the number of training examples, n,is limited. To obtain the best possible actual risk fromlimited data, novel techniques have been developed in

the last two decades based on statistical learning the-ory. According to this theory, the generalization abili-ty of learning machines depends on capacity conceptsthat are more sophisticated than merely the dimen-sionality of the space or the number of free parame-ters of the loss function (as espoused by the classicalparadigm of generalization). One such technique isthe Structural Risk Minimization principle (Vapnik,1999). It is based on the fact that for the above learn-ing problem, for any 13 e A, the bound on test error isof the form,

R(13) � Remp(13) +

where the first term is an estimate of the risk and thesecond term is the confidence interval for this esti-mate. The parameter h is called VC-dimension(named after the authors) of a set of functions. It canbe seen as the capacity of a set of functions imple-mentable by the learning machine. For ANN, deter-mining h corresponds with choosing appropriatenetwork architecture for a given training set. Duringthe training phase, the network tries to minimize thefirst term in Equation (3). If the chosen architecturehappens to be too complex for the given amount oftraining data, the confidence interval term will belarge. So, even if one could minimize the empiricalrisk, the actual risk still remains large, thus resultingin poor generalization.

According to the Structural Risk Minimizationprinciple (SRM), one can control the actual risk bycontrolling the two terms in Equation (3). Thus, fora given set of observations (x1,y1), . . . theSRM principle chooses the function f 13* in the subset{f 3 : 13 e Al, for which the guaranteed risk bound asgiven by Equation (3) is minimal.

INTRODUCTION TO SUPPORTVECTOR MACHINE

Support Vector Machine (SVM) is an approximateimplementation of SRM principle. We will firstdescribe the SVM methodology for linear functions,and then we will extend the idea for dealing with non-linearity with the so-called "kernel trick."

The basic idea is to perform linear regression tofind a function fx) that has at most E deviation fromthe actually obtained targets, y, for all the trainingdata, and at the same time, is as flat as possible (lesscomplex). The decision function is

f(x)_—(a,x)+b

JAWRA 174 JOURNAL OF THE AMERICAN WATER RESOURCES ASSOCIATION

Flood Stage Forecasting With Support Vector Machines

where (. , .) denotes the dot product. Flatness in Equa-tion (4) means that one seeks small a. Formally, theabove problem can be written as the following convexoptimization problem with slack variables , E' toaccount for the outliers in the training data

MinimizeIIaII2

+ +

Subject to d1 —(a.x1 +b)�c+

(a.; +b)—d1 �E+1� 0

(5)

In the above equation, minimizing the first termamounts to minimizing the VC-dimension of thelearning machine, and the second term controls theempirical risk. The constant C > 0 in Equation (5)determines the tradeoff between the flatness of f andthe amount up to which deviation larger than c aretolerated. This formulation corresponds to what iscalled VapnikIs c-insensitive loss function asdescribed below.

The following defines an c-tube so that if the pre-dicted value is within the tube, the loss

C(f(x)-d)= I(f(x)-d)I -c for I(f(x)-y)I�c=0 otherwise

function is zero, and if the predicted point is outsidethe tube, the loss is the magnitude of the differencebetween the predicted value and the radius c of thetube. Figure 1 gives the graphical description. Onlythe points outside the shaded region are penalized ina linear fashion to contribute to the cost.

Figure 1. Linear Loss Function With Slackand User-Specified Accuracy E.

The above optimization problem subject to the con-straints in loss function can actually be solved moreeasily in dual formulation, by introducing a dual setof variables, a and a, which are the Lagrange multi-pliers. By carrying out the minimization of the objec-tive function thus formed with respect to weightvector a and slack variables E,j and j and some math-ematical calculations, the dual problem for the non-linear regression may now be defined as

maximize

in * * n *

—ji,j=1 1=1

n

+ yj(a1 _a)i= 1

subject to the following constraints

(aj_a7)=Oi= 1

i=i,2i=1,2 ,N

(7)

(8)

The solution of the above problem yields a1 and a for(6) all i = 1,2,... ,n. It can be shown that all the training

exemplars within the c-insensitive zone yields a1 andas zeros. The remaining nonzero coefficients essen-

tially define the final decision function. The trainingexemplars corresponding to these nonvanishing coeffi-cients are called support vectors. In simple terms,support vectors are those examples that "support" orhelp in defining the decision function while otherexamples become redundant. Thus the learningmachine derives its name. The final decision functiongiven as

n

f(x)= (a1 _a)(x,xj)+bi= 1

(9)

The advantage of dual form can be seen clearly inEquations (7) and (9), where the training patternsappear as dot products. Since in dual representationdot product of two vectors of any dimension can beeasily estimated, SVM can deal any increase in num-ber of attributes with relatively much greater ease.This advantage is used to deal with non-linear func-tion approximation as described below.


— -t-


Dealing with Nonlinearity

In order to deal with the nonlinearity, the inputdata, x, in input space is mapped to a high dimension-al feature space via a nonlinear mapping function, 4>.Then a linear regression is performed in the featurespace. The objective function in dual form can be nowrepresented as

maximize

in * * n*

(a—aj)(aj—aj)(4>j,4>j)—c(a+a)

n

+ yj(a _ct)i= 1

(10)

subject to the constraints as given in Equation (8). Itshould be noted that (4>j, 4>) represent the dot productin the feature space. For example, a two dimensionalinput space can be represented with a possible map-ping of second degree monomialls as

(xi,x2) —* 4>(x1,x2)= (x,x,x1x2) (ii)

The use of features of degree d will soon result in anumber that becomes computationally infeasible forreasonable numbers of attributes and feature degrees.Further, the generalization of the learning machinemay be very poor. These problems are accounted byusing the so-called "Kernel" functions

K(xi,x2)=(4>j,4>j) (12)

It can be noted that we are actually interested in com-puting the inner product in the feature space (4j, 4>j)and are not interested in the feature space represen-tation as such. The use of kernels makes it possible tomap the data implicitly into a feature space and totrain a linear machine in such a space, potentiallyside-stepping the computational problems inherent inevaluating the feature space. It can be shown thatany symmetrical kernel function satisfying Mercer'scondition (Cristianini and Shawe-Taylor, 2000) corre-sponds to a dot product in some feature space. Thefinal decision function is

f(x) = 1(a1 _c4)K(xt,x)+bi= 1

(13)

The SVM for nonlinear regression can be imple-mented in the form of a polynomial learning machine,radial-basis function network, for example, or two-layer perception based on the type of kernel functionswe choose.

It should be noted that c, C, and the kernel-specificparameters must be tuned to their optimal values bythe user to get the final regression estimation. At themoment, identification of optimal values for theseparameters is largely a trial and error process. Fur-ther, other than c-insensitive loss function, quadraticloss function may also be used, in which case c = 0.This has the obvious disadvantage of losing thesparseness of representation (i.e., all the trainingexemplars become support vectors). Details on SVMcan be found in Vapnik (1995), Drucker et al. (1996),Smola and Schdolkopf (1998), Haykin (1999), Vapnik(1999), and Cristianini and Shawe-Taylor (2000).

In this study, the kernel function used is the RadialBasis Function (RBF)

K(x1,x) = exp(-sqrt((; - x) * (XL - x)T / (2a2))

where a is the width of RBFs. Also, we preferredquadratic loss function over c-insensitive loss functionbecause of the following reasons: (1) we wanted tohave impartial comparison with Artificial Neural Net-works (ANN), which use a quadratic loss function; (2)c-insensitive loss function are at least three to fourtimes more computer memory intensive in compari-son with the quadratic loss function; and (3) the num-ber of tunable parameters is less as c = 0.

Advantages of SVM over ANN

In the past two decades or so, ANNs have beenwidely used by researchers in hydrologic applicationslike rainfall-runoff modeling, streamfiow prediction,ground water modeling, rainfall forecasting, reservoiroperations, etc. In a recent review by ASCE Task com-mittee (2000a, 2000b) it was pointed out thatalthough ANN does have many attractive features, itsuffers from some major limitations, inviting skepti-cal attitude towards the methodology. SVM seems tobe a powerful alternative to overcome some of thebasic lacunae in application of ANNs, while retainingall the strengths of ANN. We present the advantagesof SVM with a view to address the questions posed bythe ASCE Task Committee (2000a, 2000b) on the lim-itations of ANN.

1. ANN is a Black-Box Model. The set of optimalweights and threshold values (after the training) doesnot reveal any information to the user. This has been



one of the primary reasons for skeptical attitudetowards this methodology. However, unlike ANN,SVM is not a "black box" model. It can be analyzedtheoretically using concepts from computationallearning theory. The final values of Lagrange multi-pliers indicate the relative importance of the trainingpatterns in arriving at the decision function.Although research is still underway to have a com-plete understanding of SVM in function approxima-tion, SVM offers some basic explanation on how itarrives at the final decision.

2. Identifying Optimal Training Set. As notedby the ASCE Task Committee (2000a, 2000b), inmany hydrological applications, there is a prohibitivecost and time associated with data collection. SinceANN is data intensive, without proper quality andquantity of data, the generalization will be very poor.Since SVM is based on Structural Risk Minimizationprinciple (SRM), rather than the Empirical Risk Mini-mization (ERM), it offers a better generalization erroras compared to ANN for a given training set. Further,it can be seen that after the completion of training, forthe c-insensitive loss function, the number of trainingpatterns required for defining the final decision func-tion turns out to be a small fraction of original train-ing set. This may offer a way to only store the"optimal data set" rather than the whole training set.For the quadratic loss function, all the training pat-terns become support vectors, and yet for a reason-able accuracy, many training patterns resulting in avery low value of (a1 - a) can be taken as redundant.

3. Improving Time Series Analysis. One of theimportant issues in time series analyses is memorystructure, which is usually characterized by a covari-ance function. In many studies, temporal variationsare often represented by including past inputs!outputs as current inputs. However, it is not immedi-ately clear how far back one must go in the past toinclude temporal effects. This makes the resultingANN structure more complicated with a greater num-ber of tunable parameters. SVM, however, can dealwith the increase in the number of attributes with rel-atively much greater ease, since in dual representa-tion the dot product of two vectors of any dimensioncan be easily estimated.

4. Adaptive Learning. Since SVM learning is nota "black box" learning, it is data adaptive to somedegree. In fact, since only the useful training vectorsform the basis for defining the final decision function,SVIVI is expected to give a relatively good generaliza-tion performance for future hydrologic conditions also,unless the catchment undergoes a drastic natural or

man-made change affecting the underlying physicalprocess.

5. Optimal Architecture. SVM relieves the userfrom a time-consuming trial-and-error procedure ofsynthesizing the suitable network architecture as inANN, where the choice of network architecture is usu-ally determined by users past experience and prefer-ence, rather than physical aspect of the problem. Theuse of optimal network architecture is one of themajor issues in the ANN applications. In SVM, thefinal architecture is automatically obtained from thesolution of the optimization problem that gives thesupport vectors. The number of support vectors canactually be seen as the number of hidden neurons insingle hidden layer architecture.

6. No Local Minima. The optimization problemformulated for SVM is always uniquely solvable andthus does not suffer from limitations of ways of regu-larization (like early stopping, etc.) as in ANN thatmay lead them to a local minimum.

7. Exploiting Higher Dimensional Features. Infeature selection, frequently one seeks to identify thesmallest set of features that still conveys the essentialinformation contained in the original attributes,called "dimensionality reduction." This is consideredhelpful particularly because with the growth in num-ber of features, the computational and generalizationperformance degrades. However, this poses the obvi-ous problem of not exploiting the capabilities in high-er dimensional feature space. For example, in the useof principal component analysis (PCA), dimensionali-ty reduction is sometimes done by simply removingfeatures corresponding to directions in which the datahave low variance, though there is no guarantee thatthese features are not essential for analysis. SVMoffers an efficient way to deal with this problem owingto dual representation of machines in which the train-ing examples always appear in the form of inner prod-ucts between pairs of examples. Thus, efficient use ofhigh dimensional feature space is possible throughkernel functions.

8. Learning Basis in Higher Dimension. Be-sides computational problems, the danger of over-fit-ting inherent in high dimensions may result in poorgeneralization. SVM provides a sophisticated learningbias from the statistical learning theory to account forthe generalization problem in higher dimensional fea-ture space.



FORECASTING DHAKA FLOOD STAGE

The city of Dhaka is located in the flat region inproximity to the confluence of three major rivers: theBrahmaputra, the Ganges, and the Meghna. All therivers carry heavy runoff during the monsoon (May toSeptember) when their catchments incur intenserainfalls as high as 1,100 cm. The major rivers andtheir tributaries have their origins outsideBangladesh, and only about 7.5 percent of the theirtotal catchment area of about 1,500,000 km2 lie with-in Bangladesh. Approximately 90 percent of theirannual flows originate outside the country.

Following the two successive catastrophic floodevents in 1987 and 1988 it has been recognized(Hoque, 1994) that it is necessary to: (1) examine theeffectiveness of flood control options, (2) improvethe accuracy of current flood forecasting practices,(3) extend prediction beyond the major rivers inBangladesh, and (4) evaluate a sound flood action pol-icy. Therefore, a sound flood forecasting tool is abso-lutely needed.

Historically, data on rainfall, water level, and dis-charge at selected locations are available for the past32 years. Since 1990, real-time water level data arereceived from about 40 water level stations and 46rainfall stations within Bangladesh. The Flood Fore-casting and Warning Division of Bangladesh WaterDevelopment Board (BWDB) has been established toprocess the data and make a routine flood forecastduring the monsoon period. At present, the real timeflood forecast is being carried out by BWDB using theMIKE11 (DHI, 1993) modeling system. The physicallybased models (like MIKE11) are not ideal for real-time forecasting due to the large amount of data itrequires and its long computational time. Further,according to the Flood Action Plan 10 (FAP1O) report,this could be achieved effectively only by obtainingwater level data within the Indian Territory for thetwo major rivers, the Ganges and Brahmaputra.There is an agreement on data transmission betweenIndia and Bangladesh but due to unknown reasons,the data transmitted was delayed or not received atall. Liong et al. (1999) suggested a data drivenapproach (ANN model) for predicting water level inDhaka with minimum information of only the histori-cal water level data available within the country. Thepresent study suggests SVM as a potential alternativewhich relieves the user from a time-consuming trial-and-error procedure of synthesizing the suitable net-work architecture as in ANN without compromisingon the prediction accuracy.

Data from a total of eight water level stations areavailable (Figure 2). However, based on the sensitivi-ty analysis carried out by Liong et al. (1999), it was

found that only five stations contribute significantlyto the flood stage at Dhaka. The five stations are:Dhaka (ST 1), Rajshahi (ST 2), Chilmari (ST 3), Dur-gapur (ST 4), Kanairghat (ST 5), and Habiganj (ST 7).Only the current day water level in these five stationsis considered for water level prediction in Dhaka. Theinfluence of the magnitude of the lateral inflows isrepresented in the water levels of the gauging sta-tions located downstream.

SVM is implemented on the Bangladesh flood datausing a MATLAB tool developed by Gunn (1997).Daily water level data from 1991 till 1996 are avail-able for five gauging stations. These data are used fortraining and verification of the SVM. Similar to thestudy by Liong et al. (1999), data for years 1992, 1993,and 1995 are selected for training and those of 1991,1994, and 1996 are used for verification.

A RBF kernel is used with the spread a, as definedin Equation (14). Different combinations of C (Equa-tion 5) and a values are tried to yield the best perfor-mance on training data. Table 1 enumerates the bestvalues of C and a parameters resulting from one- toseven-day lead period advance forecasting.

The performance of the predictions is evaluated bythe various goodness-of-fit measures. They are coeffi-cient of efficiency (Nash-Sutcliffe coefficient), R2,Root-Mean-square-Error (RMSE), and the MeanAbsolute Error (MAE) expressed as

p

[(Xm)j =(m)ji= 1

(P 2

RMSE=/[(Xm). _(x5).] (16)ipi=1

MAE =-.(x) —(x3)j (17)pi=1

where the subscripts m and s represent the measuredand predicted water levels; p is the total number ofevents considered; and Xm, is the mean value of themeasured data. An R2 value of 1 implies a perfectmatch between the measured and the predicted data,in which case both RMSE and MAE are 0.

The study carried out in Liong et al. (1999), using amulti-layered feed-forward neural network (MLF)with the error back-propagation, is considered forresult comparison. The final neural network architec-ture adopted in their study consisted of an input layerof five neurons representing the water levels at the

[(Xm). _(xs)12R2 = 1— (15)



Figure 2. Bangladesh River System and Gauging Stations Used in Study.

five stations, one hidden layer with 30 neurons, andan output layer with one neuron representing thewater level at Dhaka. A sigmoid activation functionwas selected and 2000 epochs were used as the crite-ria to terminate the computation. Learning rate andmomentum rate were fixed at 0.1.

TABLE 1. C and c Values for Different Lead Days.

Number of Lead Days(1)

C(2)

a(3)

1 75 5.5

2 50 4.03 60 3.54 65 6.05 50 7.56 130 13.5

7 170 25

The present study is further extended to identifythe useful training exemplars. After removing the"non-essential" training vectors, the performance ofSVM on the reduced data set is compared to that ofANN. The case study is conducted for seven-day pre-diction.

RESULTS AND DISCUSSION

The result of S\TM on Bangladesh flood data showsits robustness in both training and verification. Asexpected, the training set for shorter lead days yieldhigher R2 values and less RMSE and MAE valuesthan that of longer lead days as listed in Table 2. TheR2 value is 0.999 for up to three-lead day and 0.974for seven-lead day. The maximum water level predic-tion error varies from 0.159 m for one-lead dayto 0.687m for seven-lead day forecasting. The R2 for

JOURNAL. OF THE AMERICAN WATER RESOURCES ASSOCIATION 179 JAWRA


TABLE 2. Water Level Prediction Error and Goodness-of-Fit.

TimeGoodness-of-Fit

Water LevelMaximum

ErrorError

StandardRMSE MAEStatus (days) R2 Value (m) (m) (m) Deviation

(1) (2) (3) (4) (5) (6) (7)

Training 1

2

3

4

5

6

7

0.999(0.996)

0.999(0.990)

0.999(0.983)

0.998(0.974)

0.993(0.961)

0.990(0.956)

0.974(0.942)

0.032(0.078)

0.037(0.118)

0.036(0.153)

0.052(0.187)

0.097(0.229)

0.119(0.243)

0.184(0.275)

0.022(0.057)

0.026(0.089)

0.026(0.116)

0.037(0.140)

0.074(0.170)

0.089(0.179)

0.138(0.201)

+0.159(0.360)

+0.217(0.535)

+0.169(0.747)

+0.271(0.985)

+0.405(0.989)

+0.466(1.017)

+0.687(1.126)

0.032(0.078)

0.036(0.118)

0.036(0.153)

0.052(0.184)

0.098(0.211)

0.119(0.242)

0.184(0.273)

Validation 1

2

3

4

5

6

7

0.993(0.993)

0.986(0.985)

0.977(0.974)

0.968(0.960)

0.956(0.938)

0.943(0.934)

0.931(0.916)

0.079(0.079)

0.115(0.116)

0.144(0.153)

0.168(0.186)

0.194(0.231)

0.222(0.234)

0.244(0.262)

0.058(0.058)

0.092(0.093)

0.116(0.123)

0.137(0.150)

0.156(0.183)

0.178(0.184)

0.194(0.205)

+0.470(0.475)

+0.417(0.441)

+0.490(0.614)

+0.535(0.631)

+0.479(0.705)

+0.629(0.678)

+0.700(0.857)

0.076(0.076)

0.112(0.116)

0.140(0.153)

0.166(0.179)

0.193(0.205)

0.220(0.230)

0.244(0.256)

NOTES: + implies that measured value is underestimated.Values in ( ) refer to ANN results obtained by Liong et al. (1999).

verification set vary from 0.993 for one-lead day to0.931 for seven-lead day. The maximum water levelprediction error varies from 0.47 m for one-lead day to0.70 m for seven-lead day.

Table 2 also lists the R2 values, RMSE and MAEfor verification data for one-lead day to seven-leadday forecasting. The results are now compared tothose of NN (represented by numerals in bracket inTable 2). The results show that SVM is at least asrobust as ANN in prediction. In fact, for higher leaddays, SVM does perform better than ANN in bothtraining and verification. Figure 3 shows the scatterplot for the verification data. The plot clearly indi-cates that even at seven-lead day, the prediction isstill very good. Figure 4 shows the plot of measuredand that predicted by ANN and SVM for 7-lead dayforecasting of verification data. It can be seen thatSVM predictions are very close to that by ANN. Thevalidation results are marginally better than ANN forshorter lead days but five-lead days or higher showbetter performance in terms of R2. R2 value of SVM is0.931 as compared to 0.916 by ANN for seven-leadday. Also, the maximum water level error predictionof SVM for seven-lead days is 0.700 m as opposed to0.857 m resulting from ANN, giving an improvementof 15.7 cms. The maximal error of 0.7 m error is actu-ally an overprediction and is, therefore, favorable interms of flood evacuation measure. It is seen that the

large events are predicted more accurately both bySVM and ANN.

Figure 5 gives the pictorial representation of theform of final equation. The (a1 - a) values for differ-ent lead days are found after SVM training.

It is found (after some sequential elimination andretraining the SVM network) that out of the totaltraining exemplars of 467 (years 1992,1993, and1995), only 139 are actually useful which correspondsto all exemplars with I (a1 - a) I greater than 60. Theseven-day prediction results from ANN and SVMtraining with 139 exemplars are given in Table 3. Interms of R2 , SITM performs better than ANN with theuseful (and reduced) data set indicating the advan-tage of SRM principle. As seen from the scatter plot(Figure 6), the medium and some of the high waterlevels are better predicted by SVM.

The parameters of SVM, C, and a, are finalizedafter some trial and error. Generally the value of C issensitive between 0 and 100. However, the trial anderror process of C and a determination converges veryquickly while ANN took significantly longer to arriveat the optimal architecture for the given training set.Table 4 shows the various time-consuming and effort-ful trials adopted before arriving at the optimal archi-tecture for ANN for the case on training with only 139useful exemplars.


6

5

-0a). 3

0


(a) 1-lead day

6

5

a)

0Va)0.2

0

6

5

V4a)0V0)

0.2

1

0

6

5

V4a)00a)0.2

0

(c) 3-lead day (d) 4-lead day

(g) 7-lead day

Figure 3. Scatter Plots of Different Lead Days: Verification.


0 1 2 3 4 5 6 0 1 2 meaured 5 6measured

(b) 2 - lead day

0 1 2 34 5 6measur

0 1 2 3 4 5 6

measured

0 1 2 3 4 5 6 0 1 2 3 4 5 6measured measured

(e) 5-lead day (f) 6-lead day

0 1 2 meaured 5 6

6

5

:E4>

2

0


6

5

a,>

00

rvasured - NN ------- SVM

20 40 60 80Time (days)

100 120

(a) 7-lead day verification (1991)

140 160

0 20 40 60 80 100 120 140 160lime (days)

(b) 7-lead day verification (1994)

5

4

>a)-J

0

Measured - NN SVM

0 10 20 30 40 50 60 70 80Time (days)

(c) 7-lead day verification (1996)

Figure 4. Seven-Lead day Plots for Verification Data.


1ST 1]

ST2

[sT3 I


[sT4ci

ST5

Figure 5. Pictorial Representation of the Final Form of the SVM Equation.

TABLE 3. Case Study for Identifying "Useful" Exemplars.

Case Study

Goodness-of-FitWater Level

MaximumError

(m)

ErrorStandardDeviationR2 Value

RMSE(m)

MAE(m)

With 467 Training Exemplars 0.931(0.916) 0.244(0.262) 0.194(0.205) +0.700(0.857) 0.244(0.256)

With 139 Useful Exemplars 0.928(0.893) 0.244(0.296) 0.194(0.239) +0.683(-1.003) 0.243(0.272)

NOTES: + implies that measured value is underestimated.Values in ( ) refer to ANN results.

Flood forecasting is concerned with making anaccurate forecast at a considerably shorter time peri-od so as to allow the authorities to issue the warningof the impeding flood and to take evacuation measureas necessary. SRM principle allows SVM to generalizebetter than ANN, particularly under data deficient

conditions. In general, the prediction accuracy ofANN and SVM in this particular study is not very sig-nificantly different. However, SVM relieves the userfrom a time-consuming trial-and-error procedure ofsynthesizing the suitable network architecture as inANN (which is one of the major lacunac in ANN


I(a1 - ai*)

Hidden layer of n (=467training exemplars) iimer-product Kernels [=K(x1,x)

K(x,x)]

6

5

a,. 3a,0

0


(a) ANN (7-lead day) (b) SVM (7-lead day)

Figure 6. Scatter Plot for Seven-Lead Day Verification With Useful Training Exemplars.

TABLE 4. Optimal ANN Architecture Selection for Training With Useful Exemplars.

ExemplarSet

OptimizationParameters Variables Optimal Value

A Epoch Size 100

5001000

1500

2000

1000 Epochs

B Hidden Neuron 5

8

15

2030

15 Hidden Neurons

C Learning Rate 0.10.3

0.50.70.9

0.1 Learning rate

D Momentum Constant 0.1

0.30.5

0.70.9

0.1 Momentum Constant

E Activation Function SigmoidTanhLinear

Sigmoid

F Hidden Layer 1

2

3

1 hidden layer

G Initial Weight Distribution ±0.1

±0.3

±0.5

±0.7

±0.9

0.3 Initial Weight Distribution


0 1 2rTeaJred4 5 6 0 1 2rTaayred 4 5 6


applications as reported by many researchers). Thisbecomes particularly more important if the number ofinput variables is considerably large when we desireto include more information from the same and othergauging stations. The resulting architecture of theANN can become too big to handle and would requirea large number of exemplars to train the networkwith desired generalization error. The dual form ofrepresenting the input variables in SVM offers a sig-nificant advantage in computational efficiency. Fur-ther, it is possible to identify "useful" trainingexemplars. This may help in data storage, particular-ly for real-time forecasting problems where huge datahandling is required.

CONCLUSION

The application of Support Vector Machine as arobust forecasting tool has been shown with its imple-mentation in the Bangladesh flood stage data. SinceSVIVI is an approximate implementation of StructuralRisk Minimization principle, it offers a better general-ization performance particularly when the trainingset available is limited. Further, SVM's inherent prop-erties give it an edge in overcoming some of the majorlacunae in the application of ANN. For example,unlike ANN, SVM does not require the architecture tobe defined a priori. SVM also helps in identifying theuseful training exemplars from the available trainingset. The optimization problem formulated for SVM isalways uniquely solvable and thus does not sufferfrom limitation of ways of regularization as in ANN,which may lead them to local minima. SVM is botheasy to implement and to use.

SVM is, however, still in very beginning stage andnot many applications can be found in functionapproximation or regression problems. Identificationof optimal parameters is still an active research areain SVM and presently there seems to be no alterna-tive but to do trial and error to arrive at near optimalparameters. More research has to be done to establishthe specific advantages SVM offers in comparison toANN and that may help opening a new avenue inhydrologic modeling and forecasting.

APPENDIX: NOTATIONS

The following symbols are used in this paper:

= feasible subset value;= feasible function;= actual risk;= empirical risk;= confidence interval;= VC-dimension;

= input vector (in this study, it refers to thecurrent water levels in five stations);

y = measured output (in this study, it refersto the water level at Dhaka);

G = higher dimensional feature space (wherethe input vector dimension is increased);

= the nonlinear mapping function;= the linear function in feature space;= coefficients that have to be estimated;= measured targets;= slack variables;= insensitive loss function;= Lagrange multipliers;= user specified constant, determining the

trade-off between the flatness off andthe amount up to which deviation largerthan c are tolerated;

= the number of training samples;= the kernel function;= width of the RBFs;= total number of events considered for

prediction;= Nash-Sutcliffe coefficient;= root mean square error; and= mean absolute error.

Subscripts

i, j = positive integer index;m = measured; ands = predicted.


13

f'3

R(13)

Remp( 13)

h

x

fa and b

d

E and *C

a and a*C

nKap

R2

RMSEMAE


LITERATURE CITED

Arthur, W. T., C. M. Tam, and D. K. Liu, 2001. Comparative Studyof ANN and MRA for Predicting Hosting Times of Tower Cranes.Building and Environment. 36:457-467.

ASCE Task Committee, 2000a. Artificial Neural Networks inHydrology. 1. Preliminary Concepts. Journal of Hydrologic Engi-neering, ASCE 5(2):115-123.

ASCE Task Committee, 2000b. Artificial Neural Networks inHydrology. 2. Hydrologic Applications. Journal of HydrologicEngineering, ASCE 5(2):124-137.

Cristianini, N. and J. Shawe-Taylor, 2000. An Introduction to Sup-port Vector Machines. Cambridge University Press, UnitedKingdom.

DHI, 1993. MIKE 11: User Manual. Horshoim, Denmark.Drucker, H., C. Burges, L. Kaufman, A. Smola, and V. Vapnik,

1996. Linear Support Vector Regression Machines. NIPS 96.Gunn, S. R., 1997. Support Vector Machines for Classification and

Regression. Technical Report. Image Speech and Intelligent Sys-tems Research Group, University of Southampton, United King-dom.

Haykin, S., 1999. Neural Networks: A Comprehensive Foundation.Prentice Hall International Inc., New Jersey.

Hoque, M. M., 1994. Evaluating Design Characteristics for Floodsin Bangladesh. In: Proceedings of 2nd International Conferenceon River Flood Hydr., W. R. White and J. Watts (Editors). Wiley,Chichester, England and New York, New York, pp. 15-26.

Hsu, K. L., H. V. Gupta, and S. Sorooshian, 1995. Artificial NeuralNetworks Modelling of the Rainfall-Runoff Process. WaterResources Research 31(10):2517-2530.

Karunanithi N., W. J. Grenney, D. Whitley, and K. Bovee, 1994.Neural Networks for River Flow Prediction. Journal of Comput-ing in Civil Engineering, ASCE 8(2):201-220.

Liong, S. Y., W. H. Lim, and G. N. Paudyal, 1999. River Stage Fore-casting in Bangladesh: Neural Network Approach. Journal ofComputing in Civil Engineering, ASCE 14(1): 1-8.

Mukherjee, S., E. Osuna, and F. Girosi, 1997. Nonlinear Predictionof Chaotic Time Series Using Support Vector Machines. In: Neu-ral Networks for Signal Processing — Proceedings of the IEEEWorkshop. IEEE, Piscataway, New Jersey, pp. 511-520.

Smola, A. J. and B. Schdolkopf, 1998. A Tutorial on Support VectorRegression. Technical Report NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, Unit-ed Kingdom.

Tokar, A. S. and P. A. Johnson, 1999. Rainfall-Runoff ModelingUsing Artificial Neural Networks. Journal of Hydrologic Engi-neering. ASCE 4(3):232-239.

Thth, E., A. Brath, and A. Montanan, 2000. Comparison of Short-Tei-m Rainfall Prediction Models for Real-Time Flood Forecast-ing. Journal of Hydrology 239: 132-147.

Vapnik, V., S. Golowich, and A. Smola, 1996. Support Vector forFunction Approximation, Regression Estimation and Signal Pro-cessing. NIPS 96

Vapnik, V., 1995. The Nature of Statistical Learning Theory.Springer-Verlag, New York, New York.

Vapnik, V., 1999. An Overview of Statistical Learning Theory. IEEETransactions on Neural Networksl0(5):988-999.


flood stage forecasting with support vector machines

Documents