development of early-warning protocol for predicting chlorophyll-a concentration using machine...

11
Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea Yongeun Park a , Kyung Hwa Cho b , Jihwan Park a , Sung Min Cha c , Joon Ha Kim a, a School of Environmental Science and Engineering, Gwangju Institute of Science and Technology (GIST), 261 Cheomdan-gwagiro, Buk-gu, Gwangju 500-712, Republic of Korea b School of Urban and Environmental Engineering, Ulsan National Institute of Science and Technology (UNIST), 50 UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, Republic of Korea c Jeollanam-do Environmental Industries Promotion Institute, 650-94 Songgye-ro, Seongjeon-myeon, Gangjin-gun, Jeollanam-do, 527-811, Republic of Korea HIGHLIGHTS Two machine learning models were used to predict chlorophyll-a concentration. Two models were trained and validated using a 7-year monitoring data. Sensitivity analysis determined the sensitivity of input variables for the models. The support vector machine was found as a reliable early-warning model. This study proposed a simple early warning protocol for managing algal blooms. abstract article info Article history: Received 15 April 2014 Received in revised form 8 August 2014 Accepted 1 September 2014 Available online xxxx Editor: Simon Pollard Keywords: Articial neural network Support vector machine Early warning Chlorophyll-a Prediction accuracy Sensitivity analysis Chlorophyll-a (Chl-a) is a direct indicator used to evaluate the ecological state of a waterbody, such as algal blooms that degrade the water quality in lakes, reservoirs and estuaries. In this study, articial neural network (ANN) and support vector machine (SVM) were used to predict Chl-a concentration for the early warning in the Juam Reservoir and Yeongsan Reservoir, which are located in an upstream region (freshwater reservoir) and downstream region (estuarine reservoir), respectively. Weekly water quality data and meteorological data for a 7-year period were used to train and validate both the ANN and SVM models. The Latin-hypercube one- factor-at-a-time (LH-OAT) method and a pattern search algorithm were applied to perform sensitivity analyses for the input variables and to optimize the parameters of the two models, respectively. Results revealed that the two models well-reproduced the temporal variation of Chl-a based on the weekly input variables. In particular, the SVM model showed better performance than the ANN model, displaying a higher prediction accuracy in the validation step. The WilliamsKloot test and sensitivity analysis demonstrated that the SVM model was su- perior for predicting Chl-a in terms of prediction accuracy and description of the cause-and-effect relationship between Chl-a concentration and environmental variables in both the Juam Reservoir and Yeongsan Reservoir. Furthermore, a 7-day interval was determined as an efcient early warning interval in the two reservoirs. As such, this study suggested an effective early-warning prediction method for Chl-a concentration and improved the eutrophication management scheme for reservoirs. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Algal blooms commonly occur in receiving waterbodies, causing a potential deterioration of water quality, often resulting in problems such as depletion of oxygen, reduced water transparency, and de- creased biodiversity in marine and freshwater environments (Hartnett and Nash, 2004). These problems subsequently pose serious direct and indirect threats to aquatic ecosystems and public health (Glasgow et al., 2004). To prevent severe occurences, efcient mitigation and management techniques should be developed by monitoring and modeling algal blooms. One promising action could be reliable early-warning predictions of algal blooms by incorporating key environmental variables (e.g., temperature, light, and nutrients) (Anderson et al., 2001). Direct modeling of algal blooms, however, may be limited due to practical problems such as insufcient observations and the complex behavior of the algal community. Currently, the chlorophyll-a (Chl-a) concentra- tion has been a useful indicator for measuring the abundance and Science of the Total Environment 502 (2015) 3141 Corresponding author. Tel.: +82 62 715 3277; fax: +82 62 715 2434. E-mail address: [email protected] (J.H. Kim). http://dx.doi.org/10.1016/j.scitotenv.2014.09.005 0048-9697/© 2014 Elsevier B.V. All rights reserved. Contents lists available at ScienceDirect Science of the Total Environment journal homepage: www.elsevier.com/locate/scitotenv

Upload: joon-ha

Post on 17-Feb-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Science of the Total Environment 502 (2015) 31–41

Contents lists available at ScienceDirect

Science of the Total Environment

j ourna l homepage: www.e lsev ie r .com/ locate /sc i totenv

Development of early-warning protocol for predicting chlorophyll-aconcentration using machine learning models in freshwater andestuarine reservoirs, Korea

Yongeun Park a, Kyung Hwa Cho b, Jihwan Park a, Sung Min Cha c, Joon Ha Kim a,⁎a School of Environmental Science and Engineering, Gwangju Institute of Science and Technology (GIST), 261 Cheomdan-gwagiro, Buk-gu, Gwangju 500-712, Republic of Koreab School of Urban and Environmental Engineering, Ulsan National Institute of Science and Technology (UNIST), 50 UNIST-gil, Eonyang-eup, Ulju-gun, Ulsan 689-798, Republic of Koreac Jeollanam-do Environmental Industries Promotion Institute, 650-94 Songgye-ro, Seongjeon-myeon, Gangjin-gun, Jeollanam-do, 527-811, Republic of Korea

H I G H L I G H T S

• Two machine learning models were used to predict chlorophyll-a concentration.• Two models were trained and validated using a 7-year monitoring data.• Sensitivity analysis determined the sensitivity of input variables for the models.• The support vector machine was found as a reliable early-warning model.• This study proposed a simple early warning protocol for managing algal blooms.

⁎ Corresponding author. Tel.: +82 62 715 3277; fax: +E-mail address: [email protected] (J.H. Kim).

http://dx.doi.org/10.1016/j.scitotenv.2014.09.0050048-9697/© 2014 Elsevier B.V. All rights reserved.

a b s t r a c t

a r t i c l e i n f o

Article history:Received 15 April 2014Received in revised form 8 August 2014Accepted 1 September 2014Available online xxxx

Editor: Simon Pollard

Keywords:Artificial neural networkSupport vector machineEarly warningChlorophyll-aPrediction accuracySensitivity analysis

Chlorophyll-a (Chl-a) is a direct indicator used to evaluate the ecological state of a waterbody, such as algalblooms that degrade the water quality in lakes, reservoirs and estuaries. In this study, artificial neural network(ANN) and support vector machine (SVM) were used to predict Chl-a concentration for the early warning inthe Juam Reservoir and Yeongsan Reservoir, which are located in an upstream region (freshwater reservoir)and downstream region (estuarine reservoir), respectively. Weekly water quality data and meteorological datafor a 7-year period were used to train and validate both the ANN and SVM models. The Latin-hypercube one-factor-at-a-time (LH-OAT) method and a pattern search algorithm were applied to perform sensitivity analysesfor the input variables and to optimize the parameters of the two models, respectively. Results revealed that thetwo models well-reproduced the temporal variation of Chl-a based on the weekly input variables. In particular,the SVM model showed better performance than the ANN model, displaying a higher prediction accuracy inthe validation step. The Williams–Kloot test and sensitivity analysis demonstrated that the SVM model was su-perior for predicting Chl-a in terms of prediction accuracy and description of the cause-and-effect relationshipbetween Chl-a concentration and environmental variables in both the Juam Reservoir and Yeongsan Reservoir.Furthermore, a 7-day interval was determined as an efficient early warning interval in the two reservoirs. Assuch, this study suggested an effective early-warning prediction method for Chl-a concentration and improvedthe eutrophication management scheme for reservoirs.

© 2014 Elsevier B.V. All rights reserved.

1. Introduction

Algal blooms commonly occur in receiving waterbodies, causing apotential deterioration of water quality, often resulting in problemssuch as depletion of oxygen, reduced water transparency, and de-creased biodiversity in marine and freshwater environments (Hartnettand Nash, 2004). These problems subsequently pose serious direct

82 62 715 2434.

and indirect threats to aquatic ecosystems and public health (Glasgowet al., 2004). To prevent severe occurences, efficient mitigation andmanagement techniques should be developed by monitoring andmodeling algal blooms.

One promising action could be reliable early-warning predictionsof algal blooms by incorporating key environmental variables(e.g., temperature, light, and nutrients) (Anderson et al., 2001). Directmodeling of algal blooms, however, may be limited due to practicalproblems such as insufficient observations and the complex behaviorof the algal community. Currently, the chlorophyll-a (Chl-a) concentra-tion has been a useful indicator for measuring the abundance and

Page 2: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

32 Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

variety of phytoplankton and/or algal biomass (Boyer et al., 2009). Be-cause all photosynthetic algae include Chl-a, algal bloom can be easilypredicted by investigation of Chl-a concentration in waterbodies.

Complex relationships between diverse environmental factors(e.g., temperature, light, nutrients) are involved in predicting Chl-a con-centration (Lee et al., 2003). In order to reliably simulate the Chl-a con-centration, models need to incorporate hydrological, geochemical, andecological variables that impact algal growth. Several models havebeen used to predict Chl-a concentration, and can be categorized ac-cording to deterministic and stochastic approaches. Even though aprocess-based mathematical model has been widely implemented topredict the general ecological response of phytoplankton to several en-vironmental factors (Thomann andMueller, 1987), the physical dynam-ics of algal bloom phenomena are not understood well due to theuncertainty of kinetic rate coefficients for different species (Lee andLee, 1995; Yabunaka et al., 1997). This fact limits the development ofan appropriate formulation for simulating algal blooms and subse-quently requires an alternative approach for modeling, such as the pro-motion of data-driven methodology (Lee et al., 2003). In this study,among current models, two data-driven approaches (i.e., stochasticapproaches) are compared as an early-warning prediction model ofChl-a concentration in a lake system, through the evaluation of modelperformance.

Artificial neural network (ANN) and support vector machine (SVM)models are promising approaches used to reflect the nonlinearity be-tween Chl-a concentration and environmental factors using stochasticerror minimization approaches. In particular, ANN is a powerful patternrecognition approach that has been used in areas including business, in-dustry, engineering, and science (Widrowet al., 1994) andhas also beenapplied to predicting algal blooms (Lee et al., 2003). However, ANN haslimitations in that empirical riskminimization (ERM) is only consideredforminimizing the training error, not for generalizing its performance inthe prediction step (Barton and Meckesheimer, 2006; Yuan and Wang,2008). Recently, insteadof ANN, SVMhas been introduced as an alterna-tive method for overcoming the intrinsic weaknesses of ANNmodeling,while retaining all the advantages of ANN (Govindaraju, 2000). SVMmaintains steady performance regardless of input dimensionality andcorrectly determines the global optimum during the regression process(Ren and Bai, 2010).

In this paper, a detailed assessment for two stochastic models is per-formed in order to evaluate the early warning predictability of Chl-aconcentration through the prediction accuracy and an analysis ofthe relationship between input and output. Although a straightforwardcomparison (e.g., determining model performance using prediction ac-curacy) between ANN and SVM has been presented (Balabin andLomakina, 2011; Behzad et al., 2009; Chen et al., 2005), there has yetto be a comprehensive analysis in terms of the application of parameteroptimization and sensitivity analysis. Therefore, to investigate themodel performances in predicting the Chl-a concentration, ANN andSVM models were set up in the Juam Reservoir (freshwater reservoir)and Yeongsan Reservoir (estuarine reservoir) in the southwestern partof Korea, which have distinct site-specific characteristics. The objectivesof this study are: 1) to develop a reliable model for early warningprediction of Chl-a using ANN and SVM by optimizing key model pa-rameters, 2) to evaluatemodel-specific features based onmodel perfor-mance in terms of a statistical evaluation and sensitivity analysis inresponse to different input variables, and 3) to propose a simple earlywarning protocol for managing algal blooms using Chl-a concentrationas a key decision-supporting system.

2. Materials and methods

2.1. Site description and data acquisition

The Juam Reservoir (JAR) and Yeongsan Reservoir (YSR) are locatedin the southwestern region of Korea (see Fig. 1). The JAR, surrounded by

mountainous valleys at approximately 400m altitude above sea level, isa freshwater lake that flows into the Seomjin River. This lake is themostimportant freshwater resource in the region (e.g., Gwangju, Naju,Mokpo) and supplies 25 millionm3/day for drinkingwater. It is approx-imately 40 km long, and has a surface area of 33 km2, average depth of14m (maximumdepth of 47m), and basin area of 1010 km2 (Shin et al.,2000). In contrast, the YSR is an estuarine reservoir built in 1981 bydamming the downstream end of the Yeongsan River, which suppliesagricultural water and prevents flooding in the surrounding regions; itis 23.5 km from the Mongtan Bridge to the Yeongsan Estuarine Dam,and has a surface area of 34.6 km2, average depth of 10 m (maximumdepth of 21 m), and a basin area of 3468 km2 (Lee et al., 2009).

Water quality in the JAR is maintained in relatively good conditionscompared to othermajor reservoirs in SouthKorea, in termsof nutrientsand Chl-a concentration (Jones et al., 2003). Because the JAR basin isdominantly composed (80%) of forested area (8% agricultural area,0.6% urban area, etc.), nutrients released from the basin were relativelyinsignificant (Jones et al., 2006).Water quality in the YSR, however, hasbeen drawing researchers' attention in recent years because of the ag-gravated aqua-ecological state caused by its structural deficiency(i.e., estuarine dam) and the pollutant load from the Yeongsan Water-shed. As the dam is constructed at the outlet of the Yeongsan River, nat-ural water circulation has been inhibited, resulting in a degradation ofwater quality due to the anoxic and hypoxic conditions in the bottomlayer (Lee et al., 2009, 2010). Furthermore, there are numerous point-and non-point sources discharging into the YSR from the YeongsanWatershed, causing eutrophication (Ki et al., 2007; Cho et al., 2009a).

The water quality data in the two reservoirs were monitored by theYeongsanRiver Environmental ResearchCenter (near dike dams in boththe JAR and YSR; see Fig. 1). Surfacewater samples were taken atweek-ly intervals over a 7-year period (from 2006 to 2012). Water qualityobservations in this study include chlorophyll-a (Chl-a), phosphatephosphorus (PO4-P), ammonium nitrogen (NH3-N), nitrate nitrogen(NO3-N), and water temperature. Daily meteorological data were mon-itored by the Korea Meteorological Administration at local weather sta-tions. In addition, solar radiation and wind speed were used as data inthis study. The five water quality and two meteorological data wereused as the inputs and outputs for the two stochastic models. The de-scriptive statistics for the data are shown in Table 1, including the num-ber of data, minimum, maximum, mean, and standard deviation. Asmentioned above, a wider range of water quality parameters was ob-served in the YSR, compared to the JAR.

2.2. Theoretical background of applied stochastic models

2.2.1. Artificial neural networkANN is a useful method for classifying patterns of multi-variable

datasets and modeling complex environmental processes (Cho et al.,2011). The structure of ANN consists of two ormore layers of nodes, in-cluding an input layer, hidden layer, and output layer that are connectedby links having varying weights. The nodal data are multiplied by theweights to compute the signal strength, and are then transferred tothe next node in the network; the input layer nodes accept the inputvectors and forward the signals to the next layer according to the con-nection. This process is continued until the signals reach the outputlayer. Note that a back propagation learning algorithm was applied inthis study to minimize the objective function. ANN consists of threelayers having p input nodes (xi1, xi2, …, xip = 8), q hidden nodes (Hi

1,Hi

2,…, Hiq), and one output node (gi1) (see Fig. S1 in the Supplementary

Material). These structural components were applied to develop ANN,and followed a generalized mathematical expression (Norgaard et al.,2000). Hidden node outputs (Hi

q) were determined using Eq. (1)based on a transfer function (f1) associated with the input elements(xi), weight (wh

pq), and bias (bq1), and the final network output (gi1)was then calculated from the hidden node output (Hi

q) using the trans-fer function (f2) having a connection weight (wo

q1) and bias (b12).

Page 3: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Fig. 1.Map of study sites showing monitoring stations in the JAR and YSR.

33Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

These connection weights were calculated using the mean square error(MSE) between the final network output (gi1) and the real value.

Hiq ¼ f1

XNp¼1

whpqx

ip þ bq

1

� �ð1Þ

gi1 ¼ f2XL

q¼1wo

q1Hiq þ b1

2

� �ð2Þ

where Hiq is the output in the hidden nodes, gi1 is the output in the net-

work, xip is the pth element of the ith input data, whpq is the connection

Table 1Descriptive statistics of water quality and meteorological data for 7-year period (from 2006 to

Water quality data

Chl-a (μg/l) PO4-P (mg/l) NH3-N (mg/l) NO3-N (mg/

JAR Na 357 357 357 357Min.b 0.3 0.000 0.001 0.189Max.b 19.0 0.016 0.205 1.264Median 2.4 0.002 0.051 0.504

YSR N 350 350 350 350Min. 0.0 0.000 0.001 0.000Max. 119.7 0.199 1.841 6.597Median 7.2 0.059 0.086 2.133

a N indicates the number of samples collected.b Min. and Max. represent minimum and maximum values, respectively.

weight between the pth node of the input layer and the qth node of thehidden layer, wo

q1 is the connectionweight between the qthnode of thehidden layer and the node of the output layer, b1q and b21 are bias terms, pand q are the numbers of nodes in the input and hidden layers, respec-tively, f1 is the activation function acting on the input vector, and f2 isthe output function operating on the scalar output.

Here, MATLAB (MathWorks, Inc., Natick, Massachusetts, USA)code was used to develop ANN model. The training functions(e.g., activation and output functions) used included the logistic sigmoidfunction, tangent sigmoid function, and linear function. The numberof hidden nodes, learning rate, and momentum were the model

2012) collected at the JAR and YSR.

Meteorological data

l) Water temperature (°C) Solar radiation (MJ/m2) Wind speed (m/s)

357 357 3573.5 0.99 0.1

25.0 28.79 5.812.0 14.43 1.2

350 350 3501.0 1.63 0.6

33.0 30.27 9.818.0 14.15 2.7

Page 4: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

34 Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

parameters of ANN, where the optimization algorithm was applied todetermine the best parameter set. The number of hidden nodes wasused to consider complex non-linear relationships between the inputand output. Note that learning andmomentum rates are crucial param-eters for determining the learning process in terms of potential, stabili-ty, and computing time. The learning rate determines the magnitude ofthe correction by adjusting theweight of the computation unit; themo-mentum controls the lifespan of corrections in the modification ofweights during the training process.

2.2.2. Support vector machineSVM is popular due to its attractive features and empirical perfor-

mance (Cortes and Vapnik, 1995; Vapnik, 1998). It performs classifica-tions by constructing an N-dimensional hyperplane that optimallyseparates data into two categories. SVM is a similar machine learningmodel to classical multilayer perceptron neural networks, in whichthe model uses kernel functions as an alternative training method forthe linear, radial basis function (RBF), and multi-layer perceptronclassifiers. The weights of the network are found by solving a quadraticprogramming problem using linear constraints rather than by solving anon-convex equation. In this study, support vector regression (SVR)was applied to forecast the time-series data (Vapnik et al., 1997).The basic idea of SVR is to find a function that estimates the networkoutput (si), showing the deviation from the real values for all trainingdata (see Fig. S1 in the Supplementary Material). Initially, the inputdata Xi is mapped into a higher-dimensional feature via a nonlinearmapping function φ(Xi), and the linear regression is then implementedin this space. SVM subsequently approximates the function from theequation

s Xið Þ ¼XT

i¼1wiφ Xið Þ þ b ð3Þ

where wi and b are the coefficients determined by minimizing the reg-ularized risk function based on the network output and real value. Inthis process, in order to simply calculate the nonlinear mapping, a ker-nel function approach is applied. The kernel function, κ(Xi, X) is com-puted using the inner product (b , N) between the nonlinear mappingdata (φ(Xi), φ(X)).

Here, MATLAB (MathWorks, Inc., Natick, Massachusetts, USA) codewas used to develop SVM model. The training functions (e.g., kernelfunction) used included linear, exponential radius bias, andGaussian ra-dius bias functions. SVM includes three key parameters to set up themodel: C, epsilon, and sigma. The hyper-parameter C is a regularizedconstant used to determine the tradeoff between the complexity ofthedecision rule and the frequency of error. Epsilon ( ) determines com-plexity by adjusting the number of support vectors as a prescribed pa-rameter to consider the training error; sigma (б) is a scale parameterrelated to the model performance stability.

2.3. Methodologies of model application

2.3.1. Training and validation of ANN and SVMThe process for applying ANN and SVM is described in Fig. 2. The

minimum number of input variables was selected to develop a simpleand effective model; with cause-and-effect relationships between in-puts and algal dynamics considered. Lee et al. (2003) reported thatalgal dynamics is affected by ten variables: namely, solar radiation,total inorganic nitrogen, time lagged chlorophyll-a, phosphate, dissolvedoxygen, secchi-disk depth, water temperature, rainfall, wind speed, andtidal range. Among these, seven variables: five fromwater quality (Chl-a, PO4-P, NH3-N, NO3-N, and water temperature) and two from meteo-rological data (solar radiation and wind speed), were chosen as inputsbased on relative importance, excluding non-measured secchi-diskdepth (Baird et al., 2001; Raven and Geider, 1988; Søndergaard et al.,2003). The sampling interval was also considered as an input variableto evaluate early warning prediction of Chl-a. The maximum, mean,

and minimum sampling intervals in days were 26, 7, and 4 in the JARand 18, 7, and 3 in the YSR, respectively. Twomodels used this samplinginterval data in time-lag form to train patterns between current inputand future output. These various sampling interval data were alsoused to identify which time-lag is the most appropriate timing forearly detection. For early warning predictions, the time-lag input databased on its interval were used. In total, eight time-lag variables wereapplied to the two models as inputs. Data noise were subsequently re-moved using a noise control system (e.g., moving average), and themodel parameters of the two models were then determined using aglobal optimization algorithm for each learning algorithm. After finaliz-ing the determination, the Chl-a concentrations were predicted fromthe models and the values compared to the observed Chl-a concentra-tions to evaluate the prediction accuracy. Finally, a sensitivity analysiswas applied to assess logical relevance in order to determine significantinput variables for predicting the Chl-a concentration.

The input data set was divided into training and validation sets. Thedata covering a 5-year period (from 2006 to 2010) data were used fortraining, and 2-year period (from 2011 to 2012) were used for the val-idation. All inputs were standardized between −1 and 1 to accuratelytrain the model, thereby minimizing the output error. The neural net-work model used in this study was composed of three layers withmultiple-nodes, including: input, hidden, and output layers (seeFig. S1 in the Supplementary Material).

2.3.2. Parameter optimization for ANN and SVM modelsModel parameters significantly influence the learning and approxi-

mation ability for predicting precise solutions during the training pro-cess, i.e., the model performance is dominantly influenced by thenetwork architecture itself. In particular, the performance of ANN hasvery different characteristics depending on the network size (Maierand Dandy, 1998). If a complex structure with excessive parametervalues is constructed, there could be over-fitting in the training process;i.e., though a perfect match be obtained during training, poor predictionis achieved in the validation process with unknown patterns (Bebis andGeorgiopoulos, 1994). In SVM, to ensure a good prediction performance,the main concern is to set appropriate parameters for a given trainingdata set (Cherkassky and Ma, 2004). However, even though the deter-mination of proper model parameters is known to be an important pro-cess, challenges remain. Currently, most approaches are based on priorknowledge, user expertise, or experimental trial, such that there is noguarantee that the chosen parameters have been optimized (Ren andBai, 2010). Hence, in this study, we applied a pattern search algorithmto determine globally optimal parameters for the ANN and SVMmodels(see Appendix A in the Supplementary Material). The algorithm is use-ful for determining the global minimum of objective functions by mini-mizing the prediction error between data sets. The model parametersfor ANN and SVMmodelwere subsequently chosen fromwithin this pa-rameter range (see Table S1 in Supplementary Material); ranges werechosen from reference studies (Maier and Dandy, 2000; Cho et al.,2011; Patuelli et al., 2011). The values of C, epsilon, and sigma were in-vestigated using a pattern searchwithin this parameter range, based ona reference study (Wang et al., 2003).

2.4. Sensitivity analysis

In this study, a sensitivity analysis was applied to investigate sensi-tive input variables that influence the prediction of Chl-a concentration.The Latin hypercube one-factor-at-a-time (LH-OAT) method was usedas the assessment tool for checking global sensitivity over the entireinput range (see Appendix B in the Supplementary Material). Theboundary of each variable was set to be the minimum and the maxi-mum value of each input variable. The objective function was set to bethe predicted Chl-a concentration.

Page 5: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Fig. 2. Logical flow for developing two stochastic models: Logsig, Tansig, and Purelin indicate the transfer function used to train patterns of data sets in ANN; Linear, RBF (radial biasfunction), ExRBF (exponential RBF), and GaRBF (Gaussian RBF) indicate the kernel function used to train patterns of data sets in SVM.

35Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

2.5. Williams–Kloot test for assessing model performance

TheWilliams–Kloot testwas applied to compare themodel accuracyfor predicting Chl-a concentration (Williams and Kloot, 1953). The testhas been commonly used to select the best model between two (ormore) models in various fields (Brassard and Correia, 1977; Cho et al.,2012; Pachepsky et al., 2006), based on the equation

Y−12

YA þ YB

� �� �¼ λ YB−YA

� � ð4Þ

where Y is the measurement, YA is the value predicted by model A, andYB is the value predicted bymodel B, andλ is the slope of this relation. Ifthe λ is significantly different (p-value b 0.05) from zero and positive,model B is deemed better than model A.

3. Results and discussions

3.1. Determination of optimal model

3.1.1. Training and validation of ANN and SVMThe 7-year period data including Chl-a, PO4-P, NH3-N, NO3-N, water

temperature, solar radiation, andwind speedwere used to train and val-idate the two models. The modeling accuracy was quantitatively com-pared using the Nash–Sutcliffe efficiency (NSE) coefficient, coefficient

of determination (R2), and mean absolute error (MAE) between thepredicted and the measured Chl-a concentrations. The optimizedmodels in the JAR and YSR are shown in Table 2, including the trainingfunction, model parameters, and statistical evaluation. Fig. 3 comparesthe observed and predicted Chl-a concentrations in both the trainingand validation steps in the JAR and YSR, respectively. The prediction ac-curacies revealed that the twomodels showed acceptable accuracies forpredicting the Chl-a concentration. In the JAR, the prediction accuracybetween the ANN and SVM was almost identical during the trainingstep, whereas the accuracy of SVMwas slightly higher than ANN duringthe validation step. Similarly, in the YSR, the accuracy was almost iden-tical during the training step, whereas the accuracy of SVMwas slightlyhigher than the accuracy of ANN during the validation step.

3.1.2. Comparison of model performanceAlthough the prediction accuracies in the training steps were almost

identical in the two models, the SVM showed slightly a higher predic-tion accuracy than ANN during the validation steps in both the JARand YSR. The higher performance of SVM can be described as follows.First, it could be caused by the difference of ability in the interpretationof highly nonlinear relationships, i.e., nonlinear interferences cause therelatively poor accuracy of ANN (Thissen et al., 2004; Balabin andLomakina, 2011). In the nonlinear equalization performance compari-son, SVM had similar or superior performance than ANN (Sebald andBucklew, 2000). Second, the structural risk minimization principle of

Page 6: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Table 2Comparison of optimized ANN and SVM performances for predicting Chl-a concentration in the JAR and YSR during training and validation steps.

Site Model (training function) Model parameters NSEc R2c MAEc

Trd Vld Tr Vl Tr Vl

JAR ANN(Purelin/Purelin)

lra: 0.10moa: 0.10# Na: 5

0.71 0.73 0.71 0.74 0.52 0.85

SVM(Linear)

Cb: 53.04εb: 0.06σb: 10.00

0.71 0.75 0.71 0.75 0.53 0.84

YSR ANN(Purelin/Purelin)

lr: 0.10mo: 0.29# N: 5

0.63 0.41 0.63 0.43 5.72 5.73

SVM(Gaussian RBF)

C: 42.00ε: 0.10σ: 6.63

0.63 0.45 0.64 0.45 5.78 5.81

a lr, mo, and #N mean the learning rate, momentum, and number of hidden neuron, respectively.b C, ε, and σ indicate the penalty parameter, epsilon, and sigma, respectively.c NSE, R2, and MAE means the Nash–Sutcliffe model efficiency coefficient, coefficient of determination, and mean absolute error, respectively.d Tr and Vl indicate training step and validation step, respectively.

36 Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

SVM was more effective than the empirical risk minimization principleof ANN in terms of minimizing error; SVM uses an upper bound for thegeneralization error, rather than reducing the training error (Kim,2003). Third, in ANN the method for determining global solutions in

Fig. 3. Comparison of Chl-a concentration between observed and predicted values in ANN andrespectively; black solid circle is the observed value; gray solid line with filled triangle is the p

ANN is difficult to converge because of its inherent algorithm design,whereas the SVM has ready access to global optimal solutions, obtainedby solving a linearly constrained quadratic programming problem(Chen et al., 2005). Fourth, the model parameters for ANN are more

SVM models during training and validation steps: A and B are the results in JAR and YSR,redicted value by ANN; black solid line with open circle is the predicted values by SVM.

Page 7: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

37Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

complex than SVM (Basu et al., 2003). Overall, the optimization ofmodel parameters is not stable in ANN, as even though the data set isthe same, at the conclusion of the optimization process the model pa-rameters obtained are different. In other words, ANN has difficulty de-termining optimized model parameters.

The Williams–Kloot test result statistically confirmed that the SVMmodel performed better than ANN in terms of the slope and significance(Fig. 4). ANN and SVM display no significant performance differencefor the training step in either the JAR or YSR, based on the slope, error,and probability (p-value N 0.05 in JAR and YSR). However, based onthe slope (2.34 ± 2.09 in JAR and 1.95 ± 1.53 in YSR) and probability(p-value b 0.05 in both JAR and YSR), SVM displayed significantly supe-rior performance in the validation step for both the JAR and YSR, i.e., thestatistics clearly reveal a performancedifference betweenANNand SVMin the validation step.

3.2. Sensitivity analysis

3.2.1. JARThe sensitivities of the input variables for outputting Chl-a concen-

tration were examined by changing the concentrations of input vari-ables within a range between the minimum and maximum values.Table 3 ranks the sensitivities for the 6 input variables applied in thetwomodels. By considering the final effect value, variables can be divid-ed into 3 groups (e.g., Rank 1, Rank 2, and Ranks 3–6). The most sensi-tive variable (Rank 1) is PO4-P in both the ANN and SVM models.Algal growth rate is remarkably affected by interactions between tem-perature, light intensity, and nutrient concentration (Middlebrooksand Porcella, 1971; Baird et al., 2001). Among them, nutrient availability(e.g., mainly nitrogen and phosphorus) plays a key role in algal growth,biomass development, and dominance of species (Dugdale, 1967;Tilman et al., 1982). In particular, phosphorus plays an important rolein the photosynthetic production of a lake, as a limiting factor in fresh-water (Boyce et al., 1987; Correll, 1999; Dillon and Rigler, 1974;Edmondson, 1970; Hecky and Kilham, 1988; Schindler et al., 1971).Dzialowski et al. (2005) reported that reservoirs were P limited as TN:

Fig. 4. Plots ofWilliams–Kloot test results used to statistically comparemodel performance betwJAR and YSR, respectively, B and D indicate the results based on the validation step; the black s

TP ratios was more than 65 M. Based on the mean ratio of TN:TP(i.e., about 145 from 2006 to 2012), the JARwas classified as a phospho-rus limitation lake for phytoplankton growth (Jones et al., 2003, 2006).Hence, PO4-P is a reasonable determination as being the most signifi-cant variable for simulating Chl-a concentration in the two models.

The second most sensitive variables (Rank 2) are solar radiation forANN and NO3-N for SVM. Even though phosphorus is a key nutrient forthe primary production of phytoplankton in freshwater, nitrogen re-mains an influential nutrient in some lakes (Diaz and Pedrozo, 1996;Elser et al., 1990; Jansson et al., 1996; Vincent et al., 1984). Consideringthe nutrient concentration (i.e., total nitrogen and total phosphorus),JAR belonged to a lower quartile class compared to the major reservoirsin South Korea (Jones et al., 2003). In this class, nitrogen can play an im-portant role as a secondary limiting factor for phytoplankton production.On the other hand, light was seen to be insignificant as a constraint foralgal growth (Sterner and Grover, 1998). As such, NO3-N is a more rea-sonable secondary sensitive variable than solar radiation. Other variables(Rank3 to 6)were similar, except for solar radiation andNO3-N. In termsof nitrogen availability for phytoplankton growth, Dortch (1990) report-ed that therewas no definite explanation in terms of inhibition and pref-erence between NH3-N and NO3-N, due to substantial variations inpatterns. Thus, the classification of a sensitive nitrogen source betweentwo variables may be an insignificant issue. Overall, SVM is deemed amore reasonable model than ANN.

3.2.2. YSRThe most sensitive variables are solar radiation for ANN and NH3-N

for SVM. Compared to freshwater, coastal and estuarine ecosystemsshow a complex trend of phytoplankton growth due to the extreme di-versity found in the physical-biogeochemical systems. This complexityincreases the difficulty in determining a common limiting factor forphytoplankton growth during seasonal variation in the system (LePape et al., 1996). In general, algal growth has been limited by nitrogenin marine and estuarine waters (Carpenter and Capone, 1983; Elseret al., 2007; Howarth and Marino, 2006; Vitousek and Howarth,1991), whereas phytoplankton growth is dominantly affected by light

een twomodels in the JAR and YSR: A and C indicate results based on training step in theolid circle presents the JAR and gray solid circle indicates the YSR.

Page 8: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Table 3Sensitivity rank of input variables in ANN and SVM using the LH-OAT sensitivity analysis in the JAR and YSR.

Rank JAR YSR

ANN SVM ANN SVM

Variable Final effect Variable Final effect Variable Final effect Variable Final effect

1 PO4-P 1909.60 PO4-P 187.59 S_rad 50.94 NH3-N 45.172 S_rada 1128.60 NO3-N 10.48 NO3-N 39.39 NO3-N 39.443 NH3-N 226.00 W_speed 0.86 NH3-N 31.12 S_rad 36.894 NO3-N 176.50 S_rad 0.54 W_temp 24.41 W_temp 26.265 W_temp.a 118.61 NH3-N 0.45 W_speed 20.67 PO4-P 21.076 W_speeda 37.80 W_temp 0.08 PO4-P 4.13 W_speed 6.38

a S_rad, W_temp, and W_speed are the abbreviated forms of solar radiation, water temperature, and wind speed, respectively.

38 Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

availability under nutrient-saturated conditions in a eutrophicwaterbody (Pennock, 1985; Wofsy, 1983). Although nutrient loadingsignificantly increased in estuarine water, phytoplankton productionwas maintained in the system or sometimes declined (Alpine andCloern, 1992; Balls et al., 1995; Le Pape et al., 1996; Richardson andHeilmann, 1995). The Spearman rank correlation value between thevariables (e.g., Chl-a vs TN, Chl-a vs TP, and Chl-a vs solar radiation)was applied here to determine the relative importance to determinethe variable sensitivity (see Fig. S2 and Table S2 in the SupplementaryMaterial). Based on the results in Table S2, nitrogenmay bemore signif-icant than the solar radiation; NH3-N and NO3-N, therefore, is deemedthe most significant variable for simulating Chl-a concentration in theYSR.

Phosphorus, a significant limiting factor in upstream reservoirs(e.g., JAR, a phosphorus-limited lake), was determined to be a relativelyinsignificant variable in the YSR (Rank 6 in ANN and Rank 5 in SVM). Inthe YSR, eutrophic conditions over the past 15 years, in terms of trophicstate index (TSI) from total phosphorus (TP), secchi disk depth (SD),and Chl-a (Cho et al., 2009b). Kim et al. (2001), revealed that the ratioof TN:TPwas substantially low in the YSR (i.e., less than 50); i.e., the lim-iting factor for algal growth was likely to be nitrogen in this eutrophicdownstream reservoir. Similarly, the TN:TP ratio in this study was ap-proximately 31 for the 7-year period (from 2006 to 2012). Based onthe above information,water temperature is also a significant factor, be-cause it significantly influences the algal growth under the nutrient-saturated conditions (Raven and Geider, 1988). In a comparison be-tween phosphorus and wind speed, phosphorus was found to be amore significant factor than wind speed. Thewind speed induced phys-ical movement (e.g., vertical mixing) of the water, and this movementsubsequently enriched the nutrient load in the water column, releasedfrom the sediment (Havens et al., 1996; Søndergaard et al., 2003). Thiscondition suggests that there is an indirect relationship betweenthe wind speed and algal growth. Hence, placing phosphorus in a rankbetween water temperature (Rank 4) and wind speed (Rank 6) in theSVM model may be reasonable.

In conclusion, ANN is not seen to be a rational model comparedto SVM. First, it mis-understood the most significant limiting factorbetween nitrogen and solar radiation, and did not regard phosphorusas a more meaningful variable than wind speed. In contrast, SVMshowed better coincidence with the above explanations. Therefore,SVM was deemed a reasonable and logical model for predicting Chl-aconcentration.

3.3. Early warning Chl-a prediction test for different sampling intervals

The conventional phytoplankton monitoring approach is direct ob-servation using light microscopy of live cells (Sournia, 1978). This ap-proach accurately provides specific information for the current state ofblooms and quantitativelymeasures the cell abundance in awater sam-ple. It is, however, time-consuming and labor-intensive work requiringa high level of experience and expertise for the analysis, thereby limitingits real-time or near-real-time use for monitoring phytoplankton

(Sellner et al., 2003). More recently, a remote sensing approach usingoptical instruments (e.g., a satellite) has been applied to detect harmfulalgal blooms (Tomlinson et al., 2004). This approach is able to supportthe widespread spatial and temporal monitoring of the blooms(Cullen et al., 1997). However, when the temporal scale for early warn-ing is considered in detecting the blooms, problems such as non-detectability due to bad weather conditions and a low possibility offorecasting (e.g., after a few days or a week) due to lack of real-time de-tection remain. Nevertheless, this modeling approach is considered asan alternative to the conventional detection approach in this study.

As for the lead time (i.e., time-lag for input data to predict Chl-a con-centration), it is affected by monitoring frequency which is based oncost and other practical considerations (e.g., administrative process,measuring equipment, and facility installation. There is a limitationwhen it comes to determining the appropriate lead time in the practicalfield. Yabunaka et al. (1997) applied 7-day time-lag for predicting Chl-aconcentration using daily-interpolated data calculated from once ortwice monthly monitoring data in the eutrophic freshwater LakeKasumigaura. Maier et al. (1998) tested various time-lag conditions(e.g., 1-, 2-,…, k-week time-lag) to predict cyanobacteria concentrationusing weekly monitoring data for 7-year and 28-week periods. Barcielaet al. (1999) used daily, weekly, and annual time-lag intervals to predictChl-a concentration using 3-year weekly monitoring data. Lee et al.(2003) predicted 7-day time-lag Chl-a concentration using 19-yearmonitoring data from 1982 to 2000. Although these four studiesshowed relatively good prediction performance for the Chl-a concentra-tion, interpolated data for the frequent-interval time series data wereused to train and validate the model. This data interpolation can ignorereal data patterns between the interpolated points, thereby simulatinginaccurate Chl-a concentration. In this study, 6, 7, and 8 day time-laginput data were tested for predicting Chl-a concentration based on thesampling interval from monitoring program.

SVM, selected as a reliable model for predicting Chl-a concentrationin the JAR and YSR, was applied to assess the possibility of providingearly warning predictions. Fig. 5 presents a comparison of Chl-a concen-trations between observed and predicted values using SVM, based ondifferent sampling intervals. The total of 170 data sets having differentsampling intervals (6-day = 46, 7-day = 103, and 8-day = 22) during2-year period (2011 to 2012) in the JAR and YSR were compared. Ingeneral, the sampling frequency was once per week in both reservoirs.The sampling interval, however, was inconsistent due to the irregularsampling by environmental officials; the interval is thus divided into6-day, 7-day, and 8-day. Based on the prediction accuracy (seeTable 4), SVM can be a reasonable model for early-warning detectionafter 7-day; as for the 8-day interval prediction, there is poor predictionaccuracy. Lee et al. (2005) found that the Chl-a concentration is stronglycorrelated with past Chl-a concentration within a week, and is onlyweakly correlated with nutrients. Although results from literature stud-ieswere not directly comparedwith this study due to differences in datatime intervals, it was reported that themodel used can predict the Chl-aconcentration well using 7-day time-lag input data set (Barciela et al.,1999; Palani et al., 2008; Yabunaka et al., 1997). Lee et al. (2003)

Page 9: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

Fig. 5. Comparison of Chl-a concentration between observed and predicted values using SVM for different sampling intervals to evaluate the possibility of early-warning detection:A, B, and C depict the results for 6-day (6d), 7-day (7d), and 8-day (8d) interval data, respectively; the black-color data with left-hand side vertical axis is from the JAR; the gray color datawith right-hand side vertical axis is from the YSR.

39Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

reported that 1-week sampling interval was needed to predict algal dy-namics capturing short-term trends. Hence, considering the currentmonitoring plan (e.g., weekly sampling interval) and the correlation ofChl-a with environmental variables, the 7-day interval can be reason-able for the early-warning prediction. However, the exact comparisonof the model performance between each time-lag results in this studyis hindered because the number of time-lag samples is different in thetraining and early-warning test processes. The model, therefore, needs

Table 4Prediction accuracy of SVM model for early-warning detection.

Model evaluation statistics 6-Day 7-Day 8-Day

NSEa 0.38 0.69 −1.73R2b 0.48 0.72 0.21MAEc 4.38 2.81 3.37

a NSE is the Nash–Sutcliffe model efficiency coefficient.b R2 is the coefficient of determination.c MAE is the mean absolute error.

to be further evaluated using more time-lag data sets associated with6-day and 8-day intervals.

4. Conclusion

The main purpose of this study was to provide a rational model forthe early-warning prediction of Chl-a concentration using a compre-hensive evaluation of two stochastic models in a reservoir system.Here, ANN and SVM models were applied to predict the Chl-a concen-tration using weekly water quality data and meteorological data overa 7-year period. They were then compared in terms of prediction accu-racy and sensitivity, depending on changes in themodel input variables.Through this study, the major conclusions are as follows:

(1) Twomachine learningmodels were successfully set up, with op-timal model parameters determined using a pattern search algo-rithm, showing a reasonable prediction accuracy during both thetraining and validation processes.

Page 10: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

40 Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

(2) Even though the training prediction accuracies were almostidentical between the two models in both the JAR and YSR,SVM displayed higher prediction accuracies in the validationstep than ANN.

(3) TheWilliams–Kloot test result statistically demonstrated that theprediction accuracy of SVMwas significantly higher than ANN inthe validation step.

(4) Sensitive input variables were determined by the LH-OAT meth-od, based on an interpretation of the characteristics in JAR andYSR. The most sensitive input variable was PO4-P in both ANNand SVM in JAR, whereas the variables were solar radiation forANN and NH3-N for SVM.

(5) SVM more impressively interpreted Chl-a concentration com-pared to ANN. The cause-and-effect relationship between Chl-aconcentration and environmental variables was clearly de-scribed in both the JAR and YSR.

(6) SVM showed good performance for the early warning predictionof Chl-a for different sampling intervals. Overall, the 7-day inter-val was suggested as a reasonable interval for early warning.

This study provided useful tools for reliably predicting earlywarningChl-a concentrations by considering reasonable variables in a reservoir.It is expected that this model could be easily accessed and operated bydecision-makers and engineersworkingwith both freshwater and estu-arine reservoirs.

Conflict of interest

None.

Acknowledgments

This research was supported by the R&D Program for Society of theNational Research Foundation (NRF) funded by the Ministry of Science,ICT & Future Planning (Grant number: NRF-2014M3C8A4030498). Theauthors are also grateful to the National Institute of EnvironmentalResearch (NIER), Korea, and the Korea Meteorological Administration(KMA) for permitting use of their data.

Appendix A. Supplementary data

Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.scitotenv.2014.09.005.

References

Alpine AE, Cloern JE. Trophic interactions and direct physical effects control phytoplank-ton biomass and production in an estuary. Limnol Oceanogr 1992;37:946–55.

Anderson DM, Andersen P, Bricelj VM, Cullen JJ, Rensel JE. Monitoring and managementstrategies for harmful algal blooms in coastal waters. APEC #201-MR-01.1, Asia Pacif-ic Economic Program, Singapore, and Intergovernmental Oceanographic CommissionTechnical Series No. 59, Paris; 2001.

Baird ME, Emsley SM, McGlade JM. Modelling the interacting effects of nutrient uptake,light capture and temperature on phytoplankton growth. J Plankton Res 2001;23(8):829–40.

Balabin RM, Lomakina EI. Support vectormachine regression (SVR/LS-SVM)— An alterna-tive to neural networks (ANN) for analytical chemistry? Comparison of nonlinearmethods on near infrared (NIR) spectroscopy data. Analyst 2011;136(8):1703–12.

Balls PW, Macdonald A, Pugh K, Edwards AC. Long-term nutrient enrichment of an estu-arine system: Ythan, Scotland (1958–1993). Environ Pollut 1995;90:311–21.

Barciela RM, Garcia E, Fernandez E. Modelling primary production in a coastal embay-ment affected by upwelling using dynamic ecosystem models and artificial neuralnetworks. Ecol Model 1999;120:199–211.

Barton RR, Meckesheimer M. Chapter 18 Metamodel-based Simulation Optimization.Handbooks in operations research and management science, 13. 2006. p. 535–74.

Basu A,Walters C, ShepherdM. Support vector machines for text categorization. IEEE Pro-ceedings of the 36th Annual Hawaii International Conference on System Sciences;2003. p. 7.

Bebis G, Georgiopoulos M. Feed-forward neural networks. IEEE Potentials 1994;13(4):27–31.

Behzad M, Asghari K, Eazi M, Palhang M. Generalization performance of support vectormachines and neural networks in runoff modeling. Expert Syst Appl 2009;36(4):7624–9.

Boyce FM, Charlton MN, Rathke DC, Mortimer H, Bennett J. Lake Erie research: recent re-sults, remaining gaps. J Great Lakes Res 1987;13:826–40.

Boyer JN, Kelble CR, Ortner PB, Rudnick DT. Phytoplankton bloom status: chlorophyll abiomass as an indicator of water quality condition in the southern estuaries of Florida,USA. Ecol Indic 2009;9s:S56–67.

Brassard JR, Correia MJ. Computer program for fitting multimodal probability densityfunctions. Comput Prog Biomed 1977;7(1):1–20.

Carpenter EJ, Capone DG. Nitrogen in the marine environment. New York: AcademicPress; 1983.

ChenWH, Hsu SH, Shen HP. Application of SVM and ANN for intrusion detection. ComputOper Res 2005;32(10):2617–34.

Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVMregression. Neural Netw 2004;17(1):113–26.

Cho KH, Kang JH, Ki SJ, Park Y, Cha SM, Kim JH. Determination of the optimal parametersin regression models for the prediction of chlorophyll-a: a case study of the YeongsanReservoir, Korea. Sci Total Environ 2009a;407(8):2536–45.

Cho KH, Park Y, Kang JH, Ki SJ, Cha S, Lee SW, et al. Interpretation of seasonal water qualityvariation in the Yeongsan Reservoir, Korea using multivariate statistical analyses.Water Sci Technol 2009b;59(11):2219–26.

Cho KH, Sthiannopkao S, Pachepsky YA, Kim KW, Kim JH. Prediction of contamination po-tential of groundwater arsenic in Cambodia, Laos, and Thailand using artificial neuralnetwork. Water Res 2011;45(17):5535–44.

Cho KH, Pachepsky YA, Kim JH, Kim J, Park M. The modified SWAT model for predictingfecal coliforms in the Wachusett reservoir watershed, USA. Water Res 2012;46(15):4750–60.

Correll DL. Phosphorus: a rate limiting nutrient in surface waters. Poult Sci 1999;78:674–82.

Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20(3):273–97.Cullen JJ, Ciotti AM, Davis RF, LewisMR. Optical detection and assessment of algal blooms.

Limnol Oceanogr 1997;42:1223–39.DiazMM, Pedrozo FL. Nutrient limitation in Andean–Patagonian lakes at latitude 40–41°S.

Arch Hydrobiol 1996;138:123–43.Dillon PJ, Rigler FH. The phosphorus–chlorophyll relationship in lakes. Limnol Oceanogr

1974;19(5):767–73.Dortch Q. The interaction between ammonium and nitrate uptake in phytoplankton. Mar

Ecol Prog Ser 1990;61:183–201.Dugdale RC. Nutrient limitation in the sea: dynamics, identification and significance.

Limnol Oceanogr 1967;12:685–95.Dzialowski AR, Wang SH, Lim NC, Spotts WW, Huggins DG. Nutrient limitation of phyto-

plankton growth in central plains reservoirs, USA. J Plankton Res 2005;27(6):587–95.Edmondson WT. Phosphorus, nitrogen, and algae in Lake Washington after diversion of

sewage. Science 1970;169:690–1.Elser JJ, Marzolf ER, Goldman CR. Phosphorus and nitrogen limitation of phytoplankton

growth in the freshwaters of North America: a review and critique of experimentalenrichments. Can J Fish Aquat Sci 1990;47:1468–77.

Elser JJ, BrackenMES, Cleland EE. Global analysis of nitrogen and phosphorus limitation ofprimary producers in freshwater, marine and terrestrial ecosystems. Ecol Lett 2007;10:1135–42.

GlasgowHB, Burkholder JM, Reed RE, Lewitus AJ, Kleinman JE. Real-time remotemonitor-ing of water quality: a review of current applications, and advancements in sensor,telemetry, and computing technologies. J Exp Mar Biol Ecol 2004;300(1–2):409–48.

Govindaraju RS. Artificial neural networks in hydrology. II: Hydrologic applications.J Hydrol Eng 2000;5(2):124–37.

Hartnett M, Nash S. Modelling nutrient and chlorophyll_a dynamics in an Irish brackishwaterbody. Environ Modell Softw 2004;19(1):47–56.

Havens KE, East TL, Meeker RH, Davis WP, Steinman AD. Pytoplankton and periphytonresponses to in situ experimental nutrient enrichment in a shallow subtropicallake. J Plankton Res 1996;18:551–66.

Hecky RE, Kilham P. Nutrient limitation of phytoplankton in freshwater and marine envi-ronments: a review of recent evidence on the effects of enrichment. Limnol Oceanogr1988;33:796–822.

Howarth R, Marino R. Nitrogen as the limiting nutrient for eutrophication in coastalmarine ecosystems: evolving views over three decades. Limnol Oceanogr 2006;51:364–76.

JanssonM, Blomqvist P, JonssonA, BergstromA-K. Nutrient limitation of bacterioplankton,autotrophic and mixotrophic phytoplankton, and heterotrophic nanoflagellates inLake Ortrasket. Limnol Oceanogr 1996;41:1552–9.

Jones JR, Knowlton MF, An K-G. Trophic state, seasonal patterns and empirical models inSouth Korean reservoirs. Lake Reserv Manag 2003;19:64–78.

Jones JR, Thompson A, Seong CN, Jung JS, Yang H. Monsoon influence on the limnology ofJuam Lake, South Korea. Verh Internat Verein Limnol 2006;29:1215–22.

Ki SJ, Lee YG, Kim SW, Lee YJ, Kim JH. Spatial and temporal pollutant budget analyses to-ward the total maximum daily loads management for the Yeongsan watershed inKorea. Water Sci Technol 2007;55(1–2):367–74.

Kim KJ. Financial time series forecasting using support vector machines. Neurocomputing2003;55(1–2):307–19.

Kim B, Park JH, Hwang G, JunMS, Choi K. Eutrophication of reservoirs in South Korea. Lim-nology 2001;2(3):223–9.

Le Pape O, Del Amo Y, Ménesguen A, Aminot A, Quequiner B, Treguer P. Resistance of acoastal ecosystem to increasing eutrophic conditions: the Bay of Brest (France), asemi-enclosed zone of Western Europe. Cont Shelf Res 1996;16:1885–907.

Lee HS, Lee JHW. Continuous monitoring of short term dissolved oxygen and algal dy-namics. Water Res 1995;29(12):2789–96.

Page 11: Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs, Korea

41Y. Park et al. / Science of the Total Environment 502 (2015) 31–41

Lee JHW, Huang Y, Dickman M, Jayawardena AW. Neural network modeling of coastalalgal blooms. Ecol Model 2003;159:179–201.

Lee JHW, Hodgkiss IJ, Wong KTM, Lam IHY. Real time observations of coastal algal bloomsby an early warning system. Estuar Coast Shelf Sci 2005;65:172–90.

Lee YG, An KG, Ha PT, Lee KY, Kang JH, Cha SM, et al. Decadal and seasonal scale changesof an artificial lake environment after blocking tidal flows in the Yeongsan Estuary re-gion, Korea. Sci Total Environ 2009;407(23):6063–72.

Lee YG, Kang JH, Ki SJ, Cha SM, Cho KH, Lee YS, et al. Factors dominating stratificationcycle and seasonal water quality variation in a Korean estuarine reservoir. J EnvironMonitor 2010;12(5):1072–81.

Maier HR, Dandy GC. The effect of internal parameters and geometry on the performanceof back-propagation neural networks: an empirical study. Environ Modell Softw1998;13(2):193–209.

Maier HR, Dandy GC. Neural networks for the prediction and forecasting of water re-sources variables: a review of modelling issues and applications. Environ ModellSoftw 2000;15(1):101–24.

Maier HR, Dandy GC, Burch MD. Use of artificial neural networks for modellingcyanobacteria Anabaena spp. In the River Murray, South Australia. Ecol Model1998;105:257–72.

Middlebrooks EJ, Porcella DB. Rational multivariate algal growth kinetics. J Sanit Eng Div1971;97(1):135–40.

Norgaard M, Ravn O, Poulsen NK, Hansen LK. Neural network for modeling and control ofdynamic systems. Springer; 2000.

Pachepsky Y, Guber A, Jacques D, Simunek J, Van Genuchten MT, Nicholson T, et al.Information content and complexity of simulated soil water fluxes. Geoderma2006;134(3–4):253–66.

Palani S, Liong SY, Tkalich P. An ANN application for water quality forecasting. Mar PollutBull 2008;56(9):1586–97.

Patuelli R, Reggiani A, Nijkamp P, Schanne N. Neural networks for regional employmentforecasts: are the parameters relevant? J Geogr Syst 2011;13(1):67–85.

Pennock lR. Chlorophyll distributions in the Delaware estuary: regulation by light-limitation. Estuar Coast Shelf Sci 1985;21:711–25.

Raven JA, Geider RJ. Temperature and algal growth. New Phytol 1988;110(4):441–61.Ren Y, Bai G. Determination of optimal SVM parameters by using GA/PSO. J Comput 2010;

5(8):1160–8.Richardson K, Heilmann JP. Primary production in the Kattegat: past and present. Ophelia

1995;41:317–28.Schindler DW, Armstrong FAJ, Holmgren SK, Brunskill GJ. Eutrophication of Lake 222,

Experimental Lakes Area, northwestern Ontario, by addition of phosphate and ni-trate. J Fish Res Board Can 1971;28:1763–82.

Sebald DJ, Bucklew JA. Support vector machine techniques for nonlinear equalization.IEEE Trans Sig Process 2000;48(11):3217–26.

Sellner KG, Doucette GJ, Kirkpatrick GJ. Harmful algal blooms: causes, impacts and detec-tion. J Ind Microbiol Biotechnol 2003;30:383–406.

Shin SC, Huh IA, Yoon JH, Kim JH, Kim SS, Jang NI, et al. Study on the fate of the pollutantsand the change of ecosystem in Juam Lake. NIER No. 2000-22-584 Yeongsan-RiverWater Quality Laboratory National Institute of Environmental Research; 2000.

Søndergaard M, Jensen JP, Jeppesen E. Role of sediment and internal loading of phospho-rus in shallow lakes. Hydrobiologia 2003;506/509:135–45.

Sournia A. Phytoplankton manual. Monographs on oceanographic methodology, vol.6Paris: UNESCO; 1978. p. 337.

Sterner RW, Grover JP. Algal growth in warm temperate reservoirs: kinetic examinationof nitrogen, temperature, light, and other nutrients. Water Res 1998;32(12):3539–48.

Thissen U, Pepers M, Üstün B, Melssen WJ, Buydens LMC. Comparing support vector ma-chines to PLS for spectral regression applications. Chemometr Intell Lab Syst 2004;73(2):169–79.

Thomann RV, Mueller JA. Principles of surface water quality modeling and control. NewYork: Harper & Row, Inc.; 1987.

Tilman D, Kilham SS, Kilham P. Phytoplankton community ecology: the role of limitingnutrients. Annu Rev Ecol Syst 1982;13:349–72.

Tomlinson MC, Stumpf RP, Ransibrahmanakul V, Truby EW, Kirkpatrick GJ, Pederson BA,et al. Evaluation of the use of SeaWiFS imagery for detecting Karenia brevis harmfulalgal blooms in the eastern Gulf of Mexico. Remote Sens Environ 2004;91:293–303.

Vapnik V. Statistical learning theory. New York: John Wiley; 1998.Vapnik VN, Golowich S, Smola AJ. Support vector method for function approximation, re-

gression estimation, and signal processiong. Adv Neural Inf Process Syst 1997;9:281–7.

Vincent WF, WurtsbaughW, Vincent CL, Richerson PJ. Seasonal dynamics of nutrient lim-itation in a tropical high-altitude lake (Lake Titicaca, Peru–Bolivia): application ofphysiological bioassays. Limnol Oceanogr 1984;29:540–52.

Vitousek PM, Howarth RW. Nitrogen limitation on land and in the sea — how can itoccur? Biogeochemistry 1991;13:87–115.

Wang W, Xu Z, Lu W, Zhang X. Determination of the spread parameter in the Gaussiankernel for classification and regression. Neurocomputing 2003;55(3–4):643–63.

Widrow B, Rumelhart DE, Lehr MA. Neural networks: applications in industry, businessand science. Commun ACM 1994;37(3):93–105.

Williams EJ, Kloot NH. Interpolation in a series of correlated observations. Aust J Appl Sci1953;4:1–17.

Wofsy SC. A simple model to predict extinction coefficients and phytoplankton biomassin eutrophic waters. Limnol Oceanogr 1983;28:1144–55.

Yabunaka K, Hosomi M, Murakami A. Novel application of a back-propagation artificialneural network model formulated to predict algal bloom. Water Sci Technol 1997;36(5):89–97.

Yuan XF, Wang YN. Parameter selection of support vector machine for function approxi-mation based on chaos optimization. J Syst Eng Electron 2008;19(1):191–7.