2015 2nd author a multivariate based event detection method and performance comparison with two...

10
A multivariate based event detection method and performance comparison with two baseline methods Shuming Liu * , Kate Smith, Han Che School of Environment, Tsinghua University, Beijing 100084, China article info Article history: Received 27 January 2015 Received in revised form 30 April 2015 Accepted 5 May 2015 Available online 12 May 2015 Keywords: Contaminant classification Conventional sensor Early warning system Euclidean distance Pearson correlation Water quality abstract Early warning systems have been widely deployed to protect water systems from acci- dental and intentional contamination events. Conventional detection algorithms are often criticized for having high false positive rates and low true positive rates. This mainly stems from the inability of these methods to determine whether variation in sensor measure- ments is caused by equipment noise or the presence of contamination. This paper presents a new detection method that identifies the existence of contamination by comparing Euclidean distances of correlation indicators, which are derived from the correlation co- efficients of multiple water quality sensors. The performance of the proposed method was evaluated using data from a contaminant injection experiment and compared with two baseline detection methods. The results show that the proposed method can differentiate between fluctuations caused by equipment noise and those due to the presence of contamination. It yielded higher possibility of detection and a lower false alarm rate than the two baseline methods. With optimized parameter values, the proposed method can correctly detect 95% of all contamination events with a 2% false alarm rate. © 2015 Elsevier Ltd. All rights reserved. 1. Introduction China has suffered thousands of water contamination events over the past few decades. Between 1992 and 2006, an average of 1906 contamination accidents occurred per year (Yang et al., 2010). For example, the Songhua River was contami- nated by nitrobenzene from a chemical plant explosion in 2005, which resulted in a 4 day suspension of water supply to Harbin, China (Wang et al., 2012). More recently, in February 2012, the drinking water source of a city in the lower Yangtze River area of Jiangsu province was contaminated by a phenol spill from a South Korean cargo ship. One approach for avoiding or mitigating the impact of contamination is to establish an Early Warning System (EWS), which normally includes online sensors, a connected supervisory control and data acquisition (SCADA) system, a detection algorithm and a Abbreviations: ANN, artificial neural network; ARMA, autoregressive moving average; CIE, contaminant injection experiment; EWS, early warning system; FAR, false alarm rate; FN, false negative; FP, false positive; LPF, linear prediction filters; MED, multivariate Euclidean distance; ORP, oxidation reduction potential; PE, Pearson correlation Euclidean distance-based method; PD, probability of detection; READiw, real-time event adaptive detection, identification and warning; ROC, receiver operating characteristic; SCADA, su- pervisory control and data acquisition; SVM, support vector machine; TN, true negative; TOC, total organic carbon; TP, true positive. * Corresponding author. E-mail address: [email protected] (S. Liu). Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/watres water research 80 (2015) 109 e118 http://dx.doi.org/10.1016/j.watres.2015.05.013 0043-1354/© 2015 Elsevier Ltd. All rights reserved.

Upload: kate-smith

Post on 09-Aug-2015

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

ww.sciencedirect.com

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8

Available online at w

ScienceDirect

journal homepage: www.elsevier .com/locate/watres

A multivariate based event detection method andperformance comparison with two baselinemethods

Shuming Liu*, Kate Smith, Han Che

School of Environment, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o

Article history:

Received 27 January 2015

Received in revised form

30 April 2015

Accepted 5 May 2015

Available online 12 May 2015

Keywords:

Contaminant classification

Conventional sensor

Early warning system

Euclidean distance

Pearson correlation

Water quality

Abbreviations: ANN, artificial neural netwearly warning system; FAR, false alarm raEuclidean distance; ORP, oxidation reductiodetection; READiw, real-time event adaptivepervisory control and data acquisition; SVM* Corresponding author.E-mail address: [email protected]

http://dx.doi.org/10.1016/j.watres.2015.05.0130043-1354/© 2015 Elsevier Ltd. All rights rese

a b s t r a c t

Early warning systems have been widely deployed to protect water systems from acci-

dental and intentional contamination events. Conventional detection algorithms are often

criticized for having high false positive rates and low true positive rates. This mainly stems

from the inability of these methods to determine whether variation in sensor measure-

ments is caused by equipment noise or the presence of contamination. This paper presents

a new detection method that identifies the existence of contamination by comparing

Euclidean distances of correlation indicators, which are derived from the correlation co-

efficients of multiple water quality sensors. The performance of the proposed method was

evaluated using data from a contaminant injection experiment and compared with two

baseline detection methods. The results show that the proposed method can differentiate

between fluctuations caused by equipment noise and those due to the presence of

contamination. It yielded higher possibility of detection and a lower false alarm rate than

the two baseline methods. With optimized parameter values, the proposed method can

correctly detect 95% of all contamination events with a 2% false alarm rate.

© 2015 Elsevier Ltd. All rights reserved.

1. Introduction

China has suffered thousands of water contamination events

over the past few decades. Between 1992 and 2006, an average

of 1906 contamination accidents occurred per year (Yang

et al., 2010). For example, the Songhua River was contami-

nated by nitrobenzene from a chemical plant explosion in

2005, which resulted in a 4 day suspension of water supply to

ork; ARMA, autoregressivte; FN, false negative; FPn potential; PE, Pearsondetection, identification

, support vector machine

u.cn (S. Liu).

rved.

Harbin, China (Wang et al., 2012). More recently, in February

2012, the drinking water source of a city in the lower Yangtze

River area of Jiangsu province was contaminated by a phenol

spill from a South Korean cargo ship. One approach for

avoiding or mitigating the impact of contamination is to

establish an Early Warning System (EWS), which normally

includes online sensors, a connected supervisory control and

data acquisition (SCADA) system, a detection algorithm and a

e moving average; CIE, contaminant injection experiment; EWS,, false positive; LPF, linear prediction filters; MED, multivariatecorrelation Euclidean distance-based method; PD, probability ofand warning; ROC, receiver operating characteristic; SCADA, su-; TN, true negative; TOC, total organic carbon; TP, true positive.

Page 2: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8110

decision support system (Hasan et al., 2004). EWS should

provide a fast and accurate means to distinguish between

normal variations in water parameters and actual contami-

nation events.

A key part of an EWS is the detection algorithm, which

utilizes data fromonline sensors to evaluatewater quality and

detect the presence of contamination. Conventional water

quality sensors have been playing a growing role in EWS

because they are easy to maintain, reliable and cost-effective.

As summarized by McKenna et al. (2008), there are two ap-

proaches to developing and testing event detection methods

using water quality sensor signals. First, laboratory and test-

loop evaluation of sensors and associated event detection al-

gorithms provides direct measurement of chemical changes

in background water quality caused by specific contaminants

(Hall et al., 2007; Kroll and King, 2006a, b; Liu et al., 2015a, b).

For example, Hall et al. (2007) carried out a sensor response

experiment for 9 types of contaminants and realized that

more than one sensor responded to each tested contaminant.

After noticing this phenomenon, researchers have attempted

to develop contaminant detection methods using responses

frommultiple sensors. Yang et al. (2009) developed a real-time

event adaptive detection, identification andwarning (READiw)

methodology in a drinking water pipe. The suggested adaptive

transformation of sensor measurements reduced background

noise and enhanced contaminant signals. In the method

employed by Yang et al. (2009), the relative value of concen-

trations of free and total chlorine, pH and oxidation reduction

potential (ORP) are used for contaminant classification. This

allows for contaminant detection and further classification

based on chlorine kinetics. Kroll (2006) developed the Hach

HST approach using multiple sensors for event detection and

contaminant identification. In this approach, signals from 5

separate orthogonal measurements of water quality (pH,

conductivity, turbidity, chlorine residual, total organic carbon

(TOC)) are processed from a 5-parametermeasure into a single

scalar trigger signal. The deviation signal is then compared to

a preset threshold level. If the signal exceeds the threshold,

the trigger is activated (Kroll, 2006). In Kroll's method,

although responses from multiple sensors are utilized, their

internal relationship is not explored. McKenna et al. (2008)

argued that a drawback of the laboratory and test-loop re-

sults and the resulting algorithms is that variation of the

background water quality in these systems may be consider-

ably less than the variation observed in actual water systems.

Another drawback of these types of methods is that the

threshold level is site dependent. When applied to a situation

different from the one for which the method is developed,

field calibration is necessary.

The second approach to event detection is based on signal

processing and data-driven techniques (McKenna et al.,

2008). For example, Hart et al. (2007) reported a linear pre-

diction filters (LPF) method. The LPF method predicts the

water quality at a future time step and evaluates the residual

between predicted and observed water quality values. Klise

and McKenna (2006) developed an algorithm to classify the

current measurement as normal or anomalous by calcu-

lating the multivariate Euclidean distance (MED). The MED

approach provides a measure of the distance between the

sampled water quality and the previously measured samples

contained in the history window. McKenna et al. (2008)

compared the performance of LPF, MED and a time series

increments method. These algorithms process water quality

data at each time step to identify periods of anomalous water

quality and the probability of a water contamination event

having occurred at that time step. The averaged deviation

between the observed and predicted responses from time

series data for each sensor is compared with a preset

threshold. If the averaged deviation is greater than the preset

threshold value, an alarm is triggered. Allgeier et al. (2005)

and Raciti et al. (2012) used artificial neural networks (ANN)

and support vector machines (SVM) to classify water quality

data into normal and anomalous classes after supervised

learning. Perelman et al. (2012) and Arad et al. (2013) reported

a general framework that integrates a data-driven estima-

tion model with sequential probability updating to detect

quality faults in water distribution systems using multivar-

iate water quality time series. A common feature of signal

processing and data-driven methods is that they rely mainly

on pure mathematical data analysis. The characteristics of

sensor responses to contaminants and the connections be-

tween these are not considered by these methods. For online

water quality sensors, fluctuations can either be caused by

equipment noise, variability in hydraulics and water de-

mand, or the presence of contaminant. Signal processing and

data-driven methods have very limited ability to differen-

tiate between these two types of fluctuations, which can lead

to false positive alarms.

To overcome this drawback, this paper describes a new

method for real-time contamination detection using multiple

conventional water quality sensors for source water. The

proposedmethod aims to achieve contamination detection by

using Pearson correlation coefficients to explore the correla-

tive relationship between signals from multiple sensors and

their Euclidean distances. A Pearson correlation coefficient is

a measure of the strength and direction of the linear rela-

tionship between two variables (Mudelsee, 2003). In recent

years, it has been used for classification purpose. For example,

Monedero et al. (2011) applied Pearson correlation coefficients

to the problem of detecting fraud and other non-technical

losses in a power utility. Benesty et al. (2008) used the co-

efficients to reduce noise in speech estimation, a topic which

has attracted a considerable amount of attention over the past

few decades. However, the application of this correlation co-

efficient in the field of contamination detection has not been

explored. The method proposed in this study is tested using

data from a laboratory contaminant injection experiment

(CIE). Its detection performance is evaluated and compared

with two baseline methods.

2. Methods and materials

2.1. The proposed event detection method

The proposed event detection method, called Pearson corre-

lation Euclidean distance-based method (PE), includes three

steps: calculation of Pearson correlation coefficients, calcula-

tion of correlation indicators and calculation of Euclidean

distances.

Page 3: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 111

2.1.1. Step 1: calculation of Pearson correlation coefficientsPearson correlation coefficients formultiple sensor signals are

calculated. In a previous study, Liu et al. (2014) reported that

multiple water quality sensors could respond to a contami-

nation event simultaneously, which is defined as a correlative

response and utilized in this study for event detection. Step 1

involves quantifying the extent of correlation using Pearson

correlation coefficients, r, which are calculated as follows

rXY ¼Pn

i¼1

�xi � X

��yi � Y

�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn

i¼1

�xi � X

�2q*

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1

�yi � Y

�2q (1)

in whichX andY refer to signal series from two separatewater

quality sensors (e.g. pH and ORP). xi and yi are the ith values in

the signal series. X and Y stand for mathematical expectation.

The number of data or window size is given by n. The window

size is the number of past observations used to calculate the

Pearson correlation coefficient. For each sensor, a new

observation enters the sliding window at every time step t and

the oldest observation exits (i.e., first in first out) (Arad et al.,

2013).

2.1.2. Step 2: calculation of correlation indicatorThe value of rXY is between�1 and 1. If the value of rXY is close

to 0, the correlation between X and Y is deemed to be weak. In

this study, a correlation indicator CXY is used to denote whether

two vectors are closely related. The value of CXY is either 0 or 1,

which is obtained, as shown in Equation (2), by comparing rXYwith a pre-set indicator threshold C*.

�CXY ¼ 0 if jrXY j<C* or X ¼ YCXY ¼ 1 if C* � jrXY j � 1

(2)

2.1.3. Step 3: calculation of Euclidean distanceFor the case of s sensors, the correlation coefficient forms an

s � s matrix, as does the correlation indicator. The correlation

indicators above the diagonal are taken to construct a 1 � m

dimension vector V, which is called the correlation indicator

vector (Fig. 1). m is determined by

m ¼Xs�1

i¼1

i (3)

The Euclidean distance between the correlation indicator

vector and the point of origin, DPE, is calculated using

DPE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXmk¼1

½Vk � O�2s

(4)

Fig. 1 e Schematic graph of the correlation indicator matrix

and correlation indicator vector.

in whichVk is the kth item in the correlation indicator vectorV

andO refers to the point of origin. The items in vectorV are the

values CXY above the diagonal in the correlation indicator

matrix (see Fig. 1).

A contamination alarm will be triggered if

DPE � D*PE (5)

in which D*PE is a detection threshold.

2.2. Performance evaluation

The accuracy of an event detection method is assessed by its

ability to place the current state of water quality into one of

two classes: background and event. Accuracy is composed of

the ability of the method to detect the event (this is the

sensitivity of the method) and to correctly exclude the normal

operating conditions from being inadvertently classified as an

event (this is the specificity of the method). Evaluation of an

event detectionmethod requires that the results be examined

on a common scale to assess the tradeoffs between false

positive (FP) and false negative (FN) decisions. This evaluation

should remain independent of the basis of the detection

method and the quantity of the input data. In this study, the

receiver operating characteristic (ROC) curve is adopted as a

performance indicator. This curve has been used in other

studies (McKenna et al., 2008; Arad et al., 2013).

The ROC curve defines the probability of detection (PD) that

can be obtained as a function of the corresponding false alarm

rate (FAR). FAR is the number of FPs divided by the total

number of values that are below the detection threshold. The

PD is defined as the number of true positives (TPs) divided by

all events that are above the detection threshold. The area

under the ROC curve is the preferred single-valuedmeasure of

accuracy of the technique being evaluated (Swets 1988). The

maximum value of the area under the ROC curve is 1. The

closer the area to 1, the better performance the detection

method yields.

FAR ¼ FPFPþ TN

(6)

PD ¼ TPTPþ FN

(7)

in which TN is true negative.

In a practical situation, for a specific event detection

method, the steps to developing an ROC curve are as follows:

Step 1: collect water quality sensor signals for a period of

time;

Step 2: vary the threshold/parameter values in the event

detection method while recording the FARs and PDs to create

the ROC curve.

For an event detection method with one threshold/

parameter, there exists one ROC curve. For an event detection

method with multiple parameters/thresholds, multiple ROC

curves can be obtained. Among these curves, the one with

maximized area is defined as the optimal ROC curve. This can

be determined through an optimization process, which is

demonstrated in the next section. Once the optimal ROC curve

is obtained, it is necessary to select a point on the curve where

the PD is maximized and the FAR is minimized, typically the

Page 4: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8112

top-left point on the curve. The event detection method can

then be implemented with the threshold values correspond-

ing to this point.

Besides the ROC curve area, the FAR for a PD of 0.95 is also

evaluated as a performance indicator. Sensitivity and speci-

ficity are inextricably linked. As the sensitivity is increased,

the number of false positive (FP) events also increases (i.e. the

specificity of the detection method decreases). Decreasing the

sensitivity means some events will be missed and this in-

creases the number of false negatives (FN). Evaluation of an

event detectionmethod requires that the results be examined

on a common scale to assess the tradeoffs between FP and FN

decisions. To facilitate this, the false alarm rate (FAR) at a

given probability of detection (PD) (in this case PD ¼ 0.95) is

used. For example, the performance of a detection method

yielding an FAR ¼ 0.1 with a PD ¼ 0.95 is deemed better than

method yielding a FAR ¼ 0.2 with a PD ¼ 0.95.

2.3. Optimal parameter values

The PE method has three parameters, n (window size), C*

(indicator threshold) and D*PE (distance threshold). In this

study, the optimal values were determined using an optimi-

zation analysis. The objective function is configured to be:

Objective function ¼ max (area under the ROC curve) (8)

In the optimization analysis, a range was pre-set for each

parameter according to the problem under discussion. All

possible values of n, C* and D*PE from within these ranges

were taken to calculate FARs and PDs, which form the ROC

curves. The ROC curve with maximum area is denoted as the

optimal ROC curve. By iterating all possible values for all

parameters, the optimal ROC curve satisfying the objective

function could be obtained. The values of n and C* associated

with the optimal ROC curve are defined as the optimized

values of parameters n and C*. The optimized value of D*PE can

be identified using the top-left point on the optimal ROC

curve.

2.4. Two baseline detection methods

2.4.1. Multivariate Euclidean distance (MED) methodThe MED method considers changes in water quality by

comparing two successive distances in a multivariate space

defined by the water quality signal (McKenna et al., 2008; Klise

and McKenna, 2006). The distances compared are (1) the

Euclidean distance between the water quality measurement

at the current time step and themeanwater quality value over

the previous PE time steps and (2) the Euclidean distance be-

tween the water quality value of the previous time step and

the mean value. An example of a multivariate space is the

eight-dimensional space defined by the values of tempera-

ture, pH, turbidity, conductivity, ORP, UV-254, nitrate, and

phosphate.

The distance measure, dMED, is the difference between the

Euclidean distances (1) and (2) in the n dimensional space and

is calculated using

dMED ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXnj¼1

hZðtÞj � ZðmÞj

i2vuut �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

j¼1

hZðt� 1Þj � ZðmÞj

i2vuut (9)

in which n is the number of water quality parameters (or the

number of dimensions of the space), Z(t)j is the jth water

quality parameter in the current time step (i.e. t), Z(t�1)j is

the jth water quality parameter in the previous time step (i.e.

t�1), and Z(m)j is the mean value of the jth water quality

parameter. A constant detection threshold, d*MED, is applied to

determine if an event has occurred. The MED method iden-

tifies a contamination event when dMED is above this

threshold. Otherwise, measurements are considered

background.

2.4.2. Linear prediction filters (LPF) methodA linear predictor estimates the current value of a time series

as a linear combination of previous samples. This method is

also known as the autoregressive moving average (ARMA)

predictor (McKenna et al., 2008; Zetterqvist, 1991), and the

most common representation is

Z*ðtÞ ¼ �XPLPFi¼1

aiZðPLPF � iÞ (10)

in which Z*ðtÞ is the predicted water quality value, Z(PLPF�i) is

the previous observed values, ai are the predictor co-

efficients, and PLPF is the order of the prediction filter

polynomial.

The error generated by this estimate is

dLPFðtÞ ¼ Z*ðtÞ � ZðtÞ (11)

in which Z(t) is the observed true signal value, dLPF(t) is the

difference between prediction Z*(t) and observation Z(t). Each

water quality signal is treated separately, and then all water

quality signals are combined by taking the average absolute

value of the differences across all signals. The average dif-

ference or distance is then compared to the detection

threshold d*LPF to determine whether the difference is signifi-

cant. If the difference is greater than the detection threshold,

an alarm will be sounded.

2.4.3. Data standardizationIn the MED and LPF methods, values from or derived from

different water quality sensors are summed together. These

data are in different units. Therefore, they need to be stan-

dardized to a common scale. In this study, standardization is

achieved by subtracting the dataset mean, m, from each water

quality signal, X(t), and then dividing this difference by the

standard deviation, s, of the dataset:

ZðtÞ ¼ XðtÞ � m

s(12)

The m refers to themean of the dataset in the previous time

steps with a window size denoted by nml. The performances of

the MED and LPF methods are also evaluated using the ROC

curve. It should be noted that data for the PE method do not

need to be standardized because the Pearson coefficient is

dimensionless.

Page 5: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 113

2.5. Datasets

In order to collect sensor signals over the course of a

contamination event, a pilot-scale CIE platform was devel-

oped in the laboratory. Eight types of sensors were utilized in

this experiment. They can measure the following 8 parame-

ters simultaneously and continuously: temperature, pH,

turbidity, conductivity, ORP, UV-254, nitrate-nitrogen and

phosphate. The background experiment was conducted for

60min. Then cadmiumnitrate at a concentration of 0.008mg/l

was injected into the CIE platform for a period of 29 min. This

was done to simulate a chemical spill. Signals from 8 sensors

were recorded. Fig. 2 shows the sensor signals from the

experiment, in which period a (left of the red dotted line) is

background, while period b represents the time when cad-

mium nitrate is added (right of the red dotted line) (in the web

version). For each sensor, the dataset contained 89 values, in

which the first 60 values were for data without contamination

and the last 29 values were data with contamination. There

were 29 contamination events in total. For more information

about the CIE platform and the injection experiment, readers

can refer to Liu et al. (2014). Detailed data from this experi-

ment are provided in the Supplementary material.

To facilitate performance evaluation, each time step is

defined as a contamination event or a background situation.

In total, there are 29 contamination events and 60 background

situations.

3. Results

3.1. Initial application

To demonstrate how the PE method works, arbitrary param-

eter values were adopted in the initial application. The values

used were: n ¼ 15, C* ¼ 0.4, and D*PE ¼ 3:0. Taking time steps 50

(background situation) and 75 (contamination event) as ex-

amples, the implementation of the PE method at these two

time steps is demonstrated here.

Using Equations (1), (2) and (4), the correlation coefficients,

correlation indicators and DPE at time steps 50 and 75 were

calculated and are shown in Tables 1 and 2. The distances

Fig. 2 e Sensor signals from cadmium nitrate injection experim

Cond. ¼ conductivity; Turb. ¼ turbidity; Phos. ¼ Phosphate; Te

from the correlation indicator vectors to the point of origin

were 1.74 and 3.60 respectively. According to Equation (5), an

alarm was triggered at time step 75 (Note: 3.60 > 3.0), whereas

time step 50 was classified as background. The calculation

process was repeated for all time steps and from this a PD of

0.93 and an FAR of 0.20 were obtained. This means that 93% of

events were detected and 20% of background situations were

classified incorrectly to be contamination events.

3.2. Performance with optimal parameter values

The PE method has three parameters: n (window size), C*

(correlation indicator) and D*PE (detection threshold). Using the

method introduced in Section 2.3, the optimal values of pa-

rameters were obtained and are shown in Table 3. Fig. 3 dis-

plays the ROC curves obtained in the parameter optimization

process. The optimal ROC curve is highlighted in red (in the

web version). For the PE method, the area under the optimal

ROC curve shown in Fig. 1 was 0.97. The associated n and C*

values were 20 and 0.37 respectively. The top-left point on the

optimal ROC curve, point a, denotes the best possible detec-

tion capability of the PE method for the dataset under dis-

cussion. The corresponding PD and FAR values for point a are

0.97 and 0.025, which suggests that 97% of contamination

events were detected correctly and 2.5% of background situ-

ations were incorrectly grouped as contamination events.

Meanwhile, the FAR at a PD of 0.95 is 0.02 (see Table 4), which

means the PEmethod can detect 95% of contamination events

correctly with the compromise of wrongly identifying 2% of

background situations as contamination events. The results

show the method discriminates very well between back-

ground and contamination and that performance is improved

by adopting optimal parameter values.

3.3. Performance comparisons with the MED and LPFmethods

The MED and LPF methods were applied to the same dataset

and compared to the detection performance of the proposed

PE method. Including the parameter nml used for data stan-

dardization, the MED and LPF methods both have three pa-

rameters (MED: nml, PE and dMED; LPF: nml, PLPF and dLPF). The

ent. Abbreviations: ORP ¼ oxidation reduction potential;

mp. ¼ temperature.

Page 6: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

Table 1 e The calculation results at the 50th time step.

Turb.a pH Cond. Temp. ORP Nitr. UV Phos.

Turb. 1.00 �0.09(0)b 0.05(0) 0.45(1) �0.06(0) �0.04(0) �0.28(0) 0.00(0)

pH 1.00 �0.45(1) �0.13(0) �0.30(0) �0.36(0) �0.32(0) 0.00(0)

Cond. 1.00 �0.05(0) �0.16(0) 0.04(0) 0.18(0) 0.00(0)

Temp. 1.00 0.20(0) �0.31(0) 0.11(0) 0.00(0)

ORP 1.00 0.27(0) 0.45(1) 0.00(0)

Nitra. 1.00 0.35(0) 0.00(0)

UV 1.00 0.00(0)

Phos. 1.00

Correlation indicator vector [0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]

DPE 1.74

a Turb. e turbidity; Cond. e conductivity; Temp. e temperature; ORP e oxidation reduction potential; Nitr. e nitrate-nitrogen; Phos. e

phosphate.b Meaning of entry �0.09(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.

Table 2 e The calculation results at the 75th time step.

Turb.a pH Cond. Temp. ORP Nitr. UV Phos.

Turb. 1.00 �0.22(0)b 0.03(0) �0.75(1) �0.36(0) 0.10(0) �0.45(1) 0.00(0)

pH 1.00 0.87(1) 0.77(1) �0.09(0) �0.94(1) �0.07(0) 0.00(0)

Cond. 1.00 0.53(1) 0.00(0) �0.87(1) �0.04(0) 0.00(0)

Temp. 1.00 0.20(0) �0.63(1) 0.31(0) 0.00(0)

ORP 1.00 0.07(0) 0.62(1) 0.00(0)

Nitra. 1.00 0.06(0) 0.00(0)

UV 1.00 0.00(0)

Phos. 1.00

Correlation indicator vector [0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0]

DPE 3.60

a Turb.e turbidity; Cond.e conductivity; Temp.- temperature; ORPe oxidation reduction potential; Nitr.e nitrate-nitrogen; Phos.e phosphate.b Meaning of entry �0.22(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.

Table 3 e The initial range and optimal values ofparameters for Pearson correlation Euclidean distance-based (PE), multivariate Euclidean distance (MED), andlinear prediction filters (LPF) methods.

Detectionmethod

Optimization Optimalvalue

Parameter Initialrange

Step Datasetfrom lab

PE n [5 29] 1 20

C* [0 1] 0.001 0.37

D*PE [1 6] 0.01 3.3

MED nml [5 30] 1 9

PE [1 29] 1 24

d*MED [0 2] 0.005 0.72

LPF nml [5 50] 1 34

PLPF [1 20] 1 5

d*LPF [0 1] 0.1 0.5

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8114

same methods were adopted to evaluate the performances of

the MED and LPF methods and obtain the optimal values of

parameters (described in Sections 2.2 and 2.3).

By using the optimal parameter values shown in Table 3,

the best performance of the MED and LPF methods was ob-

tained and is presented in Table 4. As shown in Table 4, the

MED method achieved an area of 0.57 under the optimal ROC

curve and the LPF method yielded an area of 0.37, which were

both smaller than the area obtained using the PE method (i.e.,

0.97). The top-left points on the optimal ROC curves (points b

and c in Fig. 3) gave PD ¼ 0.52 FAR ¼ 0.22 and PD ¼ 0.38

FAR¼ 0.54 for theMED and LPFmethods respectively. The top-

left point on the optimal ROC represents the best possible

performance the detection method can achieve. It is obvious

that FARs for the MED and LPF methods are greater than the

one for the PEmethod and the PDs are smaller. Thismeans the

MED and LPFmethods correctly detected fewer contamination

events and incorrectly reported more background situations

as contamination events. Meanwhile, the FARs at a PD of 0.95

were 0.95 for the MED method and 0.94 for the LPF method.

This indicates that, for the MED and LPF methods, the down-

side of detecting 95% of contamination events is that nearly

95% of background situations are wrongly classified. From this

comparison, it was concluded that the PE method has better

potential to detect contamination events with a lower false

alarm rate. Thus it ismore promising for use in contamination

event detection in an early warning system.

3.4. Discussion

To further investigate how the threemethods perform at each

time step, their detection results are displayed in Fig. 4, in

which graphs (a), (b) and (c) show the calculated distance and

detection results of the PE, LPF andMEDmethods respectively.

Graph (d) indicates the actual background (blue) and

contamination (red) (in the web version). The red dashed lines

Page 7: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

Fig. 3 e Optimization results of the Pearson correlation Euclidean distance-based (PE), multivariate Euclidean distance

(MED), and linear prediction filters (LPF) methods.

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 115

are the alarm thresholds for eachmethod. The solid red blocks

indicate that a contamination alarm has been triggered, while

the solid blue blocks mean no alarm (background) (in the web

version). The empty red blocks in graph (d) shows the actual

contamination events, while the empty blue blocks denote

background situations. The remaining blank parts in the

graphs (a), (b) and (c) indicate that the methods have not

started to produce detection results due to parameter settings.

For example, the optimal value of n in the PEmethod is 20. The

first detection result was at the 21st minute.

As shown in Fig. 2, the fluctuations in the first 60 min were

mainly caused by equipment noise or variability in hydraulics,

while the ones in the last 29 min were mainly due to the

presence of contamination. The PE method only yielded one

DPE greater than 3.1 (the detection threshold D*PE) in the first

60 min, which was DPE¼ 3.11 and which occurred at the 31st

minute (Graph a in Fig. 4). The PE method only incorrectly

classified one background situation as a contamination event

(at the 31st minute). Meanwhile, in the last 29 min, it yielded

one DPE smaller than 3.1, which was DPE¼ 3.09 at the 70th

minute. All the others were greater than the threshold value

and were thus correctly identified as contamination. This

suggests that the PEmethod has the ability to differentiate the

presence of contamination from other causes of fluctuation.

On the contrary, the LFP and MED methods incorrectly re-

ported a large number of background situations as contami-

nation events and missed many real events. Graph (b) reveals

that the LFP method rarely triggered alarms after the 67th

Table 4 e The performance of three detection methods.a

Method Dataset from laboratory

AreaunderROC

PD (top-left)

FAR (top-left)

FAR at aPD of 0.95

PE 0.97 0.97 0.025 0.02

MED 0.57 0.52 0.22 0.95

LPF 0.37 0.38 0.48 0.94

a PE ¼ Pearson correlation Euclidean distance-based;

MED ¼ multivariate Euclidean distance; LPF ¼ linear prediction

filters; ROC ¼ received operating characteristic; PD, probability of

detection; FAR, false alarm rate.

minute. Most alarms appeared in the period of the 54th to the

67th minute, just before and after the injection of cadmium

nitrate. As shown in Fig. 2, measurements taken by the UV,

ORP and conductivity sensors fluctuated somewhat between

the 54th and the 60thminute (moment of injection), whichwas

mainly due to equipment noise. The fluctuations after the 60th

minute were mainly caused by the injection of cadmium ni-

trate. As introduced in Section 2.4, the LPF method triggers an

alarm by evaluating the difference between the actual sensor

signals and the predicted values. In other words, if the pre-

diction for a time step differs significantly from the observa-

tion or is unpredictable, this will result in an event being

classified. This explains why alarms were triggered between

the 54th and the 67th minutes.

For the MED method, as shown in graph (c) of Fig. 4, the

alarms were relatively evenly spread over the whole experi-

ment period. The MED method identifies a contamination

event when the distance between the current water quality

value and the mean value is significantly different from the

distance between the water quality value of the previous time

step and the mean value. The time of the alarms was

consistent with the time of fluctuations. This suggests that the

MED method is useful for detecting fluctuations. However, it

might miss situations where the variation at the current time

step is similar to the average variation over previous time

steps. Furthermore, as shown in Fig. 3, the ROC curves for the

MED and LPF methods follow a roughly diagonal trend, which

suggests that the performances of the MED and LPF methods

are very poor. This is mainly because these two methods

trigger an alarm by detecting existence of fluctuations. When

a fluctuation is large enough, an alarmwill be triggered. These

two methods do not have the capacity to further examine the

cause of fluctuations. For this reason, they tend to mistakenly

identify background as contamination, which leads to a high

false positive rate. The method proposed in this study uses

correlative coefficients, which help to differentiate the pres-

ence of contamination from other causes of fluctuation, thus

making the method stronger.

Both the MED and LFP methods can detect situations that

result in sudden changes in signals. However, these two

methods cannot differentiate between these two types of

fluctuations. As shown in graphs (b) and (c), many false alarms

were triggered before the 60th minute (i.e. for time steps

Page 8: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8116

corresponding to background). Taking the 56th minute as an

example, the ORP, conductivity, turbidity and UV all showed

significant fluctuations at this time step. The Euclidean dis-

tance between the vector at the 56th minute and the vector

made up of even values of the previous 24 time steps

(parameter PE) was calculated to be 0.79 using the MED

method, which is greater than 0.72 (the detection threshold

d*MED). Therefore, a contamination event alarmwas declared at

this time step. For the LPF method, the averaged differences

between the predictions and observations was calculated to

be 1.2, which is greater than 0.5 (the detection threshold d*LPF).

The LPF also incorrectly reported an alarm at this time step.

For the PE method, although fluctuations appeared on several

sensors at this time step, as shown in Table 5, their correlative

coefficients were rather small, which suggested these fluctu-

ations were independent and mainly caused by equipment

noise. Using Equation (4), the Euclidean distance between the

correlation indicator vector and the point of origin, DPE, was

calculated as 1.4, which is lower than 3.1 (the detection

threshold D*PE). Therefore, no alarm was triggered by the PE

method at this time step.

An advantage of online water quality sensors is that they

can provide water quality information fast. Alongside this

advantage, online water quality sensors have also been

criticized for their low stability. A common issue for online

sensors is that their signals always contain device noise. The

MED and LPF methods rely mainly on pure mathematical

data analysis. They depend either on comparison with pre-

vious time steps or on comparison between the predicted

and observed values for the same time step. The character-

istics of the signals and the connection between these sig-

nals are not taken into consideration. Fluctuations caused by

equipment noise and those caused by presence of contami-

nants are viewed as identical by the MED and LPF methods.

Both the MED and LPF methods grouped the significant

fluctuations as events. The analysis here shows that equip-

ment noise can cause the MED and LPF methods to give

erroneous results and cause a false alarm to be triggered. The

proposed PE method conducts data analysis by taking the

Fig. 4 e The detection results at each time step for the Pearson

Euclidean distance (MED), and linear prediction filters (LPF) met

contamination alarm; solid blue block: no alarm (background);

background situations. (For interpretation of the references to c

version of this article.)

physical meaning of signals into consideration. By employ-

ing correlative coefficients, the PE method distinguishes be-

tween these two types of variations and overcomes this

drawback. The FAR at a PD of 0.95 is 0.02 for the PE method,

which is significantly lower than the FAR for MED (0.95) and

LPF (0.94).

Meanwhile, the PE also has the advantage of detecting

events when the fluctuation is small. For example, at the

73rd time step, as shown in Fig. 2, the signal fluctuations are

minimal. The MED method yielded a dMED¼ 0.01, which is

lower than 0.72 (the detection threshold d*MED). The LPF

method obtained a dLPF ¼ 0.31, which is smaller than 0.50 (the

detection threshold d*LPF). Therefore, neither the MED nor the

LPF methods reported contamination. The PE method caught

these slight variations and yielded a DPE ¼ 3.7, which is

greater than 3.1 (the detection threshold D*PE). This indicates

that the signal characteristics (correlative responses) can

help in judging the presence of a contaminant, especially in

the case where variations in sensor readings are not signifi-

cant. By jointly employing the characteristics of sensor

readings and data processing techniques, the PE method

demonstrates a better detection performance than the MED

and LPF methods. Table 4 shows clearly that for both per-

formance indicators (the area under the ROC and FAR at a PD

of 0.95) the PE method has an advantage over the MED and

LPF methods.

For any detection method, the aim is to maximize proba-

bility of detection andminimize false alarm rate. In this study,

the authors tried to achieve this aim by solving a single

objective optimization problem in whichmaximizing the area

under the ROC curve is the objective. The points on the ROC

curve represent the PDs and FARs the detection method can

achieve for all possible threshold values (parameter values in

this study). Each point on the ROC curve represents a trade-off

between FAR and PD. As discussed by Flach (2010), when

selecting a solution point from an ROC curve, there exists the

possibility that non-optimal solutions (dominated points)

might be selected. In this study, this was avoided by only

choosing the top-left point on the ROC curve for discussion.

correlation Euclidean distance-based (PE), multivariate

hods. Red dashed line: alarm threshold; solid red block:

empty red block: contamination events; empty blue block:

olour in this figure legend, the reader is referred to the web

Page 9: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

Table 5 e The correlation coefficients and correlation indicators at the 56th time step.

Turb.a pH Cond. Temp. ORP Nitr. UV Phos.

Turb. 1.00 �0.10(0)b 0.50(1) 0.28(0) 0.16(0) �0.28(0) �0.21(0) 0.00(0)

pH 1.00 �0.00(0) �0.26(0) �0.24(0) �0.23(0) �0.22(0) 0.00(0)

Cond. 1.00 �0.09(0) �0.19(0) 0.15(0) �0.15(0) 0.00(0)

Temp. 1.00 0.12(0) �0.13(0) 0.33(0) 0.00(0)

ORP 1.00 0.13(0) 0.17(0) 0.00(0)

Nitra. 1.00 0.53(1) 0.00(0)

UV 1.00 0.00(0)

Phos. 1.00

Correlation indicator vector [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]

DPE 1.4

a Turb.e turbidity; Cond.e conductivity; Temp.- temperature; ORPe oxidation reduction potential; Nitr.e nitrate-nitrogen; Phos.e phosphate.b Meaning of entry �0.10(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 117

The top-left point needed to satisfy two conditions: 1) it

should be a non-dominated point; 2) it should be the shortest

distance from the top-left corner of the xey axes. The need for

the first condition can be avoided if the problem is formulated

as a two-objective optimization problem in which a Pareto

front with only non-dominated points is be obtained (Reed

et al., 2013; Zhu et al., 2014; Zhang et al., 2012). Therefore, a

full scale two-objective optimization should be conducted in a

future study.

3.5. Future study

The proposed method utilizes the signals from online water

quality sensors to detect the presence of contamination. In

this process, the signals might contain uncertainties. The

detection performance of the proposed method could be

affected by these uncertainties. The robustness of the pro-

posed method when faced with uncertainties should be

evaluated in the future. The performance of the proposed

method should also be evaluated using actual field contam-

ination data. Although the proposed method demonstrated

great improvement in reducing false positive errors, a 2%

false positive rate is still significant. More studies should be

conducted in the future to further reduce the false positive

rate.

4. Conclusions

1. This paper proposed a new contamination event detection

method, which jointly employs the characteristics of

sensor readings and data processing techniques. Using a

dataset from a contaminant injection experiment and

optimized parameter values, the proposed method can

detect 95% of contamination events correctly with a 2%

false alarm rate.

2. Fluctuations in signals could be caused by equipment

noise or by the presence of contaminant. A drawback of

baseline detection methods is that they cannot differ-

entiate well between these two types of fluctuations.

Results from this study show that the proposed method

has an obvious advantage over the MED and LFP methods

in this area. The proposed method only incorrectly

classified 2% of background situations as contamination

events, while the MED and LPF methods incorrectly

classified 22% and 38% of background situations

respectively.

3. The proposedmethod also demonstrated some strength in

detecting events that were only represented by slight var-

iations in water quality sensor signals. This means it ach-

ieves a higher true positive alarm rate than the MED and

LPF method.

Acknowledgements

This work is financially supported by the National Nature and

Science Foundation and Water Major Program (2012ZX07408-

002).

Appendix A. Supplementary data

Supplementary data related to this article can be found at

http://dx.doi.org/10.1016/j.watres.2015.05.013.

r e f e r e n c e s

Allgeier, S., Murray, R., Mckenna, S., Shalvi, D., 2005. Overview ofEvent Detection Systems for WaterSentinel. EnvironmentalProtection Agency, Washington, DC.

Arad, J., Housh, M., Perelman, L., Ostfeld, A., 2013. A dynamicthresholds scheme for contaminant event detection in waterdistribution systems. Water Res. 47 (5), 1899e1908.

Benesty, J., Chen, J., Huang, Y., 2008. On the importance of thePearson correlation coefficient in noise reduction. IEEE Trans.Audio Speech Lang. Process. 16 (4), 757e765.

Flach, P., 2010. ROC analysis. In: Sammut, C., Webb, G.I. (Eds.),Encyclopedia of Machine Learning. Springer, pp. 869e875.

Hall, J., Zaffiro, A.D., Marx, R.B., Kefauver, P.C., Krishnan, E.R.,Herrmann, J.G., 2007. On-line water quality parameters asindicators of distribution system contamination. J. Am. WaterWorks Assoc. 99 (1), 66e77.

Hart, D., McKenna, S.A., Klise, K., Cruz, V., Wilson, M., 2007.CANARY: a Water Quality Event Detection AlgorithmDevelopment Tool (Reston, VA).

Hasan, J., States, S., Deininger, R., 2004. Safeguarding the securityof public water supplies using early warning systems: a briefreview. J. Contemp. Water Res. Educ. 129 (1), 27e33.

Klise, K.A., McKenna, S.A., 2006. Water quality change detection:multivariate algorithms. In: Proc. SPIE, Optics and Photonics inGlobal Homeland Security II 6203, pp. 62030J.1e62030J.9.

Page 10: 2015 2nd author A multivariate based event detection method and performance comparison with two baseline methods Water Research

wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8118

Kroll, D., 2006. Securing Our Water Supply: Protecting aVulnerable Resource. Pennwell, Oklahoma, USA.

Kroll, D., King, K., 2006a. EPA Verification Testing and Real WorldDeployment of a Heuristic System for Water SecurityMonitoring. Abstracts of Papers of the American ChemicalSociety, p. 231.

Kroll, D., King, K., 2006b. The utilization on-line of commonparameter monitoring as a surveillance tool for enhancingwater security. Water Contam. Emergencies e Enhancing OurResponse (302), 89e98.

Liu, S., Che, H., Smith, K., Chen, L., 2014. Contamination eventdetection using multiple types of conventional water qualitysensors in source water. Environ. Science-Processes Impacts16 (8), 2028e2038.

Liu, S., Che, H., Smith, K., Chang, T., 2015a. A real time method ofcontaminant classification using conventional water qualitysensors. J. Environ. Manag. 154, 13e21.

Liu, S., Che, H., Smith, K., Chang, T., 2015b. Contaminantclassification using cosine distance based on multipleconventional sensors. Environ. Sci. Process. Impacts 17,343e350.

McKenna, S.A., Wilson, M., Klise, K.A., 2008. Detecting changes inwater quality data. J. Am. Water Works Assoc. 100 (1), 74e85.

Monedero, I., Biscarri, F., Leon, C., Guerrero, J.I., Biscarri, J.,Millan, R., 2011. Detection of frauds and other non-technicallosses in a power utility using Pearson coefficient, Bayesiannetworks and decision trees. Int. J. Electr. Power Energy Syst.34 (1), 90e98.

Mudelsee, M., 2003. Estimating Pearson's correlation coefficientwith bootstrap confidence interval from serially dependenttime series. Math. Geol. 35 (6), 651e665.

Perelman, L., Arad, J., Housh, M., Ostfeld, A., 2012. Event detectionin ater distribution systems from multivariate water qualitytime series. Environ. Sci. Technol. 46 (15), 8212e8219.

Raciti, M., Cucurull, J., Nadjm-Tehrani, S., 2012. CriticalInfrastructure Protection. Springer, pp. 98e119.

Reed, P., Hadka, D., Herman, J.D., Kasprzyk, J.R., Kollat, J.B., 2013.Evolutionary multiobjective optimization in water resources:the past, present, and future. Advances in water resources 51,438e456.

Swets, J.A., 1988. Measuring the accuracy of diagnostic systems.Science 240 (4857), 1285e1293.

Wang, C., Feng, Y., Zhao, S., Li, B.-L., 2012. A dynamiccontaminant fate model of organic compound: a case study ofnitrobenzene pollution in Songhua River, China.Chemosphere 88 (1), 69e76.

Yang, J., Bi, J., Zhang, H.-y., Li, F.-y., Zhou, J.-b, Liu, B.-b., 2010.Evolvement of the relationship between environmentalpollution accident and economic growth in China. ChinaEnviron. Sci. 30 (4), 571e576.

Yang, Y.J., Haught, R.C., Goodrich, J.A., 2009. Real-timecontaminant detection and classification in a drinking waterpipe using conventional water quality sensors: techniquesand experimental results. J. Environ. Manag. 90 (8), 2494e2506.

Zetterqvist, L., 1991. Statistical estimation and interpretation oftrends in water quality time series. Water Resour. Res. 27 (7),1637e1648.

Zhang, C., Wang, G., Peng, Y., Tang, G., Liang, G., 2012. Anegotiation-based multi-objective, multi-party decision-making model for inter-basin water transfer schemeoptimization. Water Resour. Manag. 26 (14), 4029e4038.

Zhu, X., Zhang, C., Yin, J., Zhou, H., Jiang, Y., 2014. Optimization ofwater diversion based on reservoir operating rules e a casestudy of the Biliu River reservoir, China. J. Hydrologic Eng. 19(2), 411e421.