2015 2nd author a multivariate based event detection method and performance comparison with two...
TRANSCRIPT
ww.sciencedirect.com
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8
Available online at w
ScienceDirect
journal homepage: www.elsevier .com/locate/watres
A multivariate based event detection method andperformance comparison with two baselinemethods
Shuming Liu*, Kate Smith, Han Che
School of Environment, Tsinghua University, Beijing 100084, China
a r t i c l e i n f o
Article history:
Received 27 January 2015
Received in revised form
30 April 2015
Accepted 5 May 2015
Available online 12 May 2015
Keywords:
Contaminant classification
Conventional sensor
Early warning system
Euclidean distance
Pearson correlation
Water quality
Abbreviations: ANN, artificial neural netwearly warning system; FAR, false alarm raEuclidean distance; ORP, oxidation reductiodetection; READiw, real-time event adaptivepervisory control and data acquisition; SVM* Corresponding author.E-mail address: [email protected]
http://dx.doi.org/10.1016/j.watres.2015.05.0130043-1354/© 2015 Elsevier Ltd. All rights rese
a b s t r a c t
Early warning systems have been widely deployed to protect water systems from acci-
dental and intentional contamination events. Conventional detection algorithms are often
criticized for having high false positive rates and low true positive rates. This mainly stems
from the inability of these methods to determine whether variation in sensor measure-
ments is caused by equipment noise or the presence of contamination. This paper presents
a new detection method that identifies the existence of contamination by comparing
Euclidean distances of correlation indicators, which are derived from the correlation co-
efficients of multiple water quality sensors. The performance of the proposed method was
evaluated using data from a contaminant injection experiment and compared with two
baseline detection methods. The results show that the proposed method can differentiate
between fluctuations caused by equipment noise and those due to the presence of
contamination. It yielded higher possibility of detection and a lower false alarm rate than
the two baseline methods. With optimized parameter values, the proposed method can
correctly detect 95% of all contamination events with a 2% false alarm rate.
© 2015 Elsevier Ltd. All rights reserved.
1. Introduction
China has suffered thousands of water contamination events
over the past few decades. Between 1992 and 2006, an average
of 1906 contamination accidents occurred per year (Yang
et al., 2010). For example, the Songhua River was contami-
nated by nitrobenzene from a chemical plant explosion in
2005, which resulted in a 4 day suspension of water supply to
ork; ARMA, autoregressivte; FN, false negative; FPn potential; PE, Pearsondetection, identification
, support vector machine
u.cn (S. Liu).
rved.
Harbin, China (Wang et al., 2012). More recently, in February
2012, the drinking water source of a city in the lower Yangtze
River area of Jiangsu province was contaminated by a phenol
spill from a South Korean cargo ship. One approach for
avoiding or mitigating the impact of contamination is to
establish an Early Warning System (EWS), which normally
includes online sensors, a connected supervisory control and
data acquisition (SCADA) system, a detection algorithm and a
e moving average; CIE, contaminant injection experiment; EWS,, false positive; LPF, linear prediction filters; MED, multivariatecorrelation Euclidean distance-based method; PD, probability ofand warning; ROC, receiver operating characteristic; SCADA, su-; TN, true negative; TOC, total organic carbon; TP, true positive.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8110
decision support system (Hasan et al., 2004). EWS should
provide a fast and accurate means to distinguish between
normal variations in water parameters and actual contami-
nation events.
A key part of an EWS is the detection algorithm, which
utilizes data fromonline sensors to evaluatewater quality and
detect the presence of contamination. Conventional water
quality sensors have been playing a growing role in EWS
because they are easy to maintain, reliable and cost-effective.
As summarized by McKenna et al. (2008), there are two ap-
proaches to developing and testing event detection methods
using water quality sensor signals. First, laboratory and test-
loop evaluation of sensors and associated event detection al-
gorithms provides direct measurement of chemical changes
in background water quality caused by specific contaminants
(Hall et al., 2007; Kroll and King, 2006a, b; Liu et al., 2015a, b).
For example, Hall et al. (2007) carried out a sensor response
experiment for 9 types of contaminants and realized that
more than one sensor responded to each tested contaminant.
After noticing this phenomenon, researchers have attempted
to develop contaminant detection methods using responses
frommultiple sensors. Yang et al. (2009) developed a real-time
event adaptive detection, identification andwarning (READiw)
methodology in a drinking water pipe. The suggested adaptive
transformation of sensor measurements reduced background
noise and enhanced contaminant signals. In the method
employed by Yang et al. (2009), the relative value of concen-
trations of free and total chlorine, pH and oxidation reduction
potential (ORP) are used for contaminant classification. This
allows for contaminant detection and further classification
based on chlorine kinetics. Kroll (2006) developed the Hach
HST approach using multiple sensors for event detection and
contaminant identification. In this approach, signals from 5
separate orthogonal measurements of water quality (pH,
conductivity, turbidity, chlorine residual, total organic carbon
(TOC)) are processed from a 5-parametermeasure into a single
scalar trigger signal. The deviation signal is then compared to
a preset threshold level. If the signal exceeds the threshold,
the trigger is activated (Kroll, 2006). In Kroll's method,
although responses from multiple sensors are utilized, their
internal relationship is not explored. McKenna et al. (2008)
argued that a drawback of the laboratory and test-loop re-
sults and the resulting algorithms is that variation of the
background water quality in these systems may be consider-
ably less than the variation observed in actual water systems.
Another drawback of these types of methods is that the
threshold level is site dependent. When applied to a situation
different from the one for which the method is developed,
field calibration is necessary.
The second approach to event detection is based on signal
processing and data-driven techniques (McKenna et al.,
2008). For example, Hart et al. (2007) reported a linear pre-
diction filters (LPF) method. The LPF method predicts the
water quality at a future time step and evaluates the residual
between predicted and observed water quality values. Klise
and McKenna (2006) developed an algorithm to classify the
current measurement as normal or anomalous by calcu-
lating the multivariate Euclidean distance (MED). The MED
approach provides a measure of the distance between the
sampled water quality and the previously measured samples
contained in the history window. McKenna et al. (2008)
compared the performance of LPF, MED and a time series
increments method. These algorithms process water quality
data at each time step to identify periods of anomalous water
quality and the probability of a water contamination event
having occurred at that time step. The averaged deviation
between the observed and predicted responses from time
series data for each sensor is compared with a preset
threshold. If the averaged deviation is greater than the preset
threshold value, an alarm is triggered. Allgeier et al. (2005)
and Raciti et al. (2012) used artificial neural networks (ANN)
and support vector machines (SVM) to classify water quality
data into normal and anomalous classes after supervised
learning. Perelman et al. (2012) and Arad et al. (2013) reported
a general framework that integrates a data-driven estima-
tion model with sequential probability updating to detect
quality faults in water distribution systems using multivar-
iate water quality time series. A common feature of signal
processing and data-driven methods is that they rely mainly
on pure mathematical data analysis. The characteristics of
sensor responses to contaminants and the connections be-
tween these are not considered by these methods. For online
water quality sensors, fluctuations can either be caused by
equipment noise, variability in hydraulics and water de-
mand, or the presence of contaminant. Signal processing and
data-driven methods have very limited ability to differen-
tiate between these two types of fluctuations, which can lead
to false positive alarms.
To overcome this drawback, this paper describes a new
method for real-time contamination detection using multiple
conventional water quality sensors for source water. The
proposedmethod aims to achieve contamination detection by
using Pearson correlation coefficients to explore the correla-
tive relationship between signals from multiple sensors and
their Euclidean distances. A Pearson correlation coefficient is
a measure of the strength and direction of the linear rela-
tionship between two variables (Mudelsee, 2003). In recent
years, it has been used for classification purpose. For example,
Monedero et al. (2011) applied Pearson correlation coefficients
to the problem of detecting fraud and other non-technical
losses in a power utility. Benesty et al. (2008) used the co-
efficients to reduce noise in speech estimation, a topic which
has attracted a considerable amount of attention over the past
few decades. However, the application of this correlation co-
efficient in the field of contamination detection has not been
explored. The method proposed in this study is tested using
data from a laboratory contaminant injection experiment
(CIE). Its detection performance is evaluated and compared
with two baseline methods.
2. Methods and materials
2.1. The proposed event detection method
The proposed event detection method, called Pearson corre-
lation Euclidean distance-based method (PE), includes three
steps: calculation of Pearson correlation coefficients, calcula-
tion of correlation indicators and calculation of Euclidean
distances.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 111
2.1.1. Step 1: calculation of Pearson correlation coefficientsPearson correlation coefficients formultiple sensor signals are
calculated. In a previous study, Liu et al. (2014) reported that
multiple water quality sensors could respond to a contami-
nation event simultaneously, which is defined as a correlative
response and utilized in this study for event detection. Step 1
involves quantifying the extent of correlation using Pearson
correlation coefficients, r, which are calculated as follows
rXY ¼Pn
i¼1
�xi � X
��yi � Y
�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn
i¼1
�xi � X
�2q*
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1
�yi � Y
�2q (1)
in whichX andY refer to signal series from two separatewater
quality sensors (e.g. pH and ORP). xi and yi are the ith values in
the signal series. X and Y stand for mathematical expectation.
The number of data or window size is given by n. The window
size is the number of past observations used to calculate the
Pearson correlation coefficient. For each sensor, a new
observation enters the sliding window at every time step t and
the oldest observation exits (i.e., first in first out) (Arad et al.,
2013).
2.1.2. Step 2: calculation of correlation indicatorThe value of rXY is between�1 and 1. If the value of rXY is close
to 0, the correlation between X and Y is deemed to be weak. In
this study, a correlation indicator CXY is used to denote whether
two vectors are closely related. The value of CXY is either 0 or 1,
which is obtained, as shown in Equation (2), by comparing rXYwith a pre-set indicator threshold C*.
�CXY ¼ 0 if jrXY j<C* or X ¼ YCXY ¼ 1 if C* � jrXY j � 1
(2)
2.1.3. Step 3: calculation of Euclidean distanceFor the case of s sensors, the correlation coefficient forms an
s � s matrix, as does the correlation indicator. The correlation
indicators above the diagonal are taken to construct a 1 � m
dimension vector V, which is called the correlation indicator
vector (Fig. 1). m is determined by
m ¼Xs�1
i¼1
i (3)
The Euclidean distance between the correlation indicator
vector and the point of origin, DPE, is calculated using
DPE ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXmk¼1
½Vk � O�2s
(4)
Fig. 1 e Schematic graph of the correlation indicator matrix
and correlation indicator vector.
in whichVk is the kth item in the correlation indicator vectorV
andO refers to the point of origin. The items in vectorV are the
values CXY above the diagonal in the correlation indicator
matrix (see Fig. 1).
A contamination alarm will be triggered if
DPE � D*PE (5)
in which D*PE is a detection threshold.
2.2. Performance evaluation
The accuracy of an event detection method is assessed by its
ability to place the current state of water quality into one of
two classes: background and event. Accuracy is composed of
the ability of the method to detect the event (this is the
sensitivity of the method) and to correctly exclude the normal
operating conditions from being inadvertently classified as an
event (this is the specificity of the method). Evaluation of an
event detectionmethod requires that the results be examined
on a common scale to assess the tradeoffs between false
positive (FP) and false negative (FN) decisions. This evaluation
should remain independent of the basis of the detection
method and the quantity of the input data. In this study, the
receiver operating characteristic (ROC) curve is adopted as a
performance indicator. This curve has been used in other
studies (McKenna et al., 2008; Arad et al., 2013).
The ROC curve defines the probability of detection (PD) that
can be obtained as a function of the corresponding false alarm
rate (FAR). FAR is the number of FPs divided by the total
number of values that are below the detection threshold. The
PD is defined as the number of true positives (TPs) divided by
all events that are above the detection threshold. The area
under the ROC curve is the preferred single-valuedmeasure of
accuracy of the technique being evaluated (Swets 1988). The
maximum value of the area under the ROC curve is 1. The
closer the area to 1, the better performance the detection
method yields.
FAR ¼ FPFPþ TN
(6)
PD ¼ TPTPþ FN
(7)
in which TN is true negative.
In a practical situation, for a specific event detection
method, the steps to developing an ROC curve are as follows:
Step 1: collect water quality sensor signals for a period of
time;
Step 2: vary the threshold/parameter values in the event
detection method while recording the FARs and PDs to create
the ROC curve.
For an event detection method with one threshold/
parameter, there exists one ROC curve. For an event detection
method with multiple parameters/thresholds, multiple ROC
curves can be obtained. Among these curves, the one with
maximized area is defined as the optimal ROC curve. This can
be determined through an optimization process, which is
demonstrated in the next section. Once the optimal ROC curve
is obtained, it is necessary to select a point on the curve where
the PD is maximized and the FAR is minimized, typically the
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8112
top-left point on the curve. The event detection method can
then be implemented with the threshold values correspond-
ing to this point.
Besides the ROC curve area, the FAR for a PD of 0.95 is also
evaluated as a performance indicator. Sensitivity and speci-
ficity are inextricably linked. As the sensitivity is increased,
the number of false positive (FP) events also increases (i.e. the
specificity of the detection method decreases). Decreasing the
sensitivity means some events will be missed and this in-
creases the number of false negatives (FN). Evaluation of an
event detectionmethod requires that the results be examined
on a common scale to assess the tradeoffs between FP and FN
decisions. To facilitate this, the false alarm rate (FAR) at a
given probability of detection (PD) (in this case PD ¼ 0.95) is
used. For example, the performance of a detection method
yielding an FAR ¼ 0.1 with a PD ¼ 0.95 is deemed better than
method yielding a FAR ¼ 0.2 with a PD ¼ 0.95.
2.3. Optimal parameter values
The PE method has three parameters, n (window size), C*
(indicator threshold) and D*PE (distance threshold). In this
study, the optimal values were determined using an optimi-
zation analysis. The objective function is configured to be:
Objective function ¼ max (area under the ROC curve) (8)
In the optimization analysis, a range was pre-set for each
parameter according to the problem under discussion. All
possible values of n, C* and D*PE from within these ranges
were taken to calculate FARs and PDs, which form the ROC
curves. The ROC curve with maximum area is denoted as the
optimal ROC curve. By iterating all possible values for all
parameters, the optimal ROC curve satisfying the objective
function could be obtained. The values of n and C* associated
with the optimal ROC curve are defined as the optimized
values of parameters n and C*. The optimized value of D*PE can
be identified using the top-left point on the optimal ROC
curve.
2.4. Two baseline detection methods
2.4.1. Multivariate Euclidean distance (MED) methodThe MED method considers changes in water quality by
comparing two successive distances in a multivariate space
defined by the water quality signal (McKenna et al., 2008; Klise
and McKenna, 2006). The distances compared are (1) the
Euclidean distance between the water quality measurement
at the current time step and themeanwater quality value over
the previous PE time steps and (2) the Euclidean distance be-
tween the water quality value of the previous time step and
the mean value. An example of a multivariate space is the
eight-dimensional space defined by the values of tempera-
ture, pH, turbidity, conductivity, ORP, UV-254, nitrate, and
phosphate.
The distance measure, dMED, is the difference between the
Euclidean distances (1) and (2) in the n dimensional space and
is calculated using
dMED ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXnj¼1
hZðtÞj � ZðmÞj
i2vuut �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
j¼1
hZðt� 1Þj � ZðmÞj
i2vuut (9)
in which n is the number of water quality parameters (or the
number of dimensions of the space), Z(t)j is the jth water
quality parameter in the current time step (i.e. t), Z(t�1)j is
the jth water quality parameter in the previous time step (i.e.
t�1), and Z(m)j is the mean value of the jth water quality
parameter. A constant detection threshold, d*MED, is applied to
determine if an event has occurred. The MED method iden-
tifies a contamination event when dMED is above this
threshold. Otherwise, measurements are considered
background.
2.4.2. Linear prediction filters (LPF) methodA linear predictor estimates the current value of a time series
as a linear combination of previous samples. This method is
also known as the autoregressive moving average (ARMA)
predictor (McKenna et al., 2008; Zetterqvist, 1991), and the
most common representation is
Z*ðtÞ ¼ �XPLPFi¼1
aiZðPLPF � iÞ (10)
in which Z*ðtÞ is the predicted water quality value, Z(PLPF�i) is
the previous observed values, ai are the predictor co-
efficients, and PLPF is the order of the prediction filter
polynomial.
The error generated by this estimate is
dLPFðtÞ ¼ Z*ðtÞ � ZðtÞ (11)
in which Z(t) is the observed true signal value, dLPF(t) is the
difference between prediction Z*(t) and observation Z(t). Each
water quality signal is treated separately, and then all water
quality signals are combined by taking the average absolute
value of the differences across all signals. The average dif-
ference or distance is then compared to the detection
threshold d*LPF to determine whether the difference is signifi-
cant. If the difference is greater than the detection threshold,
an alarm will be sounded.
2.4.3. Data standardizationIn the MED and LPF methods, values from or derived from
different water quality sensors are summed together. These
data are in different units. Therefore, they need to be stan-
dardized to a common scale. In this study, standardization is
achieved by subtracting the dataset mean, m, from each water
quality signal, X(t), and then dividing this difference by the
standard deviation, s, of the dataset:
ZðtÞ ¼ XðtÞ � m
s(12)
The m refers to themean of the dataset in the previous time
steps with a window size denoted by nml. The performances of
the MED and LPF methods are also evaluated using the ROC
curve. It should be noted that data for the PE method do not
need to be standardized because the Pearson coefficient is
dimensionless.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 113
2.5. Datasets
In order to collect sensor signals over the course of a
contamination event, a pilot-scale CIE platform was devel-
oped in the laboratory. Eight types of sensors were utilized in
this experiment. They can measure the following 8 parame-
ters simultaneously and continuously: temperature, pH,
turbidity, conductivity, ORP, UV-254, nitrate-nitrogen and
phosphate. The background experiment was conducted for
60min. Then cadmiumnitrate at a concentration of 0.008mg/l
was injected into the CIE platform for a period of 29 min. This
was done to simulate a chemical spill. Signals from 8 sensors
were recorded. Fig. 2 shows the sensor signals from the
experiment, in which period a (left of the red dotted line) is
background, while period b represents the time when cad-
mium nitrate is added (right of the red dotted line) (in the web
version). For each sensor, the dataset contained 89 values, in
which the first 60 values were for data without contamination
and the last 29 values were data with contamination. There
were 29 contamination events in total. For more information
about the CIE platform and the injection experiment, readers
can refer to Liu et al. (2014). Detailed data from this experi-
ment are provided in the Supplementary material.
To facilitate performance evaluation, each time step is
defined as a contamination event or a background situation.
In total, there are 29 contamination events and 60 background
situations.
3. Results
3.1. Initial application
To demonstrate how the PE method works, arbitrary param-
eter values were adopted in the initial application. The values
used were: n ¼ 15, C* ¼ 0.4, and D*PE ¼ 3:0. Taking time steps 50
(background situation) and 75 (contamination event) as ex-
amples, the implementation of the PE method at these two
time steps is demonstrated here.
Using Equations (1), (2) and (4), the correlation coefficients,
correlation indicators and DPE at time steps 50 and 75 were
calculated and are shown in Tables 1 and 2. The distances
Fig. 2 e Sensor signals from cadmium nitrate injection experim
Cond. ¼ conductivity; Turb. ¼ turbidity; Phos. ¼ Phosphate; Te
from the correlation indicator vectors to the point of origin
were 1.74 and 3.60 respectively. According to Equation (5), an
alarm was triggered at time step 75 (Note: 3.60 > 3.0), whereas
time step 50 was classified as background. The calculation
process was repeated for all time steps and from this a PD of
0.93 and an FAR of 0.20 were obtained. This means that 93% of
events were detected and 20% of background situations were
classified incorrectly to be contamination events.
3.2. Performance with optimal parameter values
The PE method has three parameters: n (window size), C*
(correlation indicator) and D*PE (detection threshold). Using the
method introduced in Section 2.3, the optimal values of pa-
rameters were obtained and are shown in Table 3. Fig. 3 dis-
plays the ROC curves obtained in the parameter optimization
process. The optimal ROC curve is highlighted in red (in the
web version). For the PE method, the area under the optimal
ROC curve shown in Fig. 1 was 0.97. The associated n and C*
values were 20 and 0.37 respectively. The top-left point on the
optimal ROC curve, point a, denotes the best possible detec-
tion capability of the PE method for the dataset under dis-
cussion. The corresponding PD and FAR values for point a are
0.97 and 0.025, which suggests that 97% of contamination
events were detected correctly and 2.5% of background situ-
ations were incorrectly grouped as contamination events.
Meanwhile, the FAR at a PD of 0.95 is 0.02 (see Table 4), which
means the PEmethod can detect 95% of contamination events
correctly with the compromise of wrongly identifying 2% of
background situations as contamination events. The results
show the method discriminates very well between back-
ground and contamination and that performance is improved
by adopting optimal parameter values.
3.3. Performance comparisons with the MED and LPFmethods
The MED and LPF methods were applied to the same dataset
and compared to the detection performance of the proposed
PE method. Including the parameter nml used for data stan-
dardization, the MED and LPF methods both have three pa-
rameters (MED: nml, PE and dMED; LPF: nml, PLPF and dLPF). The
ent. Abbreviations: ORP ¼ oxidation reduction potential;
mp. ¼ temperature.
Table 1 e The calculation results at the 50th time step.
Turb.a pH Cond. Temp. ORP Nitr. UV Phos.
Turb. 1.00 �0.09(0)b 0.05(0) 0.45(1) �0.06(0) �0.04(0) �0.28(0) 0.00(0)
pH 1.00 �0.45(1) �0.13(0) �0.30(0) �0.36(0) �0.32(0) 0.00(0)
Cond. 1.00 �0.05(0) �0.16(0) 0.04(0) 0.18(0) 0.00(0)
Temp. 1.00 0.20(0) �0.31(0) 0.11(0) 0.00(0)
ORP 1.00 0.27(0) 0.45(1) 0.00(0)
Nitra. 1.00 0.35(0) 0.00(0)
UV 1.00 0.00(0)
Phos. 1.00
Correlation indicator vector [0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
DPE 1.74
a Turb. e turbidity; Cond. e conductivity; Temp. e temperature; ORP e oxidation reduction potential; Nitr. e nitrate-nitrogen; Phos. e
phosphate.b Meaning of entry �0.09(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.
Table 2 e The calculation results at the 75th time step.
Turb.a pH Cond. Temp. ORP Nitr. UV Phos.
Turb. 1.00 �0.22(0)b 0.03(0) �0.75(1) �0.36(0) 0.10(0) �0.45(1) 0.00(0)
pH 1.00 0.87(1) 0.77(1) �0.09(0) �0.94(1) �0.07(0) 0.00(0)
Cond. 1.00 0.53(1) 0.00(0) �0.87(1) �0.04(0) 0.00(0)
Temp. 1.00 0.20(0) �0.63(1) 0.31(0) 0.00(0)
ORP 1.00 0.07(0) 0.62(1) 0.00(0)
Nitra. 1.00 0.06(0) 0.00(0)
UV 1.00 0.00(0)
Phos. 1.00
Correlation indicator vector [0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0]
DPE 3.60
a Turb.e turbidity; Cond.e conductivity; Temp.- temperature; ORPe oxidation reduction potential; Nitr.e nitrate-nitrogen; Phos.e phosphate.b Meaning of entry �0.22(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.
Table 3 e The initial range and optimal values ofparameters for Pearson correlation Euclidean distance-based (PE), multivariate Euclidean distance (MED), andlinear prediction filters (LPF) methods.
Detectionmethod
Optimization Optimalvalue
Parameter Initialrange
Step Datasetfrom lab
PE n [5 29] 1 20
C* [0 1] 0.001 0.37
D*PE [1 6] 0.01 3.3
MED nml [5 30] 1 9
PE [1 29] 1 24
d*MED [0 2] 0.005 0.72
LPF nml [5 50] 1 34
PLPF [1 20] 1 5
d*LPF [0 1] 0.1 0.5
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8114
same methods were adopted to evaluate the performances of
the MED and LPF methods and obtain the optimal values of
parameters (described in Sections 2.2 and 2.3).
By using the optimal parameter values shown in Table 3,
the best performance of the MED and LPF methods was ob-
tained and is presented in Table 4. As shown in Table 4, the
MED method achieved an area of 0.57 under the optimal ROC
curve and the LPF method yielded an area of 0.37, which were
both smaller than the area obtained using the PE method (i.e.,
0.97). The top-left points on the optimal ROC curves (points b
and c in Fig. 3) gave PD ¼ 0.52 FAR ¼ 0.22 and PD ¼ 0.38
FAR¼ 0.54 for theMED and LPFmethods respectively. The top-
left point on the optimal ROC represents the best possible
performance the detection method can achieve. It is obvious
that FARs for the MED and LPF methods are greater than the
one for the PEmethod and the PDs are smaller. Thismeans the
MED and LPFmethods correctly detected fewer contamination
events and incorrectly reported more background situations
as contamination events. Meanwhile, the FARs at a PD of 0.95
were 0.95 for the MED method and 0.94 for the LPF method.
This indicates that, for the MED and LPF methods, the down-
side of detecting 95% of contamination events is that nearly
95% of background situations are wrongly classified. From this
comparison, it was concluded that the PE method has better
potential to detect contamination events with a lower false
alarm rate. Thus it ismore promising for use in contamination
event detection in an early warning system.
3.4. Discussion
To further investigate how the threemethods perform at each
time step, their detection results are displayed in Fig. 4, in
which graphs (a), (b) and (c) show the calculated distance and
detection results of the PE, LPF andMEDmethods respectively.
Graph (d) indicates the actual background (blue) and
contamination (red) (in the web version). The red dashed lines
Fig. 3 e Optimization results of the Pearson correlation Euclidean distance-based (PE), multivariate Euclidean distance
(MED), and linear prediction filters (LPF) methods.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 115
are the alarm thresholds for eachmethod. The solid red blocks
indicate that a contamination alarm has been triggered, while
the solid blue blocks mean no alarm (background) (in the web
version). The empty red blocks in graph (d) shows the actual
contamination events, while the empty blue blocks denote
background situations. The remaining blank parts in the
graphs (a), (b) and (c) indicate that the methods have not
started to produce detection results due to parameter settings.
For example, the optimal value of n in the PEmethod is 20. The
first detection result was at the 21st minute.
As shown in Fig. 2, the fluctuations in the first 60 min were
mainly caused by equipment noise or variability in hydraulics,
while the ones in the last 29 min were mainly due to the
presence of contamination. The PE method only yielded one
DPE greater than 3.1 (the detection threshold D*PE) in the first
60 min, which was DPE¼ 3.11 and which occurred at the 31st
minute (Graph a in Fig. 4). The PE method only incorrectly
classified one background situation as a contamination event
(at the 31st minute). Meanwhile, in the last 29 min, it yielded
one DPE smaller than 3.1, which was DPE¼ 3.09 at the 70th
minute. All the others were greater than the threshold value
and were thus correctly identified as contamination. This
suggests that the PEmethod has the ability to differentiate the
presence of contamination from other causes of fluctuation.
On the contrary, the LFP and MED methods incorrectly re-
ported a large number of background situations as contami-
nation events and missed many real events. Graph (b) reveals
that the LFP method rarely triggered alarms after the 67th
Table 4 e The performance of three detection methods.a
Method Dataset from laboratory
AreaunderROC
PD (top-left)
FAR (top-left)
FAR at aPD of 0.95
PE 0.97 0.97 0.025 0.02
MED 0.57 0.52 0.22 0.95
LPF 0.37 0.38 0.48 0.94
a PE ¼ Pearson correlation Euclidean distance-based;
MED ¼ multivariate Euclidean distance; LPF ¼ linear prediction
filters; ROC ¼ received operating characteristic; PD, probability of
detection; FAR, false alarm rate.
minute. Most alarms appeared in the period of the 54th to the
67th minute, just before and after the injection of cadmium
nitrate. As shown in Fig. 2, measurements taken by the UV,
ORP and conductivity sensors fluctuated somewhat between
the 54th and the 60thminute (moment of injection), whichwas
mainly due to equipment noise. The fluctuations after the 60th
minute were mainly caused by the injection of cadmium ni-
trate. As introduced in Section 2.4, the LPF method triggers an
alarm by evaluating the difference between the actual sensor
signals and the predicted values. In other words, if the pre-
diction for a time step differs significantly from the observa-
tion or is unpredictable, this will result in an event being
classified. This explains why alarms were triggered between
the 54th and the 67th minutes.
For the MED method, as shown in graph (c) of Fig. 4, the
alarms were relatively evenly spread over the whole experi-
ment period. The MED method identifies a contamination
event when the distance between the current water quality
value and the mean value is significantly different from the
distance between the water quality value of the previous time
step and the mean value. The time of the alarms was
consistent with the time of fluctuations. This suggests that the
MED method is useful for detecting fluctuations. However, it
might miss situations where the variation at the current time
step is similar to the average variation over previous time
steps. Furthermore, as shown in Fig. 3, the ROC curves for the
MED and LPF methods follow a roughly diagonal trend, which
suggests that the performances of the MED and LPF methods
are very poor. This is mainly because these two methods
trigger an alarm by detecting existence of fluctuations. When
a fluctuation is large enough, an alarmwill be triggered. These
two methods do not have the capacity to further examine the
cause of fluctuations. For this reason, they tend to mistakenly
identify background as contamination, which leads to a high
false positive rate. The method proposed in this study uses
correlative coefficients, which help to differentiate the pres-
ence of contamination from other causes of fluctuation, thus
making the method stronger.
Both the MED and LFP methods can detect situations that
result in sudden changes in signals. However, these two
methods cannot differentiate between these two types of
fluctuations. As shown in graphs (b) and (c), many false alarms
were triggered before the 60th minute (i.e. for time steps
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8116
corresponding to background). Taking the 56th minute as an
example, the ORP, conductivity, turbidity and UV all showed
significant fluctuations at this time step. The Euclidean dis-
tance between the vector at the 56th minute and the vector
made up of even values of the previous 24 time steps
(parameter PE) was calculated to be 0.79 using the MED
method, which is greater than 0.72 (the detection threshold
d*MED). Therefore, a contamination event alarmwas declared at
this time step. For the LPF method, the averaged differences
between the predictions and observations was calculated to
be 1.2, which is greater than 0.5 (the detection threshold d*LPF).
The LPF also incorrectly reported an alarm at this time step.
For the PE method, although fluctuations appeared on several
sensors at this time step, as shown in Table 5, their correlative
coefficients were rather small, which suggested these fluctu-
ations were independent and mainly caused by equipment
noise. Using Equation (4), the Euclidean distance between the
correlation indicator vector and the point of origin, DPE, was
calculated as 1.4, which is lower than 3.1 (the detection
threshold D*PE). Therefore, no alarm was triggered by the PE
method at this time step.
An advantage of online water quality sensors is that they
can provide water quality information fast. Alongside this
advantage, online water quality sensors have also been
criticized for their low stability. A common issue for online
sensors is that their signals always contain device noise. The
MED and LPF methods rely mainly on pure mathematical
data analysis. They depend either on comparison with pre-
vious time steps or on comparison between the predicted
and observed values for the same time step. The character-
istics of the signals and the connection between these sig-
nals are not taken into consideration. Fluctuations caused by
equipment noise and those caused by presence of contami-
nants are viewed as identical by the MED and LPF methods.
Both the MED and LPF methods grouped the significant
fluctuations as events. The analysis here shows that equip-
ment noise can cause the MED and LPF methods to give
erroneous results and cause a false alarm to be triggered. The
proposed PE method conducts data analysis by taking the
Fig. 4 e The detection results at each time step for the Pearson
Euclidean distance (MED), and linear prediction filters (LPF) met
contamination alarm; solid blue block: no alarm (background);
background situations. (For interpretation of the references to c
version of this article.)
physical meaning of signals into consideration. By employ-
ing correlative coefficients, the PE method distinguishes be-
tween these two types of variations and overcomes this
drawback. The FAR at a PD of 0.95 is 0.02 for the PE method,
which is significantly lower than the FAR for MED (0.95) and
LPF (0.94).
Meanwhile, the PE also has the advantage of detecting
events when the fluctuation is small. For example, at the
73rd time step, as shown in Fig. 2, the signal fluctuations are
minimal. The MED method yielded a dMED¼ 0.01, which is
lower than 0.72 (the detection threshold d*MED). The LPF
method obtained a dLPF ¼ 0.31, which is smaller than 0.50 (the
detection threshold d*LPF). Therefore, neither the MED nor the
LPF methods reported contamination. The PE method caught
these slight variations and yielded a DPE ¼ 3.7, which is
greater than 3.1 (the detection threshold D*PE). This indicates
that the signal characteristics (correlative responses) can
help in judging the presence of a contaminant, especially in
the case where variations in sensor readings are not signifi-
cant. By jointly employing the characteristics of sensor
readings and data processing techniques, the PE method
demonstrates a better detection performance than the MED
and LPF methods. Table 4 shows clearly that for both per-
formance indicators (the area under the ROC and FAR at a PD
of 0.95) the PE method has an advantage over the MED and
LPF methods.
For any detection method, the aim is to maximize proba-
bility of detection andminimize false alarm rate. In this study,
the authors tried to achieve this aim by solving a single
objective optimization problem in whichmaximizing the area
under the ROC curve is the objective. The points on the ROC
curve represent the PDs and FARs the detection method can
achieve for all possible threshold values (parameter values in
this study). Each point on the ROC curve represents a trade-off
between FAR and PD. As discussed by Flach (2010), when
selecting a solution point from an ROC curve, there exists the
possibility that non-optimal solutions (dominated points)
might be selected. In this study, this was avoided by only
choosing the top-left point on the ROC curve for discussion.
correlation Euclidean distance-based (PE), multivariate
hods. Red dashed line: alarm threshold; solid red block:
empty red block: contamination events; empty blue block:
olour in this figure legend, the reader is referred to the web
Table 5 e The correlation coefficients and correlation indicators at the 56th time step.
Turb.a pH Cond. Temp. ORP Nitr. UV Phos.
Turb. 1.00 �0.10(0)b 0.50(1) 0.28(0) 0.16(0) �0.28(0) �0.21(0) 0.00(0)
pH 1.00 �0.00(0) �0.26(0) �0.24(0) �0.23(0) �0.22(0) 0.00(0)
Cond. 1.00 �0.09(0) �0.19(0) 0.15(0) �0.15(0) 0.00(0)
Temp. 1.00 0.12(0) �0.13(0) 0.33(0) 0.00(0)
ORP 1.00 0.13(0) 0.17(0) 0.00(0)
Nitra. 1.00 0.53(1) 0.00(0)
UV 1.00 0.00(0)
Phos. 1.00
Correlation indicator vector [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
DPE 1.4
a Turb.e turbidity; Cond.e conductivity; Temp.- temperature; ORPe oxidation reduction potential; Nitr.e nitrate-nitrogen; Phos.e phosphate.b Meaning of entry �0.10(0): number in bracket e correlation indicator; number outside of bracket e correlation coefficient.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8 117
The top-left point needed to satisfy two conditions: 1) it
should be a non-dominated point; 2) it should be the shortest
distance from the top-left corner of the xey axes. The need for
the first condition can be avoided if the problem is formulated
as a two-objective optimization problem in which a Pareto
front with only non-dominated points is be obtained (Reed
et al., 2013; Zhu et al., 2014; Zhang et al., 2012). Therefore, a
full scale two-objective optimization should be conducted in a
future study.
3.5. Future study
The proposed method utilizes the signals from online water
quality sensors to detect the presence of contamination. In
this process, the signals might contain uncertainties. The
detection performance of the proposed method could be
affected by these uncertainties. The robustness of the pro-
posed method when faced with uncertainties should be
evaluated in the future. The performance of the proposed
method should also be evaluated using actual field contam-
ination data. Although the proposed method demonstrated
great improvement in reducing false positive errors, a 2%
false positive rate is still significant. More studies should be
conducted in the future to further reduce the false positive
rate.
4. Conclusions
1. This paper proposed a new contamination event detection
method, which jointly employs the characteristics of
sensor readings and data processing techniques. Using a
dataset from a contaminant injection experiment and
optimized parameter values, the proposed method can
detect 95% of contamination events correctly with a 2%
false alarm rate.
2. Fluctuations in signals could be caused by equipment
noise or by the presence of contaminant. A drawback of
baseline detection methods is that they cannot differ-
entiate well between these two types of fluctuations.
Results from this study show that the proposed method
has an obvious advantage over the MED and LFP methods
in this area. The proposed method only incorrectly
classified 2% of background situations as contamination
events, while the MED and LPF methods incorrectly
classified 22% and 38% of background situations
respectively.
3. The proposedmethod also demonstrated some strength in
detecting events that were only represented by slight var-
iations in water quality sensor signals. This means it ach-
ieves a higher true positive alarm rate than the MED and
LPF method.
Acknowledgements
This work is financially supported by the National Nature and
Science Foundation and Water Major Program (2012ZX07408-
002).
Appendix A. Supplementary data
Supplementary data related to this article can be found at
http://dx.doi.org/10.1016/j.watres.2015.05.013.
r e f e r e n c e s
Allgeier, S., Murray, R., Mckenna, S., Shalvi, D., 2005. Overview ofEvent Detection Systems for WaterSentinel. EnvironmentalProtection Agency, Washington, DC.
Arad, J., Housh, M., Perelman, L., Ostfeld, A., 2013. A dynamicthresholds scheme for contaminant event detection in waterdistribution systems. Water Res. 47 (5), 1899e1908.
Benesty, J., Chen, J., Huang, Y., 2008. On the importance of thePearson correlation coefficient in noise reduction. IEEE Trans.Audio Speech Lang. Process. 16 (4), 757e765.
Flach, P., 2010. ROC analysis. In: Sammut, C., Webb, G.I. (Eds.),Encyclopedia of Machine Learning. Springer, pp. 869e875.
Hall, J., Zaffiro, A.D., Marx, R.B., Kefauver, P.C., Krishnan, E.R.,Herrmann, J.G., 2007. On-line water quality parameters asindicators of distribution system contamination. J. Am. WaterWorks Assoc. 99 (1), 66e77.
Hart, D., McKenna, S.A., Klise, K., Cruz, V., Wilson, M., 2007.CANARY: a Water Quality Event Detection AlgorithmDevelopment Tool (Reston, VA).
Hasan, J., States, S., Deininger, R., 2004. Safeguarding the securityof public water supplies using early warning systems: a briefreview. J. Contemp. Water Res. Educ. 129 (1), 27e33.
Klise, K.A., McKenna, S.A., 2006. Water quality change detection:multivariate algorithms. In: Proc. SPIE, Optics and Photonics inGlobal Homeland Security II 6203, pp. 62030J.1e62030J.9.
wat e r r e s e a r c h 8 0 ( 2 0 1 5 ) 1 0 9e1 1 8118
Kroll, D., 2006. Securing Our Water Supply: Protecting aVulnerable Resource. Pennwell, Oklahoma, USA.
Kroll, D., King, K., 2006a. EPA Verification Testing and Real WorldDeployment of a Heuristic System for Water SecurityMonitoring. Abstracts of Papers of the American ChemicalSociety, p. 231.
Kroll, D., King, K., 2006b. The utilization on-line of commonparameter monitoring as a surveillance tool for enhancingwater security. Water Contam. Emergencies e Enhancing OurResponse (302), 89e98.
Liu, S., Che, H., Smith, K., Chen, L., 2014. Contamination eventdetection using multiple types of conventional water qualitysensors in source water. Environ. Science-Processes Impacts16 (8), 2028e2038.
Liu, S., Che, H., Smith, K., Chang, T., 2015a. A real time method ofcontaminant classification using conventional water qualitysensors. J. Environ. Manag. 154, 13e21.
Liu, S., Che, H., Smith, K., Chang, T., 2015b. Contaminantclassification using cosine distance based on multipleconventional sensors. Environ. Sci. Process. Impacts 17,343e350.
McKenna, S.A., Wilson, M., Klise, K.A., 2008. Detecting changes inwater quality data. J. Am. Water Works Assoc. 100 (1), 74e85.
Monedero, I., Biscarri, F., Leon, C., Guerrero, J.I., Biscarri, J.,Millan, R., 2011. Detection of frauds and other non-technicallosses in a power utility using Pearson coefficient, Bayesiannetworks and decision trees. Int. J. Electr. Power Energy Syst.34 (1), 90e98.
Mudelsee, M., 2003. Estimating Pearson's correlation coefficientwith bootstrap confidence interval from serially dependenttime series. Math. Geol. 35 (6), 651e665.
Perelman, L., Arad, J., Housh, M., Ostfeld, A., 2012. Event detectionin ater distribution systems from multivariate water qualitytime series. Environ. Sci. Technol. 46 (15), 8212e8219.
Raciti, M., Cucurull, J., Nadjm-Tehrani, S., 2012. CriticalInfrastructure Protection. Springer, pp. 98e119.
Reed, P., Hadka, D., Herman, J.D., Kasprzyk, J.R., Kollat, J.B., 2013.Evolutionary multiobjective optimization in water resources:the past, present, and future. Advances in water resources 51,438e456.
Swets, J.A., 1988. Measuring the accuracy of diagnostic systems.Science 240 (4857), 1285e1293.
Wang, C., Feng, Y., Zhao, S., Li, B.-L., 2012. A dynamiccontaminant fate model of organic compound: a case study ofnitrobenzene pollution in Songhua River, China.Chemosphere 88 (1), 69e76.
Yang, J., Bi, J., Zhang, H.-y., Li, F.-y., Zhou, J.-b, Liu, B.-b., 2010.Evolvement of the relationship between environmentalpollution accident and economic growth in China. ChinaEnviron. Sci. 30 (4), 571e576.
Yang, Y.J., Haught, R.C., Goodrich, J.A., 2009. Real-timecontaminant detection and classification in a drinking waterpipe using conventional water quality sensors: techniquesand experimental results. J. Environ. Manag. 90 (8), 2494e2506.
Zetterqvist, L., 1991. Statistical estimation and interpretation oftrends in water quality time series. Water Resour. Res. 27 (7),1637e1648.
Zhang, C., Wang, G., Peng, Y., Tang, G., Liang, G., 2012. Anegotiation-based multi-objective, multi-party decision-making model for inter-basin water transfer schemeoptimization. Water Resour. Manag. 26 (14), 4029e4038.
Zhu, X., Zhang, C., Yin, J., Zhou, H., Jiang, Y., 2014. Optimization ofwater diversion based on reservoir operating rules e a casestudy of the Biliu River reservoir, China. J. Hydrologic Eng. 19(2), 411e421.