anomaly detection in sensor systems using lightweight machine learning

7
Anomaly Detection in Sensor Systems Using Lightweight Machine Learning H.H.W.J. Bosman *† , A. Liotta * * Department of Electrical Engineering Eindhoven University of Technology P.O. Box 513, 5600MB, Eindhoven, The Netherlands h.h.w.j.bosman, a.liotta @ tue.nl G. Iacca , H.J. W¨ ortche INCAS 3 P.O. Box 797, 9400AT, Assen, The Netherlands heddebosman, giovanniiacca, heinrichwoertche @ incas3.eu Abstract—The maturing field of Wireless Sensor Networks (WSN) results in long-lived deployments that produce large amounts of sensor data. While capabilities of WSN motes increase, their resources, primarily energy, are still limited. Lightweight online on-mote processing may improve energy consumption by selecting only unexpected sensor data (anomalies) for transmission, which is commonly more energy consuming. We detect anomalies by analyzing sensor reading predictions from a linear model. We use Recursive Least Squares (RLS) to estimate the model parameters, because for large datasets the standard Linear Least Squares Estimation (LLSE) is not resource friendly. We evaluate the use of fixed-point RLS with adaptive thresh- olding, and its application to anomaly detection in embedded systems. We present an extensive experimental campaign on generated and real-world datasets, with floating-point RLS, LLSE, and a rule-based method as benchmarks. The methods are evaluated on prediction accuracy of the models, and on detection of anomalies, which are injected in the generated dataset. The experimental results show that the proposed algorithm is comparable, in terms of prediction accuracy and detection performance, to the other LS methods. However, fixed-point RLS is efficiently implementable in embedded devices. The presented method enables online on-mote anomaly detection with results comparable to offline LS methods. Index Terms—Anomaly detection, Embedded Systems, Recur- sive Least Squares, Adaptive Systems I. I NTRODUCTION This paper evaluates the use of Recursive Least Squares (RLS) in combination with adaptive thresholding for online anomaly detection on embedded systems. RLS is a lightweight means to perform online parameter estimation in a Wireless Sensor Network (WSN). The prediction error of a linear model is used for online anomaly detection. This method can be used for anomaly detection within one WSN mote (an embedded device with sensors, microprocessor and networking capabil- ities), or within its neighborhood. However, the main focus of this work is evaluating the performance of embedded RLS combined with adaptive thresholding for anomaly detection in one WSN mote. The WSN community is maturing and its focus is shifting to applications and deployments, where resource-constrained embedded devices are tightly coupled to the environment. With this shift, the amount of data is increasing due to longer lifetime of deployments and increasing sensor possibilities. Anomaly detection promises not only to pinpoint data of (a) Spike (b) Noise (c) Constant (d) Drift Figure 1. Injected anomalies in noisy linear data interest (e.g. faulty data) within a vast amount of information offline, but also to optimize resource usage when implemented online. There is increasing interest to create models of the envi- ronment, either mathematical or statistical, such that when deviations occur, so called anomalies, the user is notified. Anomalies may be caused by system faults (hardware or software), but also by other interesting phenomena from the environment. System faults require maintenance by the system developers, but environmental anomalies are mainly of interest to the users. These models are, most commonly, created and evaluated offline, in a central data processing station. This requires the WSN to send all data, including the irrelevant data, to a central point, thus causing an inefficient use of resources such as communication and energy supply. Using a light-weight anomaly detection method to send only data of interest will, therefore, increase battery life. In this work we are focusing on sensor data anomalies resulting from the sensor system itself and the environment. We aim to keep the algorithm generic and application agnostic, and therefore use a common categorization of anomalies of the types spike, constant, noise and drift, as identified in the literature [1], [2] and [3] (Fig. 1). Spikes are short-duration peaks in the data, that may be caused, for instance, by bit- flips or by malfunctioning hardware connections. Noise is an unexpected increase in variance, detected through historical data, and may, for example, be caused by a depleted battery. Constant anomalies appear when changes and noise in the original signal do not lead to any changes in the measured value. Such an anomaly may, for instance, be caused by a sensor getting stuck (physically or in software) or by a loose electrical connection. Drift is an off-set measurement value relative to the original signal. This offset may be constant or

Upload: tue

Post on 10-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Anomaly Detection in Sensor Systems UsingLightweight Machine Learning

H.H.W.J. Bosman∗†, A. Liotta∗∗Department of Electrical EngineeringEindhoven University of Technology

P.O. Box 513, 5600MB, Eindhoven, The Netherlandsh.h.w.j.bosman, a.liotta @ tue.nl

G. Iacca†, H.J. Wortche††INCAS3

P.O. Box 797, 9400AT, Assen, The Netherlandsheddebosman, giovanniiacca, heinrichwoertche @ incas3.eu

Abstract—The maturing field of Wireless Sensor Networks(WSN) results in long-lived deployments that produce largeamounts of sensor data. While capabilities of WSN motesincrease, their resources, primarily energy, are still limited.Lightweight online on-mote processing may improve energyconsumption by selecting only unexpected sensor data (anomalies)for transmission, which is commonly more energy consuming. Wedetect anomalies by analyzing sensor reading predictions from alinear model. We use Recursive Least Squares (RLS) to estimatethe model parameters, because for large datasets the standardLinear Least Squares Estimation (LLSE) is not resource friendly.

We evaluate the use of fixed-point RLS with adaptive thresh-olding, and its application to anomaly detection in embeddedsystems. We present an extensive experimental campaign ongenerated and real-world datasets, with floating-point RLS,LLSE, and a rule-based method as benchmarks. The methods areevaluated on prediction accuracy of the models, and on detectionof anomalies, which are injected in the generated dataset.

The experimental results show that the proposed algorithmis comparable, in terms of prediction accuracy and detectionperformance, to the other LS methods. However, fixed-point RLSis efficiently implementable in embedded devices. The presentedmethod enables online on-mote anomaly detection with resultscomparable to offline LS methods.

Index Terms—Anomaly detection, Embedded Systems, Recur-sive Least Squares, Adaptive Systems

I. INTRODUCTION

This paper evaluates the use of Recursive Least Squares(RLS) in combination with adaptive thresholding for onlineanomaly detection on embedded systems. RLS is a lightweightmeans to perform online parameter estimation in a WirelessSensor Network (WSN). The prediction error of a linear modelis used for online anomaly detection. This method can be usedfor anomaly detection within one WSN mote (an embeddeddevice with sensors, microprocessor and networking capabil-ities), or within its neighborhood. However, the main focusof this work is evaluating the performance of embedded RLScombined with adaptive thresholding for anomaly detection inone WSN mote.

The WSN community is maturing and its focus is shiftingto applications and deployments, where resource-constrainedembedded devices are tightly coupled to the environment. Withthis shift, the amount of data is increasing due to longerlifetime of deployments and increasing sensor possibilities.Anomaly detection promises not only to pinpoint data of

(a) Spike (b) Noise (c) Constant (d) Drift

Figure 1. Injected anomalies in noisy linear data

interest (e.g. faulty data) within a vast amount of informationoffline, but also to optimize resource usage when implementedonline.

There is increasing interest to create models of the envi-ronment, either mathematical or statistical, such that whendeviations occur, so called anomalies, the user is notified.Anomalies may be caused by system faults (hardware orsoftware), but also by other interesting phenomena from theenvironment. System faults require maintenance by the systemdevelopers, but environmental anomalies are mainly of interestto the users. These models are, most commonly, created andevaluated offline, in a central data processing station. Thisrequires the WSN to send all data, including the irrelevantdata, to a central point, thus causing an inefficient use ofresources such as communication and energy supply. Usinga light-weight anomaly detection method to send only data ofinterest will, therefore, increase battery life.

In this work we are focusing on sensor data anomaliesresulting from the sensor system itself and the environment.We aim to keep the algorithm generic and application agnostic,and therefore use a common categorization of anomalies ofthe types spike, constant, noise and drift, as identified in theliterature [1], [2] and [3] (Fig. 1). Spikes are short-durationpeaks in the data, that may be caused, for instance, by bit-flips or by malfunctioning hardware connections. Noise is anunexpected increase in variance, detected through historicaldata, and may, for example, be caused by a depleted battery.Constant anomalies appear when changes and noise in theoriginal signal do not lead to any changes in the measuredvalue. Such an anomaly may, for instance, be caused by asensor getting stuck (physically or in software) or by a looseelectrical connection. Drift is an off-set measurement valuerelative to the original signal. This offset may be constant or

changing, for instance due to slow degradation of the sensor,a change in calibration, a change in the environment, or anunmodeled (cyclic) process.

In our search for lightweight anomaly-detection approacheswe evaluate the effectiveness of the combination of RLS (asdescribed in [4]) and adaptive thresholding to the context ofresource-constrained embedded devices. The inputs of RLSmay come from different sensors in the same node, but couldalso be a set of historical values of inputs and outputs or valuesfrom neighboring nodes. Anomalies can then be detected byanalyzing the prediction error for individual sensors, where,typically, an error larger than average is classified anomalous.

The following section gives an overview of related work,after which the method is explained. Section IV shows that theRLS estimation is similar in performance to the benchmarkmethods, in terms of average prediction error, and in detectinganomalies. We elaborate this conclusion and projections offuture possibilities in section V.

II. RELATED WORK

The method of Least Squares was described by Gauss andLegendre in the early 19th century to estimate the orbit ofcomets [5]. Since then, it has been used in fields such asfinance [6] or machine learning [7] to estimate parameters ofmodels. While non-linear models approximate the parametersin iterations, linear least squares estimation (LLSE) modelshave closed-form solutions. For these solutions, however, alldata has to be known.

RLS is a variant of LLSE which recursively finds pa-rameters that minimize the least square error. Without theneed to store all previous data, but with the cost of highercomputational complexity, this method is suitable for on-lineand real-time applications, such as tracking objects with Radar[8], adaptive filtering and control [9], system identification[10], localization and mapping [11] or even in reinforcementlearning [12]. None of these methods, however, focus onanomaly detection. Moreover, here we explore this applicationin the context of highly constrained devices.

Anomaly detection methods (also known as fault detectionor outlier detection) are available for centralized systems [13].But, since we target embedded systems, such as WSN motes,we are faced with severe limitations and design requirements.While RLS is suitable for low-memory systems (see sectionIV), the main limitation of many WSN mote families currentlyused in WSN is the lack of floating point math units. Fixed-point math is instead necessary, with some drawbacks in termsof overflow and underflow with multiplication and divisionoperations, that may eventually result in instability.

In the field of WSN, RLS methods are not widely applied.However, there are signal tracking methods, based on dis-tributed versions of RLS, that do not require central processingto form global parameter estimates. These can be found, forinstance, in Consensus-based RLS [14] and Diffusion RLS [15]methods. The global estimates are achieved by first letting eachnode send local measurements and regressors to its neighbors.In a second step the measurements and regressors of the

(a) Stabilized

(b) Unstabilized

Figure 2. Effects of stabilization with α = 1.1

neighbors are aggregated to form the new local regressorsestimates, that are eventually sent to the neighbors again. Whilethese methods use an embedded version of RLS, they focuson estimating parameters for a global model, and not on localanomaly detection, which is our case.

RLS estimations can also be used for anomaly detectionon a node-local or neighborhood environment. Sensors in onenode, or within neighboring nodes, that have a linear relationwith each other can be used in a linear model to predict,for instance, sensor readings. When these predictions deviateexcessively from the real values the measurement can beregarded as anomalous. For instance, Ahmed et al. [16] useRLS based anomaly detection in network systems to detectsecurity threats. They construct and adapt a dictionary thatapproximates normal features, with which a kernel featurespace is formed. This method is implemented in networkinghardware which, in general, has considerably more resourcesat their disposal compared to WSN motes. We do not focus onnetwork traffic, although such data may be similar to sensordata.

Recently, online anomaly detection in WSN has been gain-ing interest due to the increasing computational power with theadvance of technology, and to the large volumes of sensor databeing produced. An example of anomaly detection in WSN isthe piecewise linear regression technique, by Yao et al. [3].In their work, the data is represented, and compressed, as asequence of linear pieces that are fitted with LLSE. When anew data point deviates too much from the existing piece, anew piece is created. The authors assume a 24 hour cycle inthe data and therefore compare the slopes of the pieces withpieces from 24 hours before. Excessive deviation of the slopesthen indicates an anomaly. This method, however, does nottake into account the correlations between different sensors.

Real-world deployments have proven to contain sensorfaults of common types which can and should be detected[2]. A good overview of anomaly detection techniques forWSN is surveyed in [17], [18] and [19]. These surveys pinpointas open issue the realization of online decentralized anomalydetection, and also usability. This work moves towards online

decentralized anomaly detection by providing an experimental-based assessment of a lightweight online learning and predic-tion model for a single WSN mote.

III. METHOD

A. Recursive Least Squares

The RLS method, explained in [4], is a recursive methodthat in each time step n estimates parameters θn,i for eachsensor Xi in a linear model y = θX . In order to generate theseparameters we need an estimate of the inverse auto-correlationmatrix Pn. Algorithm 1 shows how these are updated withevery step of new data. Furthermore, we assume the data tohave 1) some linear correlation and 2) have Gaussian noise.Later in this section we show how to use the prediction errorfor anomaly detection.

Algorithm 1 Recursive Least Squares1: Init δ > 1, α > 1, θ0,i = {0} , P0 = δI2: for each sample step n do3: εn ← yn −X>n θn−14: Kn ← Pn−1Xn

5: µn ← X>n Kn

6: βn ← 1α+µn

7: κn ← βnKn

8: θn ← θn−1 + κnεn9: Pn ← 1

α

[Pn−1 − βnKnK

>n

]In the above Algorithm 1, δ is the initialization value for

the diagonals of Pn, α is the forgetting factor, i.e. how fast thealgorithm adapts, and θ0,i are the initial parameter estimates.The inputs are the vector X and the estimation target y.Furthermore, εn is the prediction error (residual) using theprevious iteration’s model for the current sample, βn is theinverse forgetting factor, i.e. learning rate, and Kn, µn andκn are temporary values which determine the update gain anddirection of Pn and θ. For more details please refer to [20].

B. Stability

For stable and accurate estimation, the RLS algorithmrequires infinite-precision values. Embedded systems, however,can only deal with limited precision, fixed-point variables.Therefore Bottomley [4] has analyzed the implication ofquantizing the above computations. This analysis resulted inrecursive estimates for the total quantization error, for whichBottomley suggests correction methods.

In our experiments with the fixed point implementation weindeed encountered stability issues, mostly when α is close to1, as shown in Fig. 2. We obtained stable filters implementinga subset of the correction methods in [4], but including somenecessary modifications detailed below. Most notable is theeffect of truncating instead of rounding computation results.

Furthermore, biasing the inverse auto-correlation matrixPn (line 9 in Algorithm 1), to be positive definite, aids instabilizing the filter. Contrary to what is suggested in [4], wenot only bias the diagonal elements of Pn that are nonzero

but also zero elements, in order to prevent underflow for thefixed-point values. Underflows in Pn could stop the parameterupdates, as can be derived from lines 4 and 9 in Algorithm 1.

With only the above two suggestions from [4], the addedchange ensures that the diagonals of Pn are never zero and thusthe estimate is updated. Moreover, we assume the inputs to beno greater than 16 bit values. This allows the 32bit fixed-pointrepresentation (Q16.161) to have 8 bits of space for overflows,and 8 bits of space for underflows.

C. Anomaly detection

In order to use RLS for anomaly detection, we use theestimated error εn of the linear RLS model as a detectionthreshold. We use two analysis methods which are fixed thresh-olding and adaptive thresholding. In the following section weexplain both.

We define threshold t over the absolute estimation error |εn|that, when exceeded, indicates an anomaly. For demonstrationpurposes we use a fixed value for all datasets, however, wecould determine a threshold by analyzing an initial segment ofdata.

For adaptive thresholding, described in Algorithm 2, we usethe Exponentially Weighted Moving Average (EWMA) method[21] to determine the moving average µ of the estimation errorεn, and moving absolute deviation σs of this average. Wechoose not to use the standard deviation because it requiresa square root operation, which is not commonly supported onembedded processors. Furthermore, we subsample the absolutedeviation σs, such that we can base a second threshold on theslower σl for slower changes in the signal.

The algorithm requires two initialization steps. The firststep, INITPHASEONESTEP, waits for µ and σ to representthe data. The threshold is formed by multiplying σs and σlwith factors Fs and Fl, which are determined in INITPHASET-WOSTEP, such that the data of this phase is below the thresh-olds. After these two initialization phases, we start detectinganomalies, as seen in the function DETECTIONPHASESTEP inAlgorithm 2.

IV. EVALUATION

This section describes the evaluation procedure that weused. We first describe the datasets used for evaluation. Thenwe analyze the performance of RLS through a comparativeevaluation with other least squares methods. These methodsare offline RLS implemented in R with floating point (RLS/FP)[4], traditional LLSE model fitting [22], and windowed LLSE(LLSE/Win). The latter performs LLSE fitting on a slidingwindow w of 32 data points over the dataset. This valuehas been chosen to allow an embedded implementation, as itrequires a modest memory consumption. The above methodsare used to benchmark our RLS approach, although our finalaim is to test their applicability into lightweight sensor systems.

1Qx.y denotes a fixed point number format where x bits are used for theinteger part and y bits for the fractional part.

Algorithm 2 Adaptive Threshold Detection1: function EWMA(X0, Y−1, λ)2: return ((1-λ) * Y−1) + (λ * X0)3: function UPDATEEWMAS(εn, anomaly)4: if !anomaly then5: σs ← EWMA(abs(εn-µ), σs, λ2)6: if resampleStep then7: σl ← EWMA(abs(σs), σl, λ2)8: µ ← EWMA(εn, µ, λ1)9: function INITPHASEONESTEP(εn)

10: UpdateEWMAs(εn, false)11: function INITPHASETWOSTEP(εn)12: UpdateEWMAs(εn, false)13: if abs(εn-µ) > σs*Fs then14: Fs ← abs(εn-µ)/σs15: if abs(εn-µ) > σl*Fl then16: Fl ← abs(εn-µ)/σl17: function DETECTIONPHASESTEP(εn)18: anomaly ← false19: if (abs(εn-µ) > σs*Fs) ∨ (abs(εn-µ) > σl*Fl) then20: anomaly ← true21: UpdateEWMAs(εn, anomaly)22: return anomaly23: initialize µ, σs, σl, Fs, Fl, λ1 and λ2.24: for first 1000 samples do InitPhaseOneStep(εn)25: for second 1000 samples do InitPhaseTwoStep(εn)26: for rest of lifetime do DetectionPhaseStep(εn)

A. Datasets

The performance comparison of the RLS implementation isevaluated with 3 synthesized datasets and traces of a real-worldindoor WSN setup. For this work we assume the data inputshave a linear relation between each other. The synthesized datacontains 28800 data points for 50 nodes, each node with 3sensors: 1 for the y values and 2 for the input values X1

and X2. The real-world data contains 42113 data points for19 nodes, each node with 3 sensors: temperature (y), humidity(X1) and light (X2) In both cases a third input X3 is a constant,acting as the intercept term, hence the number of inputs |θn| =3.

The following describes the general components of eachsynthesized set, as also seen in Fig. 3, and added to that isnormally distributed noise with mean 0 and variance 0.1.• Set 1 consists of lines having a random slope and inter-

cept.• Set 2 consists of lines with random slope, random inter-

cept and a sine wave.• For Set 3 a common trend is generated by creating a

random walk that is then stretched with cubic interpola-tion. For each sensor of each node, this common trendis then rescaled, a sine wave is added that is amplitudemodulated by a scaled random walk (resembling effectsthat, for instance, clouds may have on temperature). Then

Figure 3. Components of simulated dataset: (a) Line, (b) sinus, (c) noise(Normal distributed), (d) interpolated random walk (trend) and (e) a randomwalk

another scaled random walk is added, so that the dataresembles daily fluctuations of indoor temperature, bothin appearance and in frequency spectrum.

In the second half of these three datasets we add anomaliesof types spike, noise, constant and drift at random positionsand of various duration and various intensity. These anomaliesare not only present in the signal that is predicted, but also inthe input signals.

The real-world data traces are collected from an indoorWSN setup with TelosB motes, sampling every 5 minutes forabout 5 months. The TelosB consists of an MSP430 MCUand a TI CC2420 radio to wirelessly transmit the sensormeasurements. Furthermore, the temperature and humiditysensors are packaged together in the SHT11 chip, and the lightsensor is from the Hamamatsu S1087 series. The 19 nodeswere placed in each room, on the hallways and in the toiletsof the INCAS3 office building called Villa Aschwing, a housefrom 1877. In this dataset anomalies occur during the wholeperiod of measurement. These anomalies are mostly the resultof failing communication, bad placement (i.e. direct sunlight)and the opening of windows. The annotations are made withsome tools, using rules, and are checked by hand.

B. Complexity

Our implementation is evaluated based on the computationalcomplexity and memory usage for each new measurement.The complexity is derived from analysis of the pseudo code,the results of which can be seen in Tab. I. The computationalcomplexity and memory usage of RLS are constant in the orderof |θn|2, i.e. all operations described in formulas 3 to 9 arein the order of |θn|2 times a constant number of operations,while LLSE has to recompute the parameter estimates based onall n previous samples. The latter also means that, over time,memory usage and computational cost increase for LLSE. Byusing a sliding window for LLSE this cost is kept in theorder of a window length, w. We have implemented the RLSalgorithm for the TelosB platform, an 8-bit MSP430 basedplatform (compiled with GCC, optimized for size) and ran itin MSPsim [23] to count the real CPU cycles from lines 3 to9 in Algorithm 1.

From Tab. II we see that the algorithm complexity agreeswith the theoretical models of Tab. I, i.e the computationalcomplexity per iteration increases in the order of |θn|2. Thelarge numbers are mainly due to the 16-bit fixed-point com-putations, which are implemented in software. The memory

Method RLS RLS/FP LLSE LLSE/WinCPU O(|θn|2) O(|θn|2) O(n|θn|2) O(w|θn|2)Memory O(|θn|2) O(|θn|2) O(n2) O(w2)

Table ICOMPARISON OF ORDER OF COMPLEXITY DERIVED FROM ALGORITHMS

|θn| 2 3 4CPU Cycles/iteration 21k 42k 74k

CPU Time/iteration (ms) 5 10 18Max Stack/iteration (bytes) 106 118 118

Total program ROM 19788 19914 19878Total program RAM 2090 2178 2298

Table IINUMBER OF INPUT PARAMETERS VS RESOURCE USAGE OF RLS,

IMPLEMENTED IN TINYOS. COMPILED FOR TELOSB PLATFORM WITHMSP430-GCC AND DETERMINED IN MSPSIM.

usage increases slightly with the number of input parameters,while program size tends to vary due to compiler optimization.

We can estimate the energy consumption of the algorithmby using the CPU Time mentioned in Tab. II and the specifica-tions in the datasheet of the microcontroller. For the latter weconsider the MSP430F1611 that is used on the TelosB motes.Its power consumption in active mode is maximally 14.4 mWat 8 MHz, thus for 5 ms it uses 0.072 mJ. In comparison,the TelosB radio, a TI CC2420, uses 52.2 mW in transmitmode. Thus, for a transmission time of 5 ms it would take0.261 mJ. Therefore, when neighboring data is also used asinput for the algorithm, a balance has to be found betweenthe communication rate and the estimation error. Nevertheless,this shows that it is worthwhile to detect anomalies locally,and only send anomalous data to a central processing pointfor further analysis.

C. Prediction residuals

For each of the algorithms the average absolute predictionerror is the measure of precision. This average is taken overthe part of the data without anomalies. The RLS algorithmhas one parameter that influences its learning speed, which isthe forgetting factor α. We consider a reasonable range for αto be from 1 to around 32, where the latter is experimentallydetermined by the formula in line 6 and the bit size of the fixed-point fraction. The larger this factor, the slower the algorithmconverges to a least squares solution, but the more it resists toshort-term influences such as anomalies.

Due to the added noise on the inputs and outputs of thesimulated data, the residuals can never be zero. Hence, if theresiduals are significantly smaller than the average absolute dif-ference between two measurements, ε, then we are overfitting.This is because the least squares estimates are updated withtoo large steps, which results in the parameters not convergingto a global optimum but adapting to local changes.

We have tested the effect of different α values on theprediction error, of which the results can be seen in Tab. III. Wesee that the different RLS implementations are very similar intheir average error, and with smaller forgetting factors may beoverfitting slightly. While the LLSE implementation seems to

Dataset 1 Dataset 2 Dataset 3 TelosB tracesε .119 .119 .069 .028α 5 10 15 5 10 15 5 10 15 5 10 15

RLS .082 .096 .115 .082 .105 .117 .062 .159 .196 .090 .234 .355RLS/FP .082 .095 .101 .082 .104 .112 .059 .146 .179 .089 .217 .311

LLSE .107 .107 .107 .106 .106 .106 .180 .180 .180 .292 .292 .292LLSE/Win .080 .080 .080 .092 .092 .092 .054 .054 .054 .034 .034 .034

Table IIICOMPARISON OF PREDICTION RESIDUALS FOR DIFFERENT ALGORITHMS

have a higher average error, it is the ground truth for the wholedataset because it can be solved with a closed-form solution.The windowed version of LLSE is slightly overfitting, possiblydue to the small window size.

D. Detection performance

We compare the anomaly detection performance of theaforementioned methods, and contrast it with a rule-basedmethod taken as baseline. This rule-based method is as follows:first, take the absolute first order difference of the data. Then,use the same threshold t as before over the difference. Anypoint that crosses the threshold is labeled as anomaly. To detectconstant anomalies, a second rule is added that labels a pointanomalous when the difference between that point and thetwo previous points is zero, which proved sufficient to detectanomalies in the used real-world dataset.

Fig. 4 shows the anomaly detection accuracy for differenttypes of anomalies and (only for RLS-based algorithms) fordifferent forgetting factors α, both on the generated and real-world data. In general the True Positive (TP) rate is low,both for the RLS based methods as well as LLSE, whilethe False Positive (FP) rate varies with the type of anomaly.Furthermore, the figure shows that all the anomalies are notequally easy to detect.

When the learning rate is faster than the drift rate, theanomaly is likely to be undetected, because the RLS methodadapts to the gradual change. This shows in the low TP ratesfor the RLS based methods in Fig. 4, where the most detectionsare made by the rule-based LLSE method, with a TP rate of21% and only 4.9% FP. Therefore, if any detection is made, itis likely at the start of the drift, with high α (slow learning),while the rest of the drift period is not flagged as anomalous.In general the LLSE method performs better for detecting drift.This is mostly due to the fact that this method is not adaptivebut uses the whole dataset, and the fact that the drift anomaliesin the generated dataset last for a relatively short period oftime, such that the overall effect on the estimate of LLSE isminor. The FP rate for drift is, however, very low (at most3.5%), thus the precision is quite high.

Constant anomalies are similarly hard to detect. This showsin the low TP rate, more so for smaller values of α, such asα = 5. For such anomalies the low RLS performance is due tothe measurement reading the last known measurement, whichmay be relevant for short time periods. When the anomaliespersist, and the RLS algorithm adapts slowly (i.e. high α) theconstant anomaly may be detected, as is the case in Fig. 5.

(a) Generated Spike (b) Generated Noise (c) Generated Constant (d) Generated drift (e) Real-world mixed

Figure 4. True Positives (TP), False Negatives (FN) and False Positives (FP) as ratio of number of anomalies in dataset. Fix values are the result of theembedded fixed point RLS, FP values of the floating point implementation. The adaptive threshold is denoted EWMA and the fixed threshold is denoted Rule.

Figure 5. Data from set 3 and its fixed-point RLS estimate, using a highforgetting factor α = 15. With constant anomalies.

The rule-based method, however, performs almost flawless fordetecting constant anomalies, resulting in precision and recallof over 98%.

The performance of RLS for spikes is in general relativelygood. Detection is comparable to that of LLSE, with 20 to 50%TP for RLS and 22 to 50% TP for LLSE. The fixed thresholdperforms below the baseline, but is very resource friendly.The adaptive threshold outperforms the baseline method de-tecting anomalies, without an increase of False Positives.Noise anomaly detection performance is lower than that of thespikes. Similar to the detection of spikes, the adaptive thresholdperforms better than the baseline, with a TP rate of around 24%and a lower FP rate of only 0.8%.

The effect of the forgetting factor α is very noticeablefor the fixed threshold detection, where the constant anomalydetection benefits most. However, in all cases the false positiverate also increases when the forgetting factor increases. There-fore, a good choice α seems to be between 5 and 10, where aspecific application may bias the choice towards lower valueswhen spikes are more likely to occur. The adaptive thresholdseems to adapt to the changed forgetting factor, which resultsin a stable performance for the different forgetting factors. Itoutperforms the fixed-threshold detection, and, for spike andnoise anomalies, the baseline too.

V. CONCLUSION

In this paper we evaluated the use of fixed-point Recur-sive Least Squares, in combination with fixed and adaptivethresholding of the prediction error, for anomaly detection. Forthe 3 synthetic datasets and the real-world dataset, detectinganomalies with fixed-point RLS is comparable to the LLSEmethod, but RLS is efficiently implementable in embeddedsystems. The windowed LLSE version seems to perform betterat some tasks, but also seems to overfit. Furthermore, itrequires more processing. The rule-based method outperformsthe other methods in detecting spike, constant and noiseanomalies, but fails to detect any drift. When combining theRLS method with, for instance, rule-based anomaly detection,the performance can be boosted significantly for the detectionof constant anomalies.

The input for the RLS algorithm as used in this paperare node-local and concurrent. However, time correlationscan be used by adding an extra set of inputs that represent,for instance, previous inputs. Furthermore, we may also useinformation from neighboring nodes as input. There are, how-ever, some assumptions to be made. The measured modalityshould only change linearly with a change in location. Forinstance, temperature at location A should relate linearly totemperature at location B. But, more importantly, we needperiodic information from neighboring nodes. This requireseach node to broadcast its readings every now and then toneighboring nodes. We need to assume that the change in themeasured phenomenon is in a time scale much larger than thebroadcast period, so that on a short time scale measurementsare correlated and missing data, for instance due to packetloss, can also be accounted for. Moreover, using neighboringinformation requires a balance between the power consumptionof networking and local processing. Thus, using RLS as ananomaly detection method opens the possibilities to includehistorical and neighborhood information.

This work opens up various research possibilities. One as-pect which requires further investigation is the use of adaptiveforgetting factors, which enables fast learning followed by agreater resilience to change. Another option to investigate is

the extraction of features from the sensors. Together with theaforementioned possibilities, this work shows that RLS is apromising method for online distributed anomaly detection.

ACKNOWLEDGMENT

INCAS3 is co-funded by the Province of Drenthe, theMunicipality of Assen, the European Fund for Regional De-velopment and the Ministry of Economic Affairs, Peaks in theDelta.

REFERENCES

[1] K. Ni, N. Ramanathan, M. Chehade, L. Balzano, S. Nair, S. Zahedi,E. Kohler, G. Pottie, M. Hansen, and M. Srivastava, “Sensor networkdata fault types,” ACM Transactions on Sensor Networks, vol. 5, no. 3,p. 25, 2009.

[2] A. B. Sharma, L. Golubchik, and R. Govindan, “Sensor faults: Detectionmethods and prevalence in real-world datasets,” ACM Transactions onSensor Networks, vol. 6, no. 3, 2010.

[3] Y. Yao, A. Sharma, L. Golubchik, and R. Govindan, “Online anomalydetection for sensor systems: A simple and efficient approach,” Perfor-mance Evaluation, 2010.

[4] G. E. Bottomley, “A novel approach for stabilizing recursive least squaresfilters,” IEEE Transactions on Signal Processing, vol. 39, 1991.

[5] H. Sorenson, “Least-squares estimation: from Gauss to Kalman,” Spec-trum, IEEE, vol. 7, no. 7, pp. 63–68, 1970.

[6] M. Moreno and J. Navas, “On the robustness of least-squares MonteCarlo (LSM) for pricing American derivatives,” Review of DerivativesResearch, vol. 6, no. 2, pp. 107–128, 2003.

[7] J. Suykens and J. Vandewalle, “Least squares support vector machineclassifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.

[8] N. Iqevine, “A new technique for increasing the flexibility of recursiveleast squares data smoothing,” Bell System Technical Journal, 1961.

[9] K. Astrom, “Theory and applications of adaptive controla survey,”Automatica, vol. 19, no. 5, pp. 471–486, 1983.

[10] L. Ljung and S. Gunnarsson, “Adaptation and tracking in systemidentification - a survey,” Automatica, vol. 26, no. 1, pp. 7–21, 1990.

[11] S. Challa, F. Leipold, S. Deshpande, and M. Liu, “Simultaneous local-ization and mapping in wireless sensor networks,” in Intelligent Sensors,Sensor Networks and Information Processing Conference. IEEE, 2005,pp. 81–87.

[12] X. Xu, H. He, and D. Hu, “Efficient reinforcement learning using recur-sive least-squares methods,” Journal of Artificial Intelligence Research,vol. 16, pp. 259–292, 2002.

[13] Y. Zhang and J. Jiang, “Bibliographical review on reconfigurable fault-tolerant control systems,” Annual Reviews in Control, vol. 32, no. 2, pp.229–252, 2008.

[14] I. Schizas, G. Mateos, and G. Giannakis, “Consensus-based distributedrecursive least-squares estimation using ad hoc wireless sensor net-works,” in Signals, Systems and Computers, 2007. IEEE, 2007, pp.386–390.

[15] F. Cattivelli, C. Lopes, and A. Sayed, “Diffusion recursive least-squaresfor distributed estimation over adaptive networks,” Signal Processing,IEEE Transactions on, vol. 56, no. 5, pp. 1865–1877, 2008.

[16] T. Ahmed, M. Coates, and A. Lakhina, “Multivariate online anomalydetection using kernel recursive least squares,” in INFOCOM 2007. 26thIEEE International Conference on Computer Communications. IEEE,2007, pp. 625–633.

[17] Y. Zhang, N. Meratnia, and P. Havinga, “Outlier detection techniquesfor wireless sensor networks: A survey,” Communications Surveys &Tutorials, IEEE, vol. 12, no. 2, pp. 159–170, 2010.

[18] R. Jurdak, X. Wang, O. Obst, and P. Valencia, “Wireless sensor net-work anomalies: Diagnosis and detection strategies,” Intelligence-BasedSystems Engineering, pp. 309–325, 2011.

[19] M. Xie, S. Han, B. Tian, and S. Parvin, “Anomaly detection in wirelesssensor networks: A survey,” Journal of Network and Computer Applica-tions, 2011.

[20] I. M. Grant, “Recursive least squares,” Teaching Statistics, vol. 9, no. 1,pp. 15–18, 1987.

[21] S. Roberts, “Control chart tests based on geometric moving averages,”Technometrics, vol. 1, no. 3, pp. 239–250, 1959.

[22] G. Golub, “Numerical methods for solving linear least squares prob-lems,” Numerische Mathematik, vol. 7, no. 3, pp. 206–216, 1965.

[23] J. Eriksson, A. Dunkels, N. Finne, F. sterlind, and T. Voigt, “MSPsim– an Extensible Simulator for MSP430-equipped Sensor Boards,”in Proceedings of the European Conference on Wireless SensorNetworks (EWSN), Poster/Demo session, jan 2007. [Online]. Available:http://dunkels.com/adam/eriksson07mspsim.pdf