how to determine the optimal anomaly detection method for

41

Upload: others

Post on 18-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Determine the Optimal Anomaly Detection Method For

How to Determine theOptimal AnomalyDetection Method For YourApplication

Cynthia FreemanResearch Engineer

Jonathan MerrimanSoftware Engineer

Page 2: How to Determine the Optimal Anomaly Detection Method For

Background

Page 3: How to Determine the Optimal Anomaly Detection Method For

Time Series

▶ A time series is a sequence of data points indexed in order of time.▶ How are time series used?

▶ Stock Market▶ Tracking KPIs▶ Medical Sensors▶ Weather Patterns

Page 4: How to Determine the Optimal Anomaly Detection Method For

Anomalies

An anomaly in a time series is a pattern that does not conform to pastpatterns of behavior.

Applications:

▶ E�cient troubleshooting

▶ Fraud detection

▶ Ensuring undisrupted business

▶ Saving lives in system health monitoring

Page 5: How to Determine the Optimal Anomaly Detection Method For

Anomaly Detection is Hard

▶ What is anomalous?

▶ Online anomaly detection

▶ Lack of labeled data

▶ Data imbalance

▶ Minimize false positives

▶ Plethora of anomaly detection methods

Page 6: How to Determine the Optimal Anomaly Detection Method For

Which anomaly detection method should I use?

▶ Base this decision o� of the characteristics the time series possesses

▶ Evaluate anomaly detection methods on 4 time series characteristics as anexample

▶ Experiment with 2 evaluation criteria▶ Window-based F-score▶ Numenta Anomaly Benchmark (NAB) Score

Page 7: How to Determine the Optimal Anomaly Detection Method For

Signal Processing Flow for Anomaly Detection

signal

residual

detect

�lter

score

Page 8: How to Determine the Optimal Anomaly Detection Method For

Simple Example: Gaussian

▶ Estimate mean and variance oversliding window

▶ Compute a score based on the tailprobability

S(yt) = P(yt ≤ τ |µ, σ2)

▶ Use max relative to upper andlower extremes

02-24 0002-24 12

02-25 0002-25 12

02-26 0002-26 12

02-27 0002-27 12

02-28 0010

0

10

20

30

Page 9: How to Determine the Optimal Anomaly Detection Method For

Simple Example: Gaussian

2014-02-24

2014-02-25

2014-02-26

2014-02-27

2014-02-28

0.5

0.6

0.7

0.8

0.9

1.0

Anom

aly

Scor

e

2014-02-20

2014-02-21

2014-02-22

2014-02-23

2014-02-24

2014-02-25

2014-02-26

2014-02-27

2014-02-28

2014-03-010

5

10

15

20

25

30

35

log

Page 10: How to Determine the Optimal Anomaly Detection Method For

Time Series Characteristics

Page 11: How to Determine the Optimal Anomaly Detection Method For

Seasonality

▶ Presence of variations that occurat speci�c regular intervals

▶ Real data often exhibits seasonale�ects at multiple time scales.

▶ Day-of-week▶ Hour-of-day▶ Can be irregular

▶ Day-of-month▶ Holidays

▶ ACF plot is one way to detectseasonality

01Jul

2014

30 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22

timestamp

Page 12: How to Determine the Optimal Anomaly Detection Method For

Concept Drift

The underlying process can change over time.

▶ Bayesian Online Changepoint Detection

▶ ecp package in R

30

40

50

60

https://github.com/hildensia/bayesian_changepoint_detection

Page 13: How to Determine the Optimal Anomaly Detection Method For

Trend

The process mean can change over time.

Page 14: How to Determine the Optimal Anomaly Detection Method For

Missing Time Steps

0 1000 2000 3000 4000 5000 6000 7000 8000

60

65

70

75

80

85

Page 15: How to Determine the Optimal Anomaly Detection Method For

Time Series Modeling for Anomaly

Detection

Page 16: How to Determine the Optimal Anomaly Detection Method For

Nonstationarity: Di�erencing

▶ First-order di�erence to removetrend:

[∆y](t) = y(t)− y(t− 1)

▶ Seasonal di�erencing with periods:

[∆sy](t) = y(t)− y(t− s)

02-24 0002-24 12

02-25 0002-25 12

02-26 0002-26 12

02-27 0002-27 12

02-28 0010

0

10

20

30

02-24 0002-24 12

02-25 0002-25 12

02-26 0002-26 12

02-27 0002-27 12

02-28 0020

10

0

10

20

Page 17: How to Determine the Optimal Anomaly Detection Method For

Nonstationarity: Decomposition STL

Local regression with LOESS

y(t) = S(t) + T(t) + ϵ(t)

▶ Decompose into season andtrend

▶ LOESS smoothing caninterpolate missing data

▶ Residual should look morestationary

Page 18: How to Determine the Optimal Anomaly Detection Method For

ARMA

A family of Gaussian models with temporal correlation.

y(t)−p∑i=1

θiy(t− i)︸ ︷︷ ︸AR

= ϵ(t) +

q∑j=1

ϕjϵ(t− j)︸ ︷︷ ︸MA

Autoregressive (AR)

The value at time t is a linear combination of p past values plus current noise signal.

Moving Average (MA)

The value at time t is a linear combination of q past values of noise.

Page 19: How to Determine the Optimal Anomaly Detection Method For

ARMA for Nonstationary Signals

ARIMAARMA on di�erenced signal.

SARIMAExtend ARIMA to incorporate longer-term seasonal correlation.

SARIMAXAdd eXogenous variables.

Page 20: How to Determine the Optimal Anomaly Detection Method For

ARMA

▶ Generative model having Gaussian distribution at each timestep

▶ Optimal model order selection is not straightforward

▶ See: Box-Jenkins method

Page 21: How to Determine the Optimal Anomaly Detection Method For

Prophet

Uses an additive model:

y(t) = g(t) + s(t) + h(t) + ϵt

▶ g(t) is linear/logistic growth trend

▶ s(t) is yearly/weekly seasonal component

▶ h(t) is user-provided list of holidays

https://github.com/facebook/prophet

Page 22: How to Determine the Optimal Anomaly Detection Method For

Extreme Studentized Deviate Test

How many outliers does the data set contain?ESD test requires an upper bound on the number of outliers.Assuming data is approximately normally distributed,

1. Compute the statistic,

Ri =maxi |xi − x̄|

s

2. Remove observation that maximizes |xi − x̄|, and repeat

3. Compare Ri up to critical value

Page 23: How to Determine the Optimal Anomaly Detection Method For

Twitter AnomalyDetection

▶ Uses STL but replaces trend with median▶ Anomalies can a�ect trend estimation▶ Leads to arti�cial anomalies in the residual

▶ Apply Extreme Studentized Deviate (ESD) test▶ Need to specify an upper limit on the # of outliers▶ x̄ is median and s is Median Absolute Deviation

https://github.com/twitter/AnomalyDetection

Page 24: How to Determine the Optimal Anomaly Detection Method For

Recurrent Neural Network

▶ Given a window of nlag time stepsin the past, predict a window ofnseq time steps in the future

▶ Anomaly score is an average of theprediction error

▶ Adaptive: uses onlinegradient-based optimizer, built todeal with concept drift

▶ Choice of nseq can greatly a�ectfalse positive rate

Online Anomaly Detection with Concept Drift Adaptation using RNNs CoDS-COMAD ’18, January 11–13, 2018, Goa, India

where T is length of the time series:

reset gate : r(i)t = �(W(i)r · [D(z(i�1)

t ), z(i)t�1])

update gate : u(i)t = �(W(i)

u · [D(z(i�1)t ), z(i)t�1])

proposed state : z̃(i)t = tanh(W(i)p · [D(z(i�1)

t ), rt � z(i)t�1])

hidden state : z(i)t = (1� u(i)t )� z(i)t�1 + u(i)

t � z̃(i)t

(1)where � is Hadamard product, [a,b] is concatenation ofvectors a and b, D(·) is dropout operator that randomly setsthe dimensions of its argument to zero with probability equalto dropout rate, z0t equals the value of the input time series attime t. Wr, Wu, and Wp are weight matrices of appropriate

dimensions s.t. r(i)t ,u(i)t , z̃(i)t , and z(i)t are vectors in Rc(i) ,

where c(i) is the number of units in layer i. The sigmoid (�)and tanh activation functions are applied element-wise. Thehidden state z(i)t is used to obtain the output via a linear ornon-linear output layer. The parametersW = [Wr,Wu,Wp]of the RNN consist of the weight matrices in Equations 1.Dropout is used for regularization [28, 33] and is applied onlyto the non-recurrent connections, ensuring information flowacross time-steps.

3 APPROACHWe assume that a model that is able to predict the next fewsteps in a time series based on history is a suitable model forthe time series. In other words, if a model is able to predict thevalues at next few steps in a time series well, it has e↵ectivelylearned to capture the important temporal characteristics inthe time series. In case of stationary time series, once trained,such a model is expected to predict the time series depictingnormal behavior well but perform badly on the predictiontask on time series corresponding to anomalous behavior.This idea has been e↵ectively applied to several domainswhere a temporal model for normal behavior is learned usingmultilayered LSTM based RNNs (e.g. [6, 7, 27, 43]).

In the online setting, such a model of normal behaviorneeds to be updated in an incremental manner as new dataarrives. We use the most recent prediction error(s) to servetwo primary purposes: i) update the model parameters, ii)compute the anomaly score. We identify following key prop-erties of a good online anomaly detection model based onprediction errors:

• If there is a drift in values being observed over a signif-icant period of time, the model should quickly adaptto the new norm by updating the parameters.

• If there is a short subsequence of anomalous values,the model need not be updated. Rather the modelupdation step should be restrained.

• If there is a continuous drift in the normal behavior,the model should get updated continuously or modelthe continuous drift as part of the normal behavior.

We next describe our approach that caters to the afore-mentioned criteria. We show how multi-step predictions are

Anomaly Score Computation

Prediction using RNN

Anomaly Score Computation

RNN Updation usingBPTT

At time t At time t+1

Prediction using RNN

RNN Updation usingBPTT

Figure 1: Steps in Online RNN-AD approach

obtained using RNNs and then used for anomaly score com-putation as well as incremental model updation. Overall stepsof the algorithm are depicted in Figure 1.

3.1 Online RNN-ADConsider a multivariate time series x = {x1,x2, ...,xt}, wheret is the length of the time series, each point xi 2 Rm (i =1 . . . t) in the time series is an m-dimensional vector (e.g. eachof the m dimensions corresponds to a di↵erent sensor). Attime t, we use a past window of length w as raw input timeseries it:

it = xt�w�(l�1) . . .xt�l (2)

The RNN model obtained at time t�1 with parameters Wt�1

is used to obtain the l-steps ahead estimates x0t�l+1 . . .x

0t

corresponding to last l raw observed values ot:

ot = xt�l+1 . . .xt (3)

The raw input it and raw target output ot are appropri-ately normalized before feeding it to the latest RNN networkrepresented by weight matrix Wt�1. The error in predictingot by processing it via the RNN Wt�1 is used to obtainthe anomaly score at and update the RNN to Wt. We nextdescribe these steps in detail:

Figure 2: E↵ect of local normalization in case of sud-den mean shift

Illustration from Saurav et al. '18

Page 25: How to Determine the Optimal Anomaly Detection Method For

HTM for Anomaly Detection

Hierarchical Temporal Memory Network

▶ HTM outputs sparserepresentation of input and nextprediction step to determine theprediction error modeled as arolling normal distribution

▶ HTM not implmented in a widelyaccessible way

▶ Cannot handle missing time stepsinnately

Illustration from Ahmad et al. '17

Page 26: How to Determine the Optimal Anomaly Detection Method For

HOT-SAX

Heuristically Ordered Timeseries - Symbolic Aggregated ApproXimation

▶ Finds Discords: Subsequences of time series that are maximally di�erentfrom all remaining subsequences

▶ Transform timeseries into alphabetical symbols and compare the distancesbetween words

▶ Not built for concept drift detection

▶ Ine�cient for very large time series

hepatitis database, for visualization [8][11], and a host of other data mining tasks.

4.1 A Brief Review of SAX A time series C of length n can be represented in a w-dimensional space by a vector

wccC ,,1 � . The ith element of C is calculated by the following equation:

¦��

i

ijjn

wi

wn

wn

cc1)1(

In other words, to transform the time series from n dimensions to w dimensions, the data is divided into w equal sized “frames.”The mean value of the data falling within a frame is calculated and a vector of these values becomes the dimensionality-reduced representation. This simple representation is known as Piecewise Aggregate Approximation (PAA).

Having transformed a time series into the PAA representation, we can apply a further transformation to obtain a discrete representation. It is desirable to have a discretization technique that will produce symbols with equiprobability [2][6]. In empirical tests on more than 50 datasets, we noted that normalized subsequences have highly Gaussian distribution[10], so we can simply determine the “breakpoints” that will produce equal-sized areas under Gaussian curve.

Definition 9. Breakpoints: Breakpoints are a sorted list of numbers % = E1 ,…,Ea-1 such that the area under a N(0,1)Gaussian curve from Ei to Ei+1 = 1/a (E0 and Ea are defined as -f and f, respectively).

These breakpoints may be determined by looking them up in astatistical table. For example, Table 3 gives the breakpoints for values of a from 3 to 5.

Table 3: A lookup table that contains the breakpointsthat divides a Gaussian distribution in an arbitrarynumber (from 3 to 5) of equiprobable regions

Eia 3 4 5

E1 -0.43 -0.67 -0.84

E2 0.43 0 -0.25

E3 0.67 0.25

E4 0.84

Once the breakpoints have been obtained we can discretize atime series in the following manner. We first obtain a PAA ofthe time series. All PAA coefficients that are below the smallestbreakpoint are mapped to the symbol “a”, all coefficientsgreater than or equal to the smallest breakpoint and less than the second smallest breakpoint are mapped to the symbol “b”, etc. Figure 3 illustrates the idea.

Figure 3: A time series (thin black line) is discretized by first obtaining a PAA approximation (heavy gray line) and thenusing predetermined breakpoints to map the PAA coefficientsinto symbols (bold letters). In the example above, with n = 128,w = 8 and a = 3, the time series is mapped to the word cbccbaab

Note that in this example, the 3 symbols, “a”, “b” and “c” are approximately equiprobable as we desired. We call the concatenation of symbols that represent a subsequence a word .

Definition 10. Word: A subsequence C of length n can be represented as a word

w1as follows. Let Dj denote

the jth element of the alphabet, i.e., D1 = a and D2 = b. Then the mapping from a PAA approximation

ccC ˆ,,ˆˆ �

C to a word C is obtained as follows:

ˆ

jijji ciffc EED �d �1ˆ

We have now completely defined SAX representation. Note that our observation that normalized subsequences have highlyGaussian distribution [10], is not critical to correctness of any of the algorithms that use SAX, including the ones in this work. A pathological dataset that violates this assumption will onlyaffect the efficiency of the algorithms.

4.2 Approximating the Magic Outer Loop We begin by creating two data structures to support our heuristics. We are given n, the length of the discords in advance, and we must choose two parameters, the cardinality of the SAX alphabet size D, and the SAX word size w. We defer a discussion of how to set these parameters until later in this section, but note that the values of D and w only affect the efficiency of our algorithm, not the final result, which dependsonly on the user supplied length of the discord.We begin by creating a SAX representation of the entire time series, by sliding a window of length n across time series T,extracting subsequences, converting them to SAX words, and placing them in an array where the index refers back to the original sequence. Figure 4 gives a visual intuition of this,where both D and w are set to 3.

0 500 1000 1500 2000 2500

C1

c a a

c

c

b

a

c

cb

ac

b

ac

b

a

cb

a

1 3

2

(m – n) -1

(m – n)+1

77

9

23

731

c

C1^

Subsequence extracted

Converted to SAX

Inserted into array

Raw time series

Augmented Trie

a

b

b

::

::

a

b

a

2

1

2

::

::

3

1

3

c

c

b

::

::

a

a

a

b(m – n) +1

a(m – n)

c(m – n) -1

::::

::::

c3

c2

c1

a

b

b

::

::

a

b

a

2

1

2

::

::

3

1

3

c

c

b

::

::

a

a

a

b( ) +1

a( –

c( – ) -1

::::

::::

c3

c2

c1

0 500 1000 1500 2000 2500

C1

c a a

c

c

b

a

c

cb

ac

b

ac

b

a

cb

a

1 3

2

(m – n) -1

(m – n)+1

77

9

23

731

c

C1^

Subsequence extracted

Converted to SAX

Inserted into array

Raw time series

Augmented Trie

0 500 1000 1500 2000 2500

C1

c a a

c

c

b

a

c

cb

ac

b

ac

b

a

cb

a

1 3

2

(m – n) -1

(m – n)+1

77

9

23

731

c

c

c

b

a

c

cb

ac

b

ac

b

a

cb

a

cb

a

cb

ac

b

a

cb

ac

b

a

cb

a

cb

a

cb

a

1 3

2

(m – n) -1

(m – n)+1

77

9

23

731

c

C1^C1^

Subsequence extracted

Converted to SAX

Inserted into array

Raw time series

Augmented Trie

a

b

b

::

::

a

b

a

2

1

2

::

::

3

1

3

c

c

b

::

::

a

a

a

b(m – n) +1

a(m – n)

c(m – n) -1

::::

::::

c3

c2

c1

a

b

b

::

::

a

b

a

2

1

2

::

::

3

1

3

c

c

b

::

::

a

a

a

b( ) +1

a( –

c( – ) -1

::::

::::

c3

c2

c1

0 20 40 60 80 100 120

-1 .5

- 1

-0 .5

0

0.5

1

1.5

ba

a

b

cc

b

cFigure 4: The two data structures used to support the Inner and Outer heuristics. (left) An array of SAX words, where the last column contains a count of how often each word occurs in the array. (right) An excerpt of an trie with leaves that contain a listof all array indices that map to that terminal node

Note that the index goes from 1 to (m – n) + 1, because the right edge of n-length sliding window “bumps” against end of the m-length time series.Once we have this ordered list of SAX words, we can imbedthem into an augmented trie where the leaf nodes contain a linked list index of all word occurrences that map there. The

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05) 1550-4786/05 $20.00 © 2005 IEEE

5.1.1 Anomaly Detection in Space Telemetry We consider the problem of finding anomalies in sensor time series. In Figure 5, we see an example of a Space ShuttleMarotta Valve time series that was annotated as normal by a NASA engineer.

Figure 5: An example of a Space Shuttle Marotta Valve timeseries that was annotated as normal

The engineer further annotated several other traces of the samesensor that have several kinds of faults. We first consider aneasy problem; in Figure 6, the expert annotated the last in aseries of 5 energize/deenergize cycles as “Poppet pulled significantly out of the solenoid before energizing”. The problem is immediately obvious to even the untrained eye, and the discord128 completely spans the offending section.

Figure 6: An example of an annotated Marotta Valve timeseries. The discord discovered (highlighted in bold) exactly corresponds with the expert’s annotation

In Figure 7, we consider a much more subtle problem; once again we find the discord128, and once again its location exactlycoincides with the expert’s annotation.

Figure 7: Another example of an annotated Marotta Valve timeseries. While the discord discovered (highlighted in bold)exactly corresponds with the expert’s annotation, it is difficult to see why at this scale

At the scale shown, it is impossible to see what the expert sawto justify her decision; however, in Figure 8, we can see that thediscord has a “double hump”, whereas the corresponding section of the other four cycles have a single hump.

Figure 8: (Left) A subsection of Figure 7 showing the discord found. (Right) A Zoom-in of the discord, and the four corresponding sections form the normal cycles explains the fault

5.1.2 Anomaly Detection in Electrocardiograms We have already considered the utility of discords in an Electrocardiogram (ECG) in Figure 1. That was a very simpleand “clean” example for clarity; however, it is remarkable howvaried and complex normal healthy ECGs can be. For example,Figure 9 shows a very complicated signal with remarkablevariability Surprisingly, this ECG contains only one smallanomaly, which is easily discovered by a discord. In Figure 10, we consider an ECG that has several differenttypes of anomalies. Here, the first 3 discords exactly line upwith the cardiologist’s annotations.

0 5 00 0 1 00 0 0 1 5 0 0 0

B ID M C C on g es tiv e H ea rt F a ilu re D a ta b a s e: R ec ord 1 5

r0 5 00 0 1 00 0 0 1 5 0 0 0

B ID M C C on g es tiv e H ea rt F a ilu re D a ta b a s e: R ec ord 1 5

r

0 100 200 300 400 500 600 700 800 900 1000

EnergizingPhase

De-Energizing Phase

Space Shuttle M arottaValve: N ormal cycle

0 100 200 300 400 500 600 700 800 900 1000

EnergizingPhase

De-Energizing Phase

Space Shuttle M arottaValve: N ormal cycle Figure 9: An ECG that has been annotated by a cardiologist

(bottom bar) as containing one premature ventricular contraction. The discord256 (bold line) exactly coincides with theheart anomaly

In this figure we could perhaps spot the anomalies by eye;however, the full time series is much longer, and impossible to scrutinize without a scrollbar and much patience.

0 5 0 0 0 1 0 0 0 0 1 5 0 0 0

M IT - B I H A rrh y th m ia D a ta b a s e : R e c o rd 1 0 8

r S r

1 s t D is c o rd2 n d D is c o rd

3 rd D is c o rd

0 5 0 0 0 1 0 0 0 0 1 5 0 0 0

M IT - B I H A rrh y th m ia D a ta b a s e : R e c o rd 1 0 8

r S r

1 s t D is c o rd2 n d D is c o rd

3 rd D is c o rd

Figure 10: An excerpt of an ECG that has been annotated by a cardiologist (bottom bar) as containing 3 various anomalies. The first 3 discord600 (bold lines) exactly coincides with the anomalies

0 50 0 1000 1500 2000 2500 3000 3500 4000 4500 5000

P opp et pulle d sign if icantly out o fthe so leno id be fo re energ iz ing The D e-

E nerg iz ingp hase isnorm a l

M aro tta V alve S eries I

0 50 0 1000 1500 2000 2500 3000 3500 4000 4500 5000

P opp et pulle d sign if icantly out o fthe so leno id be fo re energ iz ing The D e-

E nerg iz ingp hase isnorm a l

M aro tta V alve S eries I

In the above cases, we simply set the length of the discords to be approximately one full heartbeat (note that the two datasets havedifferent sampling rates). Although we found that we could double or half the parameters without affecting the quality ofresults, on just a handful of the dozens of ECG datasets we examined, the discords had a harder time finding the anomalous heartbeats. We conferred with cardiologist, Dr. Helga Van Herle M.D., who informed us that heart irregularities can sometimesmanifest themselves at scales significantly shorter than a singleheartbeat. Armed with this knowledge, we searched for discords at approximately ¼ the length of a single heartbeat. In Figure11, we show the results of such a search.

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 5 0 0 0

P o p p e t p u lle d o u t o f th e s o le n o id b e fo re e n e rg iz in gM a ro t ta V a lv e S e r ie s I I

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 5 0 0 0

P o p p e t p u lle d o u t o f th e s o le n o id b e fo re e n e rg iz in gM a ro t ta V a lv e S e r ie s I I

0 500 1000 1500

rQT Database: Record 0606

0 500 1000 1500

rQT Database: Record 0606

Figure 11: An ECG that has been annotated by a cardiologist(bottom bar) as containing one premature ventricular contraction. The discord40 (bold line) exactly coincides with the heart anomaly

While the result is satisfying in that it immediately locates the anomaly, it is not obvious from the figure that the discord is actually different for the other heartbeats. In Figure 12 (left) we see a zoom-in of the subsequence surrounding the discord, andwe can see that the discord falls over the ST wave. In Figure 12 (right), we manually extracted 4 ST waves from the subsequence in Figure 11 and clustered them together with the discord. This makes the source of the anomaly apparent.

0 50 1001500 2000 2500

Poppet pulled out of thesolenoid before

energizing

Discord

Correspondingsection ofother cycles

Discord

0 50 1001500 2000 2500

Poppet pulled out of thesolenoid before

energizing

Discord

Correspondingsection ofother cycles

Discord

900 1000 1100 1200r

P

Q

R

S

T

Discord

4

1

3

2

900 1000 1100 1200r

P

Q

R

S

T

Discord

4

1

3

2

Figure 12: (left) A zoom-in of a section of Figure 11. The firstheartbeat has been annotated with the classic notation. (right)Five ST waves from Figure 11 (including the discord)hierarchically clustered

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05) 1550-4786/05 $20.00 © 2005 IEEE

Illustrations from Keough et al. 2005

Page 27: How to Determine the Optimal Anomaly Detection Method For

Evaluation Strategies

Page 28: How to Determine the Optimal Anomaly Detection Method For

Anomaly Scores

Anomaly detectors are adapted to output a score between 0 and 1

▶ HTM: Use provided score

▶ Twitter AD and HOT-SAX: Use binary determination

▶ Windowed gaussian: Apply Q function to standardized signal

▶ STL, SARIMA, Prophet: Apply Q function to standardized residual

Page 29: How to Determine the Optimal Anomaly Detection Method For

Numenta Anomaly Benchmark Scoring

▶ For every predicted anomaly y, itsscore σ(y) is determined by itsposition relative to its containingwindow or an immediatelypreceding window

▶ For every ground truth anomaly,construct an anomaly windowwith the anomaly in the center.

.1×length of time series# of true anomalies

v. is fully automated across all datasets (any data specific parameter tuning must be done online without human intervention)

The NAB scoring algorithm aims to reward these characteristics. There are three key aspects of scoring in NAB: anomaly windows, the scoring function, and application profiles. These are described in more detail below.

Traditional scoring methods, such as precision and recall, don’t suffice because they don’t effectively test anomaly detection algorithms for real-time use. For example, they do not incorporate time and do not reward early detection. Therefore, the standard classification metrics – true positive (TP), false positive (FP), true negative (TN), and false negative (FN) are not applicable for evaluating algorithms for the above requirements.

Fig. 2. Shaded red regions represent the anomaly windows for this data file. The shaded purple region is the first 15% of the data file, representing the probationary period. During this period the detector is allowed to learn the data patterns without being tested.

To promote early detection NAB defines anomaly windows. Each window represents a range of data points that is centered around a ground truth anomaly label. Fig. 2 shows an example using the data from Fig 1. A scoring function (described in more detail below) uses these windows to identify and weight true positives, false positives, and false negatives. If there are multiple detections within a window, the earliest detection is given credit and counted as a true positive. Additional positive detections within the window are ignored. The sigmoidal scoring function gives higher positive scores to true positive detections earlier in a window and negative scores to detections outside the window (i.e. the false positives). These properties are illustrated in Fig. 3 with an example.

How large should the windows be? The earlier a detector can reliably identify anomalies the better, implying these windows should be as large as possible. The tradeoff with extremely large windows is that random or unreliable detections would be regularly reinforced. Using the underlying assumption that true anomalies are rare, we define anomaly window length to be 10% the length of a data file, divided by the number of anomalies in the given file. This technique allows us to provide a generous window for early detections and also allow the detector to get partial credit if detections are

soon after the ground truth anomaly. 10% is a convenient number but note the exact number is not critical. We tested a range of window sizes (between 5% and 20%) and found that, partly due to the scaled scoring function, the end score was not sensitive to this percentage.

Different applications may place different emphases as to the relative importance of true positives vs. false negatives and false positives. For example, the graph in Fig. 2 represents an expensive industrial machine that one may find in a manufacturing plant. A false negative leading to machine failure in a factory can lead to production outages and be extremely expensive. A false positive on the other hand might require a technician to inspect the data more closely. As such the cost of a false negative is far higher than the cost of a false positive. Alternatively, an application monitoring the statuses of individual servers in a datacenter might be sensitive to the number of false positives and be fine with the occasional missed anomaly since most server clusters are relatively fault tolerant.

To gauge how algorithms operate within these different application scenarios, NAB introduces the notion of application profiles. For TPs, FPs, FNs, and TNs, NAB applies different relative weights associated with each profile to obtain a separate score per profile.

Fig. 3. Scoring example for a sample anomaly window, where the values represent the scaled sigmoid function, the second term in Eq. (1). The first point is an FP preceding the anomaly window (red dashed lines) and contributes -1.0 to the score. Within the window we see two detections, and only count the earliest TP for the score. There are two FPs after the window. The first is less detrimental because it is close to the window, and the second yields -1.0 because it’s too far after the window to be associated with the true anomaly. TNs make no score contributions. The scaled sigmoid values are multiplied by the relevant application profile weight, as shown in Eq. (1), the NAB score for this example would calculate as: −1.0!!" + 0.9999!!" −0.8093!!" − 1.0!!" . With the standard application profile this would result in a total score of 0.6909.

NAB includes three different application profiles: standard, reward low FPs, and reward low FNs. The standard profile assigns TPs, FPs, and FNs with relative weights (tied to the window size) such that random detections made 10% of the

Illustration from Lavin & Ahmad '15

Page 30: How to Determine the Optimal Anomaly Detection Method For

Numenta Anomaly Benchmark Scoring (Continued)

▶ The raw score is computed as:

Sd =

∑y∈Yd

σ(y)

+ AFNfd

AFN is cost of false negatives

▶ Then rescale to get summary score:

100× S− SnullSperfect − Snull

▶ Choose threshold that maximizes score

Page 31: How to Determine the Optimal Anomaly Detection Method For

Window-based F-score

▶ Segment into nonoverlapping windows

▶ Window is anomalous if it contains an anomaly

▶ Treat like binary classi�cation and report F1

▶ Choose threshold that minimizes # of errors

▶ Prefer detection in case of tie

Page 32: How to Determine the Optimal Anomaly Detection Method For

Results and Conclusions

Page 33: How to Determine the Optimal Anomaly Detection Method For

Characteristic Corpora

Seasonality10 datasets

63,336 samples

23 ground truth anomalies

Trend10 datasets

31,596 samples

17 ground truth anomalies

Concept Drift10 datasets

32,402 samples

27 ground truth anomalies

Missing Timesteps10 datasets

33,245 samples

22 ground truth anomalies

1,254 missing samples

https://github.com/numenta/NAB

Page 34: How to Determine the Optimal Anomaly Detection Method For

Example

Page 35: How to Determine the Optimal Anomaly Detection Method For

Which methods are promising given a characteristic?

Seasonality and TrendSTL, SARIMA, Prophet

Concept DriftRequires more complex methods such as HTMs

Missing Time Steps

▶ Performance varies based on evaluation strategy▶ Area for future work: more methods needed!

Page 36: How to Determine the Optimal Anomaly Detection Method For

Which evaluation strategy should I use?

▶ F-score scheme is more restrictive

▶ NAB scores have more wiggle room for false positives due to reward forearly detection

▶ What evaluation metric to use is entirely based on the needs of the user

Page 37: How to Determine the Optimal Anomaly Detection Method For

In Summary

▶ The existence of an anomaly detection method that is optimal for alldomains is a myth

▶ Determine the characteristics present in the data to narrow down thechoices for anomaly detection methods

Page 38: How to Determine the Optimal Anomaly Detection Method For

Questions?

Cynthia [email protected]

Jonathan [email protected]

https://github.com/cynthiaw2004/adclasses

Page 39: How to Determine the Optimal Anomaly Detection Method For
Page 40: How to Determine the Optimal Anomaly Detection Method For

Rate today ’s session

Session page on conference website O’Reilly Events App

Page 41: How to Determine the Optimal Anomaly Detection Method For

Timing

Average time to generate anomaly scores: