finding changes in real data

© 2017 MapR Technologies 1

Detecting Change


Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

mailto:[email protected]



Who We Are

• MapR Technologies

– We make a kick-ass platform for big data computing

– Support many workloads including Hadoop / Spark / HPC / Other

– Extended to allow streams and tables in basic platform

– Free for academic research / training

• Apache Software Foundation

– Culture hub for building open source communities

– Shared values around openness for contribution as well as use

– Many major projects are part of Apache

– Even more minor ones!


Basic Outline

• Goal Setting

• Basic Ideas

– LLR (finding changes in counts)

– Poisson rate change detection (finding changes in events timing)

– Distribution estimation / visualization

– Labeled events and adding labels

• Free Improvisation on Themes


Why Is This Practically Important

• The novice came to the master and says “something is broken”




• The master replied “What has changed?”




• The master replied “What has changed?”

• And the student was enlightened


The Second Student

• Another student said to the master, “I see something has

changed … something may have broken”


The Second Student



• The master replied, “You have no question to ask. You have no

need of enlightenment”


The Second Student



• The master replied, “You have no question to ask. You have no

need of enlightenment”

• And thus the student was enlightened


• There are some very powerful techniques available, some only

very recently, that can make the detection of change much

easier than you might think. I will describe the practical use of

several of these techniques including t-digest, non-linear

histograms, variable rate Poisson models and combinations of

these.


Comparing Counts

• Suppose we have two situations A and B, each with many

observations, nA and nB

• And some event x occurred n1A and n1B times in each situation

x other

A n1A nA - n1A

B n1B nB - n1B


Comparing Counts

• Have we seen a change in the frequency of x?

• Frequency ratios?

– Breaks with small counts

• - test?

– Breaks with small counts


Log-Likelihood Ratio Test (Root LLR)

• In R

entropy = function(k) {

-sum(k*log((k==0)+(k/sum(k))))

}

llr = function(k) {

(entropy(rowSums(k))+entropy(colSums(k))

-entropy(k))*2

}

• Like mutual information * 2 N


Spot the Anomaly

• Root LLR is roughly like standard deviations

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 2

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

0.89 1.95

4.51 14.29


How Does it Work

Empirical fit to asymptotic

distribution is very good


How Does it Work?


OKWe can detect changes in counts


Real-life Example

• Query: “Paco de Lucia”

• Conventional meta-data search results:

– “hombres de paco” times 400

– not much else

• Recommendation based search:

– Flamenco guitar and dancers

– Spanish and classical guitar

– Van Halen doing a classical/flamenco riff


Real-life Example


Example 2 - Common Point of Compromise

• Scenario:

– Merchant 0 is compromised, leaks account data during compromise

– Fraud committed elsewhere during exploit

– High background level of fraud

– Limited detection rate for exploits

• Goal:

– Find merchant 0

• Meta-goal:

– Screen algorithms for this task without leaking sensitive data


Example 2 - Common Point of Compromise

skim exploit

Merchant 0

Skimmed data

Merchant n

Card data is stolen

from Merchant 0

That data is used

in frauds at other

merchants


Simulation Setup

0 20 40 60 80 100

01

00

300

50

0

day

coun

tCompromise period

Exploit period

compromises

frauds


Detection Strategy

• Select histories that precede non-fraud

• And histories that precede fraud detection

• Analyze 2x2 cooccurrence of merchant n versus fraud

detection


What about the

real world?


●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●

●●

●

●●

02

04

06

08

0

LLR score for real data

Number of Merchants

Bre

ach S

core

(L

LR

)

Real truly bad guys

100

101

102

103

104

105

106

Really truly bad guys


What about time?


Finding Changes in Timing

• Suppose our input is events embedded in time

• Suppose we want to find changes in our input in real-time

• Waiting and counting is fine if we don’t have to react now

• We can do much better


Poisson Event Rate Change

• Detection of fallout

– Time since last is very sensitive for complete failure

• Detection of change relative to reference

– Time since n-th most recent

– LLR with time

• Have to trade detection speed versus false positive rate and

size of change

• Can run multiple detectors at once


Basic idea:Time interval is better than counts


Sporadic Events: Finding Normal and Anomalous Patterns

• Time between intervals is much more usable than absolute

times

• Counts don’t link as directly to probability models

• Time interval is log ρ

• This is a big deal


Event Stream (timing)

• Events of various types arrive at irregular intervals

– we can assume Poisson distribution

• The key question is whether frequency has changed relative to

expected values

– This shows up as a change in interval

• Want alert as soon as possible


Converting Event Times to Anomaly

99.9%-ile

99.99%-ile


In the real world, event rates often vary


Time Intervals Are Key to Modeling Sporadic Events

0 1 2 3 4

02

46

8

t (days)

dt (m

in)


Poisson Distribution

• Time between events is exponentially distributed

• This means that long delays are exponentially rare

• If we know λ we can select a good threshold

– or we can pick a threshold empirically

Dt ~ le-lt

P(Dt > T ) = e-lT

- logP(Dt > T ) = lT


After Rate Correction

0 1 2 3 4

02

46

810

t (days)

dt

/ ra

te

99.9%−ile

99.99%−ile


Detecting Anomalies in Sporadic Events

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t

t i δ λ(t i- t i - n)

λt


Detecting Anomalies in Sporadic Events

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t


λt


Seasonality Poses a Challenge

Nov 17 Nov 27 Dec 07 Dec 17 Dec 27

02

46

8

Christmas Traffic

Date

Hits /

10

00


Something more is needed …

Nov 17 Nov 27 Dec 07 Dec 17 Dec 27

02

46

8

Christmas Traffic

Date

Hits /

10

00


We need a better rate predictor…

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t


λt


Idea: Predict log(rate) from lagged log(rate)

• Predict log because

– Peak to valley ratio

– Traffic grew by 30 %

– All rates are positive







– Just because I said so







– Just because I said so

• Let model see many lagged values

• Use L1 regularized linear model to pick important historical

values

– We would have moved to something fancier if this hadn’t worked


A New Rate Predictor for Sporadic Events


Improved Prediction with Adaptive Modeling

Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29

02

46

8

Christmas Prediction

Date

Hits (

x 1

00

0)


Some days the magic worksSome days ...

We use slightly different magic


Detecting More Subtle Changes

• Time-since-last finds complete failures well

• Nth order time finds more subtle rate changes

• But that subtlety delays detection of complete failure

– First order delay has 99.9% confidence at 6.5 units

– 10th order delay has 99.9% confidence at 12.5 units

• But 10th order delay can find speedups, first order cannot


10th order difference of

Poisson distribution


Finding Changes in Time Series

• So far, we only have times

• What about when we have times and measurements together?

– These are called time-series!

• First step can be to discretize the measurement

– Quintiles or deciles are good candidates

– Multi-scale discretization is a fine thing to do

• That gives us arrival times for measurements in each bin

– And this is susceptible to the rate model on previous slides


Finding Changes in Time Series

• Comprehensive approaches also possible (for counts)

• Time aware variant of G-test is possible

vs

Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March

1993)

http://bit.ly/surprise-and-coincidence


Propagation Anomalies

• What happens when something shadows part of the coverage

field for mobile telecom?

– Can happen in urban areas with a construction crane

• Can solve heuristically

– Subtract from reference image composed by long term averages

– Doesn’t deal well with weak signal regions and low S/N

• Can solve probabilistically

– Compute anomaly for each measurement, use mean of log(p)


Variable Signal/Noise Makes Heuristic Tricky

Far from the transmitter,

received signal is dominated by

noise. This makes subtraction of

average value a bad algorithm.


Other Issues

• Finding changes in coverage area is similar tricky

• Coverage area is roughly where tower signal strength is higher

than neighbors

• Except for fuzziness due to hand-off delays

• Except for bias due to large-scale caller motions

– Rush hour

– Event mobs


Simple Answer for Propagation Anomalies

• Cluster signal strength reports

• Cluster locations using k-means, large k

• Model report rate anomaly using discrete event models

• Model signal strength anomaly using percentile model

• Trade larger k against higher report rates, faster detection

• Overall anomaly is sum of individual log(p) anomalies


Tower Coverage Areas


Just One Tower


Cluster Reports for That Tower


Cluster Reports for That Tower

1

2 3

4

5

6

7

8

9

Can also sub-divide each cluster

into signal strength ranges

Multiple scales of clustering

can also be used to trade off

geographic versus temporal

resolution


Example

0.0

0.5

1.0

1.5

dt

01

23

45

67

dt

0.0

0.2

0.4

0.6

dt

Each cluster gives us a

sequence of events.

Individual anomaly scores can

be scaled and added to get

composite anomaly score

Optimality of combined signal

derives from optimality of

components.


Characterizing Distributions

• What about sequences of values from arbitrary distributions

– Can we find changes in the distribution?

– For instance, what about latencies?

• Non-linear histogram - FloatHistogram

• Fully Adaptive histogram – t-digest


FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

• Relative error is bounded in measurement space


FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

• Relative error is bounded in measurement space

• Bin index can be computed using FP representation!


T-digest

• Or we can talk about small errors in q

• Accumulate samples, sort, merge

• Merge if k-size < 1


T-digest




0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k


T-digest




• Interpolate using centroids in x

• Very good near extremes, no dynamic allocation

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k


Finding Change with Histograms

• With fixed bins, we can simply count and compare counts for

different bins

• Thus, histogram change reduces to count change

• Or to changes in event times


Visualizing Histograms

• We want to detect small changes

– Consider log-scale for Y

• Non-linear bin spacing is really good for increasing counts

– Reweight by bin-width

– Changing x axis changes y axis


Good Results


Bad Results


With Better Scaling


Bad Results


With FloatHistogram


Summary

• Counts – LLR

• Events – Poisson + nth-order diffs

• Decimate in space

• Decimate in measurement space

– t-digest, FloatHistogram

• Don’t forget visualization

Incoming

events

99.97%-ile

Alarm

Δn

Rate predictor

Rate

history

t-digest

δ> t


λt

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k


Q & A


Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning



finding changes in real data

Data & Analytics