finding changes in real data
TRANSCRIPT
© 2017 MapR Technologies 1
Detecting Change
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email [email protected] [email protected]
Twitter @ted_dunning
© 2017 MapR Technologies 3
Who We Are
• MapR Technologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
© 2017 MapR Technologies 4
Basic Outline
• Goal Setting
• Basic Ideas
– LLR (finding changes in counts)
– Poisson rate change detection (finding changes in events timing)
– Distribution estimation / visualization
– Labeled events and adding labels
• Free Improvisation on Themes
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
© 2017 MapR Technologies 8
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
© 2017 MapR Technologies 9
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
© 2017 MapR Technologies 10
The Second Student
• Another student said to the master, “I see something has
changed … something may have broken”
• The master replied, “You have no question to ask. You have no
need of enlightenment”
• And thus the student was enlightened
© 2017 MapR Technologies 11
• There are some very powerful techniques available, some only
very recently, that can make the detection of change much
easier than you might think. I will describe the practical use of
several of these techniques including t-digest, non-linear
histograms, variable rate Poisson models and combinations of
these.
© 2017 MapR Technologies 12
Comparing Counts
• Suppose we have two situations A and B, each with many
observations, nA and nB
• And some event x occurred n1A and n1B times in each situation
x other
A n1A nA - n1A
B n1B nB - n1B
© 2017 MapR Technologies 13
Comparing Counts
• Have we seen a change in the frequency of x?
• Frequency ratios?
– Breaks with small counts
• - test?
– Breaks with small counts
© 2017 MapR Technologies 14
Log-Likelihood Ratio Test (Root LLR)
• In R
entropy = function(k) {
-sum(k*log((k==0)+(k/sum(k))))
}
llr = function(k) {
(entropy(rowSums(k))+entropy(colSums(k))
-entropy(k))*2
}
• Like mutual information * 2 N
© 2017 MapR Technologies 15
Spot the Anomaly
• Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.89 1.95
4.51 14.29
© 2017 MapR Technologies 16
How Does it Work
Empirical fit to asymptotic
distribution is very good
© 2017 MapR Technologies 17
How Does it Work?
© 2017 MapR Technologies 18
OKWe can detect changes in counts
© 2017 MapR Technologies 19
Real-life Example
• Query: “Paco de Lucia”
• Conventional meta-data search results:
– “hombres de paco” times 400
– not much else
• Recommendation based search:
– Flamenco guitar and dancers
– Spanish and classical guitar
– Van Halen doing a classical/flamenco riff
© 2017 MapR Technologies 20
Real-life Example
© 2017 MapR Technologies 21
Example 2 - Common Point of Compromise
• Scenario:
– Merchant 0 is compromised, leaks account data during compromise
– Fraud committed elsewhere during exploit
– High background level of fraud
– Limited detection rate for exploits
• Goal:
– Find merchant 0
• Meta-goal:
– Screen algorithms for this task without leaking sensitive data
© 2017 MapR Technologies 22
Example 2 - Common Point of Compromise
skim exploit
Merchant 0
Skimmed data
Merchant n
Card data is stolen
from Merchant 0
That data is used
in frauds at other
merchants
© 2017 MapR Technologies 23
Simulation Setup
0 20 40 60 80 100
01
00
300
50
0
day
coun
tCompromise period
Exploit period
compromises
frauds
© 2017 MapR Technologies 24
Detection Strategy
• Select histories that precede non-fraud
• And histories that precede fraud detection
• Analyze 2x2 cooccurrence of merchant n versus fraud
detection
© 2017 MapR Technologies 25
© 2017 MapR Technologies 26
What about the
real world?
© 2017 MapR Technologies 27
●●●●●●●●●●●●●●●●●●●● ● ●● ●●● ●●● ●●●●● ●●●●● ●●● ●●● ●● ● ●● ●● ●● ● ●●●● ●●●● ●● ●●●● ●●●● ●●● ●● ●● ● ●● ● ●●●● ●● ● ●●●● ●●●●●● ●● ●● ●●● ●●● ●●●●● ● ●●● ●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●● ●●
●●
●
●●
02
04
06
08
0
LLR score for real data
Number of Merchants
Bre
ach S
core
(L
LR
)
Real truly bad guys
100
101
102
103
104
105
106
Really truly bad guys
© 2017 MapR Technologies 28
What about time?
© 2017 MapR Technologies 29
Finding Changes in Timing
• Suppose our input is events embedded in time
• Suppose we want to find changes in our input in real-time
• Waiting and counting is fine if we don’t have to react now
• We can do much better
© 2017 MapR Technologies 30
Poisson Event Rate Change
• Detection of fallout
– Time since last is very sensitive for complete failure
• Detection of change relative to reference
– Time since n-th most recent
– LLR with time
• Have to trade detection speed versus false positive rate and
size of change
• Can run multiple detectors at once
© 2017 MapR Technologies 31
Basic idea:Time interval is better than counts
© 2017 MapR Technologies 32
Sporadic Events: Finding Normal and Anomalous Patterns
• Time between intervals is much more usable than absolute
times
• Counts don’t link as directly to probability models
• Time interval is log ρ
• This is a big deal
© 2017 MapR Technologies 33
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2017 MapR Technologies 34
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2017 MapR Technologies 35
In the real world, event rates often vary
© 2017 MapR Technologies 36
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02
46
8
t (days)
dt (m
in)
© 2017 MapR Technologies 37
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02
46
8
t (days)
dt (m
in)
© 2017 MapR Technologies 38
Poisson Distribution
• Time between events is exponentially distributed
• This means that long delays are exponentially rare
• If we know λ we can select a good threshold
– or we can pick a threshold empirically
Dt ~ le-lt
P(Dt > T ) = e-lT
- logP(Dt > T ) = lT
© 2017 MapR Technologies 39
After Rate Correction
0 1 2 3 4
02
46
810
t (days)
dt
/ ra
te
99.9%−ile
99.99%−ile
© 2017 MapR Technologies 40
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate predictor
Rate
history
t-digest
δ> t
t i δ λ(t i- t i - n)
λt
© 2017 MapR Technologies 41
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate predictor
Rate
history
t-digest
δ> t
t i δ λ(t i- t i - n)
λt
© 2017 MapR Technologies 42
Seasonality Poses a Challenge
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02
46
8
Christmas Traffic
Date
Hits /
10
00
© 2017 MapR Technologies 43
Something more is needed …
Nov 17 Nov 27 Dec 07 Dec 17 Dec 27
02
46
8
Christmas Traffic
Date
Hits /
10
00
© 2017 MapR Technologies 44
We need a better rate predictor…
Incoming
events
99.97%-ile
Alarm
Δn
Rate predictor
Rate
history
t-digest
δ> t
t i δ λ(t i- t i - n)
λt
© 2017 MapR Technologies 45
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
© 2017 MapR Technologies 46
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
© 2017 MapR Technologies 47
Idea: Predict log(rate) from lagged log(rate)
• Predict log because
– Peak to valley ratio
– Traffic grew by 30 %
– All rates are positive
– Just because I said so
• Let model see many lagged values
• Use L1 regularized linear model to pick important historical
values
– We would have moved to something fancier if this hadn’t worked
© 2017 MapR Technologies 48
A New Rate Predictor for Sporadic Events
© 2017 MapR Technologies 49
Improved Prediction with Adaptive Modeling
Dec 17 Dec 19 Dec 21 Dec 23 Dec 25 Dec 27 Dec 29
02
46
8
Christmas Prediction
Date
Hits (
x 1
00
0)
© 2017 MapR Technologies 50
Some days the magic worksSome days ...
We use slightly different magic
© 2017 MapR Technologies 51
Detecting More Subtle Changes
• Time-since-last finds complete failures well
• Nth order time finds more subtle rate changes
• But that subtlety delays detection of complete failure
– First order delay has 99.9% confidence at 6.5 units
– 10th order delay has 99.9% confidence at 12.5 units
• But 10th order delay can find speedups, first order cannot
© 2017 MapR Technologies 57
10th order difference of
Poisson distribution
© 2017 MapR Technologies 58
Finding Changes in Time Series
• So far, we only have times
• What about when we have times and measurements together?
– These are called time-series!
• First step can be to discretize the measurement
– Quintiles or deciles are good candidates
– Multi-scale discretization is a fine thing to do
• That gives us arrival times for measurements in each bin
– And this is susceptible to the rate model on previous slides
© 2017 MapR Technologies 59
Finding Changes in Time Series
• Comprehensive approaches also possible (for counts)
• Time aware variant of G-test is possible
vs
Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March
1993)
http://bit.ly/surprise-and-coincidence
© 2017 MapR Technologies 60
Propagation Anomalies
• What happens when something shadows part of the coverage
field for mobile telecom?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2017 MapR Technologies 61
© 2017 MapR Technologies 62
© 2017 MapR Technologies 63
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2017 MapR Technologies 64
Other Issues
• Finding changes in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2017 MapR Technologies 65
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2017 MapR Technologies 66
Tower Coverage Areas
© 2017 MapR Technologies 67
Just One Tower
© 2017 MapR Technologies 68
Cluster Reports for That Tower
© 2017 MapR Technologies 69
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
Can also sub-divide each cluster
into signal strength ranges
Multiple scales of clustering
can also be used to trade off
geographic versus temporal
resolution
© 2017 MapR Technologies 70
Example
0.0
0.5
1.0
1.5
dt
01
23
45
67
dt
0.0
0.2
0.4
0.6
dt
Each cluster gives us a
sequence of events.
Individual anomaly scores can
be scaled and added to get
composite anomaly score
Optimality of combined signal
derives from optimality of
components.
© 2017 MapR Technologies 71
Characterizing Distributions
• What about sequences of values from arbitrary distributions
– Can we find changes in the distribution?
– For instance, what about latencies?
• Non-linear histogram - FloatHistogram
• Fully Adaptive histogram – t-digest
© 2017 MapR Technologies 72
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
© 2017 MapR Technologies 73
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 74
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
© 2017 MapR Technologies 75
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
0.0 0.2 0.4 0.6 0.8 1.0q
02
46
81
0k
© 2017 MapR Technologies 76
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0q
02
46
81
0k
© 2017 MapR Technologies 77
Finding Change with Histograms
• With fixed bins, we can simply count and compare counts for
different bins
• Thus, histogram change reduces to count change
• Or to changes in event times
© 2017 MapR Technologies 78
Visualizing Histograms
• We want to detect small changes
– Consider log-scale for Y
• Non-linear bin spacing is really good for increasing counts
– Reweight by bin-width
– Changing x axis changes y axis
© 2017 MapR Technologies 79
Good Results
© 2017 MapR Technologies 80
Bad Results
© 2017 MapR Technologies 81
Bad Results
© 2017 MapR Technologies 82
With Better Scaling
© 2017 MapR Technologies 83
Bad Results
© 2017 MapR Technologies 84
© 2017 MapR Technologies 85
With FloatHistogram
© 2017 MapR Technologies 86
Summary
• Counts – LLR
• Events – Poisson + nth-order diffs
• Decimate in space
• Decimate in measurement space
– t-digest, FloatHistogram
• Don’t forget visualization
Incoming
events
99.97%-ile
Alarm
Δn
Rate predictor
Rate
history
t-digest
δ> t
t i δ λ(t i- t i - n)
λt
0.0 0.2 0.4 0.6 0.8 1.0q
02
46
81
0k
© 2017 MapR Technologies 87
Q & A
© 2017 MapR Technologies 88
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email [email protected] [email protected]
Twitter @ted_dunning