pydata nyc 2015 - automatically detecting outliers with datadog
TRANSCRIPT
One of these things is not like the othersAutomatically Detecting Outliers
Homin Lee, Data Scientist
Outline
● Monitoring● Alerting● Outlier vs. Anomaly Detection● Outlier Detection Algorithms● Our Python Implementation
Outlier Detection Algorithms
MADmedian absolute deviation
DBSCANdensity-based spatial clustering of applications with noise
Median Absolute Deviation
MAD(D) = median( { |di - median(D)| } )
D = { 1, 2, 3, 4, 5, 6, 100 }median = 4
Median Absolute Deviation
MAD(D) = median( { |di - median(D)| } )
D = { 1, 2, 3, 4, 5, 6, 100 }median = 4
deviations = { -3, -2, -1, 0, 1, 2, 96 }
Median Absolute Deviation
MAD(D) = median( { |di - median(D)| } )
D = { 1, 2, 3, 4, 5, 6, 100 }median = 4
deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }
Median Absolute Deviation
MAD(D) = median( { |di - median(D)| } )
D = { 1, 2, 3, 4, 5, 6, 100 }median = 4
deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }
MAD = 2
Median Absolute Deviation
MAD(D) = median( { |di - median(D)| } )
D = { 1, 2, 3, 4, 5, 6, 100 }median = 4
deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }
MAD = 2 (std dev = 33.8)
Pythondef MAD(slist, tol, pct): val_array = np.concatenate(slist) median = np.median(val_array) diffs = np.abs(val_array - median) mad = np.median(diffs) outlier_factor = tol*mad / NORM_CONSTANT outliers = [] for series in slist: series_diffs = np.abs(series - median) outlier_values = series_diffs[series_diffs > outlier_factor] pct_outliers = 100 * (len(outlier_values) / float(len(series_values))) if pct_outliers > pct: outliers.append(series) return outliers
Pythondef DBSCAN(slist, tol): median_series = np.median(slist, axis=0) dists = scipy.spatial.distance.cdist(values_array, np.array([median_series])) eps = tolerance*np.median(dists)/NORM_CONSTANT db_scan = sklearn.cluster.DBSCAN(min_samples=1, eps=eps) db_labels = db_scan.fit_predict(values_array) most = np.argmax(np.bincount(db_labels)) return [slist[i] for i, l in enumerate(db_labels) if l != most]
DASHBOARDSBuild Real-Time Interactive Dashboards
CORRELATIONSearch And Correlate Metrics And Events
See It All In One PlaceYour Servers, Your Clouds, Your Metrics, Your Apps, Your team. Together.
COLLABORATIONShare What You Saw, Write What You Did
METRIC ALERTSGet Alerted On Critical Issues
DEVELOPER APIInstrument Your Apps, Write New Integrations
See It All In One PlaceYour Servers, Your Clouds, Your Metrics, Your Apps, Your team. Together.
Flexible PricingTo Match Your Dynamic Infrastructure.
FreeUp to 5 Hosts
1 Day retention
Custom metrics and events
Discussion group supported
ProUp to 500 Hosts
$15 Per Host / Month
13 Month retention
Custom metrics and events
Metric alerts*
Email supported
Enterprise500+ Hosts
Contact us for pricing:+1 866 329 [email protected]
Customized retention
Custom metrics and events
Metric alerts*
Email and phone supported