pydata nyc 2015 - automatically detecting outliers with datadog

41
One of these things is not like the others Automatically Detecting Outliers Homin Lee, Data Scientist

Upload: datadog

Post on 20-Mar-2017

1.049 views

Category:

Software


6 download

TRANSCRIPT

One of these things is not like the othersAutomatically Detecting Outliers

Homin Lee, Data Scientist

Outline

● Monitoring● Alerting● Outlier vs. Anomaly Detection● Outlier Detection Algorithms● Our Python Implementation

Monitor Everything

Monitor EverythingDatadog gathers performance data from all your application components.

Monitor Everything

Monitor Everything

Monitor Everything?

Alerting

Alerting?

Alerting?

Outlier Detection

Outlier Detection

Outlier Detection

Outliers vs. Anomalies

Outlier Detection Algorithms

MADmedian absolute deviation

DBSCANdensity-based spatial clustering of applications with noise

Robust Outlier Detection Algorithms

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }median = 4

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }median = 4

deviations = { -3, -2, -1, 0, 1, 2, 96 }

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }median = 4

deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }median = 4

deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }

MAD = 2

Median Absolute Deviation

MAD(D) = median( { |di - median(D)| } )

D = { 1, 2, 3, 4, 5, 6, 100 }median = 4

deviations = { -3, -2, -1, 0, 1, 2, 96 }abs deviations = { 0, 1, 1, 2, 2, 3, 96 }

MAD = 2 (std dev = 33.8)

Median Absolute Deviation

Parameters: Tolerance, Pct

} tol. = 3.0

DBSCAN

DBSCAN

Parameters: epsilon, min_samples

DBSCAN

1 dd/2d/4 3d/4

DBSCAN

1 dd/2d/4 3d/4

DBSCAN

1 dd/2d/4 3d/4

~ median(dist from median series) × tolerance

MAD or DBSCAN?

MAD or DBSCAN?

Some subtleties

Some subtleties

Some subtleties

Pythondef MAD(slist, tol, pct): val_array = np.concatenate(slist) median = np.median(val_array) diffs = np.abs(val_array - median) mad = np.median(diffs) outlier_factor = tol*mad / NORM_CONSTANT outliers = [] for series in slist: series_diffs = np.abs(series - median) outlier_values = series_diffs[series_diffs > outlier_factor] pct_outliers = 100 * (len(outlier_values) / float(len(series_values))) if pct_outliers > pct: outliers.append(series) return outliers

Pythondef DBSCAN(slist, tol): median_series = np.median(slist, axis=0) dists = scipy.spatial.distance.cdist(values_array, np.array([median_series])) eps = tolerance*np.median(dists)/NORM_CONSTANT db_scan = sklearn.cluster.DBSCAN(min_samples=1, eps=eps) db_labels = db_scan.fit_predict(values_array) most = np.argmax(np.bincount(db_labels)) return [slist[i] for i, l in enumerate(db_labels) if l != most]

Thanks!

Appendix

DASHBOARDSBuild Real-Time Interactive Dashboards

CORRELATIONSearch And Correlate Metrics And Events

See It All In One PlaceYour Servers, Your Clouds, Your Metrics, Your Apps, Your team. Together.

COLLABORATIONShare What You Saw, Write What You Did

METRIC ALERTSGet Alerted On Critical Issues

DEVELOPER APIInstrument Your Apps, Write New Integrations

See It All In One PlaceYour Servers, Your Clouds, Your Metrics, Your Apps, Your team. Together.

Flexible PricingTo Match Your Dynamic Infrastructure.

FreeUp to 5 Hosts

1 Day retention

Custom metrics and events

Discussion group supported

ProUp to 500 Hosts

$15 Per Host / Month

13 Month retention

Custom metrics and events

Metric alerts*

Email supported

Enterprise500+ Hosts

Contact us for pricing:+1 866 329 [email protected]

Customized retention

Custom metrics and events

Metric alerts*

Email and phone supported