monitoring anomalies in experimentation platform

13
Monitoring Anomalies in Experimentation Platform Deepak Vasthimal – MTS @ eBay Connect Me - https:// www.linkedin.com/in/whatisdeepak Available on eBay Tech Blog - https:// goo.gl/6bUbE9

Upload: deepak-vasthimal

Post on 26-Jan-2017

22 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Monitoring anomalies in experimentation platform

Monitoring Anomalies in Experimentation Platform

Deepak Vasthimal – MTS @ eBayConnect Me - https://www.linkedin.com/in/whatisdeepakAvailable on eBay Tech Blog - https://goo.gl/6bUbE9

Page 2: Monitoring anomalies in experimentation platform

2

Overview of Experimentation (A/B Test) Platform

• A/B Testing is comparing different experiences and measure their performance.

• Variance could be in UI, Components, Algorithms etc.

• Measure Bankable metrics and non-bankable (activity click rates).

• Enables data driven decisions.

• Avoid making releases features on intuitions to over 150 million users.

Monitoring Anomalies in the Experimentation Platform

Page 3: Monitoring anomalies in experimentation platform

3

Experimentation Reporting• 1500+ experiments

• Migrated from Teradata/SQL to Hadoop using Scala.

• Process 100s of TBs of data daily on Hadoop cluster with around 400 M/R jobs.

• 200+ metrics generated daily using batch system.

• Built using a mix of open source technologies like Scala, Scoobi, Hive, Hadoop and proprietary tech like Teradata and MicroStrategy.

Monitoring Anomalies in the Experimentation Platform

Page 4: Monitoring anomalies in experimentation platform

4

High Level Flow

Monitoring Anomalies in the Experimentation Platform

Page 5: Monitoring anomalies in experimentation platform

5

Anomalies with Experiments• Traffic corruption – Traffic between test and control is skewed by UID/GUID.

• Tag corruption – Data loss/corrupted during logging & transfer to HDFS.

• GUID Reset – Browser cookie.

• Cache refresh - eBay application servers maintain caches of experiment configurations. A software or hardware glitch can cause corruption of cache.

Monitoring Anomalies in the Experimentation Platform

Page 6: Monitoring anomalies in experimentation platform

6

Monitoring Anomalies• Identify & Categorize anomalies within experiments using Teradata/Hive.

• Store identified anomalies in HDFS and route to InfluxDB (TSDB)

•Visualize using Grafana. •I was introduced to Grafana through SpaceX tweet by Torkel.

Monitoring Anomalies in the Experimentation Platform

Page 7: Monitoring anomalies in experimentation platform

7

Reason we chose Grafana• Visually pleasing graphs.• Easy setup. • In Built Query Editor (SQL & UI)• Instantly change dashboards using duplicate dashboards/panels feature.

• InfluxDB for its ability to be setup in minutes when compared to Graphite/Prometheus .

• Entire pipeline took couple of days.

Monitoring Anomalies in the Experimentation Platform

Page 8: Monitoring anomalies in experimentation platform

8

Home Page

Monitoring Anomalies in the Experimentation Platform

Page 9: Monitoring anomalies in experimentation platform

9

DrillDown

Monitoring Anomalies in the Experimentation Platform

Page 10: Monitoring anomalies in experimentation platform

10

Search

Monitoring Anomalies in the Experimentation Platform

Page 11: Monitoring anomalies in experimentation platform

11

Scale• InfluxDB (v 0.11-1) is installed on a single node with 45 GB of memory.

• Grafana (v 3.0.2) is installed on a single node with 45 GB of memory.

• 2000 points are ingested daily which is minuscule.

• Currently have around 10 months of historical data.

Monitoring Anomalies in the Experimentation Platform

Page 12: Monitoring anomalies in experimentation platform

12

Grafana <3 is spreading.

• Performance analysis of MapReduce jobs by Experimentation platform with job counters.

• Monitor Elastic Search Cluster (10k+ points per second).

• Anomaly detection in Tracking data (10k+ points per minute).

• Each use case stores data in InfluxDB.

Monitoring Anomalies in the Experimentation Platform

Page 13: Monitoring anomalies in experimentation platform

13

Thank You

Monitoring Anomalies in the Experimentation Platform