BigML Inc
BigML Inc 2
Today’s Webinar
• Speaker:
• Poul Petersen, CIO
• Moderator:
• Andrew Shikiar, VP Business Development
• Enter questions into chat box – we’ll answer some via text; others at the end of the session
• For direct follow-up, email us at [email protected]
BigML Inc 3
Agenda
12 Anomaly Detection
3 Questions
What’s New
2 Coming Soon
BigML Inc 4
Model Clusters
6
5
132
4
7Spicy Body Nutty
5.1 3.5 1.42.6 3.5
6.7 2.5 5.8… … …
Spicy Body Nutty In 5?
5.1 3.5 1.4 TRUE5.7 2.6 3.5 FALSE6.7 2.5 5.8 TRUE… … … …
In Cluster 5?
Use models to discover rules that describe clusters
BigML Inc 5
Model Clusters• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile, then discover rules that distinguish the clusters from each other.
BigML Inc 6
Missing SplitsMissing:
101010
Real World Data … is messy
x?
• Define missing tokens: N/A, Null, etc
• Filter out missing values
• Add a new feature to replace missing values
• Default numeric values in cluster
• Proportional prediction for missing input data
• Allow splits on missing values
BigML Inc 7
Online Predictions
• Single predictions
• Computed in real-time using browser JS
• JS will be open sourced
• Available for models, ensembles, and clusters
BigML Inc 8
Fast(er) Ensembles
Old New Savings
n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ]
Fetch Dataset “F” secs
Transform Dataset “T” secs
Model Dataset
“M” secs
Store Model
“S” secs
Tim
e
Number of Models “n”
Insight: if the dataset fits in memory, we can perform the fetch and transform steps once and model quickly in memory
BigML Inc 9
Anomaly Detection
An unsupervised algorithm to find unusual data quickly and easily
BigML Inc 10
Cluster (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity
Anomalies (Unsupervised Learning) !Provide: unlabeled data Learning Task: Rank data by dissimilarity
Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label
Learning Tasks
BigML Inc 11
sepal length
sepal width
petal length
petal width species
5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …
Inputs “X” “Y”
Learning Task: Find function “f” such that: f(X)≈Y
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Learning Task: Find “k” clusters such that the data in each cluster is self similar
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.
Learning Tasks
BigML Inc 12
AnomaliesIsolation Forest:
Grow a random decision tree until each instance is in its own leaf
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
BigML Inc
batchcentroid batchanomalyscore
anomalyscorecentroid
cluster anomaly
13
WorkflowClusters Anomalies
ANOMALYSCORE
DATASET
+
CSV
DATASET DATASETCLUSTER
INSTANCE
+
CENTROIDINSTANCE
+
DATASET
+
CSV
ANOMALY
CLUSTER ANOMALY
CLUSTER ANOMALY
BigML Inc 14
Use Cases
• Unusual instance discovery
• Intrusion Detection
• Fraud
• Identify Incorrect Data
• Remove Outliers
• Model Competence / Input Data Drift
BigML Inc 15
Anomalies
• High dimensions - 10,000 fields
• Mixed data:
• numerical: 3.4
• categorical: red, green, blue
• date time: 2014-05-14T12:34:56
• unstructured text: “The quick brown fox…”
• Computing anomaly score for new data
• Using anomaly detectors programmatically
Coming
BigML Inc 16
Coming Soon
• Config panel for anomaly detection
• Project Management
• In-memory sample server
• Dynamic scatterplots
BigML Inc 17
Coming Soon
BigML Inc 18
FEEDBACK
@bigmlcom TWITTER
Get Started Today!
RESOURCESJoin us for future
webinars & hangouts