operationalizing security data science for the cloud: challenges, solutions, and trade-offs
TRANSCRIPT
Choosing the Learner
Binary Classification
Regression
Multiclass Classification
Unsupervised
Ranking
Anomaly Detection
Collaborative Filtering
Sequence Prediction
Reinforcement Learning
Representation Learning
Choosing the Learning Task •Binary Classification•Anomaly Detector •Ranking
Defining Data Input• Data Loaders (text, binary, SVM light, Transpose loader)
•Data type
Applying Data Transforms •Cleaning Missing data •Dealing with categorical data •Dealing with text data •Data Normalization
Choosing the Learner•Binary Classification •Regression •Multi class•Unsupervised •Ranking •Anomaly Detection •Collaborative Filtering •Sequence Prediction
Choosing Output •Save Features of a model? •Save the model as text? •Save Model as binary? •Save the per-instance results?
Choosing Run Options •Run Locally? •Run distributed on HPC cluster? •Are all paths in the experiment node-accessible?
•Priority? •Max Concurrent Process?
View Results•Too large? •Sampled
•Right size•Load data
•Histogram•Per feature•Sampled Instances
Debug and Visualize Errors •Error in Data•Error in Learner•Error in Optimizer •Error in Experimentation setup
Analyze Model Predictions •Root cause analysis •Grading
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking
Defining Data Input• Data Loaders (text, binary, SVM
light, Transpose loader)• Data type
Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization
Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction
Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?
Choosing Run Options • Run Locally? • Run distributed on HPC cluster?
• Are all paths in the experiment node-accessible?
• Priority? • Max Concurrent Process?
View Results
Debug and Visualize Errors
Analyze Model Predictions
Operationalizing Security Data Science
Ram Shankar Siva Kumar (@ram_ssk) Andrew Wicker
Microsoft
Security Data Science Projects are different • Traditional Programming Projects: spec/prototype implement ship• Data Science Projects: at each stage: relabel, refeaturize, retrain
• With data-driven features, all components drift:• Learner: more accurate/faster/lower-memory-footprint/…• Features: there are always better ones• Data: all distributions drift
• Security Projects: at each stage: assess threat, build detections, respond• All components drift:
• Threat: new attacks constantly come out; • Detection: newer log sources • Response: better tooling, newer TSGs
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
So wait…when do we ship??
You ship when your solution is operational
Security Experts
Engineers
Legal
Service Engineers
Product Managers
Machine Learning Experts
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Operational is more than your “model is working”…
Detect unusual user activity to prevent data exfiltration
Detect unusual user activity using Application logs, with false positive rate < 1%, for all Azure customers, in near real-time
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Detect unusual user activity using Application logs, with false positive rate < 1%,for all Azure Customersin near real-time
=> The Problem => Data=> Model Evaluation => Model Deployment => Model Scale-out
Operationalize Security Data Science: Components
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model EvaluationHow do you know your system works?
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Evaluation Metrics
Model Usage Metrics
Model Validation Metrics
• E.g: False Positive• Makes your customer (and ergo,
your business) happy• How to measure this?
• E.g: Call Rate • How much is the model in use? • Makes your division happy• Collected by your pipeline after
deployment
• E.g: MSE, Reconstruction error….• How well does the model
generalize? • Makes the data scientist happy• Comes pre-built with ML
framework (Scikit learn, CNTK)
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Evaluation: How to gather Evaluation dataset? • Good: Use Benchmark datasets• List of curated datasets - www.secrepo.com • Con: Remember – attackers have ‘em too!
• Better: Use previous Indicators of Compromise• Honeypots, commercial IOC feeds• Steps:
• Gather confirmed IOCs• “Backprop” them through the generated alerts • This will help you calculate FP and FN
• Best: Curate your own dataset
Mor
e S
peci
alize
d
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Curating your own dataset options 1. Inject Fake Malicious data
Model
Synthetic data
Storage
How: Label data as “eviluser” and check if “eviluser” pops to the top of the reports every day
Pro: Low overhead—you don’t have to depend on a red team to test your detection
Con: The injected data may not be representative of true attacker activity
Storage
AlertingSystem C
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Curating your own dataset options 2. Employ Commonly Used Attacker Tools
How: Spin up a malicious process using Metasploit, Powersploit, or Veil in your environment. Look for traces in your logs
Pro: Easy to implement; your development team, with little tutorial, can run the tool, which would generate attack data in the logs.
Con: The machine learning system, will only learn to detect known attacker toolkits and not generalize over the attack methodology
Model
Storage
Tainted Data
AlertingSystem
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Curating your own dataset options 3. Red Team pentests your environment
How: a red team attacks the system and we try to get the logs from the attacks, as tainted data
Pro: Closest technique to real-world attacks
Con: Red Teams are point in time exercises; expensive
Model
Storage
Tainted Data
AlertingSystem
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Growing your dataset: Generative Adversarial Networks
Source: https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f#.djcfc6eo0
Source: http://www.evolvingai.org/ppgn Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model DeploymentTailoring alerts based on customers geographic location
Azure has data centers all around the world!
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Localization affects Model Building • Privacy Laws vary across the board • IP address is treated as EII in some regions vs. not EII in other regions
• “Anyone logging into corporate network at midnight during the weekend is anomalous” • Weekend in Middle East != Weekend in Americas • Seasonality varies
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Option 1: Shotgun Deployment • How: Deploy same model code
across different regions • Pros:• Easy deployment;• Uniform metrics • Single TSG to debug all service incidents
• Cons: • Lose macro trends in favor of micro
trends• Model-Region Incompatibility Region
1Region
2Region
3
Model Model Model
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Option 2: Tiered Modeling • How:
• Federated Models • Each region is modeled separately • Results are scrubbed according to compliance laws
and privacy agreements• Scrubbed results are used as input to “Model Prime”
• Model Prime• Results are collated to search for global trends
• Pros: • Bespoke modeling for every region • Balance between Micro and Macro modeling
• Cons:• Complicated Deployment • Depending on the agreements, model-prime may
not be possible
Region1 Region2 Region3
Model 1
Model - Prime
Model 2 Model3
Scru
bbed
Re
sults
Scrubbed
Results
Scrubbed Results
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Scale-Out A Case Study
Detecting Malicious ActivitiesDetect risky or malicious activity in SharePoint Online activity logs with precision > 90% for all SPO users in near real-time
=> The Problem => Data=> Model Evaluation => Model Deployment => Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Exploratory Analysis• Typical data science work:• Sample data• Script for preprocessing data• Summary statistics• Script for evaluating approaches
• All done locally on dev machine using R/Python• Facilitates quick turn around• Avoids having to debug at scale
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Evaluation • Labels from known incidents and investigations• Inject labels by mimicking malicious activity• SPO team helps us understand the malicious activity• Red team helps us simulate the malicious activity
• > 90% precision
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model: Bayesian Network• Probabilistic Graphical Model• Related to GMM, CRF, MRF
• Represents variables and conditional independence assertions in a directed acyclic graph• Directed edges encode conditional
dependencies• Conditional probability distributions for
each variable
Burglary
Alarm
MaryCalls
JohnCalls
Earthquake
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Initial Prototype – v0.1• One activity model for all users• Run model in cloud environment with
Azure Worker Role• Storage accounts for input data and
output scores• Pros:
• Easy to manage• Small memory footprint
• Cons:• Does not scale• Low throughput
Data
Scores
Azure Worker Role
Activity Model
User 1User 2User 3
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Improved Approach• One model for each user• Personalized activity suspiciousness• Cluster low-activity users for better
model results
• Replace storage accounts with Azure Event Hubs• Low-latency, cloud-scale “queues”
Azure Worker Role
User 1User 2User 3
Event Hub
Event Hub
Model 1
Model 2
Model 3
Model n…
Scores
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Scale-Out: Memory
Azure Worker Role
User 1User 2User 3
Event Hub
Event Hub
Model 1
Model 2
Model 3
Model n
…
Scores
Model Storage
• Millions of per-user models• More than can fit in worker
role memory
• Store models in storage account• Load as needed
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Model Scale-Out: Latency
Azure Worker Role
User 1User 2User 3
Event Hub
Event Hub
Model 1
Model 2
Model 3
Model n…
ScoresModel Storage
Redis Cache
• Model storage account adds too much latency• Redis cache minimizes model
loading latency• LRU policy as we process user
activity events
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Data Compliance• Models can not use certain PII• Balkanized cloud environments• Tiered model development
• Resolve user information for UX• UserID -> User Name
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Data Compliance
Azure Worker Role
User 1User 2User 3
Event Hub
Event Hub
Model 1
Model 2
Model 3
Model n…
ScoresModel Storage
Redis Cache
User Account DB
Redis Cache
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Cloud Resource Competition
Signal 1
Signal 2
Signal 3
Signal m
User Account DB
Redis Cache
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Cloud Resource Competition
Signal 1
Signal 2
Signal 3
Signal m
User Account DB
Redis Cache
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
From v0.1 to v1.0
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
Conclusion
Operationalize Security Data Science: Components
=> Model Evaluation => Model Deployment => Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion
The Rand TestTest to see if your Security Data Science solution operational
Answer Yes/No to the following: 1) Do you have an established pipeline to collect relevant security data? 2) Do you have established SLAs/data contracts with partner teams?3) Can you seamlessly update the model with new features and re-train? 4) Did you evaluate the model with real attack data? 5) Does your model respect different privacy laws, across all regions? 6) Do you account for model localization? 7) Is your model scalable, end to end? 8) Do you hold live site meetings about your solution? 9) Can security responders leverage the model for insights during an
investigation? 10) Do you have a framework to collect feedback from security
analysts/feedback on the results?
By @ram_ssk, Andrew Wicker
Score - Yes = 1 point
10
5
0
All systems Operational!
Houston! We have a problem
One small step…
Model Evaluation Model Deployment Model Scale-out
Intro Model Evaluation Model Deployment Model Scale-out Conclusion