operationalizing security data science for the cloud: challenges, solutions, and trade-offs

Choosing the Learner

Binary Classification

Regression

Multiclass Classification

Unsupervised

Ranking

Anomaly Detection

Collaborative Filtering

Sequence Prediction

Reinforcement Learning

Representation Learning

Choosing the Learning Task •Binary Classification•Anomaly Detector •Ranking

Defining Data Input• Data Loaders (text, binary, SVM light, Transpose loader)

•Data type

Applying Data Transforms •Cleaning Missing data •Dealing with categorical data •Dealing with text data •Data Normalization

Choosing the Learner•Binary Classification •Regression •Multi class•Unsupervised •Ranking •Anomaly Detection •Collaborative Filtering •Sequence Prediction

Choosing Output •Save Features of a model? •Save the model as text? •Save Model as binary? •Save the per-instance results?

Choosing Run Options •Run Locally? •Run distributed on HPC cluster? •Are all paths in the experiment node-accessible?

•Priority? •Max Concurrent Process?

View Results•Too large? •Sampled

•Right size•Load data

•Histogram•Per feature•Sampled Instances

Debug and Visualize Errors •Error in Data•Error in Learner•Error in Optimizer •Error in Experimentation setup

Analyze Model Predictions •Root cause analysis •Grading

Choosing the Learning Task • Binary Classification• Anomaly Detector • Ranking

Defining Data Input• Data Loaders (text, binary, SVM

light, Transpose loader)• Data type

Applying Data Transforms • Cleaning Missing data • Dealing with categorical data • Dealing with text data • Data Normalization

Choosing the Learner• Binary Classification • Regression • Multi class• Unsupervised • Ranking • Anomaly Detection • Collaborative Filtering • Sequence Prediction

Choosing Output • Save Features of a model? • Save the model as text? • Save Model as binary? • Save the per-instance results?

Choosing Run Options • Run Locally? • Run distributed on HPC cluster?

• Are all paths in the experiment node-accessible?

• Priority? • Max Concurrent Process?

View Results

Debug and Visualize Errors

Analyze Model Predictions










View Results












View Results












View Results












View Results












View Results












View Results












View Results












View Results



Operationalizing Security Data Science

Ram Shankar Siva Kumar (@ram_ssk) Andrew Wicker

Microsoft

Security Data Science Projects are different • Traditional Programming Projects: spec/prototype implement ship• Data Science Projects: at each stage: relabel, refeaturize, retrain

• With data-driven features, all components drift:• Learner: more accurate/faster/lower-memory-footprint/…• Features: there are always better ones• Data: all distributions drift

• Security Projects: at each stage: assess threat, build detections, respond• All components drift:

• Threat: new attacks constantly come out; • Detection: newer log sources • Response: better tooling, newer TSGs

Intro Model Evaluation Model Deployment Model Scale-out Conclusion

So wait…when do we ship??

You ship when your solution is operational

Security Experts

Engineers

Legal

Service Engineers

Product Managers

Machine Learning Experts


Operational is more than your “model is working”…

Detect unusual user activity to prevent data exfiltration

Detect unusual user activity using Application logs, with false positive rate < 1%, for all Azure customers, in near real-time


Detect unusual user activity using Application logs, with false positive rate < 1%,for all Azure Customersin near real-time

=> The Problem => Data=> Model Evaluation => Model Deployment => Model Scale-out

Operationalize Security Data Science: Components


Model EvaluationHow do you know your system works?

Model Evaluation Metrics

Model Usage Metrics

Model Validation Metrics

• E.g: False Positive• Makes your customer (and ergo,

your business) happy• How to measure this?

• E.g: Call Rate • How much is the model in use? • Makes your division happy• Collected by your pipeline after

deployment

• E.g: MSE, Reconstruction error….• How well does the model

generalize? • Makes the data scientist happy• Comes pre-built with ML

framework (Scikit learn, CNTK)


Model Evaluation: How to gather Evaluation dataset? • Good: Use Benchmark datasets• List of curated datasets - www.secrepo.com • Con: Remember – attackers have ‘em too!

• Better: Use previous Indicators of Compromise• Honeypots, commercial IOC feeds• Steps:

• Gather confirmed IOCs• “Backprop” them through the generated alerts • This will help you calculate FP and FN

• Best: Curate your own dataset

Mor

e S

peci

alize

d


http://www.secrepo.com/

Curating your own dataset options 1. Inject Fake Malicious data

Model

Synthetic data

Storage

How: Label data as “eviluser” and check if “eviluser” pops to the top of the reports every day

Pro: Low overhead—you don’t have to depend on a red team to test your detection

Con: The injected data may not be representative of true attacker activity

Storage

AlertingSystem C


Curating your own dataset options 2. Employ Commonly Used Attacker Tools

How: Spin up a malicious process using Metasploit, Powersploit, or Veil in your environment. Look for traces in your logs

Pro: Easy to implement; your development team, with little tutorial, can run the tool, which would generate attack data in the logs.

Con: The machine learning system, will only learn to detect known attacker toolkits and not generalize over the attack methodology

Model

Storage

Tainted Data

AlertingSystem


https://www.metasploit.com/

https://github.com/PowerShellMafia/PowerSploit

https://www.veil-framework.com/

Curating your own dataset options 3. Red Team pentests your environment

How: a red team attacks the system and we try to get the logs from the attacks, as tainted data

Pro: Closest technique to real-world attacks

Con: Red Teams are point in time exercises; expensive

Model

Storage

Tainted Data

AlertingSystem


Growing your dataset: Generative Adversarial Networks

Source: https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f#.djcfc6eo0

Source: http://www.evolvingai.org/ppgn Intro Model Evaluation Model Deployment Model Scale-out Conclusion

https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f#.djcfc6eo0

https://medium.com/@devnag/generative-adversarial-networks-gans-in-50-lines-of-code-pytorch-e81b79659e3f#.djcfc6eo0

http://www.evolvingai.org/ppgn

Model DeploymentTailoring alerts based on customers geographic location

Azure has data centers all around the world!


Localization affects Model Building • Privacy Laws vary across the board • IP address is treated as EII in some regions vs. not EII in other regions

• “Anyone logging into corporate network at midnight during the weekend is anomalous” • Weekend in Middle East != Weekend in Americas • Seasonality varies


Option 1: Shotgun Deployment • How: Deploy same model code

across different regions • Pros:• Easy deployment;• Uniform metrics • Single TSG to debug all service incidents

• Cons: • Lose macro trends in favor of micro

trends• Model-Region Incompatibility Region

1Region

2Region

3

Model Model Model


Option 2: Tiered Modeling • How:

• Federated Models • Each region is modeled separately • Results are scrubbed according to compliance laws

and privacy agreements• Scrubbed results are used as input to “Model Prime”

• Model Prime• Results are collated to search for global trends

• Pros: • Bespoke modeling for every region • Balance between Micro and Macro modeling

• Cons:• Complicated Deployment • Depending on the agreements, model-prime may

not be possible

Region1 Region2 Region3

Model 1

Model - Prime

Model 2 Model3

Scru

bbed

Re

sults

Scrubbed

Results

Scrubbed Results


Model Scale-Out A Case Study

Detecting Malicious ActivitiesDetect risky or malicious activity in SharePoint Online activity logs with precision > 90% for all SPO users in near real-time

=> The Problem => Data=> Model Evaluation => Model Deployment => Model Scale-out


Exploratory Analysis• Typical data science work:• Sample data• Script for preprocessing data• Summary statistics• Script for evaluating approaches

• All done locally on dev machine using R/Python• Facilitates quick turn around• Avoids having to debug at scale


Model Evaluation • Labels from known incidents and investigations• Inject labels by mimicking malicious activity• SPO team helps us understand the malicious activity• Red team helps us simulate the malicious activity

• > 90% precision


Model: Bayesian Network• Probabilistic Graphical Model• Related to GMM, CRF, MRF

• Represents variables and conditional independence assertions in a directed acyclic graph• Directed edges encode conditional

dependencies• Conditional probability distributions for

each variable

Burglary

Alarm

MaryCalls

JohnCalls

Earthquake


Initial Prototype – v0.1• One activity model for all users• Run model in cloud environment with

Azure Worker Role• Storage accounts for input data and

output scores• Pros:

• Easy to manage• Small memory footprint

• Cons:• Does not scale• Low throughput

Data

Scores

Azure Worker Role

Activity Model

User 1User 2User 3


Improved Approach• One model for each user• Personalized activity suspiciousness• Cluster low-activity users for better

model results

• Replace storage accounts with Azure Event Hubs• Low-latency, cloud-scale “queues”

Azure Worker Role

User 1User 2User 3

Event Hub

Event Hub

Model 1

Model 2

Model 3

Model n…

Scores


Model Scale-Out: Memory

Azure Worker Role

User 1User 2User 3

Event Hub

Event Hub

Model 1

Model 2

Model 3

Model n

…

Scores

Model Storage

• Millions of per-user models• More than can fit in worker

role memory

• Store models in storage account• Load as needed


Model Scale-Out: Latency

Azure Worker Role

User 1User 2User 3

Event Hub

Event Hub

Model 1

Model 2

Model 3

Model n…

ScoresModel Storage

Redis Cache

• Model storage account adds too much latency• Redis cache minimizes model

loading latency• LRU policy as we process user

activity events


Data Compliance• Models can not use certain PII• Balkanized cloud environments• Tiered model development

• Resolve user information for UX• UserID -> User Name


Data Compliance

Azure Worker Role

User 1User 2User 3

Event Hub

Event Hub

Model 1

Model 2

Model 3

Model n…

ScoresModel Storage

Redis Cache

User Account DB

Redis Cache


Cloud Resource Competition

Signal 1

Signal 2

Signal 3

Signal m

User Account DB

Redis Cache


From v0.1 to v1.0


Conclusion

Operationalize Security Data Science: Components

=> Model Evaluation => Model Deployment => Model Scale-out


The Rand TestTest to see if your Security Data Science solution operational

Answer Yes/No to the following: 1) Do you have an established pipeline to collect relevant security data? 2) Do you have established SLAs/data contracts with partner teams?3) Can you seamlessly update the model with new features and re-train? 4) Did you evaluate the model with real attack data? 5) Does your model respect different privacy laws, across all regions? 6) Do you account for model localization? 7) Is your model scalable, end to end? 8) Do you hold live site meetings about your solution? 9) Can security responders leverage the model for insights during an

investigation? 10) Do you have a framework to collect feedback from security

analysts/feedback on the results?

By @ram_ssk, Andrew Wicker

Score - Yes = 1 point

10

5

0

All systems Operational!

Houston! We have a problem

One small step…

Model Evaluation Model Deployment Model Scale-out


operationalizing security data science for the cloud: challenges, solutions, and trade-offs

Engineering