applying machine learning to network security monitoring - baythreat 2013
DESCRIPTION
Video (at YouTube) - http://bit.ly/19TNSTF Big Data Security Analytics, Data Science and Machine Learning are a few of the new buzzwords that have invaded out industry of late. Most of what we hear are promises of an unicorn-laden, silver-bullet panacea by heavy-handed marketing folks, evoking an expected pushback from the most enlightened members of our community. This talk will help parse what we as a community need to know and understand about these concepts and help understand where the technical details and actual capabilities of those concepts and also where they fail and how they can be exploited and fooled by an attacker. The talk will also share results of the author's current ongoing research (on MLSec Project) of applying machine learning techniques to information secuirty monitoring.TRANSCRIPT
Applying Machine Learning to Network Security Monitoring
Alexandre Pinto Chief Data Scien4st | MLSec Project
@alexcpsec @MLSecProject!
• This is a talk about BUILDING not breaking – NO systems were harmed on the development of this talk. – This is NOT about 1337 Android Malware
• Only thing we are likely to break here is the 4me limit on the talk
• This talk includes more MATH than the daily recommended
intake by the FDA.
• All stunts described in this talk were performed by trained professionals.!
WARNING!
• 13 years in Informa4on Security, done a liRle bit of everything. • Past 7 or so years leading security consultancy and monitoring
teams in Brazil, London and the US. – If there is any way a SIEM can hurt you, it did to me.
• Researching machine learning and data science in general for the past year or so and presen4ng about the intersec4on of it and Infosec throughout the year.
• Created MLSec Project in July 2013 to give structure to the research being done.
Who's Alex?
• Defini4ons • Big Data • Data Science • Machine Learning
• Y U DO DIS? • Network Security Monitoring • PoC || GTFO • Feature Intui4on • How to get started?
Agenda
Big Data + Machine Learning + Data Science
Big Data + Machine Learning + Data Science
Big Data
(Security) Data ScienEst
Data Science Venn Diagram by Drew Conway!
• “Data Scien4st (n.): Person who is beRer at sta4s4cs than any so`ware engineer and beRer at so`ware engineering than any sta4s4cian.”
-‐-‐ Josh Willis, Cloudera
• “Machine learning systems automa4cally learn programs from data” (*)
• You don’t really code the program, but it is inferred from data.
• Intui4on of trying to mimic the way the brain learns: that's where terms like ar#ficial intelligence come from.!
Enter Machine Learning
(*) CACM 55(10) -‐ A Few Useful Things to Know about Machine Learning (Domingos 2012)
• Supervised Learning: – Classifica4on (NN, SVM, Naïve Bayes)
– Regression (linear, logis4c)!
Kinds of Machine Learning
Source – scikit-‐learn.github.io/scikit-‐learn-‐tutorial/general_concepts.html
• Unsupervised Learning : – Clustering (k-‐means) – Decomposi4on (PCA, SVD)
ClassificaEon Example
VS!
Regression Example
ConsideraEons on Data Gathering • Models will (generally) get beRer with more data
– But we always have to consider bias and variance as we select our data points
– Also adversaries – we may be force fed “bad data”, find signal in weird noise or design bad (or exploitable) features
• “I’ve got 99 problems, but data ain’t one”!
Domingos, 2012 Abu-‐Mostafa, Caltech, 2012
• Sales!
ApplicaEons of Machine Learning
• Trading
• Image and Voice Recogni4on
• Common reac4ons from Security Professionals: • “Eh, cool…” *blank stare* *walks away* • “Are you high, bro?”
Y U DO DIS?
• “Why aren’t you doing some cool research like Android Malware?”
Math is HARD
• Fraud detec4on systems: – Is what he just did consistent with past behavior?
• Network anomaly detec4on (?): – More like bad sta4s4cal analysis – Did not advance a lot, IMO
• Predic4ng likelihood of aRack actors – Create different predic4ve models and chain them to gain more confidence in each step.!
Security ApplicaEons of ML
• SPAM filters
• Adversaries -‐ Exploi4ng the learning process • Understand the model, understand the machine, and you can circumvent it
• Something InfoSec community knows very well • Any predic4ve model on InfoSec will be pushed to the limit
• Again, think back on the way SPAM engines evolved.!
ConsideraEons on Data Gathering
Network Security Monitoring
• Rules in a SIEM solu4on invariably are: – “Something” has happened “x” 4mes; – “Something” has happened and other “something2” has happened, with some rela4onship (4me, same fields, etc) between them.
• Configuring SIEM = iterate on combina4ons un4l: – Customer or management is foole.. I mean sa4sfied; – Consul4ng money runs out
• Behavioral rules (anomaly detec4on) helps a bit with the “x”s, but s4ll, very laborious and 4me consuming.!
CorrelaEon Rules: A Primer
• Alert-‐based: – “Tradi4onal” log management – SIEM – Using “Threat Intelligence” (i.e blacklists) for about a year or so
– Lack of context – Low effec4veness – You get the results handed over to you
Kinds of Network Security Monitoring
• Explora4on-‐based: – Network Forensics tools (2/3 years ago)
– Elas4c Search based LM systems
– High effec4veness – Lots of people necessary – Lots of HIGHLY trained people
• Big Data Security Analy4cs (BDSA): – Run explora4on-‐based monitoring on Hadoop – More like Big Data Security Monitoring (BDSM)
Alert-‐based + ExploraEon-‐based
A wild army of robots appears
Using robots to catch bad guys
• We developed a set of algorithms to detect malicious behavior from log entries of firewall blocks
• Over 6 months of data from SANS DShield (thanks, guys!) • A`er a lot of sta4s4cal-‐based math (true posi4ve ra4o, true nega4ve ra4o, odds likelihood), it could pinpoint actors that would be 13x-‐18x more likely to aRack you.
• Today more like 30x on the SANS data, and finding around 80% of “badness” in par4cipant deployments.!
PoC || GTFO
• Assump4ons to aggregate the data • Correla4on / proximity / similarity BY BEHAVIOR • “Bad Neighborhoods” concept: – Spamhaus x CyberBunker – Google Report (June 2013) – Moura 2013
• Group by Geoloca4on • Group by Netblock (/16, /24) • Group by ASN – (thanks, Team Cymru)!
Feature IntuiEon: IP Proximity
Map of the Internet
• (Hilbert Curve) • Block port 22 • 2013-‐07-‐20
0
10
127
MULTICAST AND FRIENDS
CN
RU
CN, BR, TH
You are here!
• Even bad neighborhoods renovate: – ARackers may change ISPs/proxies – Botnets may be shut down / relocate – A liRle paranoia is Ok, but not EVERYONE is out to get you (at least not all at once)!
Feature IntuiEon: Temporal Decay
• As days pass, let's forget, bit by bit, who aRacked
• Last 4me I saw this actor, and how o`en did I see them!
• Behavior: block on port 22
• Trial inference on 100k IP addresses per Class A subnet
• Logarithm scale: brightest 4les are 10 to 1000 4mes more likely to aRack.
MLSec Project
• Who resolves to this IP address? • Number of domains that resolve to the IP address • Distribu4on of their life4me • Entropy, size, ccTLDs • Registrar informa4on
• Reverse DNS informa4on… • History of DNS registra4on… • (Thanks, DNSDB!)
Feature IntuiEon: DNS features
• YAY! We have a bunch of numbers per IP address/domain! • How do you define what is malicious or not?
• “Advanced exper4se in both informa4on security and data science will be a necessary ingredient in enabling accurate discrimina4on between malicious and benign ac4vity. “
-‐ Anton Chuvakin, Gartner
• Kinda easy for security tools (if you trust them) • Web applica4on logs need deeper sta4s4cal analysis • Not normal / standard devia4on thing
!
Training the Model
• Programming is a must (Python / R) • Sta4s4cal knowledge keeps you from making dumb mistakes
• Specific machine learning courses and books: – Coursera (ML/ Data Analysis / Data Science)
• Prac4ce, Prac4ce, Prac4ce: – Explore your data! – (Security Onion) – Kaggle – KDD, VAST, VizSec!
How do I get started on this?
MLSec Project
• Sign up, send logs, receive reports generated by machine learning models!
• Working with several companies on trying out these models on their environment with their data
• We are hiring (KINDA)
• Visit h]ps://www.mlsecproject.org , message @MLSecProject or just e-‐mail me.!
• Inbound aRacks on exposed services (DEFCON/BH 2013): – Informa4on from inbound connec4ons on firewalls, IPS, WAFs – Feature extrac4on and supervised learning
• Malware Distribu4on and Botnets: – Informa4on from outbound connec4ons on firewalls, DNS and Web Proxy
– Ini4al labeling provided by intelligence feeds and AV/an4-‐malware – Semi-‐supervised learning involved
• Kill-‐chain Ensemble Models: – Increased precision by composing different behaviors – Web server path -‐> go through Firewall, then IPS, then WAF – Early confirma4on of aRack failure or success
MLSec Project -‐ Current Research
Thanks! • Q&A? • Feedback?
Alexandre Pinto @alexcpsec
@MLSecProject hRps://www.mlsecproject.org/
" Essen4ally, all models are wrong, but some are useful." -‐ George E. P. Box