applying machine learning to network security monitoring - baythreat 2013

Applying Machine Learning to Network Security Monitoring

Alexandre Pinto Chief Data Scien4st | MLSec Project

@alexcpsec @MLSecProject!

•  This is a talk about BUILDING not breaking –  NO systems were harmed on the development of this talk. –  This is NOT about 1337 Android Malware

•  Only thing we are likely to break here is the 4me limit on the talk

•  This talk includes more MATH than the daily recommended

intake by the FDA.

•  All stunts described in this talk were performed by trained professionals.!

WARNING!

•  13 years in Informa4on Security, done a liRle bit of everything. •  Past 7 or so years leading security consultancy and monitoring

teams in Brazil, London and the US. –  If there is any way a SIEM can hurt you, it did to me.

•  Researching machine learning and data science in general for the past year or so and presen4ng about the intersec4on of it and Infosec throughout the year.

•  Created MLSec Project in July 2013 to give structure to the research being done.

Who's Alex?

•  Defini4ons •  Big Data •  Data Science •  Machine Learning

•  Y U DO DIS? •  Network Security Monitoring •  PoC || GTFO •  Feature Intui4on •  How to get started?

Agenda

Big Data + Machine Learning + Data Science

Big Data

(Security) Data ScienEst

Data Science Venn Diagram by Drew Conway!

•  “Data Scien4st (n.): Person who is beRer at sta4s4cs than any so`ware engineer and beRer at so`ware engineering than any sta4s4cian.”

-‐-‐ Josh Willis, Cloudera

•  “Machine learning systems automa4cally learn programs from data” (*)

•  You don’t really code the program, but it is inferred from data.

•  Intui4on of trying to mimic the way the brain learns: that's where terms like ar#ficial intelligence come from.!

Enter Machine Learning

(*) CACM 55(10) -‐ A Few Useful Things to Know about Machine Learning (Domingos 2012)

•  Supervised Learning: –  Classifica4on (NN, SVM, Naïve Bayes)

–  Regression (linear, logis4c)!

Kinds of Machine Learning

Source – scikit-‐learn.github.io/scikit-‐learn-‐tutorial/general_concepts.html

•  Unsupervised Learning : –  Clustering (k-‐means) –  Decomposi4on (PCA, SVD)

ClassificaEon Example

VS!

Regression Example

ConsideraEons on Data Gathering •  Models will (generally) get beRer with more data

–  But we always have to consider bias and variance as we select our data points

–  Also adversaries – we may be force fed “bad data”, find signal in weird noise or design bad (or exploitable) features

•  “I’ve got 99 problems, but data ain’t one”!

Domingos, 2012 Abu-‐Mostafa, Caltech, 2012

•  Sales!

ApplicaEons of Machine Learning

•  Trading

•  Image and Voice Recogni4on

•  Common reac4ons from Security Professionals: •  “Eh, cool…” *blank stare* *walks away* •  “Are you high, bro?”

Y U DO DIS?

•  “Why aren’t you doing some cool research like Android Malware?”

Math is HARD

•  Fraud detec4on systems: –  Is what he just did consistent with past behavior?

•  Network anomaly detec4on (?): –  More like bad sta4s4cal analysis –  Did not advance a lot, IMO

•  Predic4ng likelihood of aRack actors –  Create different predic4ve models and chain them to gain more confidence in each step.!

Security ApplicaEons of ML

•  SPAM filters

•  Adversaries -‐ Exploi4ng the learning process •  Understand the model, understand the machine, and you can circumvent it

•  Something InfoSec community knows very well •  Any predic4ve model on InfoSec will be pushed to the limit

•  Again, think back on the way SPAM engines evolved.!

ConsideraEons on Data Gathering

Network Security Monitoring

•  Rules in a SIEM solu4on invariably are: –  “Something” has happened “x” 4mes; –  “Something” has happened and other “something2” has happened, with some rela4onship (4me, same fields, etc) between them.

•  Configuring SIEM = iterate on combina4ons un4l: –  Customer or management is foole.. I mean sa4sfied; –  Consul4ng money runs out

•  Behavioral rules (anomaly detec4on) helps a bit with the “x”s, but s4ll, very laborious and 4me consuming.!

CorrelaEon Rules: A Primer

•  Alert-‐based: –  “Tradi4onal” log management –  SIEM –  Using “Threat Intelligence” (i.e blacklists) for about a year or so

–  Lack of context –  Low effec4veness –  You get the results handed over to you

Kinds of Network Security Monitoring

•  Explora4on-‐based: –  Network Forensics tools (2/3 years ago)

–  Elas4c Search based LM systems

–  High effec4veness –  Lots of people necessary –  Lots of HIGHLY trained people

•  Big Data Security Analy4cs (BDSA): –  Run explora4on-‐based monitoring on Hadoop –  More like Big Data Security Monitoring (BDSM)

Alert-‐based + ExploraEon-‐based

A wild army of robots appears

Using robots to catch bad guys

•  We developed a set of algorithms to detect malicious behavior from log entries of firewall blocks

•  Over 6 months of data from SANS DShield (thanks, guys!) •  A`er a lot of sta4s4cal-‐based math (true posi4ve ra4o, true nega4ve ra4o, odds likelihood), it could pinpoint actors that would be 13x-‐18x more likely to aRack you.

•  Today more like 30x on the SANS data, and finding around 80% of “badness” in par4cipant deployments.!

PoC || GTFO

•  Assump4ons to aggregate the data •  Correla4on / proximity / similarity BY BEHAVIOR •  “Bad Neighborhoods” concept: –  Spamhaus x CyberBunker –  Google Report (June 2013) – Moura 2013

•  Group by Geoloca4on •  Group by Netblock (/16, /24) •  Group by ASN –  (thanks, Team Cymru)!

Feature IntuiEon: IP Proximity

Map of the Internet

•  (Hilbert Curve) •  Block port 22 •  2013-‐07-‐20

0

10

127

MULTICAST AND FRIENDS

CN

RU

CN, BR, TH

You are here!

•  Even bad neighborhoods renovate: –  ARackers may change ISPs/proxies –  Botnets may be shut down / relocate –  A liRle paranoia is Ok, but not EVERYONE is out to get you (at least not all at once)!

Feature IntuiEon: Temporal Decay

•  As days pass, let's forget, bit by bit, who aRacked

•  Last 4me I saw this actor, and how o`en did I see them!

•  Behavior: block on port 22

•  Trial inference on 100k IP addresses per Class A subnet

•  Logarithm scale: brightest 4les are 10 to 1000 4mes more likely to aRack.

MLSec Project

•  Who resolves to this IP address? •  Number of domains that resolve to the IP address •  Distribu4on of their life4me •  Entropy, size, ccTLDs •  Registrar informa4on

•  Reverse DNS informa4on… •  History of DNS registra4on… •  (Thanks, DNSDB!)

Feature IntuiEon: DNS features

•  YAY! We have a bunch of numbers per IP address/domain! •  How do you define what is malicious or not?

•  “Advanced exper4se in both informa4on security and data science will be a necessary ingredient in enabling accurate discrimina4on between malicious and benign ac4vity. “

-‐ Anton Chuvakin, Gartner

•  Kinda easy for security tools (if you trust them) •  Web applica4on logs need deeper sta4s4cal analysis •  Not normal / standard devia4on thing

!

Training the Model

•  Programming is a must (Python / R) •  Sta4s4cal knowledge keeps you from making dumb mistakes

•  Specific machine learning courses and books: –  Coursera (ML/ Data Analysis / Data Science)

•  Prac4ce, Prac4ce, Prac4ce: –  Explore your data! – (Security Onion) –  Kaggle –  KDD, VAST, VizSec!

How do I get started on this?

MLSec Project

•  Sign up, send logs, receive reports generated by machine learning models!

•  Working with several companies on trying out these models on their environment with their data

•  We are hiring (KINDA)

•  Visit h]ps://www.mlsecproject.org , message @MLSecProject or just e-‐mail me.!

•  Inbound aRacks on exposed services (DEFCON/BH 2013): –  Informa4on from inbound connec4ons on firewalls, IPS, WAFs –  Feature extrac4on and supervised learning

•  Malware Distribu4on and Botnets: –  Informa4on from outbound connec4ons on firewalls, DNS and Web Proxy

–  Ini4al labeling provided by intelligence feeds and AV/an4-‐malware –  Semi-‐supervised learning involved

•  Kill-‐chain Ensemble Models: –  Increased precision by composing different behaviors – Web server path -‐> go through Firewall, then IPS, then WAF –  Early confirma4on of aRack failure or success

MLSec Project -‐ Current Research

Thanks! •  Q&A? •  Feedback?

Alexandre Pinto @alexcpsec

@MLSecProject hRps://www.mlsecproject.org/

" Essen4ally, all models are wrong, but some are useful." -‐ George E. P. Box

applying machine learning to network security monitoring - baythreat 2013

Technology

mlsecproject behavior

butitisinferred fromdata

e blacklistsforaboutayearor

groupbyasn thanks

nave bayes regressionlinear

andhow o

ware engineerandbereratso

andnding around80