using data science for cybersecurity

background image: 960x540 pixels - send to back of slide and set to 80% transparency

Using Data Science for Cybersecurity Anirudh Kondaveeti, Principal Data Scientist, Pivotal Jeff Kelly, Principal Product Marketing Manager, Pivotal

Today’s Speakers

2

Using Data Science for Cybersecurity

Anirudh Kondaveeti Principal Data Scientist, Pivotal

Jeff Kelly Product Marketing, Pivotal

Moderator Presenter

cover this square with an image (540 x 480 pixels)

●  Cybercrime costs average US enterprise $17m per year*

●  Cost grew at 15% CAGR over last three years

●  Any given cybercrime can cost significantly more

●  Target’s 2014 hack cost company approximately $162m

●  Costs not just financial, also reputational

Cost of Cybercrime on the Rise

*Source: 2016 Cost of Cyber Crime Study & the Risk of Business Innovation, Ponemon Institute


●  Amateur hackers giving way to professionals

●  Developing new, more sophisticated, methods

●  Professional hackers make their services available for a fee

●  Costs to commit cybercrime dropping

●  Average subscription fee for a one hour/month DDoS package is roughly $38*

Hackers Growing More Sophisticated

*Source: Q2 2015 Global DDoS Threat Landscape, Incapsula


●  Defending the perimeter no longer enough

●  No 100%, fool-proof way to keep bad actors out

●  Some threats come from within

●  The idea of a perimeter becoming obsolete with mobile, cloud, IoT

●  Need better methods for threat detection inside the network

Perimeter Defense Inadequate

Data Science for Cybersecurity

Security must move beyond signature-based matching •  Necessary defense direction: Find the Unknown •  Need an advanced platform: Security is a Big Data problem •  Multiple decentralized sources of traditional or unconventional data •  Need a platform for better BI, reporting, and cross-source correlation •  Develop intelligence: Security is an Advanced Analytics problem

BI and Compliance-

driven Investigation-

driven

Behavior-metrics

Investigation-driven

Data-science driven

Background

8

Lateral Movement Detection

Advanced Persistent Threat (APT)

A handful of users are targeted by two

phishing attacks: one user opens Zero day

payload (CVE-02011-0609)

The user machine is accessed remotely by Poison Ivy tool

Attacker elevates access to important user, service and

admin accounts, and specific systems

Data is acquired from target servers and

staged for exfiltration

Data is exfiltrated via encrypted files over ftp to external, compromised machine at a hosting

provider

Phishing and Zero Day Attack Back Door Lateral

Movement Data Gathering Exfiltrate

1 2 3 4 5

APT Kill Chain

What: Identify anomalous user-level access to hosts How: Look at People & Machines •  Users (User Behavior Models) •  Network, Servers (User Peer Models)

Scenarios: Network reconnaissance from remote adversary on hijacked device Ill-intentioned activities by legitimate employee Access policy abuse

Business values: Immediate security alert generation Enhanced SIEM alert queue prioritization Focused monitoring Future integration with other analytic models for 360° attack view

Lateral Movement Detection

Data Computing Appliance

Logs

Active Directory Activity

Active Directory Metadata

Server Information

Structured

Ext

erna

l Tab

les

Semi-structured

Regression Based Model

Cluster Based Model

Recommendation System Based

User Behavioral Model

Anomalous Users

Greenplum

DIA

LDAP Activity

Lateral Movement Detection (LMD) – Flow Diagram

Model to identify users with unusual variation in the number of servers accessed over time

Build a regression model for each user (Y = aX + b)

No. of servers accessed each week (Y) ~ Week Index (X)

Find the slope of the regression line for each user (a)

Identify users who have a high positive or negative slope to find users with unusual activity

Num

ber o

f Ser

vers

Week of the year

Regression plot of number of servers for a user

Regression-Based Model

Build historical behavioral profile for each user based on following features: •  Servers accessed •  IP addresses logged in from •  Geographical information of login

Models stress individual user/job log-in frequency

Multiple Feature Generations reduce false alarms: •  Aggregate servers to respective server group •  Incorporate server criticality •  Assign more weight to less popular servers and IP

addresses •  E.g. print servers are low-weighted •  Use recommendation engine to suggest servers to users

based on job roles and peers

Ser

vers

s1

s2

s3

s4

s5

s6

s7

s8

s9

s10

Typically uses only a few servers

Begins logging into a lot of new servers

User Behavior Models (UBM)

Week1 Week2 . Week10 Week 11 . Week15

server1 2 3 1 0 . 0

server2 4 7 1 3 . 7

server3 0 2 0 0 . 0

. . . . . .

server25 1 3 5 8 . 1

PCA Model Built per User (Training Data) Testing Data

User behavior matrix is created using ‘x’ weeks of history for a user. The current week is used as test data.

PCA is dimensionality reduction technique used to capture the components set of multidimensional vector which account for most of the variance.

Principal dimensions are calculated from the training data.

Principal Component Analysis (PCA) Scoring

Reconstruction Error

Training Data (User Behavior

Matrix)

Run PCA

Principal Dimensions

Reconstruct

Project onto Principal Dimensions

Test Vector (User data for new week)

Reconstructed Test Vector

Difference between two vectors

Anomaly Score

Ref: A Lakhina, M Crovella, C Diot, Diagnosing network-wide traffic anomalies


Oversampling PCA

Reference and Image Source: YR Yeh, ZY Lee, YJ Lee, Anomaly Detection via Over-sampling Principal Component Analysis


Matrix)

Run PCA

Oversampled Test Data


Matrix)

Run PCA

First Principal Vector

Difference in angle between them

Anomaly Score

First Principal Vector after oversampling

Test Data


R Code to find the Principal Components (using SVD) SQL & R

User1 Data

User2 Data

User3 Data

User4 Data

User5 Data

User1 Model

User2 Model

User3 Model

User4 Model

User5 Model

PLR wrapper over the R Code to run in parallel

Parallelized PCA using PL/R

Users rate items

To recommend items to a particular user A •  Find other users U similar to A

•  Identify the set of items I accessed by U

•  Recommend these items I to A

Users = Employees

Items = Servers accessed Image Source: http://dataconomy.com/2015/03/an-introduction-to-recommendation-engines/

Recommendation System-Based Model

�  Historical profile for each user based on number of days per week for a particular server weighted by recommendations

�  AD Logs, LDAP data (job title, dept, etc)

�  Heat Map (Top figure) –  X-Axis : Week Index –  Y-Axis : Server –  Value: Number of days per

week weighted by recommendations

�  Outlier Plot (Bottom Figure) –  X-Axis : Week Index –  Y-Axis : Outlier Score

Heat map before recommendations Heat map after recommendations

Servers g3 & g4 are recommended, hence weight is decreased

Outlier score in test week decreases because the new servers that the user accesses are

recommended for his job profile

g1 g

2 g

3 g

4 g

5

g1 g

2 g

3 g

4 g

5

Recommendation System-Based Model

Using historical windows events data to build graphs* of typical user behavior •  Which machines does the user log into? •  Which machines does the user log in from? •  How often? •  In which order?

Ask if this behavior is typical •  Is it typical for this user? •  Is it typical for someone in a particular department? •  Is this typical for someone in the user’s job role?

Graph models are sensitive to direction, order, and frequency

34.23.123.4

Typical Behavior

Anomalous Behavior

DB with financial information

34.23.123.51

34.23.1.1

34.23.0.1

34.23.2.8

34.23.123.4

34.23.1.1

34.23.0.1

34.23.2.8

34.23.123.51

*Reference: Alexander D. Kenta, Lorie M. Liebrockb, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.

Graph Model

Challenge: •  Cybersecurity threats, data privacy, data protection and fraudulent

behavior going undetected, leaving customer vulnerable to security risks, loss of money

•  Need to gain timely insight into unusual/suspicious internal behavior to allow for proper action

•  Tools in place cannot be customized to leverage historical security data and allow for predictive analytics

Solution: •  Leveraged Data Science to show use cases analyzing their active

directory data, identifying fraud, unapproved file sharing, etc. •  Utilized Big Data Suite, specifically Greenplum + MADlib + R to

store and analyze data with potential to build out Hadoop data lake with HDB (aka HAWQ)

Pivotal Solution includes: Pivotal Greenplum, Pivotal HDB, Apache MADlib

Fortune 100 Companies Leverage Pivotal to Tackle Enterprise-wide Security Risks with Analytics

•  Pivotal Data Science expertise and partnership with customers to identify high-value use cases to solve and build data science center of excellence for security analytics

•  Tight integration to Analytical Tools that run in-database and across all of the data, to cover the most possible use cases

•  Scalable Solution that can grow as data needs grow, leveraging commodity hardware to keep costs low as data volume increases

•  Join key Pivotal customers in the Security Advisory Council for collaboration and knowledge sharing

Why Pivotal for Security Analytics

Additional Resources & Next Steps Read: Pivotal Data Science Blog https://blog.pivotal.io/channels/data-science-pivotal Strategic: Pivotal Data Science Analytics Road mapping Engagement https://pivotal.io/contact Tune in: Next data science webinar: “Using Data Science to Detect Healthcare Fraud, Waste, and Abuse,” March 14, 2017 https://pivotal.io/resources/1/webinars Hands on: Pivotal Greenplum Sandboxhttps://network.pivotal.io/products/pivotal-gpdb Apache MADlib (incubating)http://madlib.incubator.apache.org/

Questions? Using Data Science for Cybersecurity

using data science for cybersecurity

Technology