using data science for cybersecurity

Download Using Data Science for Cybersecurity

Post on 11-Apr-2017

205 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • background image: 960x540 pixels - send to back of slide and set to 80% transparency

    Using Data Science for Cybersecurity Anirudh Kondaveeti, Principal Data Scientist, Pivotal Jeff Kelly, Principal Product Marketing Manager, Pivotal

  • Todays Speakers

    2

    Using Data Science for Cybersecurity

    Anirudh Kondaveeti Principal Data Scientist, Pivotal

    Jeff Kelly Product Marketing, Pivotal

    Moderator Presenter

  • cover this square with an image (540 x 480 pixels)

    Cybercrime costs average US enterprise $17m per year*

    Cost grew at 15% CAGR over last three years

    Any given cybercrime can cost significantly more

    Targets 2014 hack cost company approximately $162m

    Costs not just financial, also reputational

    Cost of Cybercrime on the Rise

    *Source: 2016 Cost of Cyber Crime Study & the Risk of Business Innovation, Ponemon Institute

  • cover this square with an image (540 x 480 pixels)

    Amateur hackers giving way to professionals

    Developing new, more sophisticated, methods

    Professional hackers make their services available for a fee

    Costs to commit cybercrime dropping

    Average subscription fee for a one hour/month DDoS package is roughly $38*

    Hackers Growing More Sophisticated

    *Source: Q2 2015 Global DDoS Threat Landscape, Incapsula

  • cover this square with an image (540 x 480 pixels)

    Defending the perimeter no longer enough

    No 100%, fool-proof way to keep bad actors out

    Some threats come from within

    The idea of a perimeter becoming obsolete with mobile, cloud, IoT

    Need better methods for threat detection inside the network

    Perimeter Defense Inadequate

  • Data Science for Cybersecurity

  • Security must move beyond signature-based matching Necessary defense direction: Find the Unknown Need an advanced platform: Security is a Big Data problem Multiple decentralized sources of traditional or unconventional data Need a platform for better BI, reporting, and cross-source correlation Develop intelligence: Security is an Advanced Analytics problem

    BI and Compliance-

    driven Investigation-

    driven

    Behavior-metrics

    Investigation-driven

    Data-science driven

    Background

  • 8

    Lateral Movement Detection

  • Advanced Persistent Threat (APT)

    A handful of users are targeted by two

    phishing attacks: one user opens Zero day

    payload (CVE-02011-0609)

    The user machine is accessed remotely by Poison Ivy tool

    Attacker elevates access to important user, service and

    admin accounts, and specific systems

    Data is acquired from target servers and

    staged for exfiltration

    Data is exfiltrated via encrypted files over ftp to external, compromised machine at a hosting

    provider

    Phishing and Zero Day Attack Back Door

    Lateral Movement Data Gathering Exfiltrate

    1 2 3 4 5

    APT Kill Chain

  • What: Identify anomalous user-level access to hosts How: Look at People & Machines Users (User Behavior Models) Network, Servers (User Peer Models)

    Scenarios: Network reconnaissance from remote adversary on hijacked device Ill-intentioned activities by legitimate employee Access policy abuse

    Business values: Immediate security alert generation Enhanced SIEM alert queue prioritization Focused monitoring Future integration with other analytic models for 360 attack view

    Lateral Movement Detection

  • Data Computing Appliance

    Logs

    Active Directory Activity

    Active Directory Metadata

    Server Information

    Structured

    Ext

    erna

    l Tab

    les

    Semi-structured

    Regression Based Model

    Cluster Based Model

    Recommendation System Based

    User Behavioral Model

    Anomalous Users

    Greenplum

    DIA

    LDAP Activity

    Lateral Movement Detection (LMD) Flow Diagram

  • Model to identify users with unusual variation in the number of servers accessed over time

    Build a regression model for each user (Y = aX + b)

    No. of servers accessed each week (Y) ~ Week Index (X)

    Find the slope of the regression line for each user (a)

    Identify users who have a high positive or negative slope to find users with unusual activity

    Num

    ber o

    f Ser

    vers

    Week of the year

    Regression plot of number of servers for a user

    Regression-Based Model

  • Build historical behavioral profile for each user based on following features: Servers accessed IP addresses logged in from Geographical information of login

    Models stress individual user/job log-in frequency

    Multiple Feature Generations reduce false alarms: Aggregate servers to respective server group Incorporate server criticality Assign more weight to less popular servers and IP

    addresses E.g. print servers are low-weighted Use recommendation engine to suggest servers to users

    based on job roles and peers

    Ser

    vers

    s1

    s2

    s3

    s4

    s5

    s6

    s7

    s8

    s9

    s10

    Typically uses only a few servers

    Begins logging into a lot of new servers

    User Behavior Models (UBM)

  • Week1 Week2 . Week10 Week 11 . Week15

    server1 2 3 1 0 . 0

    server2 4 7 1 3 . 7

    server3 0 2 0 0 . 0

    . . . . . .

    server25 1 3 5 8 . 1

    PCA Model Built per User (Training Data) Testing Data

    User behavior matrix is created using x weeks of history for a user. The current week is used as test data.

    PCA is dimensionality reduction technique used to capture the components set of multidimensional vector which account for most of the variance.

    Principal dimensions are calculated from the training data.

    Principal Component Analysis (PCA) Scoring

  • Reconstruction Error

    Training Data (User Behavior

    Matrix)

    Run PCA

    Principal Dimensions

    Reconstruct

    Project onto Principal Dimensions

    Test Vector (User data for new week)

    Reconstructed Test Vector

    Difference between two vectors

    Anomaly Score

    Ref: A Lakhina, M Crovella, C Diot, Diagnosing network-wide traffic anomalies

    Principal Component Analysis (PCA) Scoring

  • Oversampling PCA

    Reference and Image Source: YR Yeh, ZY Lee, YJ Lee, Anomaly Detection via Over-sampling Principal Component Analysis

    Training Data (User Behavior

    Matrix)

    Run PCA

    Oversampled Test Data

    Training Data (User Behavior

    Matrix)

    Run PCA

    First Principal Vector

    Difference in angle between them

    Anomaly Score

    First Principal Vector after oversampling

    Test Data

    Principal Component Analysis (PCA) Scoring

  • R Code to find the Principal Components (using SVD) SQL & R

    User1 Data

    User2 Data

    User3 Data

    User4 Data

    User5 Data

    User1 Model

    User2 Model

    User3 Model

    User4 Model

    User5 Model

    PLR wrapper over the R Code to run in parallel

    Parallelized PCA using PL/R

  • Users rate items

    To recommend items to a particular user A Find other users U similar to A

    Identify the set of items I accessed by U

    Recommend these items I to A

    Users = Employees

    Items = Servers accessed Image Source: http://dataconomy.com/2015/03/an-introduction-to-recommendation-engines/

    Recommendation System-Based Model

  • Historical profile for each user based on number of days per week for a particular server weighted by recommendations

    AD Logs, LDAP data (job title, dept, etc)

    Heat Map (Top figure) X-Axis : Week Index Y-Axis : Server Value: Number of days per

    week weighted by recommendations

    Outlier Plot (Bottom Figure) X-Axis : Week Index Y-Axis : Outlier Score

    Heat map before recommendations Heat map after recommendations

    Servers g3 & g4 are recommended, hence weight is decreased

    Outlier score in test week decreases because the new servers that the user accesses are

    recommended for his job profile

    g1 g

    2 g

    3 g

    4 g

    5

    g1 g

    2 g

    3 g

    4 g

    5

    Recommendation System-Based Model

  • Using historical windows events data to build graphs* of typical user behavior Which machines does the user log into? Which machines does the user log in from? How often? In which order?

    Ask if this behavior is typical Is it typical for this user? Is it typical for someone in a particular department? Is this typical for someone in the users job role? Graph models are sensitive to direction, order, and frequency

    34.23.123.4

    Typical Behavior

    Anomalous Behavior

    DB with financial information

    34.23.123.51

    34.23.1.1

    34.23.0.1

    34.23.2.8

    34.23.123.4

    34.23.1.1

    34.23.0.1

    34.23.2.8

    34.23.123.51

    *Reference: Alexander D. Kenta, Lorie M. Liebrockb, Joshua C. Neila. Authentication graphs: Analyzing user behavior within an enterprise network.

    Graph Model

  • Challenge: Cybersecurity threats, data privacy, data protection and fraudulent

    behavior going undetected, leaving customer vulnerable to security risks, loss of money

    Need to gain timely insight into unusual/suspicious internal behavior to allow for proper action

    Tools in place cannot be customized to leverage historica