demystifying machine learning and

Copyright©2016Splunk Inc.

TobyRyanEmersonIncidentResponseTeam(CIRT)

DemystifyingMachineLearningAndAnomalyDetection:PracticalApplicationsinSplunk forInsiderThreatDetectionandNetworkSecurityAnalytics

Bio• Current: Manager, Behavioral Analytics, Emerson Computer Incident Response Team (CIRT)

– “Unofficial” Data Scientist– Serve as the design lead for our Splunk custom analytics platform– Manage the Insider Threat Program– Member - Carnegie Mellon CERT Open Source Insider Threat (OSIT) working group– Chair – OSIT Data Analytics Special Interest Group– Board of Advisors - Carnegie Mellon CERT Open Source Insider Threat (OSIT) working group

• Prior to Emerson: Special Agent, US Naval Criminal Investigative Service (NCIS)– Insider Threat, Cyber, and Fraud Investigations (8 years)

• 1996-2007: The “Lost” Years

• BS in Spatial Information Science and Engineering – University of Maine (1996)(I was doing data science before it was cool!)

2

GoalsoftheSession:

3

• You will be able to describe the similarities and differences between internal/insider and external threats

• You will be able to map Machine Learning (ML) and Anomaly Detection (AD) algorithms to security use-cases

• You can start demystifying ML and AD by using practical security applications of ML and AD with Splunk Enterprise

• You will have the knowledge of where to start your own Security-Purposed ML and AD platform using Splunk Enterprise.

• You can start the conversation between technical experts and non-technical Insider Threat experts

Agenda• Overview of threat types• Data Science cycle for security• Architecture of a Splunk-based Anomaly Detection platform• Types of anomalies used in security use-cases• Solving a security problem with Machine Learning

– Deep dive for email analytics– Practical applications in ML– Anomaly Detection model improvement– Clustering for security

• Practical uses of ML and AD in various security and insider threat uses cases• Advanced use-cases• Wrap up and Questions

4

WhyIWantToTalkToYou….

5

InsiderThreatProgramsarealmostequallydistributedbetweenHumanResources,Legal,Security,andInformationSecurityThat’sroughly75%thatareNOT inatechnicaldepartmentIfwearethe75%,howdoweapproachourInformationSecuritydepartmentstoexplainwhatwearelookingfor?Ifwearethe25%,howdoweexplainwhatwecando?

Highly Technical IT Security HR/Legal/Security A Disconnect

Internalvs.ExternalThreats• InsiderThreatcategories:

ê MaliciousInsider

ê Non-MaliciousInsider(“AccidentalInsiderThreat”)

– NegligentInsider

ê Externalactorbehaving likeaninsider

• 3types ofInsiderThreats:ê DataTheft(IntellectualProperty,PII,Financial,etc.)

ê Fraud

ê Sabotage

6

AlarmingStatistics• 62%OfemployeesthinkitisOKtomoveworkdocumentstopersonalcomputersormobiledevices

• 51%ThinkitisOKtotakecorporatedatabecausepoliciesarenotenforced;overhalfofemployeessurveyedwholosttheirjobintheprevious12monthskeptconfidentialdata

• 56%Donotthinkitisacrimetousecompetitor’stradesecrets

7

So……..IfyoustoodatthedooronaFridayandstoppedallresigningemployees,youwouldhavea1in2chanceofcatchingsomebody

Source:2013SymantecGlobalSurvey– InsiderThreat

WhatTheStatisticsSay- Generally

Insiderthreatsaccountfor25%-45%ofcyberattacks

MaliciousInsidersstealdata,commitfraud,orsetthesabotageinactionwithin

thelast30daysofemployment

NegligentInsidersarebecomingthemajorityofinsiderthreats

10%-20%Ofemployeesclickonmaliciouslinksinphishingemails

Privilegedusers(Admin,DBA,ITSecurity,accesstotradesecrets,etc.)are

companies’biggestconcern

8

9

Let’sGetIntoTheData……

TheDataScienceCycleForSecurity

10

DetermineUse-Case

DataMining&Exploration

DataValidating&Cleaning

ComputationalScaling/Storage

MachineLearning&AnomalyDetection

Model

Refinement

Alerts&Visualization

ModelTesting

TheDataScienceCycleForSecurity(V2)

11

DetermineUse-Case





Model

Refinement


ModelTesting

InSplunk Terms….

12

DetermineUse-Case





Model

Refinement


ModelTestingSelectSourcetype(s)

VerboseModeExploring

CompareEvents/TableData

API/Short-termStorage

Whatfieldstomeasure:Numericalvs.Text

RepeatCycle

Splunk Alerts/Dashboards

Dotheresultsmakesense?

Architecture

13

LogSources

ExplorationMiningCleaningValidation

ShortTermStorageOff-Splunk ComputationsMachineLearningAnomalyDetection

Globalenterpriseenvironmentwith100,000+users/accounts

API

ModelTesting&Validation

Alerts&Visualizations

UF

Off-Splunk CalculationsatScale

14

Whymoveoff-Splunk?ComputationallyExpensive

NeedsforMachineLearningandAnomalyDetectionaredifferentSplunk usedforexplorationandmodeldevelopmentattestingscaleSplunk MachineLearningToolkit

Allareopensourceproducts

AnomalyDetection&MachineLearning

15

WhatisAD?Typesofsecurityanomalies:

ê spikesinactivity

ê rareevents

ê first-observed

ê outliers

ê statechange

ê simpleexistence

Whatdothesehaveincommon?

time-based

Thebasiccomparisonparameterisself-comparisonovertime.Advancedparametersincludepeer-basedcomparison.

WhatisML?ê SupervisedML

– Classification/Regression

ê UnsupervisedML

– Clustering

ê Semi-Supervised

– Rule-basedAD

ForADandsecurity,MLcanestablishabaselineofnormal(negative)values

UnsupervisedLearning

16

UnsupervisedMachineLearning– Youhaveunlabeleddataandwanttogroupthedatabyfeature(s)

– Thealgorithmmakesitsownstructureoutofthedata

– Youdonotknowwhatoutlierslooklike

– Goodforthedataexplorationphasesofsecurityanomalydetection

– Examplesusedinsecurityapplicationsinclude:

ê Clustering:k-means,k-medians,ExpectationMaximization

ê Association:lessrelevantbecauseinhighlystructuredsearcheswearelessconcernedwith

associationsbetweenfieldsforsecurity anomalydetection

SupervisedLearning

17

SupervisedMachineLearning

– Youhavelabeleddataandthealgorithmpredictstheoutput

– Classificationvs.Regression

– ExampleMLalgorithmsinclude:ê LinearandLogisticRegressionê RandomForestê SupportVectorMachineê DBSCAN

Semi-SupervisedMachineLearning– Youhave“some”labeleddata,butnotall– MostsecurityMLapplicationsfallinthiscategory– LabelPropagation– Rule-basedanomalydetection

ForSECURITY-PURPOSEDapplicationsofML,acombinationofunsupervised,supervised,and

Semi-Supervisedlearningalgorithmsisabestpractice

Inrealisticapplications,security-purposedADrequireshighlystructureddataand

humantrainingofthealgorithm

WhyDoWeUseML&AD?

18

Growingabilityofadversariestoavoidedgedetection

Signature-basedtoolsonlydetectonaknownsignature- theymusthaveanexampleornameofa

bad“thing”tosaywhattheyaremeasuringis“bad”

Behavioral-baseddetectionstargethumannature

ê Whatisattheendonan“endpoint?”

ê Whatistheweakestlink?

ê Targetscomputerandnetworkbehaviortodeterminewhat’snormalandwhat’snot

Whydosomanyphishingemailsgetthrough?

DeepDive:EmailAnalyticsfortheNegligentInsider

19

Uh-oh…..Inthisdeepdive,we’llexamineacompromisethatwouldnotbecaughtwithtraditionalsecuritystacktoolsbutwillbecaughtusingbasicML&AD

Use-CaseDeepDive

EmailUse-Case

20

Yourcompanyhasbeenhitwithalargenumberofphishingemailsthatwerenotdetectedbytraditionalsignature-basedtools

Severalemployeeshaveclickedonthephishinglinkandenteredtheircredentials

Theadversaryhastakenoverseveralaccountsandsentthousandsofadditionalemails,internalandexternal

Use-CaseDeepDive


21

Whatlooksinterestinginthissourcetype?

Whatcouldbeusedtodetectananomaly?

Whatisimportanttonoteabouttheevents?

Sendanemailtoyourself,thentoaco-worker,thentoseveralpeople,etc.asavalidationtest;tracetheactionsthroughSplunk

ML&ADforSecurityBestPractice:

Validatedatabyviewingyourownactionsonthenetwork

sourcetype="MSExchange:2010:MessageTracking"

Use-CaseDeepDive

DataCleaning

22

sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |sort-_time

Whatfieldsarebestpoisedformeasuring?

Whatfieldsprovideenoughcontextforanalysis?

Use-CaseDeepDive

#!/usr/bin/env python#ProgrammedbyGrantRichardSteiner#Lasteditedon07/21/2016- GS#07/19/2016- OutputtoCSV#07/20/2016- tranferred overtoworkstation(runningcentOS 7),setupcron job,queries,outputstoproper#directory#07/21/2106- fixingpermissions,mounteddirectorytolinux machine,shouldoutputtocorrectdirectory#importnecessarylibrariesimporttimeimportsubprocessimportdatetime asdt#totimetheprogram,notnecessarySTART_TIME=time.time()#fornamingfilesSTART_DATE=dt.datetime(2016,01,1)#########################################################################################Thisblockgetsdataandpumpsittothenecessarylocationlocation='/data/ws13/unique_email_averages_%s.csv'%str((dt.datetime.now()- START_DATE).days)#Splunk searchsearch_string ='"searchsourcetype=MSExchange:2010:MessageTrackingsender=*@emerson.comrecipient_count!=NONE'\

'earliest=-1d@dlatest=-0d@d|dedup message_id sortby _time'\'|fields_timesendermessage_subject message_id recipient_count '\'|eval recipient_count =if(recipient_count=NULL,0,recipient_count)|bucket_timespan=1d'\'|statssum(recipient_count)asDaily_Total bysender,_time"'

#calltothecURL executabletorunthesearchfromtheSplunk APIcommand_string ="/usr/bin/curl"\

"-k-ucirt_insider:insiderTHREAT!!dash#"\"https://splunk.emrsn.org:8089/services/search/jobs/export"\"--data-urlencode search=%s-doutput_mode=csv-o"\"%s"%(search_string,location)

#runthecalltotheSplunk APIsubprocess.check_output(command_string,shell=True)#####################################################################################printtime.time()- start_time

MovingData

23

UtilizetheSplunk APItocall,move,transfer,ordepositdata

Inthisuse-case,wearepullingover120daysworthofdatainitially,andthenpullingdailytotals

Dataismovedtoashort-termstoragecontainer(inthiscase,ElasticSearch,butMySQL,orotheropensourceSQLorNoSQLDBsworkfine

Whyarewedoingthis?ê Splunk’s AWESOMEintegrationcapabilityê ComputationallyexpensivewithinSplunkê Enterpriseconstraints

Use-CaseDeepDive

WhereAreWeInThePlatform?

24

LogSources




API

UF


Use-CaseDeepDive

ML&ADModel

25

Whatfeaturesdowechoose?

Supervised?Unsupervised?Classification?

Whatstatisticalmodeldowechoose?

Startbyclusteringalldataê Splunk “cluster”commandfortextand“kmeans”fornumericalfields

ê Whatcommandisaclusteringcommandindisguise?

Whydoweclusterfirst?

Howmanyfeaturesdowechoose?

|statscountby{fieldbeingmeasured}ML&ADforSecurityBestPractice:

Fromanincidentresponseperspective,highlystructuredandsinglefeaturedataisrequiredtominimizetime

consideringfalsepositives

Use-CaseDeepDive

26

K-MeansClustering

sourcetype="MSExchange:2010:MessageTracking"sender="*@emerson.com"recipient_count!=NONE|dedupmessage_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_idrecipient_count total_bytes |bucket_timespan=1d|statssum(recipient_count)asdaily_total bysender,_time|kmeans k=5daily_total |statscountbyCLUSTERNUMcentroid_daily_total

Use-CaseDeepDive

TrainingDataAndTheMLProcess

27

Collectasetoftrainingdata(univariate/singlefeature/singlefield)ê Inourcase,itis60-120daysworthofdailyemailtotals

ê Next,splitthedatabytimeinto3groups:trainingset,cross-validationset,

testset

DetermineifyourdatasetisGaussian(NormalDistribution)

ML&ADforSecurityBestPractices:- Splithistoricaldata60-20-20intotraining,cross-validation,andtestsets- Don’treusedata;donotuserandomsplits

Use-CaseDeepDive

28

Don’tworry,Splunk andSciPy/Scikit dothemathforyou!

AlgorithmSelectionFornormaldistributions,Inter-QuartileRange(IQR)isagoodplacetostart

WecantestbackinSplunk forspecificclusterusers(usingself-generateddata)

Otheroptionsavailableinclude:

– Scikit-learn.orghasthepythonmodules– MATLAB,GNUOctave,andRallhaveextensiveMLandADpackages– PythonhaseasyGaussiantestalgorithms(usedinthisexample)

ê scipy.stats.mstats.normaltestê scipy.stats.shapiro

Scikit-Learnhasin-depthexplanationsofeachalgorithmand

commanddescriptionssuchas“fit(x)”and“predict(x)”,etc.

Use-CaseDeepDive

ModelTesting

30

sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |timechart sum(recipient_count)asdaily_total span=1d|eventstats median(daily_total)asmedian,p25(daily_total)asp25,p75(daily_total)asp75,mean(daily_total)asmean|eval iqr =p75- p25|eval xplier =2|eval low_lim =median- (iqr *xplier)|eval high_lim =median+(iqr *xplier)|eval anomaly=if(daily_total <low_lim ORdaily_total >high_lim,daily_total,0)|table_timedaily_total anomaly

FalsePositive FalsePositives

TruePositive

Use-CaseDeepDive

Better?

31

sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_idsortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |timechartsum(recipient_count)asdaily_total span=1d|eventstats median(daily_total)asmedian,p10(daily_total)asp10,p90(daily_total)asp90,mean(daily_total)asmean|eval iqr =p90- p10|eval xplier =2|eval low_lim =median- (iqr *xplier)|eval high_lim =median+(iqr *xplier)|eval anomaly=if(daily_total <low_lim ORdaily_total >high_lim,daily_total,0)|table_timedaily_total anomaly

Use-CaseDeepDive

ValidatingModels

32

• How can we validate models?

Precision =# of correct positive values

# of all positive results

# of correct positive values

# that should have been positiveRecall =

precision x recall

precision + recallF1 Score = 2

F1 Score is the harmonic mean, or average of rates, where F1 is best at a value of 1, and worst at a value of 0.

Firstmodel: F1 =0.4

Secondmodel: F1 =1.0

Use-CaseDeepDive

Bewareofmissingfalsenegativesbytuningtoomuchtooquickly;tuningisaniterativeprocessovertime

33

LogSources




API

UF


WhereAreWeInThePlatform? Use-CaseDeepDive

OneLastNoteonNegligentInsiderEmailAnalytics

34

Considernotonlyalargenumberofrecipientsoutsideauser’snormalbehavior,butconsiderthenumberofnewrecipientsWhatistheaveragenumberofnewrecipientsanemployeeemailseachday?One?Five?Establishasetoftrainingdataandrecordtheuniquerecipientsover60daysCreateananomalydetectionthatfireswhenthenumberofnewrecipientsexceedsthebaselinevarianceAddtothe“#ofrecipientsperday”dataforhigherfidelityalertsExample:

ê Baselinenumberofdailyrecipients=30ê Today’samount=75,butfallsonthefringeofbeinganoutlierê Numberofnewrecipients=1-5;falsepositiveê Numberofnewrecipients=50;truepositive

Use-CaseDeepDive

35

MoveTheDataToShort-termStorageForMeasurement Use-Case

DeepDive

Nowinapositiontocompare#ofnewrecipients


36

• Theoutputoftheoff-Splunk calculationscanbepicked

upbytheSplunk UForwrittentoaflatfile

• AllowstheusertocapitalizeontheSplunk interface

• Advantages/DisadvantagesofIndexingand

Sourcetyping:

ê Treatlikeanyotherdatasourceforcalculations

ê Technically“re-indexing”data,howeveranomalydata

setswillbesmall

Use-CaseDeepDive

Refinement

37

• Treatdifferentclusterswithdifferentmodels

• Continuallyvalidatedataandresults

• Understandwhyfalsepositivescomeup

• Addlengthtotrainingdatatimeifpossible

• IfaclusterisnotGaussian,tryothermodels,ortrytofitthedatato

aNormalDistribution

• Comparesimplerule-basedmodelssuchas3xmean=anomaly

Use-CaseDeepDive

Additional Use-Cases &Use-Case Starter Searches

Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedSearches

39

• Cross-correlate USB activity with departing employees for insider threat detection

• Background: Historical data from insider threat incidents indicate large transfers of data prior to departure

• Data Source:ê Endpoint Agent Logs (McAfee, Symantec, Kaspersky, etc.)ê Message Tracking Logs

DepartingEmployees

(HighestRisk)

SpikeinUSBActivity

Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedADSearches

40

• MostUSBactivitywillnotbeanormaldistribution

• UtilizeK-Meanstodetermineuserswhoback-uptheirmachinesviaUSB

• Mustutilizearule-basedapproach

• Setadailythresholdsuchas:daily_total >20MBtoindicatelargedatatransfers

sourcetype=“endpointagent"api="FileWrite"file_size!=0user=xxxxx |eval file_size_MB=(file_size/1048576)|dedupfile_size_MB,src |renameapi asaction,parameterasfile_path,file_name asprogram,src ashost_name |timechartsum(file_size_MB)asDaily_Total span=1d

41

• CreateasearchthatincludesdatafromyourHR

organizationOR…learnwhoisdepartingthrougha

rulebasedsearchofemailsubjectsê “*termination*”

ê “*resignation*”

ê AutogeneratedOracleorPeoplesoft emails

• CombinetheresultswiththespikesinUSBactivity

searchtocreateatwo-featureclassificationlearning

algorithm

Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedADSearches

sourcetype="MSExchange:2010:MessageTracking"recipient_count!=NONEmessage_subject!="*outofoffice*"message_subject!="Automaticreply*"message_subject!="CustomerSatisfaction*"message_subject!="READ:*"message_subject!="*determination*"(message_subject="*resignation*"ORmessage_subject="*termination*")|dedupmessage_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id

ML&ADforSecurityBestPractice:Cleanandreducesizeofthedatasettoincludeonlythoseitemsofvalue

Use-Case:EmailClientTypeStateChange

42

• DetectrareoranomalousvaluesinclienttypesusedbyOutlookWebAccess(OWA)orOutlookAnywhereusersasasignofcompromise

• Rarityor“novelty-based”anomalydetectionovertime

• First-observeddetection

(sourcetype="MSWindows:2008R2:IIS"ORsourcetype=iis)cs_username!="-"cs_user_agent!="-"(WebApplication=OWAORWebApplication=owa)|fields_timecs_username cs_user_agent WebApplication eventtype |statscountbycs_username,cs_user_agent|sort-count|statslist(cs_user_agent)asuser_agent,list(count)ascountbycs_username |wheremvcount(user_agent)>2

Use-case:SudoLogsAndSabotage

43

• Lookforanomalouspatternsinsudotorootprivileges

ê Time-basedloginsê Unauthorizedscriptsê DataTheftê Unauthorizedserveraccess

• Useacombinationofsupervisedrule-baseddetectionforscriptexecutionandtime-basedanomalydetectionforauthenticationdata

sourcetype=sudo dba_user |table_timehostdba_user Sudo_User PWDCOMMAND|statscountbydba_user,Sudo_User,COMMAND|sort-count|statslist(COMMAND)asCommand,list(count)ascountbydba_user,Sudo_User

InsiderThreatUse-CaseStarterSearches

44

• DetermineanomalousvaluessurroundingprivilegedwindowsadminswhoutilizeRDP

ê Determinemodelforbaseline– willprobablynotbeanormaldistribution

ê Setdetectionsforvaluesoutsideofbaselineê Featuresincludedestinationservers,time-of-day,andmultiplenoveltyevents

• UseclassificationandsupervisedlearningtodetectsensitivefiletypesinUSBtraffic

ê CADfiles,financialdocuments,engineering,businessintelligence,pricing,etc.

ê Setpositivevaluestofileextensionsofinterestsuchas.DWG

sourcetype="WinEventLog:Security"(EventCode=4624OREventCode=4625)Logon_Type=10user=“*admin*"|table_timeuserAccount_NameWorkstation_Name ComputerName src_ip

sourcetype=“endpointagent""FileWrite"file_size!=0user=xxxxx “*.DWG”|eval file_size_MB=(file_size/1048576)|dedup file_size_MB,src |statscountbysrc {oruser}

InternalAndExternalThreatUse-caseIdeas

45

• FTPServers:clusteringIPaddresses,frequencyspike,rule-baseddetectionsusingcompany-specificcriteria(IIS/FW/LBLogs)

• Phishingandfraudemaildetection:domainmismatchusingemailmetadata(message_id)tocomparesendingdomain,displayname,andreturnpath(MessageTrackingLogs)

• LargeNumberoffiledownloads/views/printsfromapplicationhousingsensitivedocuments(App-specificand/orIISlogsifweb-based)

• Anomalousportactivity(Ports53,25,21,22,443,123,etc.)• Authenticationanomalies:logintoarareorfirst-observeddevice,off-hourlogin,patternofsingle

failedloginsfromseveralmachinesorSharepoint locations(the“probing”user)• Detectingsharedcredentials– especiallyamongsensitiveusers(DBAs,Admins,etc.)

WrappingUp- WhatHaveWeCovered?

46

• TheDataScienceCycleforSecurity

• DeepDiveintoML&ADforsecurity

• DemystifiedthemathbehindML&ADandprovided

simplesolutionssuchasclassificationalgorithms

• VariousUse-CasesforsecurityML&AD

46

The process of cleaning and carefully selecting data is more important than choosing the right algorithm

47

Questions?

THANKYOU

demystifying machine learning and

Documents