demystifying machine learning and
TRANSCRIPT
Copyright©2016Splunk Inc.
TobyRyanEmersonIncidentResponseTeam(CIRT)
DemystifyingMachineLearningAndAnomalyDetection:PracticalApplicationsinSplunk forInsiderThreatDetectionandNetworkSecurityAnalytics
Bio• Current: Manager, Behavioral Analytics, Emerson Computer Incident Response Team (CIRT)
– “Unofficial” Data Scientist– Serve as the design lead for our Splunk custom analytics platform– Manage the Insider Threat Program– Member - Carnegie Mellon CERT Open Source Insider Threat (OSIT) working group– Chair – OSIT Data Analytics Special Interest Group– Board of Advisors - Carnegie Mellon CERT Open Source Insider Threat (OSIT) working group
• Prior to Emerson: Special Agent, US Naval Criminal Investigative Service (NCIS)– Insider Threat, Cyber, and Fraud Investigations (8 years)
• 1996-2007: The “Lost” Years
• BS in Spatial Information Science and Engineering – University of Maine (1996)(I was doing data science before it was cool!)
2
GoalsoftheSession:
3
• You will be able to describe the similarities and differences between internal/insider and external threats
• You will be able to map Machine Learning (ML) and Anomaly Detection (AD) algorithms to security use-cases
• You can start demystifying ML and AD by using practical security applications of ML and AD with Splunk Enterprise
• You will have the knowledge of where to start your own Security-Purposed ML and AD platform using Splunk Enterprise.
• You can start the conversation between technical experts and non-technical Insider Threat experts
Agenda• Overview of threat types• Data Science cycle for security• Architecture of a Splunk-based Anomaly Detection platform• Types of anomalies used in security use-cases• Solving a security problem with Machine Learning
– Deep dive for email analytics– Practical applications in ML– Anomaly Detection model improvement– Clustering for security
• Practical uses of ML and AD in various security and insider threat uses cases• Advanced use-cases• Wrap up and Questions
4
WhyIWantToTalkToYou….
5
InsiderThreatProgramsarealmostequallydistributedbetweenHumanResources,Legal,Security,andInformationSecurityThat’sroughly75%thatareNOT inatechnicaldepartmentIfwearethe75%,howdoweapproachourInformationSecuritydepartmentstoexplainwhatwearelookingfor?Ifwearethe25%,howdoweexplainwhatwecando?
Highly Technical IT Security HR/Legal/Security A Disconnect
Internalvs.ExternalThreats• InsiderThreatcategories:
ê MaliciousInsider
ê Non-MaliciousInsider(“AccidentalInsiderThreat”)
– NegligentInsider
ê Externalactorbehaving likeaninsider
• 3types ofInsiderThreats:ê DataTheft(IntellectualProperty,PII,Financial,etc.)
ê Fraud
ê Sabotage
6
AlarmingStatistics• 62%OfemployeesthinkitisOKtomoveworkdocumentstopersonalcomputersormobiledevices
• 51%ThinkitisOKtotakecorporatedatabecausepoliciesarenotenforced;overhalfofemployeessurveyedwholosttheirjobintheprevious12monthskeptconfidentialdata
• 56%Donotthinkitisacrimetousecompetitor’stradesecrets
7
So……..IfyoustoodatthedooronaFridayandstoppedallresigningemployees,youwouldhavea1in2chanceofcatchingsomebody
Source:2013SymantecGlobalSurvey– InsiderThreat
WhatTheStatisticsSay- Generally
Insiderthreatsaccountfor25%-45%ofcyberattacks
MaliciousInsidersstealdata,commitfraud,orsetthesabotageinactionwithin
thelast30daysofemployment
NegligentInsidersarebecomingthemajorityofinsiderthreats
10%-20%Ofemployeesclickonmaliciouslinksinphishingemails
Privilegedusers(Admin,DBA,ITSecurity,accesstotradesecrets,etc.)are
companies’biggestconcern
8
TheDataScienceCycleForSecurity
10
DetermineUse-Case
DataMining&Exploration
DataValidating&Cleaning
ComputationalScaling/Storage
MachineLearning&AnomalyDetection
Model
Refinement
Alerts&Visualization
ModelTesting
TheDataScienceCycleForSecurity(V2)
11
DetermineUse-Case
DataMining&Exploration
DataValidating&Cleaning
ComputationalScaling/Storage
MachineLearning&AnomalyDetection
Model
Refinement
Alerts&Visualization
ModelTesting
InSplunk Terms….
12
DetermineUse-Case
DataMining&Exploration
DataValidating&Cleaning
ComputationalScaling/Storage
MachineLearning&AnomalyDetection
Model
Refinement
Alerts&Visualization
ModelTestingSelectSourcetype(s)
VerboseModeExploring
CompareEvents/TableData
API/Short-termStorage
Whatfieldstomeasure:Numericalvs.Text
RepeatCycle
Splunk Alerts/Dashboards
Dotheresultsmakesense?
Architecture
13
LogSources
ExplorationMiningCleaningValidation
ShortTermStorageOff-Splunk ComputationsMachineLearningAnomalyDetection
Globalenterpriseenvironmentwith100,000+users/accounts
API
ModelTesting&Validation
Alerts&Visualizations
UF
Off-Splunk CalculationsatScale
14
Whymoveoff-Splunk?ComputationallyExpensive
NeedsforMachineLearningandAnomalyDetectionaredifferentSplunk usedforexplorationandmodeldevelopmentattestingscaleSplunk MachineLearningToolkit
Allareopensourceproducts
AnomalyDetection&MachineLearning
15
WhatisAD?Typesofsecurityanomalies:
ê spikesinactivity
ê rareevents
ê first-observed
ê outliers
ê statechange
ê simpleexistence
Whatdothesehaveincommon?
time-based
Thebasiccomparisonparameterisself-comparisonovertime.Advancedparametersincludepeer-basedcomparison.
WhatisML?ê SupervisedML
– Classification/Regression
ê UnsupervisedML
– Clustering
ê Semi-Supervised
– Rule-basedAD
ForADandsecurity,MLcanestablishabaselineofnormal(negative)values
UnsupervisedLearning
16
UnsupervisedMachineLearning– Youhaveunlabeleddataandwanttogroupthedatabyfeature(s)
– Thealgorithmmakesitsownstructureoutofthedata
– Youdonotknowwhatoutlierslooklike
– Goodforthedataexplorationphasesofsecurityanomalydetection
– Examplesusedinsecurityapplicationsinclude:
ê Clustering:k-means,k-medians,ExpectationMaximization
ê Association:lessrelevantbecauseinhighlystructuredsearcheswearelessconcernedwith
associationsbetweenfieldsforsecurity anomalydetection
SupervisedLearning
17
SupervisedMachineLearning
– Youhavelabeleddataandthealgorithmpredictstheoutput
– Classificationvs.Regression
– ExampleMLalgorithmsinclude:ê LinearandLogisticRegressionê RandomForestê SupportVectorMachineê DBSCAN
Semi-SupervisedMachineLearning– Youhave“some”labeleddata,butnotall– MostsecurityMLapplicationsfallinthiscategory– LabelPropagation– Rule-basedanomalydetection
ForSECURITY-PURPOSEDapplicationsofML,acombinationofunsupervised,supervised,and
Semi-Supervisedlearningalgorithmsisabestpractice
Inrealisticapplications,security-purposedADrequireshighlystructureddataand
humantrainingofthealgorithm
WhyDoWeUseML&AD?
18
Growingabilityofadversariestoavoidedgedetection
Signature-basedtoolsonlydetectonaknownsignature- theymusthaveanexampleornameofa
bad“thing”tosaywhattheyaremeasuringis“bad”
Behavioral-baseddetectionstargethumannature
ê Whatisattheendonan“endpoint?”
ê Whatistheweakestlink?
ê Targetscomputerandnetworkbehaviortodeterminewhat’snormalandwhat’snot
Whydosomanyphishingemailsgetthrough?
DeepDive:EmailAnalyticsfortheNegligentInsider
19
Uh-oh…..Inthisdeepdive,we’llexamineacompromisethatwouldnotbecaughtwithtraditionalsecuritystacktoolsbutwillbecaughtusingbasicML&AD
Use-CaseDeepDive
EmailUse-Case
20
Yourcompanyhasbeenhitwithalargenumberofphishingemailsthatwerenotdetectedbytraditionalsignature-basedtools
Severalemployeeshaveclickedonthephishinglinkandenteredtheircredentials
Theadversaryhastakenoverseveralaccountsandsentthousandsofadditionalemails,internalandexternal
Use-CaseDeepDive
DataMining&Exploration
21
Whatlooksinterestinginthissourcetype?
Whatcouldbeusedtodetectananomaly?
Whatisimportanttonoteabouttheevents?
Sendanemailtoyourself,thentoaco-worker,thentoseveralpeople,etc.asavalidationtest;tracetheactionsthroughSplunk
ML&ADforSecurityBestPractice:
Validatedatabyviewingyourownactionsonthenetwork
sourcetype="MSExchange:2010:MessageTracking"
Use-CaseDeepDive
DataCleaning
22
sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |sort-_time
Whatfieldsarebestpoisedformeasuring?
Whatfieldsprovideenoughcontextforanalysis?
Use-CaseDeepDive
#!/usr/bin/env python#ProgrammedbyGrantRichardSteiner#Lasteditedon07/21/2016- GS#07/19/2016- OutputtoCSV#07/20/2016- tranferred overtoworkstation(runningcentOS 7),setupcron job,queries,outputstoproper#directory#07/21/2106- fixingpermissions,mounteddirectorytolinux machine,shouldoutputtocorrectdirectory#importnecessarylibrariesimporttimeimportsubprocessimportdatetime asdt#totimetheprogram,notnecessarySTART_TIME=time.time()#fornamingfilesSTART_DATE=dt.datetime(2016,01,1)#########################################################################################Thisblockgetsdataandpumpsittothenecessarylocationlocation='/data/ws13/unique_email_averages_%s.csv'%str((dt.datetime.now()- START_DATE).days)#Splunk searchsearch_string ='"searchsourcetype=MSExchange:2010:MessageTrackingsender=*@emerson.comrecipient_count!=NONE'\
'earliest=-1d@dlatest=-0d@d|dedup message_id sortby _time'\'|fields_timesendermessage_subject message_id recipient_count '\'|eval recipient_count =if(recipient_count=NULL,0,recipient_count)|bucket_timespan=1d'\'|statssum(recipient_count)asDaily_Total bysender,_time"'
#calltothecURL executabletorunthesearchfromtheSplunk APIcommand_string ="/usr/bin/curl"\
"-k-ucirt_insider:insiderTHREAT!!dash#"\"https://splunk.emrsn.org:8089/services/search/jobs/export"\"--data-urlencode search=%s-doutput_mode=csv-o"\"%s"%(search_string,location)
#runthecalltotheSplunk APIsubprocess.check_output(command_string,shell=True)#####################################################################################printtime.time()- start_time
MovingData
23
UtilizetheSplunk APItocall,move,transfer,ordepositdata
Inthisuse-case,wearepullingover120daysworthofdatainitially,andthenpullingdailytotals
Dataismovedtoashort-termstoragecontainer(inthiscase,ElasticSearch,butMySQL,orotheropensourceSQLorNoSQLDBsworkfine
Whyarewedoingthis?ê Splunk’s AWESOMEintegrationcapabilityê ComputationallyexpensivewithinSplunkê Enterpriseconstraints
Use-CaseDeepDive
WhereAreWeInThePlatform?
24
LogSources
ExplorationMiningCleaningValidation
ModelTesting&Validation
Alerts&Visualizations
API
UF
ShortTermStorageOff-Splunk ComputationsMachineLearningAnomalyDetection
Use-CaseDeepDive
ML&ADModel
25
Whatfeaturesdowechoose?
Supervised?Unsupervised?Classification?
Whatstatisticalmodeldowechoose?
Startbyclusteringalldataê Splunk “cluster”commandfortextand“kmeans”fornumericalfields
ê Whatcommandisaclusteringcommandindisguise?
Whydoweclusterfirst?
Howmanyfeaturesdowechoose?
|statscountby{fieldbeingmeasured}ML&ADforSecurityBestPractice:
Fromanincidentresponseperspective,highlystructuredandsinglefeaturedataisrequiredtominimizetime
consideringfalsepositives
Use-CaseDeepDive
26
K-MeansClustering
sourcetype="MSExchange:2010:MessageTracking"sender="*@emerson.com"recipient_count!=NONE|dedupmessage_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_idrecipient_count total_bytes |bucket_timespan=1d|statssum(recipient_count)asdaily_total bysender,_time|kmeans k=5daily_total |statscountbyCLUSTERNUMcentroid_daily_total
Use-CaseDeepDive
TrainingDataAndTheMLProcess
27
Collectasetoftrainingdata(univariate/singlefeature/singlefield)ê Inourcase,itis60-120daysworthofdailyemailtotals
ê Next,splitthedatabytimeinto3groups:trainingset,cross-validationset,
testset
DetermineifyourdatasetisGaussian(NormalDistribution)
ML&ADforSecurityBestPractices:- Splithistoricaldata60-20-20intotraining,cross-validation,andtestsets- Don’treusedata;donotuserandomsplits
Use-CaseDeepDive
AlgorithmSelectionFornormaldistributions,Inter-QuartileRange(IQR)isagoodplacetostart
WecantestbackinSplunk forspecificclusterusers(usingself-generateddata)
Otheroptionsavailableinclude:
– Scikit-learn.orghasthepythonmodules– MATLAB,GNUOctave,andRallhaveextensiveMLandADpackages– PythonhaseasyGaussiantestalgorithms(usedinthisexample)
ê scipy.stats.mstats.normaltestê scipy.stats.shapiro
Scikit-Learnhasin-depthexplanationsofeachalgorithmand
commanddescriptionssuchas“fit(x)”and“predict(x)”,etc.
Use-CaseDeepDive
ModelTesting
30
sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |timechart sum(recipient_count)asdaily_total span=1d|eventstats median(daily_total)asmedian,p25(daily_total)asp25,p75(daily_total)asp75,mean(daily_total)asmean|eval iqr =p75- p25|eval xplier =2|eval low_lim =median- (iqr *xplier)|eval high_lim =median+(iqr *xplier)|eval anomaly=if(daily_total <low_lim ORdaily_total >high_lim,daily_total,0)|table_timedaily_total anomaly
FalsePositive FalsePositives
TruePositive
Use-CaseDeepDive
Better?
31
sourcetype="MSExchange:2010:MessageTracking"sender="[email protected]"recipient_count!=NONE|dedup message_idsortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id recipient_count total_bytes |timechartsum(recipient_count)asdaily_total span=1d|eventstats median(daily_total)asmedian,p10(daily_total)asp10,p90(daily_total)asp90,mean(daily_total)asmean|eval iqr =p90- p10|eval xplier =2|eval low_lim =median- (iqr *xplier)|eval high_lim =median+(iqr *xplier)|eval anomaly=if(daily_total <low_lim ORdaily_total >high_lim,daily_total,0)|table_timedaily_total anomaly
Use-CaseDeepDive
ValidatingModels
32
• How can we validate models?
Precision =# of correct positive values
# of all positive results
# of correct positive values
# that should have been positiveRecall =
precision x recall
precision + recallF1 Score = 2
F1 Score is the harmonic mean, or average of rates, where F1 is best at a value of 1, and worst at a value of 0.
Firstmodel: F1 =0.4
Secondmodel: F1 =1.0
Use-CaseDeepDive
Bewareofmissingfalsenegativesbytuningtoomuchtooquickly;tuningisaniterativeprocessovertime
33
LogSources
ExplorationMiningCleaningValidation
ModelTesting&Validation
Alerts&Visualizations
API
UF
ShortTermStorageOff-Splunk ComputationsMachineLearningAnomalyDetection
WhereAreWeInThePlatform? Use-CaseDeepDive
OneLastNoteonNegligentInsiderEmailAnalytics
34
Considernotonlyalargenumberofrecipientsoutsideauser’snormalbehavior,butconsiderthenumberofnewrecipientsWhatistheaveragenumberofnewrecipientsanemployeeemailseachday?One?Five?Establishasetoftrainingdataandrecordtheuniquerecipientsover60daysCreateananomalydetectionthatfireswhenthenumberofnewrecipientsexceedsthebaselinevarianceAddtothe“#ofrecipientsperday”dataforhigherfidelityalertsExample:
ê Baselinenumberofdailyrecipients=30ê Today’samount=75,butfallsonthefringeofbeinganoutlierê Numberofnewrecipients=1-5;falsepositiveê Numberofnewrecipients=50;truepositive
Use-CaseDeepDive
35
MoveTheDataToShort-termStorageForMeasurement Use-Case
DeepDive
Nowinapositiontocompare#ofnewrecipients
Alerts&Visualizations
36
• Theoutputoftheoff-Splunk calculationscanbepicked
upbytheSplunk UForwrittentoaflatfile
• AllowstheusertocapitalizeontheSplunk interface
• Advantages/DisadvantagesofIndexingand
Sourcetyping:
ê Treatlikeanyotherdatasourceforcalculations
ê Technically“re-indexing”data,howeveranomalydata
setswillbesmall
Use-CaseDeepDive
Refinement
37
• Treatdifferentclusterswithdifferentmodels
• Continuallyvalidatedataandresults
• Understandwhyfalsepositivescomeup
• Addlengthtotrainingdatatimeifpossible
• IfaclusterisnotGaussian,tryothermodels,ortrytofitthedatato
aNormalDistribution
• Comparesimplerule-basedmodelssuchas3xmean=anomaly
Use-CaseDeepDive
Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedSearches
39
• Cross-correlate USB activity with departing employees for insider threat detection
• Background: Historical data from insider threat incidents indicate large transfers of data prior to departure
• Data Source:ê Endpoint Agent Logs (McAfee, Symantec, Kaspersky, etc.)ê Message Tracking Logs
DepartingEmployees
(HighestRisk)
SpikeinUSBActivity
Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedADSearches
40
• MostUSBactivitywillnotbeanormaldistribution
• UtilizeK-Meanstodetermineuserswhoback-uptheirmachinesviaUSB
• Mustutilizearule-basedapproach
• Setadailythresholdsuchas:daily_total >20MBtoindicatelargedatatransfers
sourcetype=“endpointagent"api="FileWrite"file_size!=0user=xxxxx |eval file_size_MB=(file_size/1048576)|dedupfile_size_MB,src |renameapi asaction,parameterasfile_path,file_name asprogram,src ashost_name |timechartsum(file_size_MB)asDaily_Total span=1d
41
• CreateasearchthatincludesdatafromyourHR
organizationOR…learnwhoisdepartingthrougha
rulebasedsearchofemailsubjectsê “*termination*”
ê “*resignation*”
ê AutogeneratedOracleorPeoplesoft emails
• CombinetheresultswiththespikesinUSBactivity
searchtocreateatwo-featureclassificationlearning
algorithm
Use-case:Cross-correlationOfDepartingEmployeesThroughRule-basedADSearches
sourcetype="MSExchange:2010:MessageTracking"recipient_count!=NONEmessage_subject!="*outofoffice*"message_subject!="Automaticreply*"message_subject!="CustomerSatisfaction*"message_subject!="READ:*"message_subject!="*determination*"(message_subject="*resignation*"ORmessage_subject="*termination*")|dedupmessage_id sortby _time|table_timedirectionalitysenderrecipientmessage_subject message_id
ML&ADforSecurityBestPractice:Cleanandreducesizeofthedatasettoincludeonlythoseitemsofvalue
Use-Case:EmailClientTypeStateChange
42
• DetectrareoranomalousvaluesinclienttypesusedbyOutlookWebAccess(OWA)orOutlookAnywhereusersasasignofcompromise
• Rarityor“novelty-based”anomalydetectionovertime
• First-observeddetection
(sourcetype="MSWindows:2008R2:IIS"ORsourcetype=iis)cs_username!="-"cs_user_agent!="-"(WebApplication=OWAORWebApplication=owa)|fields_timecs_username cs_user_agent WebApplication eventtype |statscountbycs_username,cs_user_agent|sort-count|statslist(cs_user_agent)asuser_agent,list(count)ascountbycs_username |wheremvcount(user_agent)>2
Use-case:SudoLogsAndSabotage
43
• Lookforanomalouspatternsinsudotorootprivileges
ê Time-basedloginsê Unauthorizedscriptsê DataTheftê Unauthorizedserveraccess
• Useacombinationofsupervisedrule-baseddetectionforscriptexecutionandtime-basedanomalydetectionforauthenticationdata
sourcetype=sudo dba_user |table_timehostdba_user Sudo_User PWDCOMMAND|statscountbydba_user,Sudo_User,COMMAND|sort-count|statslist(COMMAND)asCommand,list(count)ascountbydba_user,Sudo_User
InsiderThreatUse-CaseStarterSearches
44
• DetermineanomalousvaluessurroundingprivilegedwindowsadminswhoutilizeRDP
ê Determinemodelforbaseline– willprobablynotbeanormaldistribution
ê Setdetectionsforvaluesoutsideofbaselineê Featuresincludedestinationservers,time-of-day,andmultiplenoveltyevents
• UseclassificationandsupervisedlearningtodetectsensitivefiletypesinUSBtraffic
ê CADfiles,financialdocuments,engineering,businessintelligence,pricing,etc.
ê Setpositivevaluestofileextensionsofinterestsuchas.DWG
sourcetype="WinEventLog:Security"(EventCode=4624OREventCode=4625)Logon_Type=10user=“*admin*"|table_timeuserAccount_NameWorkstation_Name ComputerName src_ip
sourcetype=“endpointagent""FileWrite"file_size!=0user=xxxxx “*.DWG”|eval file_size_MB=(file_size/1048576)|dedup file_size_MB,src |statscountbysrc {oruser}
InternalAndExternalThreatUse-caseIdeas
45
• FTPServers:clusteringIPaddresses,frequencyspike,rule-baseddetectionsusingcompany-specificcriteria(IIS/FW/LBLogs)
• Phishingandfraudemaildetection:domainmismatchusingemailmetadata(message_id)tocomparesendingdomain,displayname,andreturnpath(MessageTrackingLogs)
• LargeNumberoffiledownloads/views/printsfromapplicationhousingsensitivedocuments(App-specificand/orIISlogsifweb-based)
• Anomalousportactivity(Ports53,25,21,22,443,123,etc.)• Authenticationanomalies:logintoarareorfirst-observeddevice,off-hourlogin,patternofsingle
failedloginsfromseveralmachinesorSharepoint locations(the“probing”user)• Detectingsharedcredentials– especiallyamongsensitiveusers(DBAs,Admins,etc.)
WrappingUp- WhatHaveWeCovered?
46
• TheDataScienceCycleforSecurity
• DeepDiveintoML&ADforsecurity
• DemystifiedthemathbehindML&ADandprovided
simplesolutionssuchasclassificationalgorithms
• VariousUse-CasesforsecurityML&AD
46
The process of cleaning and carefully selecting data is more important than choosing the right algorithm