advanced machine learning in spl with the …...advanced machine learning in spl with the machine...
TRANSCRIPT
Copyright©2016Splunk Inc.
JacobLeverichSoftwareEngineer,Splunk
AdvancedMachineLearninginSPLwiththeMachineLearningToolkit
Disclaimer
2
Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose
containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor
functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.
WhoamI?Splunker for2years,basedinSanFrancisco
Engineeringleadfor…– MLToolkitandShowcaseApp– ITSIAnomalyDetectionandAdaptiveThresholdingfeatures– Splunk customsearchcommandinterface
Initialauthoroffit/applycommandsinMLToolkit
Die-hardLonghornsfan
3
Agenda
MachineLearning+SplunkML-SPL:MachineLearninginSPL– Whatitis– Howitworks
OverviewofAlgorithmsandAnalyticsavailableinML-SPLTipsforFeatureEngineeringinSPLWrapup
4
MachineLearning+Splunk
MachineLearningisNotMagic
…it’saprocess.
Theprocessstartswithaquestion:– HowmanyrequestsdoIexpectinthenexthour?– Howlikelyisthisharddrivetofailinthenearfuture?– AmIbeinghacked?
ê IsitunexpectedforJoetologintothebastionhostat2am?
6
MachineLearningisNotMagic
…it’saprocess.
7
CollectData
Explore/Visualize
Model
Evaluate
Clean/Transform
Publish/Deploy
8
“CleaningBigData:MostTime-Consuming,LeastEnjoyableDataScienceTask,SurveySays”,ForbesMar23,2016
Splunk forDataPreparation
9
CollectData
Explore/Visualize
Model
Evaluate
Clean/Transform
Publish/Deploy
props.conf,transforms.conf,DatamodelsAdd-onsfromSplunkbase,etc.
Pivot,TableUI,SPLMLToolkit
Alerts,Dashboards,Reports
ML-SPL:MachineLearninginSPL
ML-SPL:Whatisit?AsuiteofSPLsearchcommandsspecificallyforMachineLearning:– fit– apply– summary– listmodels– deletemodel– sample
ImplementedusingmodulesfromthePythonforScientificComputingadd-onforSplunk:– scikit-learn,numpy,pandas,statsmodels,scipy
ML-SPLCommands:A“grammar”forMLFit(i.e.train)amodelfromsearchresults… | fit <ALGORITHM> <TARGET> from <VARIABLES …>
<PARAMETERS> into <MODEL>
Applyamodeltoobtainpredictionsfrom(new)searchresults… | apply <MODEL>
Inspectthemodelbuiltby<ALGORITHM> (e.g.displaycoefficients)| summary <MODEL>
ML-SPLCommands:fit
13
… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>
Examples:
… | fit LinearRegressionsystem_temp from cpu_load fan_rpminto temp_model
… | fit KMeans k=10downloads purchases posts days_active visits_per_dayinto user_behavior_clusters
… | fit LinearRegressionpetal_length from species
optional
14
fit:HowItWorks
15
1. Discardfieldsthatarenullforallsearchresults.2. Discardnon-numericfieldswith>100distinctvalues.3. Discardsearchresultswithanynullfields.4. Convertnon-numericfieldstobinaryindicatorvariables
(i.e.“dummycoding”).5. Converttoanumericmatrixandhandoverto<ALGORITHM>.6. Computepredictionsforallsearchresults.7. Savethelearnedmodel.
fit:HowItWorks
16
1. Discardfieldsthatarenullforallsearchresults.
field_A field_B field_C field_D field_E
ok 41 red 172.24.16.5
ok 32 green 192.168.0.2
FRAUD 1 blue 10.6.6.6
ok 43 171.64.72.1
2 blue 192.168.0.2
Target ExplanatoryVariables…
… | fit LogisticRegression field_A from field_*
fit:HowItWorks
17
2. Discardnon-numericfieldswith>100distinctvalues.
field_A field_B field_D field_E
ok 41 red 172.24.16.5
ok 32 green 192.168.0.2
FRAUD 1 blue 10.6.6.6
ok 43 171.64.72.1
2 blue 192.168.0.2
Target ExplanatoryVariables…
… | fit LogisticRegression field_A from field_*
fit:HowItWorks
18
3. Discardsearchresultswithanynullfields.
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
ok 43
2 blue
Target ExplanatoryVariables…
… | fit LogisticRegression field_A from field_*
fit:HowItWorks
19
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
Target ExplanatoryVariables…
… | fit LogisticRegression field_A from field_*
4. Convertnon-numericfieldstobinaryindicatorvariables.
field_A field_B field_D=red …=green …=blue
ok 41 1 0 0
ok 32 0 1 0
FRAUD 1 0 0 1
fit:HowItWorks
20
5. Converttoanumericmatrixandhandoverto<ALGORITHM>.y = X=
… | fit LogisticRegression field_A from field_*
[1,1,0] [[41,1,0,0],[32,0,1,0],[1,0,0,1]]
𝑦" = 1
1 + 𝑒((*+,)Find𝜃 usingmaximumlikelihoodestimation.
e.g.forLogisticRegression:
Modelinferencegenerallydelegatedtoscikit-learnandstatsmodels.(e.g.sklearn.linear_model.LogisticRegression)
fit:HowItWorks
21
6. Computepredictionsforallsearchresults.
field_A field_B field_C field_D field_E predicted(field_A)
ok 41 red 172.24.16.5 ok
ok 32 green 192.168.0.2 ok
FRAUD 1 blue 10.6.6.6 FRAUD
ok 43 171.64.72.1 ok
2 blue 192.168.0.2 FRAUD
Target ExplanatoryVariables…
… | fit LogisticRegression field_A from field_*
Prediction
22
23
24
fit:HowItWorks
25
7. Savethelearnedmodel.
Serializemodelsettings,coefficients,etc.intoaSplunk lookuptable.• ReplicatedamongstmembersofSearchHeadCluster.• AutomaticallydistributedtoIndexerswithsearchbundle.
… | fit LogisticRegression field_A from field_* into logreg_model
26
fit:Properties
27
Eacheventisan“example”forthelearningalgorithm.
Resilienttomissingvalues.(butbecareful!)
Automaticallyhandlescategorical(e.g.non-numeric)fields.
SAVESITSWORK:– Learnedmodelcanbeappliedtonew,unseen data
withtheapply command.
fit:Scalability
28
Somealgorithmsareinherentlynotscalable.– e.g.Kernel-basedSupportVectorMachinesis𝑂 𝑁1
Inputissampledusingreservoirsampling.– Per-algorithmsamplereservoirsize,typically100,000events– Configurableinmlspl.conf.
Somealgorithmssupportincrementalfitting,e.g.:SGDRegressor,SGDClassifier,NaiveBayes– Use“partial_fit=t”optionwithfit command.– Nosampling,noeventlimit!
Forthemostpart,youdon’tneedtocare.
ML-SPLCommands:apply
29
… | apply <MODEL>
Examples:
… | apply temp_model
… | apply user_behavior_clusters
… | apply petal_length_from_species
30
apply:HowItWorks
31
1. Loadthelearnedmodel.2. Discardfieldsthatarenullforallsearchresults.3. Discardnon-numericfieldswith>100distinctvalues.4. Convertnon-numericfieldstobinaryindicatorvariables
(i.e.“dummycoding”).5. Discardvariablesnotinthelearnedmodel.6. Fillmissingfieldswith0’s.7. Converttoanumericmatrixandhandoverto<ALGORITHM>.8. Computepredictionsforallsearchresults.
field_A field_B field_D
ok 41 red
ok 32 green
FRAUD 1 blue
41 yellow
field_A field_B field_D=red …=green …=blue …=yellow
ok 41 1 0 0 0
ok 32 0 1 0 0
FRAUD 1 0 0 1 0
41 0 0 0 1
apply:HowItWorks
32
Target ExplanatoryVariables…
… | apply fraud_model
4. Convertnon-numericfieldstobinaryindicatorvariables.
field_A field_B field_D=red …=green …=blue …=yellow
ok 41 1 0 0 0
ok 32 0 1 0 0
FRAUD 1 0 0 1 0
41 0 0 0 1
apply:HowItWorks
33
Target ExplanatoryVariables…
… | apply fraud_model
5. Discardvariablesnotinthelearnedmodel.
apply:HowItWorks
34
5. Converttoanumericmatrixandhandoverto<ALGORITHM>.
y = X=
… | apply fraud_model
[1,1,0,1,?] [[41,1,0,0],[32,0,1,0],[1,0,0,1],[41,0,0,0]]
𝑦" = 1
1 + 𝑒((*+,)Compute𝑦" usingθ foundbyfit command.
e.g.forLogisticRegression:
apply:HowItWorks
35
7. Computepredictionsforallsearchresults.
field_A field_B field_C field_D field_E predicted(field_A)
ok 41 red 172.24.16.5 ok
ok 32 green 192.168.0.2 ok
FRAUD 1 blue 10.6.6.6 FRAUD
ok 43 171.64.72.1 ok
41 yellow 192.168.0.2 ok
Target ExplanatoryVariables…
… | apply fraud_model
Prediction
apply:Properties
36
Learnedmodelscanbeappliedtonew,unseen data.
| fit isto| apply
as
| outputlookup isto| lookup
Resilienttomissingvalues.(but,again,becareful!)
Automaticallyhandlescategorical(e.g.non-numeric)fields.
apply:Scalability
37
Nolimits.
Whenpossible,executesattheIndexingtier.– Fullyparallelized;harnesstheCPUpowerofyourIndexingCluster.– Mustset“streaming_apply = true”inmlspl.conf.
ML-SPLCommands:summary
38
… | summary <MODEL>
Examples:
… | summary temp_model
… | summary user_behavior_clusters
… | summary petal_length_from_species
39
40
𝑦" = 1
1 + 𝑒((*+,)
AlgorithmsandAnalyticsinML-SPL
RegressionAlgorithms(e.g.predictnumericfields)
LinearRegression– …includingLasso,Ridge,ElasticNet
KernelRidgeDecisionTreeRegressorRandomForestRegressorSGDRegressor
Allimplementedwithsklearn models.
ClassificationAlgorithms(e.g.predictcategoricalfields)
LogisticRegressionDecisionTreeClassifierRandomForestClassifierSGDClassifierSVMNaïve Bayes– IncludingBernoulliNB andGuassianNB
ClusteringAlgorithms(e.g.grouplikewithlike)
KMeansDBSCANBirchSpectralClustering
FeatureEngineeringAlgorithms(e.g.datapre-processing)
TFIDF (term-frequencyxinversedocument-frequency)– Transformfree-formtextintonumericfields
StandardScaler (i.e.normalization)FieldSelector (i.e.chooseKbestfeaturesforregression/classification)PCA andKernelPCA
“Pipeline”MultipleAlgorithms
46
Example:TextAnalytics– TFIDF totransformfree-formmessagesintonumericfields,
followedby…ê KMeans togroupsimilarmessagesê BernoulliNB toclassifymessages(e.g.accordingtosentiment)ê PCA tovisualizedistributionofmessages
– … | fit TFIDF message | fit Kmeans message_tfidf_* | ...
AnalogoustoPipelineconceptfromsklearn orSparkMLLib
47
48
“Pipeline”MultipleAlgorithms
49
ML-SPLanalyticsarestackable.
VeryadvancedMLuse-casesaresuccinctlyexpressible.
TipsforFeatureEngineering
TipsforFeatureEngineeringWorkonaggregates,notrawevents.– DONOTusefiton1,000,000,000events.DOusestats.
Useeval tocomputenewfeatures.
Usestreamstats toconstructleadingindicators.
…
Workonaggregates,notrawevents… | fit KMeans k=10
downloads purchases posts days_active visits_per_dayinto user_behavior_clusters
Usestats andlookuptablestoconstructfeatures:
index=activity_logs| stats count by action user_id| xyseries user_id action count | fillnull| lookup user_activity user_id
OUTPUT days_active visits_per_day| fit KMeans k=10 …
Useeval tocomputenewfeatures
53
Coercenumbersintocategoriesbyprependingastring:– … | eval region_id = “Region ” + region_id | …
Modelinteractionsbetweenfeatures:– … | eval X_factor = importance * urgency | …– Use+forcategoricalfields,*fornumeric
Makenon-linearfeaturesoutofnumericvalues:– … | eval temperature = pow(temperature,2) | …– … | eval latency = log(latency) | …
54
55
Usestreamstats forleadingindicators
56
index=application_log OR index=tickets| timechart span=1d count(failure) as FAILS,
count(“Change Request”) as CHANGES| reverse| streamstats window=3 sum(FAILS) as FAILS_NEXT_3DAYS| reverse| fit LinearRegression FAILS_NEXT_3DAYS from CHANGES
into FAILS_PREDICTION_MODEL
Wrap-up
Whatdidwecover?
MachineLearning+SplunkML-SPL:MachineLearninginSPL– Whatitis– Howitworks
OverviewofAlgorithmsandAnalyticsavailableinML-SPLTipsforFeatureEngineeringinSPL
58
WhatNow?
59
InstalltheMLToolkitfromSplunkbase!– http://tiny.cc/splunkmlapp
Don’tmissManishSainani’s orAdamOliner’s talks!
ProductManager:ManishSainani <[email protected]>FieldExpert:AndrewStein<[email protected]>Me:JacobLeverich<[email protected]>
THANKYOU
fit:Misc.details
61
Multi-classclassificationproblemstypicallymodeledas“one-vs-rest”
SomealgorithmsdoNOTsupportsavedmodels,e.g.:– DBSCANandSpectralClustering
ML-SPLCommandsfit <ALGORITHM> <TARGET> from <VARIABLES …> <PARAMETERS> into <MODEL>– Fit(i.e.train)amodelfromsearchresults
apply <MODEL>– Applyamodeltoobtainpredictionsfrom(new)searchresults
summary <MODEL>– Inspectthemodelinferredby<ALGORITHM> (e.g.,displaycoefficients)
SlideTitle