advanced machine learning in spl with the …...advanced machine learning in spl with the machine...

Copyright©2016Splunk Inc.

JacobLeverichSoftwareEngineer,Splunk

AdvancedMachineLearninginSPLwiththeMachineLearningToolkit

Disclaimer

2

Duringthecourseofthispresentation,wemaymakeforwardlookingstatementsregardingfutureeventsortheexpectedperformanceofthecompany.Wecautionyouthatsuchstatementsreflectourcurrentexpectationsandestimatesbasedonfactorscurrentlyknowntousandthatactualeventsorresultscoulddiffermaterially.Forimportantfactorsthatmaycauseactualresultstodifferfromthose

containedinourforward-lookingstatements,pleasereviewourfilingswiththeSEC.Theforward-lookingstatementsmadeinthethispresentationarebeingmadeasofthetimeanddateofitslivepresentation.Ifreviewedafteritslivepresentation,thispresentationmaynotcontaincurrentoraccurateinformation.Wedonotassumeanyobligationtoupdateanyforwardlookingstatementswemaymake.Inaddition,anyinformationaboutourroadmapoutlinesourgeneralproductdirectionandissubjecttochangeatanytimewithoutnotice.Itisforinformationalpurposesonlyandshallnot,beincorporatedintoanycontractorothercommitment.Splunkundertakesnoobligationeithertodevelopthefeaturesor

functionalitydescribedortoincludeanysuchfeatureorfunctionalityinafuturerelease.

WhoamI?Splunker for2years,basedinSanFrancisco

Engineeringleadfor…– MLToolkitandShowcaseApp– ITSIAnomalyDetectionandAdaptiveThresholdingfeatures– Splunk customsearchcommandinterface

Initialauthoroffit/applycommandsinMLToolkit

Die-hardLonghornsfan

3

Agenda

MachineLearning+SplunkML-SPL:MachineLearninginSPL– Whatitis– Howitworks

OverviewofAlgorithmsandAnalyticsavailableinML-SPLTipsforFeatureEngineeringinSPLWrapup

4

MachineLearning+Splunk

MachineLearningisNotMagic

…it’saprocess.

Theprocessstartswithaquestion:– HowmanyrequestsdoIexpectinthenexthour?– Howlikelyisthisharddrivetofailinthenearfuture?– AmIbeinghacked?

ê IsitunexpectedforJoetologintothebastionhostat2am?

6

MachineLearningisNotMagic

…it’saprocess.

7

CollectData

Explore/Visualize

Model

Evaluate

Clean/Transform

Publish/Deploy

8

“CleaningBigData:MostTime-Consuming,LeastEnjoyableDataScienceTask,SurveySays”,ForbesMar23,2016

Splunk forDataPreparation

9

CollectData

Explore/Visualize

Model

Evaluate

Clean/Transform

Publish/Deploy

props.conf,transforms.conf,DatamodelsAdd-onsfromSplunkbase,etc.

Pivot,TableUI,SPLMLToolkit

Alerts,Dashboards,Reports

ML-SPL:MachineLearninginSPL

ML-SPL:Whatisit?AsuiteofSPLsearchcommandsspecificallyforMachineLearning:– fit– apply– summary– listmodels– deletemodel– sample

ImplementedusingmodulesfromthePythonforScientificComputingadd-onforSplunk:– scikit-learn,numpy,pandas,statsmodels,scipy

ML-SPLCommands:A“grammar”forMLFit(i.e.train)amodelfromsearchresults… | fit <ALGORITHM> <TARGET> from <VARIABLES …>

<PARAMETERS> into <MODEL>

Applyamodeltoobtainpredictionsfrom(new)searchresults… | apply <MODEL>

Inspectthemodelbuiltby<ALGORITHM> (e.g.displaycoefficients)| summary <MODEL>

ML-SPLCommands:fit

13

… | fit <ALGORITHM> <TARGET> from <VARIABLES …><PARAMETERS> into <MODEL>

Examples:

… | fit LinearRegressionsystem_temp from cpu_load fan_rpminto temp_model

… | fit KMeans k=10downloads purchases posts days_active visits_per_dayinto user_behavior_clusters

… | fit LinearRegressionpetal_length from species

optional

fit:HowItWorks

15

1. Discardfieldsthatarenullforallsearchresults.2. Discardnon-numericfieldswith>100distinctvalues.3. Discardsearchresultswithanynullfields.4. Convertnon-numericfieldstobinaryindicatorvariables

(i.e.“dummycoding”).5. Converttoanumericmatrixandhandoverto<ALGORITHM>.6. Computepredictionsforallsearchresults.7. Savethelearnedmodel.

fit:HowItWorks

16

1. Discardfieldsthatarenullforallsearchresults.

field_A field_B field_C field_D field_E

ok 41 red 172.24.16.5

ok 32 green 192.168.0.2

FRAUD 1 blue 10.6.6.6

ok 43 171.64.72.1

2 blue 192.168.0.2

Target ExplanatoryVariables…

… | fit LogisticRegression field_A from field_*

fit:HowItWorks

17

2. Discardnon-numericfieldswith>100distinctvalues.

field_A field_B field_D field_E

ok 41 red 172.24.16.5

ok 32 green 192.168.0.2

FRAUD 1 blue 10.6.6.6

ok 43 171.64.72.1

2 blue 192.168.0.2



fit:HowItWorks

18

3. Discardsearchresultswithanynullfields.

field_A field_B field_D

ok 41 red

ok 32 green

FRAUD 1 blue

ok 43

2 blue



fit:HowItWorks

19


ok 41 red

ok 32 green

FRAUD 1 blue



4. Convertnon-numericfieldstobinaryindicatorvariables.

field_A field_B field_D=red …=green …=blue

ok 41 1 0 0

ok 32 0 1 0

FRAUD 1 0 0 1

fit:HowItWorks

20

5. Converttoanumericmatrixandhandoverto<ALGORITHM>.y = X=


[1,1,0] [[41,1,0,0],[32,0,1,0],[1,0,0,1]]

𝑦" = 1

1 + 𝑒((*+,)Find𝜃 usingmaximumlikelihoodestimation.

e.g.forLogisticRegression:

Modelinferencegenerallydelegatedtoscikit-learnandstatsmodels.(e.g.sklearn.linear_model.LogisticRegression)

fit:HowItWorks

21

6. Computepredictionsforallsearchresults.

field_A field_B field_C field_D field_E predicted(field_A)

ok 41 red 172.24.16.5 ok

ok 32 green 192.168.0.2 ok

FRAUD 1 blue 10.6.6.6 FRAUD

ok 43 171.64.72.1 ok

2 blue 192.168.0.2 FRAUD



Prediction

fit:HowItWorks

25

7. Savethelearnedmodel.

Serializemodelsettings,coefficients,etc.intoaSplunk lookuptable.• ReplicatedamongstmembersofSearchHeadCluster.• AutomaticallydistributedtoIndexerswithsearchbundle.

… | fit LogisticRegression field_A from field_* into logreg_model

fit:Properties

27

Eacheventisan“example”forthelearningalgorithm.

Resilienttomissingvalues.(butbecareful!)

Automaticallyhandlescategorical(e.g.non-numeric)fields.

SAVESITSWORK:– Learnedmodelcanbeappliedtonew,unseen data

withtheapply command.

fit:Scalability

28

Somealgorithmsareinherentlynotscalable.– e.g.Kernel-basedSupportVectorMachinesis𝑂 𝑁1

Inputissampledusingreservoirsampling.– Per-algorithmsamplereservoirsize,typically100,000events– Configurableinmlspl.conf.

Somealgorithmssupportincrementalfitting,e.g.:SGDRegressor,SGDClassifier,NaiveBayes– Use“partial_fit=t”optionwithfit command.– Nosampling,noeventlimit!

Forthemostpart,youdon’tneedtocare.

ML-SPLCommands:apply

29

… | apply <MODEL>

Examples:

… | apply temp_model

… | apply user_behavior_clusters

… | apply petal_length_from_species

apply:HowItWorks

31

1. Loadthelearnedmodel.2. Discardfieldsthatarenullforallsearchresults.3. Discardnon-numericfieldswith>100distinctvalues.4. Convertnon-numericfieldstobinaryindicatorvariables

(i.e.“dummycoding”).5. Discardvariablesnotinthelearnedmodel.6. Fillmissingfieldswith0’s.7. Converttoanumericmatrixandhandoverto<ALGORITHM>.8. Computepredictionsforallsearchresults.


ok 41 red

ok 32 green

FRAUD 1 blue

41 yellow

field_A field_B field_D=red …=green …=blue …=yellow

ok 41 1 0 0 0

ok 32 0 1 0 0

FRAUD 1 0 0 1 0

41 0 0 0 1

apply:HowItWorks

32


… | apply fraud_model

4. Convertnon-numericfieldstobinaryindicatorvariables.

field_A field_B field_D=red …=green …=blue …=yellow

ok 41 1 0 0 0

ok 32 0 1 0 0

FRAUD 1 0 0 1 0

41 0 0 0 1

apply:HowItWorks

33



5. Discardvariablesnotinthelearnedmodel.

apply:HowItWorks

34

5. Converttoanumericmatrixandhandoverto<ALGORITHM>.

y = X=


[1,1,0,1,?] [[41,1,0,0],[32,0,1,0],[1,0,0,1],[41,0,0,0]]

𝑦" = 1

1 + 𝑒((*+,)Compute𝑦" usingθ foundbyfit command.

e.g.forLogisticRegression:

apply:HowItWorks

35

7. Computepredictionsforallsearchresults.

field_A field_B field_C field_D field_E predicted(field_A)

ok 41 red 172.24.16.5 ok

ok 32 green 192.168.0.2 ok

FRAUD 1 blue 10.6.6.6 FRAUD

ok 43 171.64.72.1 ok

41 yellow 192.168.0.2 ok



Prediction

apply:Properties

36

Learnedmodelscanbeappliedtonew,unseen data.

| fit isto| apply

as

| outputlookup isto| lookup

Resilienttomissingvalues.(but,again,becareful!)

Automaticallyhandlescategorical(e.g.non-numeric)fields.

apply:Scalability

37

Nolimits.

Whenpossible,executesattheIndexingtier.– Fullyparallelized;harnesstheCPUpowerofyourIndexingCluster.– Mustset“streaming_apply = true”inmlspl.conf.

ML-SPLCommands:summary

38

… | summary <MODEL>

Examples:

… | summary temp_model

… | summary user_behavior_clusters

… | summary petal_length_from_species

40

𝑦" = 1

1 + 𝑒((*+,)

AlgorithmsandAnalyticsinML-SPL

RegressionAlgorithms(e.g.predictnumericfields)

LinearRegression– …includingLasso,Ridge,ElasticNet

KernelRidgeDecisionTreeRegressorRandomForestRegressorSGDRegressor

Allimplementedwithsklearn models.

ClassificationAlgorithms(e.g.predictcategoricalfields)

LogisticRegressionDecisionTreeClassifierRandomForestClassifierSGDClassifierSVMNaïve Bayes– IncludingBernoulliNB andGuassianNB

ClusteringAlgorithms(e.g.grouplikewithlike)

KMeansDBSCANBirchSpectralClustering

FeatureEngineeringAlgorithms(e.g.datapre-processing)

TFIDF (term-frequencyxinversedocument-frequency)– Transformfree-formtextintonumericfields

StandardScaler (i.e.normalization)FieldSelector (i.e.chooseKbestfeaturesforregression/classification)PCA andKernelPCA

“Pipeline”MultipleAlgorithms

46

Example:TextAnalytics– TFIDF totransformfree-formmessagesintonumericfields,

followedby…ê KMeans togroupsimilarmessagesê BernoulliNB toclassifymessages(e.g.accordingtosentiment)ê PCA tovisualizedistributionofmessages

– … | fit TFIDF message | fit Kmeans message_tfidf_* | ...

AnalogoustoPipelineconceptfromsklearn orSparkMLLib

“Pipeline”MultipleAlgorithms

49

ML-SPLanalyticsarestackable.

VeryadvancedMLuse-casesaresuccinctlyexpressible.

TipsforFeatureEngineering

TipsforFeatureEngineeringWorkonaggregates,notrawevents.– DONOTusefiton1,000,000,000events.DOusestats.

Useeval tocomputenewfeatures.

Usestreamstats toconstructleadingindicators.

…

Workonaggregates,notrawevents… | fit KMeans k=10

downloads purchases posts days_active visits_per_dayinto user_behavior_clusters

Usestats andlookuptablestoconstructfeatures:

index=activity_logs| stats count by action user_id| xyseries user_id action count | fillnull| lookup user_activity user_id

OUTPUT days_active visits_per_day| fit KMeans k=10 …

Usestreamstats forleadingindicators

56

index=application_log OR index=tickets| timechart span=1d count(failure) as FAILS,

count(“Change Request”) as CHANGES| reverse| streamstats window=3 sum(FAILS) as FAILS_NEXT_3DAYS| reverse| fit LinearRegression FAILS_NEXT_3DAYS from CHANGES

into FAILS_PREDICTION_MODEL

Wrap-up

Whatdidwecover?

MachineLearning+SplunkML-SPL:MachineLearninginSPL– Whatitis– Howitworks

OverviewofAlgorithmsandAnalyticsavailableinML-SPLTipsforFeatureEngineeringinSPL

58

WhatNow?

59

InstalltheMLToolkitfromSplunkbase!– http://tiny.cc/splunkmlapp

Don’tmissManishSainani’s orAdamOliner’s talks!

ProductManager:ManishSainani <[email protected]>FieldExpert:AndrewStein<[email protected]>Me:JacobLeverich<[email protected]>

THANKYOU

fit:Misc.details

61

Multi-classclassificationproblemstypicallymodeledas“one-vs-rest”

SomealgorithmsdoNOTsupportsavedmodels,e.g.:– DBSCANandSpectralClustering

ML-SPLCommandsfit <ALGORITHM> <TARGET> from <VARIABLES …> <PARAMETERS> into <MODEL>– Fit(i.e.train)amodelfromsearchresults

apply <MODEL>– Applyamodeltoobtainpredictionsfrom(new)searchresults

summary <MODEL>– Inspectthemodelinferredby<ALGORITHM> (e.g.,displaycoefficients)

SlideTitle

advanced machine learning in spl with the …...advanced machine learning in spl with the machine...

Documents