real-time alarm veriﬁca/on with spark streaming and ... · 1/23/2017 · real-time alarm...

Real-TimeAlarmVerifica/onwithSparkStreamingandMachineLearning

AnaSima,JanStampfli,KurtStockingerZurichUniversityofAppliedSciences

SwissBigDataUserGroupMee3ng

January23,2017

AboutMe

•  Prof.Dr.KurtStockinger•  Sincesummer2013atZHAW

•  Databases,DataWarehousing,BigData

•  2007-2013:•  DataWarehouse&BusinessIntelligenceArchitect

•  2004-2007:•  ComputerScien3st

•  1999-2003:•  ComputerScien3st

2

WhatistheProblemwithAlarmSystems?

3

ATypicalExample

4

Seriousuniversi3es…

…withSeriousStudents

5

ASeriousPartyinaStudentHome

6

…SeriousFireFighters

7

WhyMachineLearning?

•  Around90%ofreportedincidentsarefalsealarms•  Humanoperatorsshouldpriori3zetruealarms•  Machinelearningisableto

•  discoverthepa_ernsburiedinthealarmdata•  separatetruefromfalsealarmswithahighaccuracy

8

9

DIGITAL FOOTPRINT OF A SECURED OBJECT

Alarm panel

Transmitter

User behavior

Alarm types

User behavior will improve overall process User based business models are possible with machine learning tools e.g. Services payments only when an alarm panel is activated

ResearchProjectwithSitasys

10

TalkOutline

•  Machinelearningforalarmverifica3on

•  End-to-endperformanceanalysis:•  Streamprocessing(liveanalysis)•  Batchprocessing(historicanalysis)•  Onlinemachinelearning(liveverifica3on)

11

MachineLearning

withSpark

12

AboutMe

JanStampfli•  ZHAWBachelorinComputerScience

•  Graduatedin2014•  ResearchassistantatZHAW

•  SinceSeptember2014

•  Workinginresearchprojectsonthetopics•  BigData•  MachineLearning•  Informa3onRetrieval

13

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

ATypicalMachineLearningProcess

14

SparkMLPipeline

h_p://spark.apache.org/docs/latest/ml-pipeline.html

15

loca/on labelBern 1Chur 0Sion 1

hash206691120996822577109

predic/on probabili/es0 [0.51,0.49]0 [0.99,0.01]1 [0.13,0.87]

DataFrame

Pipeline

PipelineTransformer EstimatorTransformerTransformer

EstimatorEstimator

Pipeline

SparkMLTuning

•  RandomForest

•  Numberoftrees•  Maxdepthoftrees

•  ParameterMap•  [10,20,30]•  [5,25,50]

16

DataFrame

CrossValidator

Pipeline Evaluator

h_p://spark.apache.org/docs/latest/ml-tuning.html

SparkMLEvalua3on

h_p://spark.apache.org/docs/latest/mllib-evalua3on-metrics.html 17

PredictionDataFrame

label probabili/es1 [0.51,0.49]0 [0.99,0.01]1 [0.13,0.87]

Evaluator

AccuracyPrecision

Recall

…

AppliedMachineLearning

18

DataaboutRealAlarms

•  Features

•  Loca3on,sensortype,dayofweek,…•  Labels

•  0=falsealarm•  1=truealarm

19

Total Falsealarms TruealarmsTotal 339’841 165’997 173’844Training 169’920 84’960 84’960Test 169’921 81’037 88’884

AppliedAlgorithms

•  RandomForest

•  SupportVectorMachine

•  Logis3cRegression

•  DeepNeuralNetwork(DNN)

20

Results

21

JanStampfli,KurtStockinger(2016).ApplieddataScience:UsingMachineLearningforAlarmVerifica3on,ERCIMNews,107,10.

Results

21

Accuracy

92.33% 92.05%

•  Confusionmatrix

•  Precision(correctlypredictedalarms)

•  Recall(correctlydetectedalarms)

22

Evalua3on–RandomForest

2222

Predic/onTrue Predic/onFalse

LabelTrue 82’999 5’885

LabelFalse 7’148 73’889


LabelTrue 92.07%

LabelFalse 92.62%


LabelTrue 93.38%

LabelFalse 91.18%

ConclusionaboutML

•  Humanoperatorsarenowabletopriori3zealarmspredictedastrue

•  Falsealarmsmusts3llbeevaluated

23

Performance

24

AboutMe

AnaSima

•  EPFLMasterinSWSystems,2015

•  Previouslyworkedonaframeworkforunifiedstream&batchprocessing(Cyclone)

•  JoinedZHAWsinceNovember201625

Welcomebackfromtheholidays!

26

Whyisalarmprocessinga3me-cri3calmission?

27

Whenthingsgowrong…

28

It’sallaboutresponse3me!

Performancetes3ng

29

DataConsumer

Streaming

Historic

MachineLearning

DataGenerator

Alarmstream

Designingtheconsumer

•  Spark-”AunifiedengineforBigDataProcessing”

30

SQL

StreamingMachineLearning

GraphProcessing

DataFrames

Kryo Gson Jackson

DataSets

RDDs

Noonesizefitsall,sohowdoIchoose???

31

Thebasecase

32

Producer

WriteNalarms/s

Consumer

Project&FilterDis3nctComputeHistogramPredictTrue/False

Alarms

Historicquery

ConsumerWorkflow

•  Everywindowof10s•  Streaming

•  mAddrsInWindow = alarms.map(Alarm::getMacAddress).distinct();

•  Historic•  Select MacAddress, count(*) from mongoDB Where MacAddress in mAddrsInWindow

Group By MacAddress

•  MachineLearning•  (Coveredinfirsthalfoftalk)

33

34

SAVEBenchmarkIden3fyingBo_lenecks

1

ZHAWDatalab

Iden3fyingbo_lenecks(1)

1.  Startedexperimentwith•  8-coreKataProducer•  8-coreSparkConsumer

2.  Consumerslowevenforabasicstreamingtask•  Countnumberofelementsinwindow=>afewsec

3.  Producernotoutpuungexpectedthroughput•  Stumbled@around12K/s(recordsize≃600B) 35

Commonrootcause?

36

Producer

Serialize,(fasterxmlJackson)WriteNalarms/s

ConsumerDeserialize,


SerializedAlarms

Historicquery

Doesitreallyma_er?

•  Per-objectserializa3ononlytakesafewtensofns

37

•  Mul3plythatby12K..

•  Andbtw,haveitallsentin1s!

Areweusingtherighttool?

•  Answer:benchmarks*!

•  ”IfyourenvironmentprimarilydealswithlotsofsmallJSONrequests[…]thenGSONisyourlibraryofinterest.Jacksonstrugglesthemostwithsmallfiles.”

•  Jackson:upto3xslowerforsmallobjects

*TheulMmateJSONlibrary:hRp://blog.takipi.com/the-ulMmate-json-library-json-simple-vs-gson-vs-jackson-vs-json/

38

39


2

ZHAWDatalab

Results(1)

05,00010,00015,00020,00025,00030,000

Jackson Gson

Producermaxthroughput(alarms/s)

+

40*<30LOCchange=>�~2xspeedupinbothProducer&Consumer

05,00010,00015,00020,00025,00030,000

Jackson Gson

Consumermaxthroughput(alarms/s)

Howmuchwillfitin10s?

41

2.91

2.3

0.02

0.02

0.02

7

5

7.5

0

2

4

6

8

10

12

Kata+Jackson(3.5Kalarms)

Kata+GSON(3.5Kalarms)


Breakdownof/me/subtask

Streaming Historic MachineLearning

Note:thealarmsRDDisusedforboththestreaming&theMLpart=>serializerchangebenefitsboth

Iden3fyingbo_lenecks(2)

1.  Caching?•  Broughtsomeimprovement,butsmall(~5%)•  Theamountofdatareusedistooli_le(20MB)

2.  Simplifyingthedesign?•  TheKISS*principle

•  KeepItSimple,Stupid!

42

Cappedcollec3on

Simplifica3on

43

Alarms

Cappedcollec3on

Producer

WriteNalarms/s

Consumer


Alarms

44


3

ZHAWDatalab

Results(2)

Producer

05,00010,00015,00020,00025,00030,000

Maxthroughput(alarms/s)

45

05,00010,00015,00020,00025,00030,000


Note:Producerslow,buts3ll@previousmaxconsumerthroughput!Consumer:~50%speedup.

+

Howmuchwillfitin10s?

46

2.91 1

0.02

0.02 0.02

7

5

2

0

2

4

6

8

10

12

Kata+Jackson(3.5Kalarms)


Mongo+GSON(3.5Kalarms)

Breakdownof/me/subtask

Streaming Historic MachineLearning

Rootcause?

•  UsingKatadirectstreamwithasinglepar33on=>poorparallelism•  Mostopera3onsrunonasingleexecutor

•  EventhoughSparkisconfiguredtorunon8cores

•  UsingMongoDB=>finergrainedcontroloverparallelismlevelfortheRDD•  Readanarrayofdocumentsfromtailablecursor•  ParallelizearrayintoRDD&runMLoneachpart

47

FIXIT!

ConfiguringtheKataDirectStreaminSparkwithproperseungs….

(numpar33ons=numcores)

48

Results(3)

Producer

05,00010,00015,00020,00025,00030,000


49

05,00010,00015,00020,00025,00030,000


+

Designingtheconsumer

•  Spark-”AunifiedengineforBigDataProcessing”

50

SQL

StreamingMachineLearning

GraphProcessing

DataFrames

Kryo Gson Jackson

DataSets

RDDs

Easytousetools…cansome/mesbetrickytogetright

Conclusions & Lessons Learned

•  Machine learning algorithm of Spark can be directly applied •  Works well for small and big data •  No re-writing necessary to make algorithm scale across computing cluster

•  Spark Streaming can’t be used out of the box for stream and batch processing: •  Need to use a persistency layer •  Requires significant performance tuning

•  Working with our industry partner Sitasys: •  Great collaboration •  Working with real-world problem is very rewarding •  Can make real contribution and enhance existing algorithms and

technology 51

Appendix

52

PipelineExampleCode(Java)//CreatestringindexerTransformer.StringIndexerstringIndexer=newStringIndexer()

.setInputCol("loca/on").setOutputCol("indexLoca/on");//CreaterandomforestEsMmator.RandomForestClassifierrandomForestClf=newRandomForestClassifier()

.setFeaturesCol("indexLoca/on").setLabelCol("label");//CreatepipelineEsMmator(specifyingtheMLworkflow).Pipelinepipeline=newPipeline()

.setStages(newPipelineStage[]{stringIndexer,randomForestClf});//CreatecrossvalidaMonEsMmator(wrappingthepipelineEsMmator).ParamMap[]paramGrid=newParamGridBuilder()

.addGrid(randomForestClf.numTrees()(),newint[]{10,20,30}).addGrid(randomForestClf.maxDepth(),newint[]{5,25,50}).build();

CrossValidatorcrossValidator=newCrossValidator().setEs3mator(pipeline).setEs3matorParamMaps(paramGrid).setEvaluator(newBinaryClassifica3onEvaluator()).setNumFolds(10);

53

Training,Tes3ngandEvalua3onExampleCode(Java)//Given(creaMonofdatasetnoincludedinexamplecode)Dataset<Row>trainSet=…;Dataset<Row>testSet=…;//Fitthecrossvalidatortotrainingdocuments.CrossValidatorModelmodel=crossValidator.fit(trainSet);//MakepredicMonsontestdocuments.Dataset<Row>predic3ons=model.transform(testSet);//Createevaluator.Mul3classClassifica3onEvaluatorevaluator=newMul3classClassifica3onEvaluator()

.setLabelCol("label").setPredic3onCol("predic/on");//EvaluatepredicMons.evaluator.setMetricName("accuracy");doubleaccuracy=evaluator.evaluate(predic3ons); 54

real-time alarm veriﬁca/on with spark streaming and ... · 1/23/2017 · real-time alarm...

Documents