real-time alarm verifica/on with spark streaming and ... · 1/23/2017 · real-time alarm...
TRANSCRIPT
Real-TimeAlarmVerifica/onwithSparkStreamingandMachineLearning
AnaSima,JanStampfli,KurtStockingerZurichUniversityofAppliedSciences
SwissBigDataUserGroupMee3ng
January23,2017
AboutMe
• Prof.Dr.KurtStockinger• Sincesummer2013atZHAW
• Databases,DataWarehousing,BigData
• 2007-2013:• DataWarehouse&BusinessIntelligenceArchitect
• 2004-2007:• ComputerScien3st
• 1999-2003:• ComputerScien3st
2
WhatistheProblemwithAlarmSystems?
3
ATypicalExample
4
Seriousuniversi3es…
…withSeriousStudents
5
ASeriousPartyinaStudentHome
6
…SeriousFireFighters
7
WhyMachineLearning?
• Around90%ofreportedincidentsarefalsealarms• Humanoperatorsshouldpriori3zetruealarms• Machinelearningisableto
• discoverthepa_ernsburiedinthealarmdata• separatetruefromfalsealarmswithahighaccuracy
8
9
DIGITAL FOOTPRINT OF A SECURED OBJECT
Alarm panel
Transmitter
User behavior
Alarm types
User behavior will improve overall process User based business models are possible with machine learning tools e.g. Services payments only when an alarm panel is activated
ResearchProjectwithSitasys
10
TalkOutline
• Machinelearningforalarmverifica3on
• End-to-endperformanceanalysis:• Streamprocessing(liveanalysis)• Batchprocessing(historicanalysis)• Onlinemachinelearning(liveverifica3on)
11
MachineLearning
withSpark
12
AboutMe
JanStampfli• ZHAWBachelorinComputerScience
• Graduatedin2014• ResearchassistantatZHAW
• SinceSeptember2014
• Workinginresearchprojectsonthetopics• BigData• MachineLearning• Informa3onRetrieval
13
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
ATypicalMachineLearningProcess
14
SparkMLPipeline
h_p://spark.apache.org/docs/latest/ml-pipeline.html
15
loca/on labelBern 1Chur 0Sion 1
hash206691120996822577109
predic/on probabili/es0 [0.51,0.49]0 [0.99,0.01]1 [0.13,0.87]
DataFrame
Pipeline
PipelineTransformer EstimatorTransformerTransformer
EstimatorEstimator
Pipeline
SparkMLTuning
• RandomForest
• Numberoftrees• Maxdepthoftrees
• ParameterMap• [10,20,30]• [5,25,50]
16
DataFrame
CrossValidator
Pipeline Evaluator
h_p://spark.apache.org/docs/latest/ml-tuning.html
SparkMLEvalua3on
h_p://spark.apache.org/docs/latest/mllib-evalua3on-metrics.html 17
PredictionDataFrame
label probabili/es1 [0.51,0.49]0 [0.99,0.01]1 [0.13,0.87]
Evaluator
AccuracyPrecision
Recall
…
AppliedMachineLearning
18
DataaboutRealAlarms
• Features
• Loca3on,sensortype,dayofweek,…• Labels
• 0=falsealarm• 1=truealarm
19
Total Falsealarms TruealarmsTotal 339’841 165’997 173’844Training 169’920 84’960 84’960Test 169’921 81’037 88’884
AppliedAlgorithms
• RandomForest
• SupportVectorMachine
• Logis3cRegression
• DeepNeuralNetwork(DNN)
20
Results
21
JanStampfli,KurtStockinger(2016).ApplieddataScience:UsingMachineLearningforAlarmVerifica3on,ERCIMNews,107,10.
Results
21
Accuracy
92.33% 92.05%
• Confusionmatrix
• Precision(correctlypredictedalarms)
• Recall(correctlydetectedalarms)
22
Evalua3on–RandomForest
2222
Predic/onTrue Predic/onFalse
LabelTrue 82’999 5’885
LabelFalse 7’148 73’889
Predic/onTrue Predic/onFalse
LabelTrue 92.07%
LabelFalse 92.62%
Predic/onTrue Predic/onFalse
LabelTrue 93.38%
LabelFalse 91.18%
ConclusionaboutML
• Humanoperatorsarenowabletopriori3zealarmspredictedastrue
• Falsealarmsmusts3llbeevaluated
23
Performance
24
AboutMe
AnaSima
• EPFLMasterinSWSystems,2015
• Previouslyworkedonaframeworkforunifiedstream&batchprocessing(Cyclone)
• JoinedZHAWsinceNovember201625
Welcomebackfromtheholidays!
26
Whyisalarmprocessinga3me-cri3calmission?
27
Whenthingsgowrong…
28
It’sallaboutresponse3me!
Performancetes3ng
29
DataConsumer
Streaming
Historic
MachineLearning
DataGenerator
Alarmstream
Designingtheconsumer
• Spark-”AunifiedengineforBigDataProcessing”
30
SQL
StreamingMachineLearning
GraphProcessing
DataFrames
Kryo Gson Jackson
DataSets
RDDs
Noonesizefitsall,sohowdoIchoose???
31
Thebasecase
32
Producer
WriteNalarms/s
Consumer
Project&FilterDis3nctComputeHistogramPredictTrue/False
Alarms
Historicquery
ConsumerWorkflow
• Everywindowof10s• Streaming
• mAddrsInWindow = alarms.map(Alarm::getMacAddress).distinct();
• Historic• Select MacAddress, count(*) from mongoDB Where MacAddress in mAddrsInWindow
Group By MacAddress
• MachineLearning• (Coveredinfirsthalfoftalk)
33
34
SAVEBenchmarkIden3fyingBo_lenecks
1
ZHAWDatalab
Iden3fyingbo_lenecks(1)
1. Startedexperimentwith• 8-coreKataProducer• 8-coreSparkConsumer
2. Consumerslowevenforabasicstreamingtask• Countnumberofelementsinwindow=>afewsec
3. Producernotoutpuungexpectedthroughput• Stumbled@around12K/s(recordsize≃600B) 35
Commonrootcause?
36
Producer
Serialize,(fasterxmlJackson)WriteNalarms/s
ConsumerDeserialize,
Project&FilterDis3nctComputeHistogramPredictTrue/False
SerializedAlarms
Historicquery
Doesitreallyma_er?
• Per-objectserializa3ononlytakesafewtensofns
37
• Mul3plythatby12K..
• Andbtw,haveitallsentin1s!
Areweusingtherighttool?
• Answer:benchmarks*!
• ”IfyourenvironmentprimarilydealswithlotsofsmallJSONrequests[…]thenGSONisyourlibraryofinterest.Jacksonstrugglesthemostwithsmallfiles.”
• Jackson:upto3xslowerforsmallobjects
*TheulMmateJSONlibrary:hRp://blog.takipi.com/the-ulMmate-json-library-json-simple-vs-gson-vs-jackson-vs-json/
38
39
SAVEBenchmarkIden3fyingBo_lenecks
2
ZHAWDatalab
Results(1)
05,00010,00015,00020,00025,00030,000
Jackson Gson
Producermaxthroughput(alarms/s)
+
40*<30LOCchange=>�~2xspeedupinbothProducer&Consumer
05,00010,00015,00020,00025,00030,000
Jackson Gson
Consumermaxthroughput(alarms/s)
Howmuchwillfitin10s?
41
2.91
2.3
0.02
0.02
0.02
7
5
7.5
0
2
4
6
8
10
12
Kata+Jackson(3.5Kalarms)
Kata+GSON(3.5Kalarms)
Kata+GSON(5.7Kalarms)
Breakdownof/me/subtask
Streaming Historic MachineLearning
Note:thealarmsRDDisusedforboththestreaming&theMLpart=>serializerchangebenefitsboth
Iden3fyingbo_lenecks(2)
1. Caching?• Broughtsomeimprovement,butsmall(~5%)• Theamountofdatareusedistooli_le(20MB)
2. Simplifyingthedesign?• TheKISS*principle
• KeepItSimple,Stupid!
42
Cappedcollec3on
Simplifica3on
43
Alarms
Cappedcollec3on
Producer
WriteNalarms/s
Consumer
Project&FilterDis3nctComputeHistogramPredictTrue/False
Alarms
44
SAVEBenchmarkIden3fyingBo_lenecks
3
ZHAWDatalab
Results(2)
Producer
05,00010,00015,00020,00025,00030,000
Maxthroughput(alarms/s)
45
05,00010,00015,00020,00025,00030,000
Maxthroughput(alarms/s)
Note:Producerslow,buts3ll@previousmaxconsumerthroughput!Consumer:~50%speedup.
+
Howmuchwillfitin10s?
46
2.91 1
0.02
0.02 0.02
7
5
2
0
2
4
6
8
10
12
Kata+Jackson(3.5Kalarms)
Kata+GSON(3.5Kalarms)
Mongo+GSON(3.5Kalarms)
Breakdownof/me/subtask
Streaming Historic MachineLearning
Rootcause?
• UsingKatadirectstreamwithasinglepar33on=>poorparallelism• Mostopera3onsrunonasingleexecutor
• EventhoughSparkisconfiguredtorunon8cores
• UsingMongoDB=>finergrainedcontroloverparallelismlevelfortheRDD• Readanarrayofdocumentsfromtailablecursor• ParallelizearrayintoRDD&runMLoneachpart
47
FIXIT!
ConfiguringtheKataDirectStreaminSparkwithproperseungs….
(numpar33ons=numcores)
48
Results(3)
Producer
05,00010,00015,00020,00025,00030,000
Maxthroughput(alarms/s)
49
05,00010,00015,00020,00025,00030,000
Maxthroughput(alarms/s)
+
Designingtheconsumer
• Spark-”AunifiedengineforBigDataProcessing”
50
SQL
StreamingMachineLearning
GraphProcessing
DataFrames
Kryo Gson Jackson
DataSets
RDDs
Easytousetools…cansome/mesbetrickytogetright
Conclusions & Lessons Learned
• Machine learning algorithm of Spark can be directly applied • Works well for small and big data • No re-writing necessary to make algorithm scale across computing cluster
• Spark Streaming can’t be used out of the box for stream and batch processing: • Need to use a persistency layer • Requires significant performance tuning
• Working with our industry partner Sitasys: • Great collaboration • Working with real-world problem is very rewarding • Can make real contribution and enhance existing algorithms and
technology 51
Appendix
52
PipelineExampleCode(Java)//CreatestringindexerTransformer.StringIndexerstringIndexer=newStringIndexer()
.setInputCol("loca/on").setOutputCol("indexLoca/on");//CreaterandomforestEsMmator.RandomForestClassifierrandomForestClf=newRandomForestClassifier()
.setFeaturesCol("indexLoca/on").setLabelCol("label");//CreatepipelineEsMmator(specifyingtheMLworkflow).Pipelinepipeline=newPipeline()
.setStages(newPipelineStage[]{stringIndexer,randomForestClf});//CreatecrossvalidaMonEsMmator(wrappingthepipelineEsMmator).ParamMap[]paramGrid=newParamGridBuilder()
.addGrid(randomForestClf.numTrees()(),newint[]{10,20,30}).addGrid(randomForestClf.maxDepth(),newint[]{5,25,50}).build();
CrossValidatorcrossValidator=newCrossValidator().setEs3mator(pipeline).setEs3matorParamMaps(paramGrid).setEvaluator(newBinaryClassifica3onEvaluator()).setNumFolds(10);
53
Training,Tes3ngandEvalua3onExampleCode(Java)//Given(creaMonofdatasetnoincludedinexamplecode)Dataset<Row>trainSet=…;Dataset<Row>testSet=…;//Fitthecrossvalidatortotrainingdocuments.CrossValidatorModelmodel=crossValidator.fit(trainSet);//MakepredicMonsontestdocuments.Dataset<Row>predic3ons=model.transform(testSet);//Createevaluator.Mul3classClassifica3onEvaluatorevaluator=newMul3classClassifica3onEvaluator()
.setLabelCol("label").setPredic3onCol("predic/on");//EvaluatepredicMons.evaluator.setMetricName("accuracy");doubleaccuracy=evaluator.evaluate(predic3ons); 54