anomaly detection using spark mllib and spark streaming

8
Anomaly Detection Offline Training using Spark Mllib; Online Testing using Spark Streaming; Details: https://github.com/keiraqz/anomaly-detection Keira Zhou Dec, 2015

Upload: keira-zhou

Post on 07-Jan-2017

736 views

Category:

Data & Analytics


5 download

TRANSCRIPT

Page 1: Anomaly Detection using Spark MLlib and Spark Streaming

Anomaly DetectionOffline Training using Spark Mllib; Online Testing using Spark Streaming;Details: https://github.com/keiraqz/anomaly-detection

Keira Zhou Dec, 2015

Page 2: Anomaly Detection using Spark MLlib and Spark Streaming

The ModelModel is trained using KMeans(Spark MLlib K-

means) approach Trained on "normal" dataset only After the model is trained, the centroid of the

"normal" dataset will be returned as well as a threshold

During the validation stage, any data points that are further than the threshold from the centroid are considered as "anomalies".

Page 3: Anomaly Detection using Spark MLlib and Spark Streaming

DatasetThe dataset is downloaded from KDD Cup 1999

Data for Anomaly Detection [1]Training Set: The training set is separated

from the whole dataset with the data points that are labeled as "normal" only

Validation Set: The validation set is using the whole dataset. All data points that are NOT labeled as "normal" are considered as "anomalies”

[1] KDD Cup 1999 Data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Page 4: Anomaly Detection using Spark MLlib and Spark Streaming

Offline TrainingThe majority of the training code mainly follows

the tutorial from Sean Owen, Cloudera: Video: https://www.youtube.com/watch?v=TC5cKYBZAeI Slides-1: http://www.slideshare.net/CIGTR/anomaly-detection-

with-apache-spark Slides-2: http://www.slideshare.net/cloudera/anomaly-

detection-with-apache-spark-2

Couple of modifications have been made to fit personal interest: Instead of training multiple clusters, the code only trains on

"normal" data points Only one cluster center is recorded and threshold is set to the

last of the furthest 2000 data points During later validating stage, all points that are further than

the threshold is labeled as "anomaly"

Page 5: Anomaly Detection using Spark MLlib and Spark Streaming

Online TestingValidation is run as a streaming job using Spark

Streaming Currently the application reads the input data

from a local file In an ideal situation, the program will read the

data from some ingestion tools such as Kafka

Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved

into a database

Page 6: Anomaly Detection using Spark MLlib and Spark Streaming

Spark Streaming context: process every 3 seconds

Load the trained model:

Load from local file and put into a queueStream

The streaming task: Calculate the distance between the data point and the centroid, then compare to the threshold

Page 7: Anomaly Detection using Spark MLlib and Spark Streaming

NotesCurrently the application reads the input data

from a local file In an ideal situation, the program will read the

data from some ingestion tools such as Kafka

Also, the trained model (centroid and threshold) is also saved in a local file In production, the information should be saved

into a database

The output of the testing can be saved into a database for visualization

Page 8: Anomaly Detection using Spark MLlib and Spark Streaming

More Detailshttps://github.com/keiraqz/anomaly-detection