anomaly detection in data mining. hybrid approach between filtering- and-refinement and dbscan eng....

21
Anomaly Detection in Data Anomaly Detection in Data Mining. Mining. Hybrid Approach between Hybrid Approach between Filtering-and-refinement and Filtering-and-refinement and DBSCAN DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Ciocârlie SACI May 2011

Upload: roland-todd

Post on 04-Jan-2016

229 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

Anomaly Detection in Data Mining.Anomaly Detection in Data Mining.Hybrid Approach between Filtering-Hybrid Approach between Filtering-

and-refinement and DBSCANand-refinement and DBSCAN

Eng. Ştefan-Iulian HandraProf. Dr. Eng. Horia Ciocârlie

SACIMay 2011

Page 2: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

CoContentsntents1.1. IntroducIntroductiontion2.2. Anomaly detection classical Anomaly detection classical

approachesapproaches3.3. Filtering-and-refinementFiltering-and-refinement4.4. Hybrid methodHybrid method5.5. Experimental resultsExperimental results6.6. ConcluConclusions and Further sions and Further

DevelopmentDevelopment7.7. BibliographyBibliography

1/19

Page 3: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

Anomaly detectionAnomaly detection : :

the process of finding individual the process of finding individual objects that are different from the objects that are different from the normal objectsnormal objects

ApplicationsApplications ::

critical safe systems, insurance, critical safe systems, insurance, health, electronic and bank fraud health, electronic and bank fraud detection, military surveillance of detection, military surveillance of enemy activities, data miningenemy activities, data mining

1.1. IntroducIntroductiontion

2/19

Page 4: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

2.2. Classical techniques Classical techniques

The Nearest Neighbor approach:

- calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor

- sparse instances are considered anomalies, dense instances are considered normal instances

The Density based Local Outliers approach:

- assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood

- average density of the instance is compared with the average density of its nearest neighbors 3/19

Page 5: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

2.2. Classical techniques Classical techniques

The DBSCAN algorithm:

- well known clustering algorithm

- based on the density-reachability and density-connectivity concepts

- it does not assign all the entries to a cluster

- weaknesses: lacks scalability and fast response capabilities

4/19

Page 6: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

2.2. Classical techniques Classical techniques

The Random Forest approach:

- ensemble of individual tree predictors

- each tree depends on the values of a random vector sampled independently with the same distribution in all the trees

- advantage: discovers new patterns that the Euclidian distance does not

- weakness: working with labeled data and calculation speed

5/19

Page 7: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

33. . Filtering-and-refinementFiltering-and-refinement

- classical methods focus on normal instances for detecting anomalies

- F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances

- two stage approach

6/19

Page 8: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

33. . Filtering-and-refinementFiltering-and-refinement

7/19

-Filtering stage:

- removes majority of normal instances

Refinement stage:

- examines data with different density based measures

Page 9: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

33. . Filtering-and-refinementFiltering-and-refinement

Advantages:

- saves the majority of the processing time by only analyzing the remaining data in the second step

- flexible and combinable with different density based algorithms

Disadvantage: not really tested in practice

8/19

Page 10: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

44. . Hybrid methodHybrid method

- combination between Filtering-and-refinement and DBSCAN

- filtering stage : using average value

- refinement stage : using DBSCAN

- JAVA routines for filtering stage

- WEKA processing for refinement stage

9/19

Page 11: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

44. . Hybrid methodHybrid method

10/19

Two separate implementations:

- F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%)

- F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)

Page 12: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

44. . Hybrid methodHybrid method

11/19

- automatically generated anomalies

- we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances

- 3 separate runs to compare the results (F&R1, F&R2, normal)

Page 13: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

55. . Experimental resultsExperimental results

12/19

5.1. Data sets used

- 24 variations of data sets each containing over 20.000 entries

- data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to

- for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection

Page 14: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

55. . Experimental resultsExperimental results

13/19

0

20

40

60

80

100

120

Anomalies

Anomaliesdiscovered

5.2. Results

Page 15: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

55. . Experimental resultsExperimental results

14/19

Anomalies discovered ratio

00.10.20.30.40.50.60.70.80.9

1

1FR1

-50

1Nor

mal

-50

2FR2

-50

3FR1

-50

3Nor

mal

-50

1FR2

-100

2FR1

-100

2Nor

mal

-100

3FR2

-100

1FR1

-500

1Nor

mal

-500

2FR2

-500

3FR1

-500

3Nor

mal

-500

1FR2

-100

0

2FR1

-100

0

2Nor

mal

-100

0

3FR2

-100

0

Anomalies discovered ratio

Page 16: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

55. . Experimental resultsExperimental results

- for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s

15/19

Approach Best Time(s) Worst Time(s)

FR1 3 29FR2 8 156

Normal 908 1070

5.2. Results

Page 17: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

- both F&R approaches are more accurate compared to the classical approach

- F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters

6.1. Conclusions

66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment

16/19

Page 18: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

- overall enormous speed gain compared to classical methods

- saves disk space and processing resources

- the hybrid method spends the majority of the time processing anomalies and not normal instances

6.1. Conclusions

66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment

17/19

Page 19: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

- adaptation of algorithm to different domains

- use “filtered out” instances for training parallel neural networks

- experiment with a hybrid method between the RF predictor and the F&R approach

6.2. Further Development

66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment

18/19

Page 20: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two-Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009

- Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008

- Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006.

- Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008

77. . BibliographyBibliography

19/19

Page 21: Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

SACI

Thank you for your attentionThank you for your attention!!

May 2011