rdf: a density-based outlier detection method using vertical data representation dongmei ren,...

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation

Dongmei Ren, Baoying Wang, William Perrizo

North Dakota State University, U.S.A

Introduction Related Work

Breunig et al. [6] first proposed a density-based approach to mining outliers over datasets with different densities.

Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). Not efficient.

Contributions of this paper 1. a relative density factor (RDF)

RDF expresses the same amount of information as LOF (local outlier factor)[6] and MDEF(multi-granularity deviation factor)[7]

but RDF is easier to compute; 2. RDF-based outlier detection method

it efficiently prunes the data points which are deep in clusters It detects outliers only within the remaining small subset of the data;

3. a vertical data representation in P-trees P-Trees improve the efficiency of the method further.

Definitions

dim/rr)DiskNbr(x,rxDens ||),(

Direct DiskNbr

x

Indirect DiskNbr

Definition 1: Disk Neighborhood --- DiskNbr(x,r)

Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’ X | d(x-x’) r}, where d(x-x’) is the distance of x and x’

Direct & indirect neighbors of x

Definition 2: Density of DiskNbr(x, r) --- Dens (x,r)

, where dim is the number of dimensions

Definitions (Continued)

2

),(),(

|),(|

|),(|

),(

)),((

rxDiskNbr

rqDiskNbr

rxDens

rqDensAVGRDF(x,r) rxDiskNbrqrxDiskNbrq

Definition 3: Relative Density Factor (RDF) of point x with radius r -- RDF(x,r)

RDF is used to measure outlierness. Outliers are points with high RDF values.

Direct DiskNbr

x

Indirect DiskNbr

Special case: RDF between DiskNbr(x,r) and {DiskNbr(x,2r)- DiskNbr(x,r)}

)12*(|),(|

|),(||)2,(|dim

rxDiskNbr

rxDiskNbrrxDiskNbrRDF(x,r)

Direct neighbor

r 2r Indirect neighborsx

The Proposed Outlier Detection Method

Given a dataset X, the proposed outlier detection method is processed by:

Find Outliers Prune Non-outliers

Our method prunes non-outliers (points deep in clusters) efficiently; find outliers over the remaining small subset of the data, which consists of points on cluster boundaries and real outliers.

Prune out non-outlier

Start point x

prr 2r 4r 6r

Finding Outliers

Outliers!!!

Finding Outliers

Three possible distributions with regard to RDF:

(a) prune all neighbors, call “Pruning Non-outliers” procedure; (b) prune all direct neighbors of x, calculate RDF for each

indirect neighbor.(c) x is an outlier, prune indirect neighbors of x.

(a) 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε)

x

Finding Outliers using P-Trees P-Tree based direct neighbors --- PDNx

r For point x, let X= (x1,x2,…,xn) or X = (x1,m-1, …x1,0), (x2,m-1, …x2,0), … (xn,m-1, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, PDNxi

r = Px’>xi-r AND Px’xi+r For muti-attributes, |DiskNbr(x,r)|= rc(PDNx

r)

P-Tree based indirect neighbors --- PINxr

PINxr = (OR q Nbr(x,r) PDNq

r) AND PDNx’r

Pruning is done by P-Trees ANDing based on the above three distributions

(a),(c): PU = PU AND PDNxr AND PINx’

r

(b): PU = PU AND PDNxr;

where PU is a P-tree representing unprocessed data points

ixr

Xr PDNANDPDN

1-n0,i

Pruning Non-outliers

1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius.

RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process;

RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”.

The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer.

Start point

xr r 2r 4r

Prune out non-outlier

Pruning Non-outliers Using P-Trees We define ξ- neighbors: it represents the neighbors with ξ bits of di

ssimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value

For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), … (xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by

iix

jix PANDP ,

-m1,j

ixX PANDP 1-n0,i

The pruning is accomplished by: PU = PU AND PX

ξ’, where PXξ’ is the complement set of PX

ξ

,where

0,'

1,

,,

,,

,

jiji

jijiix

ji xP

xPP

RDF-based Outlier Detection Process

Algorithm: RDF-based Outlier Detection using P-Trees

Input: Dataset X, radius r, distribution parameter ε.

Output: An outlier set Ols.

// PU — unprocessed points represented by P-Trees;

// |PU| — number of points in PU

// PO --- outliers;

//Build up P-Trees for Dataset X

PU createP-Trees(X);

i 1;

WHILE |PU| > 0 DO

x PU.first; //pick an arbitrary point x

PO FindOutliers (x, r, ε);

i i+1

ENDWHILE

“Find Outliers” and “Prune Non-Outliers” Procedures

Algorithm: PruneNonOutliers Input: point x, distribution parameter ε,dataset X Output: pruned dataset PU // Pi,j is P-tree for jth bit of ith attribute of X // PNx

ξ, i-neighborhood of a point x // n is number of attributes ,m is the number of bits //in each attribute // Pxi’

i,j is complement set of Pxii,j

FOR j = 0 TO m-1 IF xi,j = 1 Pxi

,i,j Pi,j

ELSE Pxi,i,j P’i,j

ENDFOR PU 1; Px 1; ξ = 0; DO FOR i = 1 TO n Pxi Pxi

,i,1 FOR j = 0 TO m-ξ Pxi Pxi AND Pxi

,i,j+1

ENDFOR PX PX AND Pxi ENDFOR PNx

ξ Px;

ξ ξ + 1; rdf (rc(PNx

ξ)-rc (PNxξ-1

) ) / (rc(PNX,ξ-1))2

WHILE (rdf < 1/(1+ ε) || rdf > (1+ε) ) q {PNX’ξ-1AND PNX

ξ} IF rdf < 1/(1+ ε) PU PU AND PNX’ξ-1; // pruning FindOutliers (q,r,ε); ELSE IF rdf > (1+ε) FindNonOutliers(q,r,ε) ENDIF

Algorithm: FindOutliers Input: point x, radius r, distribution parameter ε Output: pruned dataset PU //PDN(x): direct neighbors of x //PIN(x): indirect neighbors of x // rdf is relative density factor PDN(x,r) = PX?x+r OR PX>x-r sum 0, PN 0; FOR each point q in PDN(x, r)

PN (q, r) = PX<q+r OR PX<q-r; sum sum + |PN(q, r)|;

ENDFOR rdf sum / (|PDN(x)|2); switch (rdf) case : 1/(1+ ε ) ? rdf ? 1+ ε PU PU AND PDN’(x) AND PIN’(x); PruneNonOutliers(x, r, ε); case rdf <1/(1+ ε ): PU PU AND PDN’(x); FOR each point q in PIN(x) FindOutliers (q, r, ε); ENDFOR case rdf > (1+ ε) // Add point x into the outlier set Ols Ols OR x; PU PU AND PIN’(x);

Experimental Study

NHL data set (1996) Compare with LOF, aLOCI

LOF: Local Outlier Factor Method aLOCI: approximate Local Correlation Integral Method

Run Time Comparison Scalability Comparison

Start from 16,384, outperform in terms of scalability and speed

0

500

1000

1500

2000

Run Time (s)

Data Size

Run Time Comparisons of LOF, aLOCI, RDF

LOF 0.23 1.92 38.79 103.19 1813.43

aLOCI 0.17 1.87 35.81 87.34 985.39

RDF 0.58 2.1 8.34 37.82 108.91

256 1024 4096 16384 65536

Scalability Comparison of LOF,aLOCI,RDF

-200

0

200

400

600

800

1000

1200

1400

1600

1800

2000

256 1024 4096 16384 65536

Data Size

Run

Tim

e(s)

LOF

aLOCI

RDF

ReferenceReference1. V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International

Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222.3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large

Data Bases Conference Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data

Bases Conference Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”,

International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163-5808

6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000

7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India

8. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 19999. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large

Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169.10. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.11. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied

Computing, 2002.12. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.13. M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD

2002, Spriger-Verlag LNAI 2776, 200214. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 200315. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003

Thank you!

Determination of Parameters Determination of r

Breunig et al. shows choosing miniPt = 10-30 work well in general [6] (miniPt-Neighborhood)

Choosing miniPts=20, get the average radius of 20-neighborhood, raverage.

In our algorithm, r = raverage=0.5 Determination of ε

Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are.

We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster.

The results shown in the experimental part is based on ε=0.8.

rdf: a density-based outlier detection method using vertical data representation dongmei ren,...

Documents