performance improvement for bayesian classification on spatial data with p-trees amal s. perera...

17
Performance Improvement for Bayesian Classification on Spatial Data with P- Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science North Dakota State University Fargo, ND 58105 These notes contain NDSU confidential and Proprietary material. Patents pending on the P-tree technology

Upload: christal-wright

Post on 26-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees

Amal S. PereraMasum H. SeraziWilliam Perrizo

Dept. of Computer ScienceNorth Dakota State University

Fargo, ND 58105

These notes contain NDSU confidential andProprietary material.Patents pending on the P-tree technology

Page 2: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Outline

• Introduction• P-Tree• P-Tree Algebra• Bayesian Classifier• Calculating Probabilities using P-Trees• Band-based vs. Bit-based approach • Sample Data• Classification Accuracy• Classification Time• Conclusion

Page 3: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Introduction

• Classification is a form of data analysis and data mining that can be used to extract models describing important data classes or to predict future data trends.

• Some data classification techniques are:

Decision Tree Induction Bayesian Neural Networks K-Nearest Neighbor

Case Based Reasoning Genetic Algorithm rough sets fuzzy logic techniques

• A Bayesian classifier is a statistical classifier, which uses Bayes’ theorem to predict class membership as a conditional probability that a given data sample falls into a particular class.

Page 4: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Introduction Cont..

• The P-Tree data structure allows us to compute the Bayesian probability values efficiently, without resorting to the naïve Bayesian assumption.

• Bayesian classification with P-Trees has been used successfully in remotely sensed image precision agriculture to predict yield and in genomics (2-yeast hybrid classification) to place in the ACM 02KDD-cup competition. http://www.biostata.wisc.edu/~craven/kddcup/winners.html

• To completely eliminate the naïve assumption, a bit-based Bayesian classification is used instead of a band-based approach.

Page 5: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

P-Tree

• Most spatial data comes in a band format called BSQ.

• Each BSQ band is divided into several files, one for each bit position of the data values. This format is called ‘bit Sequential’ or bSQ.

• Each bSQ bit file, Bij (file constructed from the jth bits of ith band), into a tree structure, called a Peano Tree (P-Tree).

• P-Trees represent tabular data in a lossless, compressed, bit-by-bit, recursive, datamining-ready arrangement.

Page 6: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

A bSQ file, its raster spatial file and P-Tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

16 16

55

0 4 4 4 4

158

1 1 1 0

3

0 0 1 0

1

1 1

3

0 1

1111110011111000111111001111111011111111111111111111111101111111

Page 7: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

P-Tree Algebra

• Logical operator– And – Or– Complement– Other (XOR, etc)

• Applying this operators we calculate value P-Trees, interval P-Trees, and slice P-Trees.

Ptree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101

Complement: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010

Page 8: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

’ indicates COMPLEMENT

operation

P-Tree Algebra Cont..

• Basic P-Trees can be combined using logical operations to produce P-Trees for the original values at any level of bit precision. Using 8-bit precision for values, Pb11010011 , which counts the numer of occurrences of 11010011 in each quadrant, can be constructed from the basic P-Trees as:

Pb11010011 = Pb1 AND Pb2 AND Pb3’ AND Pb4 AND Pb5’ AND Pb6’ AND Pb7 AND Pb8

AND operation is simply the pixel-wise

AND of the bits

• Similarly, any data set in the relational format can be represented as P-Trees. For any combination of values, (v1,v2,…,vn), where vi is from band-i, the quadrant-wise count of occurrences of this combination of values is given by:

P(v1,v2,…,vn) = P1v1 ^ P2v2 ^ … ^ Pnvn

Page 9: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Bayesian Classifier

Pr(Ci | X) is the posterior probability

Pr(Ci) is the prior probability

Can find conditional probabilities, Pr(X|Ci).

Classify X with Max Pr(Ci | X)

Since Pr(X) is constant for all classes, therefore, instead maximize Pr(X|Ci) * Pr(Ci).

)()(*)|(

)|(XPr

iCPriCXPrXiCPr

Based on Bayes Theorem:

Page 10: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Calculating Probabilities Pr(X|Ci)

Using naïve assumption Pr(X | Ci ) = Pr( X1 | Ci ) × Pr( X2 | Ci )… × Pr( Xn | Ci )× Pr( XC | Ci) Scan the data and calculate Pr(X | Ci ) for given X .

Using P-Trees:

Pr(X|Ci) = # training samples in Ci having pattern X / # samples in class Ci

= RC[ P1(X1) ^ P2(X2) ^ … ^Pn(Xn) ^ PC(Ci) ] / RC[ PC(Ci) ]

Problem ? : if RC[ P1(X1) ^ P2(X2) ^ … ^Pn(Xn) ^ PC(Ci) ] = 0 for all i i.e unclassified pattern does not exist in the training set.

Page 11: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Band-based-P-tree Approach

• When all RC = 0 for given pattern

– Reduce the restrictiveness of the pattern• Removing the attribute with least information gain

– Calculate (assume attribute 2 has the least IG)• Pr( X | Ci ) = RC[ P1X1 ^ P3X3 ^ … ^ PnXn ^ PCCi ] / RC[ PCCi ]

• Calculation of information gain Using P-trees– 1 time calculation for the entire training data

Page 12: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Bit-based Approach• Search for similar patterns by removing the least significant bits in the attribute space.

• The order of the bits to be removed is selected by calculating the info gain (IG).

(b)(a)

R

00

01

10

11

01 10 1100 G

G(c)

01 10 1100

R

00

01

10

11

R

00

01

10

11

01 1000 11 G

00

01

10

11

(d)01 10 1100

R

G

E.g., Calculate the Bayesian conditional probability value for the pattern [G,R] = [10,01] in 2-attribute space.

Assume IG for 1st significant bit of R < that of G.

Assume IG for 2nd significant bit of G < that of R.

Initially, search for the pattern, [10,01] (a).

If not found, search for [1_,01] considering IG for the 2nd significant bit. Search space will increase (b).

If not found, search for [1_,0_] considering IG for the 2nd significant bit. Search space will increase (c).

If not found, search for [1_,_ _] considering IG for the 1st significant bit. Search space will increase (d).

Page 13: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Experiments

• The experimental data was extracted from two sets of aerial photographs of the Best Management Plot (BMP) of the Oakes Irrigation Test Area (OITA) near Oaks, North Dakota.

»

• The images were taken in 1997 and 1998.

• Each image contains 3 bands, red, green and blue reflectance values.

»

• Three other files contain synchronized soil moisture, nitrate and yield values.

Page 14: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Classification Accuracy

• Accuracy of the proposed bit-based approach is compared with band-based, and KNN with Euclidian distance.

• It is clear that our approach out performs the others.

Classification accuracy for '97 Data

0

10

20

30

40

50

60

70

80

90

1K 4K 16K 65K 260K

Training Data Size (pixels)

Band-Ptree KNN-Euc. Bit

Page 15: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Classification Accuracy Cont..

• The accuracy of the approach was also compared to an existing Bayesian belief network classifier. The classifier is J Cheng's Bayesian Belief Network available at

http://www.cs.ualberta.ca/~jcheng/ .

– This classifier was the winning entry for the KDD Cup 2001 data mining competition. The developer claims that the classifier can perform with or without domain knowledge.

• For the comparison smaller training data sets ranging from 4K to 16K pixels were used due to the inability of the implementation to handle larger data sets.

Training Size (pixels)Bit-Ptree

Based Bayesian Belief

4000 66 % 26 %

16000 67 % 51 %

The Belief network was built without using any domain knowledge to make it comparable with to P-Tree approach.

Accuracy

Page 16: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Classification Time

• P-Tree approach requires no build time (lazy classifier).

• In most lazy classifiers the classification time per tuple varies with the number of items in the training set due to the requirement of having to scan the training data.

• P-Tree approach does not require a traditional data scan.

• The data in figure was collected using 5 significant bits and a threshold probability of 0.85.

• The time is given for scalability comparisons.

Variation of Classification Time with Training Size for bit-P-tree alg.

0

100

200

300

0 100 200 300

Trainig sample size (pixels)

Page 17: Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science

Conclusion

• Naïve assumption reduces the accuracy of the classification in this particular application domain.

• Our approach increases accuracy of a P-Tree Bayesian classifier by completely eliminating the naïve assumption.

– New approach has a better accuracy than the existing P-Tree based Bayesian classifier.

– It was also shown to be better than a Bayesian belief network implementation and a Euclidian distance based KNN approach.

• It has the same computational cost with respect to the use of P-Tree operations as the previous P-tree approach, and is scalable with respect to the size of the data set.