clustering analysis of spatial data using peano count trees qiang ding william perrizo department of...

Clustering Analysis of Spatial Data Using Peano Count Trees

Qiang DingWilliam Perrizo

Department of Computer ScienceNorth Dakota State University, USA

(Ptree technology is patented by NDSU)

Overview

IntroductionData StructuresClustering Algorithms based on PartitioningOur ApproachExampleConclusion & Discussion

Introduction

Existing methods are not always suitable for cluster analysis due to the dataset size.

Peano Count Tree (PC-tree) provides a lossless, compressed clustering-ready representation of a spatial dataset.

We introduce an efficient clustering methods using this structure.

Background on Spatial Data

Band – attributePixel – transaction (tuple)Value – 0~255 (one byte)Different kinds of images have different numbers

of bands– TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2)– TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC)– TIFF: 3 bands (B, G, R)– Ground data: individual bands (Yield, Moisture,

Nitrate, Temperature)

Spatial Data Formats

Existing formats– BSQ (Band Sequential) – BIL (Band Interleaved by Line) – BIP (Band Interleaved by Pixel)

New format– bSQ (bit Sequential)

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19


BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)


Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19


BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)


Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19


BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)


Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

bSQ format (16 files)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

bSQ Format

Reasons of using bSQ format– Different bits contribute to the value differently.

• bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision).

– bSQ format facilitates the creation of efficient structures• P-trees

• P-tree algebra.

Example– Landsat Thematic-Mapper (TM) satellite image, is in BSQ format

• 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values.

• In this case, the bSQ format will consist of 56 separate files, B11,…,b78, each containing ~40,000,000 bits.

Peano Count Tree (PC-tree)

P-trees represent spatial data in a bit-by-bit, recursive, quadrant-by-quadrant arrangement.

P-trees are lossless representations of the original data.

P-trees are compressed structures.

An example of a P-tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

16 16

55

0 4 4 4 4

158

1 1 1 0

3

0 0 1 0

1

1 1

3

0 1

1111111111111111111000001111001011111111111111111111111111111111

bSQ

Arranged as a spatial dataset (2-D raster order)

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

An example of Ptree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

0 1 2 3

111

( 7, 1 ) ( 111, 001 ) 10.10.11

2

3

2 . 2 . 3

001

Ptree Algebra

AndOrComplementOther (XOR, etc)

PC-tree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101

Its complement (counts 0’s, not 1’s: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010

Basic, Value and Tuple Ptrees

Value or Interval PtreesP1, 5 = P1,101 = P11 AND P12’ AND P13

Tuple PtreesP(5,2,7) = P(101,010,111) = P1,101 ^ P2,010 ^ P3,111 = P11^P’12^P13 ^ P’21^P22^P’23 ^ P31^P32^P33

Basic PtreesP11, P12, …, P18, P21, …, P28, …, P71, …, P78

Notational alternatives: P1, 5 = P1,101 = P(101,,)

Clustering Methods

A Categorization of Major Clustering MethodsPartitioning methods

• K-means, K-mediods,…

Hierarchical methods• Agglomerative, divissive,…

Density-based methodsGrid-based methodsModel-based methods

The K-Means Clustering Method

Given k, k-means alg is implemented as follows:– Partition objects into k nonempty subsets

1. Compute seed points (centroids or mean points) of the clusters of the current partition.

2. Assign each object to the cluster with nearest seed point.

– Repeat until some stopping condition is satified.

The K-Means Clustering Method

Strength– Relatively efficient: O(nkt)

– n = number of objects,

– k = number of clusters,

– t = number of iterations.

– Normally k and t << n.

Weakness– Requires a metric so that mean is defined.

– Need to specify k, the number of clusters, in advance.

– Sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value.

The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters• Often a “middle-ish” or “median”object.

PAM (Partitioning Around Medoids, 1987)• Pick k medoids; check all pairs, (medoid, non-medoid) for improved

clustering; if yes, relpace medoid. Repeat until some stopping condition.

• PAM effective for small data sets, but does not scale well for large data sets

CLARA (Clustering LARge Applications) (Kaufmann & Rousseeuw, 1990)

• Draw many sample sets; apply PAM to each; return best clustering.

CLARANS (Clustering Large Apps based upon RANdomized Search) (Ng & Han, 1994)

• Similar to CLARA except a graph is used to guide replacements.

PAM (Partitioning Around Medoids)

Use real object to represent the cluster

– Select k representative objects arbitrarily

– For each pair of non-selected object h and selected object i,

calculate the total swapping cost TCih

– For each pair of i and h,

• If TCih < 0, i is replaced by h

• Then assign each non-selected object to the most similar

representative object

– repeat steps 2-3 until there is no change

CLARA (Clustering Large Applications)

It draws multiple samples of the data set, applies PAM on

each sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

– Efficiency depends on the sample size

– A good clustering based on samples will not necessarily

represent a good clustering of the whole data set if the sample

is biased

CLARANS (A Clustering Algorithm based on Randomized Search)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids

If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum

It is more efficient and scalable than both PAM and CLARA

Our Approach Representing the original data set as interval P-trees by using higher-order-bits

concept hierarchy (interval = value at a higher level in the hierarchy) These P-trees can be viewed as groups of very similar data elements Prune out outliers by disregarding sparse groups

Input: total number of objects (N), all interval P-trees,

pruning criteria (e.g., root count threshold, outlier (ol) percentage)Output: interval P-trees after prune

(1) Choose the interval P-tree with smallest root count (Pv)

(2) Apply pruning criteria ( eg, RC(Pv) < threshold; (ol:=ol+RC(Pv))/N t )

then remove Pv and repeat until pruning criteria fails. Finding clusters by

– traversing P-tree levels until there are k– use PAM where each interval P-tree is an object

Example

BSQ: 0,0 | 0011| 0111| 1000| 1011 0,1 | 0011| 0011| 1000| 1111 0,2 | 0111| 0011| 0100| 1011 0,3 | 0111| 0010| 0101| 1011 1,0 | 0011| 0111| 1000| 1011 1,1 | 0011| 0011| 1000| 1011 1,2 | 0111| 0011| 0100| 1011 1,3 | 0111| 0010| 0101| 1011 2,0 | 0010| 1011| 1000| 1111 2,1 | 0010| 1011| 1000| 1111 2,2 | 1010| 1010| 0100| 1011 2,3 | 1111| 1010| 0100| 1011 3,0 | 0010| 1011| 1000| 1111 3,1 | 1010| 1011| 1000| 1111 3,2 | 1111| 1010| 0100| 1011 3,3 | 1111| 1010| 0100| 1011

bSQ: B11 B12 B13 B14

0000 0011 1111 11110000 0011 1111 11110011 0001 1111 00010111 0011 1111 0011B21, B22, B23, B24

B21, B22, B23, B24

B21, B22, B23, B24

Value P-trees:P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110

P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111

P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110

P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111

P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110

P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111

P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110

P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111

Basic P-trees:P11 , P12 , P13 , P14 , P21 , P22 , P23 , P24

P31 , P32 , P33 , P34 , P41 , P42 , P43 , P44

P11’ , P12

’ , P13

’ , P14

’ , P21

’ , P22

’’, P23 , P24’

P31’ , P32

’ , P33

’ , P34

’ , P41

’ , P42

’ , P43

’ , P44’

Tuple P-trees: (non-zero RCs)P(0010,1011,1000,1111) 3 0 0 3 0 1110

P(1010,1010,0100,1011) 1 0 0 0 1 1000

P(1010,1011,1000,1111) 1 0 0 1 0 0001

P(0011,0011,1000,1011) 1 1 0 0 0 0001

P(0011,0011,1000,1111) 1 1 0 0 0 0100

P(0011,0111,1000,1011) 2 2 0 0 0 1010

P(0111,0010,0101,1011) 2 0 2 0 0 0101

P(0111,0011,0100,1011) 2 0 2 0 0 1010

P(1111,1010,0100,1011) 3 0 0 0 3 0111

P-tree Performance

Time Required Vs Bit Number

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8

Bit Number

Average time required to perform multi-operand ANDing operation on a TM file (40 million pixels)

Conclusion & Discussion

PAM is not efficient in dealing with medium and large data sets.

CLARA - CLARANS draw samples from the original data randomly.

Our algorithm (using P-trees; lossless, data-mining-ready data structures) does not draw samples, but groups the data first

– each interval P-tree can be viewed as a group

– Then PAM only needs to deal with the P-trees, and the number of P-trees are much smaller than the data set CLARA and CLARANS need to deal with.

– Because P-tree ANDing is very fast, our algorithm is very fast.

clustering analysis of spatial data using peano count trees qiang ding william perrizo department of...

Documents