clustering analysis of spatial data using peano count trees qiang ding william perrizo department of...
TRANSCRIPT
Clustering Analysis of Spatial Data Using Peano Count Trees
Qiang DingWilliam Perrizo
Department of Computer ScienceNorth Dakota State University, USA
(Ptree technology is patented by NDSU)
Overview
IntroductionData StructuresClustering Algorithms based on PartitioningOur ApproachExampleConclusion & Discussion
Introduction
Existing methods are not always suitable for cluster analysis due to the dataset size.
Peano Count Tree (PC-tree) provides a lossless, compressed clustering-ready representation of a spatial dataset.
We introduce an efficient clustering methods using this structure.
Background on Spatial Data
Band – attributePixel – transaction (tuple)Value – 0~255 (one byte)Different kinds of images have different numbers
of bands– TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2)– TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC)– TIFF: 3 bands (B, G, R)– Ground data: individual bands (Yield, Moisture,
Nitrate, Temperature)
Spatial Data Formats
Existing formats– BSQ (Band Sequential) – BIL (Band Interleaved by Line) – BIP (Band Interleaved by Pixel)
New format– bSQ (bit Sequential)
Spatial Data Formats (Cont.)
BAND-1 254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
Spatial Data Formats (Cont.)
BAND-1 254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
Spatial Data Formats (Cont.)
BAND-1 254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
BIP format (1 file)
254 37 127 240 14 200 193 19
Spatial Data Formats (Cont.)
BAND-1 254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
BIP format (1 file)
254 37 127 240 14 200 193 19
bSQ format (16 files)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1
bSQ Format
Reasons of using bSQ format– Different bits contribute to the value differently.
• bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision).
– bSQ format facilitates the creation of efficient structures• P-trees
• P-tree algebra.
Example– Landsat Thematic-Mapper (TM) satellite image, is in BSQ format
• 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values.
• In this case, the bSQ format will consist of 56 separate files, B11,…,b78, each containing ~40,000,000 bits.
Peano Count Tree (PC-tree)
P-trees represent spatial data in a bit-by-bit, recursive, quadrant-by-quadrant arrangement.
P-trees are lossless representations of the original data.
P-trees are compressed structures.
An example of a P-tree
Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count
Level Fan-out QID (Quadrant ID)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
55
16 8 15 16
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
16 16
55
0 4 4 4 4
158
1 1 1 0
3
0 0 1 0
1
1 1
3
0 1
1111111111111111111000001111001011111111111111111111111111111111
bSQ
Arranged as a spatial dataset (2-D raster order)
55
16 8 15 16
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
An example of Ptree
Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count
Level Fan-out QID (Quadrant ID)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
0 1 2 3
111
( 7, 1 ) ( 111, 001 ) 10.10.11
2
3
2 . 2 . 3
001
Ptree Algebra
AndOrComplementOther (XOR, etc)
PC-tree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101
Its complement (counts 0’s, not 1’s: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010
Basic, Value and Tuple Ptrees
Value or Interval PtreesP1, 5 = P1,101 = P11 AND P12’ AND P13
Tuple PtreesP(5,2,7) = P(101,010,111) = P1,101 ^ P2,010 ^ P3,111 = P11^P’12^P13 ^ P’21^P22^P’23 ^ P31^P32^P33
Basic PtreesP11, P12, …, P18, P21, …, P28, …, P71, …, P78
Notational alternatives: P1, 5 = P1,101 = P(101,,)
Clustering Methods
A Categorization of Major Clustering MethodsPartitioning methods
• K-means, K-mediods,…
Hierarchical methods• Agglomerative, divissive,…
Density-based methodsGrid-based methodsModel-based methods
The K-Means Clustering Method
Given k, k-means alg is implemented as follows:– Partition objects into k nonempty subsets
1. Compute seed points (centroids or mean points) of the clusters of the current partition.
2. Assign each object to the cluster with nearest seed point.
– Repeat until some stopping condition is satified.
The K-Means Clustering Method
Strength– Relatively efficient: O(nkt)
– n = number of objects,
– k = number of clusters,
– t = number of iterations.
– Normally k and t << n.
Weakness– Requires a metric so that mean is defined.
– Need to specify k, the number of clusters, in advance.
– Sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value.
The K-Medoids Clustering Method
Find representative objects, called medoids, in clusters• Often a “middle-ish” or “median”object.
PAM (Partitioning Around Medoids, 1987)• Pick k medoids; check all pairs, (medoid, non-medoid) for improved
clustering; if yes, relpace medoid. Repeat until some stopping condition.
• PAM effective for small data sets, but does not scale well for large data sets
CLARA (Clustering LARge Applications) (Kaufmann & Rousseeuw, 1990)
• Draw many sample sets; apply PAM to each; return best clustering.
CLARANS (Clustering Large Apps based upon RANdomized Search) (Ng & Han, 1994)
• Similar to CLARA except a graph is used to guide replacements.
PAM (Partitioning Around Medoids)
Use real object to represent the cluster
– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected object i,
calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar
representative object
– repeat steps 2-3 until there is no change
CLARA (Clustering Large Applications)
It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
Strength: deals with larger data sets than PAM
Weakness:
– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample
is biased
CLARANS (A Clustering Algorithm based on Randomized Search)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum
It is more efficient and scalable than both PAM and CLARA
Our Approach Representing the original data set as interval P-trees by using higher-order-bits
concept hierarchy (interval = value at a higher level in the hierarchy) These P-trees can be viewed as groups of very similar data elements Prune out outliers by disregarding sparse groups
Input: total number of objects (N), all interval P-trees,
pruning criteria (e.g., root count threshold, outlier (ol) percentage)Output: interval P-trees after prune
(1) Choose the interval P-tree with smallest root count (Pv)
(2) Apply pruning criteria ( eg, RC(Pv) < threshold; (ol:=ol+RC(Pv))/N t )
then remove Pv and repeat until pruning criteria fails. Finding clusters by
– traversing P-tree levels until there are k– use PAM where each interval P-tree is an object
Example
BSQ: 0,0 | 0011| 0111| 1000| 1011 0,1 | 0011| 0011| 1000| 1111 0,2 | 0111| 0011| 0100| 1011 0,3 | 0111| 0010| 0101| 1011 1,0 | 0011| 0111| 1000| 1011 1,1 | 0011| 0011| 1000| 1011 1,2 | 0111| 0011| 0100| 1011 1,3 | 0111| 0010| 0101| 1011 2,0 | 0010| 1011| 1000| 1111 2,1 | 0010| 1011| 1000| 1111 2,2 | 1010| 1010| 0100| 1011 2,3 | 1111| 1010| 0100| 1011 3,0 | 0010| 1011| 1000| 1111 3,1 | 1010| 1011| 1000| 1111 3,2 | 1111| 1010| 0100| 1011 3,3 | 1111| 1010| 0100| 1011
bSQ: B11 B12 B13 B14
0000 0011 1111 11110000 0011 1111 11110011 0001 1111 00010111 0011 1111 0011B21, B22, B23, B24
B21, B22, B23, B24
B21, B22, B23, B24
Value P-trees:P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110
P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111
P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110
P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111
P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110
P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111
P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110
P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111
Basic P-trees:P11 , P12 , P13 , P14 , P21 , P22 , P23 , P24
P31 , P32 , P33 , P34 , P41 , P42 , P43 , P44
P11’ , P12
’ , P13
’ , P14
’ , P21
’ , P22
’’, P23 , P24’
P31’ , P32
’ , P33
’ , P34
’ , P41
’ , P42
’ , P43
’ , P44’
Tuple P-trees: (non-zero RCs)P(0010,1011,1000,1111) 3 0 0 3 0 1110
P(1010,1010,0100,1011) 1 0 0 0 1 1000
P(1010,1011,1000,1111) 1 0 0 1 0 0001
P(0011,0011,1000,1011) 1 1 0 0 0 0001
P(0011,0011,1000,1111) 1 1 0 0 0 0100
P(0011,0111,1000,1011) 2 2 0 0 0 1010
P(0111,0010,0101,1011) 2 0 2 0 0 0101
P(0111,0011,0100,1011) 2 0 2 0 0 1010
P(1111,1010,0100,1011) 3 0 0 0 3 0111
P-tree Performance
Time Required Vs Bit Number
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8
Bit Number
Average time required to perform multi-operand ANDing operation on a TM file (40 million pixels)
Conclusion & Discussion
PAM is not efficient in dealing with medium and large data sets.
CLARA - CLARANS draw samples from the original data randomly.
Our algorithm (using P-trees; lossless, data-mining-ready data structures) does not draw samples, but groups the data first
– each interval P-tree can be viewed as a group
– Then PAM only needs to deal with the P-trees, and the number of P-trees are much smaller than the data set CLARA and CLARANS need to deal with.
– Because P-tree ANDing is very fast, our algorithm is very fast.