association rule mining on remotely sensed imagery using peano-trees (p-trees) qin ding, qiang ding,...
TRANSCRIPT
Association Rule Mining on Remotely Sensed Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees)Imagery Using Peano-trees (P-trees)
Qin Ding, Qiang Ding, and William PerrizoComputer Science Department
North Dakota State University, USA
May 2002
(P-tree technology is patent pending by NDSU)(P-tree technology is patent pending by NDSU)
OutlineOutline Concepts
– Association Rule Mining– Market Basket Data– Remotely Sensed Imagery (RSI) data– Peano Count Trees (P-trees)
Association rule mining on RSI data using P-trees Performance analysis Conclusion
Association Rule MiningAssociation Rule Mining Originally proposed for market basket data. Given
– A set of items I = {i1,i2,…im} (e.g., items purchasable in a market)
– A set of transactions D (e.g., customers checking out = id + itemset)
An association rule is X=>Y, where X, Y are disjoint itemsets– X, Y are consider as events.
E.g., X is the event that a transaction contains X. X=>Y is the event: “if t contains X, then it contains Y” X is called the antecedent, Y is called the consequent.
Two measures: support (% trans containing XY) and confidence (% of those transactions containing X which also contain Y)
Given minimum thresholds, minsup and minconf,– Find the frequent itemsets which have support above minsup.– Derive all rules supported by frequent sets, with confidence above minconf.
Association rule mining on RSI dataAssociation rule mining on RSI data
RSI data can be viewed as a relational table– Each band (column) is an attribute (for simplicity we assume all
values are bytes)– Each pixel (row) is a transaction.– Each interval in each band is an item.– Row/column or longitude/latitude is the primary key
ARM task on RSI data– To mine implicit relations among different bands, for example,
relations among spectral bands and yield. Example Rule (NDVI): NIR[192,255] ^ RED[0,63] => Yield[128,255]
Important ARM AlgorithmsImportant ARM Algorithms
Apriori – stepwise algorithm
DHP (Direct Hashing and Pruning) – hash itemset counts and prune transactions
Partition – divide the database into small partitions such that each can be processed independently and efficiently in memory.
DIC (Dynamic Itemset Counting) – overlap the counting of candidate itemsets at different points during a scan.
FP-growth – uses Frequent Pattern tree (FP-tree) to optimize candidate generation.
Others…
Remotely Sensed Imagery (RSI) DataRemotely Sensed Imagery (RSI) Data
Satellite image– TM (Thematic Mapper) imagery (6, 7 or 8 bands)
TM is Landsat satellite imagery covering the earth every 18 days since 1972. ETM+ (Landsat-7) contains 8 bands
– 7 VIR bands (Blue, Green, Red, NIR, MIR, TIR, MIR2)– 1 Panchromatic band (PC).
Aerial photography– TIFF (3 bands: Blue, Green, Red)
Ground data– Yield, Moisture, Nitrate, Temperature, Elevation, etc
Precision Agriculture Dataset:Precision Agriculture Dataset:TIFF Image and related Bands TIFF Image and related Bands
(1320(1320×1320)×1320)
RGB
Moisture
Yield
Nitrate
812 445 43 60 59 146 83 188 812 446 43 58 50 146 83 188 812 447 44 60 52 146 83 187 812 448 43 63 54 146 83 186 812 449 43 69 52 146 83 186 812 450 47 73 54 146 83 185 812 451 50 68 58 146 83 184 812 452 51 65 54 146 83 183 812 453 46 63 54 146 83 182 812 454 33 53 50 146 83 182 812 455 30 49 47 146 83 181 812 456 41 55 54 146 83 180 812 457 40 55 57 146 83 179 812 458 43 56 52 146 83 178 812 459 42 52 52 146 83 177 812 460 40 58 45 146 83 176 812 461 40 66 47 146 83 176 812 462 38 59 47 145 83 175 812 463 34 51 55 145 82 175 812 464 39 53 63 145 82 174 812 465 36 54 57 145 82 173 812 466 42 57 48 145 82 173 812 467 40 59 43 145 82 172 812 468 39 68 50 145 82 172 812 469 40 56 57 145 82 172 812 470 30 45 43 145 82 172 812 471 33 57 45 145 82 172 812 472 35 58 62 145 82 173 812 473 30 54 63 145 82 173 812 474 30 57 52 145 82 173
x y R G B Y M N
x: Row
y: Column
R: Red
G: Green
B: Blue
Y: Yield
M: Moisture
N: Nitrate
As a relationAs a relation
Spatial Data FormatsSpatial Data FormatsBAND-1
254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
Spatial Data FormatsSpatial Data FormatsBAND-1
254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
Spatial Data FormatsSpatial Data FormatsBAND-1
254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
BIP format (1 file)
254 37 127 240 14 200 193 19
Spatial Data FormatsSpatial Data FormatsBAND-1
254 127 (1111 1110) (0111 1111)
14 193 (0000 1110) (1100 0001)
BAND-237 240(0010 0101) (1111 0000)
200 19(1100 1000) (0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193 Band 2: 37 240 200 19
BIL format (1 file)
254 127 37 240 14 193 200 19
BIP format (1 file)
254 37 127 240 14 200 193 19
bSQ format (16 files)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1
Peano Count Tree (P-tree)Peano Count Tree (P-tree)
P-tree represents RSI data bit-by-bit in a recursive quadrant-by-quadrant arrangement.
P-trees are a lossless compressed representation of the original data.
An example 2-D a P-treeAn example 2-D a P-tree
Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0
39
16 8 15 0
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
16 0
39
0 4 4 4 4
158
1 1 1 0
3
0 0 1 0
1
1 1
3
0 1
1111111111111111111000001111001011111111111111111111111111111111
bSQ file
bSQ file arranged as a spatialdataset (2-D raster order)
Peano Mask Tree (PM-tree)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0
1 0
0 1 1 1 1
m
1 1 1 0 0 0 1 0 1 1 0 1
m
m
1 1 1 0
m
0 0 1 0
m
1 1
m
0 1
Truth-Trees (1 if condition is true of quadrant, else 0– E.g., Pure-1 and Pure-0 Trees– All are lossless compressed representations of the dataset
55
16 8 15 16
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count
Level Fan-out QID (Quadrant ID)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
0 1 2 3
111
( 7, 1 ) ( 111, 001 ) 10.10.11
2
3
2 . 2 . 3
001
P-tree OperationsP-tree Operations
P-tree 55 PM-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ 3 0 4 1 4 4 3 4 m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ 1110 0010 1101 1110 0010 1101
P-tree-1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101
P-tree-2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100
AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100
OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010
Complement 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8____ _1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ 1 4 0 3 0 0 1 0 m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\ 0001 1101 0010 0001 1101 0010
Ptree ANDing OperationPtree ANDing Operation
PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101
PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100
Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100
0 100 101 102 12 132 20 21 220 221 223 23 3 & 0 20 21 22 231 RESULT0 0 0 20 20 20 21 21 21 220 221 223 22 220 221 223 23 231 231
Depth-first Pure-1 path code
Various P-treesVarious P-trees
Basic P-treesPi, j
Value P-treesPi(v)
Tuple P-treesP(v1, v2, …, vn)
AND COMPLEMENT
AND
Interval P-treesPi(v1, v2)
Cube P-treesP([v11, v12], …, [vN1, vN2])
OR
OR
AND
AND, OR, COMPLEMENT
AND, ORPredicate P-trees
P(p) COMPLEMENT
AND, OR, COMPLEMENT
Association Rule Mining on RSI Data Association Rule Mining on RSI Data using P-treesusing P-trees
Admissible Itemsets (Asets )– Asets are itemsets of the form, Int1 Int2 ... Intn =
Π i=1...n Inti , where Inti is an interval of values in Bandi
(some of which may be the full value range).
– Example: Aset {[01,01]1, [11,11]2}
P-ARM algorithmPruning techniques
P-ARM algorithmP-ARM algorithm
Procedure P-ARM{ Data_Discretization; F1 = {frequent 1-Asets}; For (k=2; F k-1 ) do begin Ck = p-gen(F k-1); Forall candidate Asets c Ck do c.count = AND_rootcount(c); Fk = {cCk | c.count >= minsup} end Answer = k Fk
}
•F1 is determined directly from P-tree root counnts and pruning techniques rather than transaction database scan.
•The p-gen function differs from the apriori-gen function in Apriori by using some pruning techniques.
•
• The AND_rootcount function is used to calculate Aset counts directly by ANDing the appropriate basic P-trees instead of scanning the transaction databases.
The support count for Aset {B1[0,64), B2[64,127)} (or {[00, 00]1, [01, 01]2}) is the root count of P1(00) AND P2(01).
Pruning TechniquesPruning Techniques
Band-based pruning– An itemset with two items from the same band will have support zero.
Constraint-base pruning– E.g., specify yield as the only consequent band of interest.– Note: in the performance comparisons we did not use this pruning
technique (to maintain fairness, since it is hard to implement in other alogrithms)
Bit-based pruning for multi-level rules– if Aset [128,255] (or [1,1]2) is not frequent, then the Aset [128,191] (or [10,10]2) and
[192,255] (or [11,11]2) cannot be frequent either.
Others
P-ARM versus AprioriP-ARM versus Apriori
Scalability with support threshold
0
100
200
300
400
500
600
700
800
10%20%30%40%50%60%70%80%90%
Support threshold
Ru
n t
ime
(Sec
.)
P-ARM
Apriori
1,742,400 pixels (transactions)
P-ARM versus Apriori (cont.)P-ARM versus Apriori (cont.)
Scalability with number of transactions
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions (K)
Tim
e (
Sec.)
Apriori
P-ARM
Support threshold =10%
P-ARM versus FP-growthP-ARM versus FP-growth
Scalability with support threshold
0
100
200
300
400
500
600
700
800
10% 30% 50% 70% 90%
Support threshold
Ru
n t
ime
(S
ec.)
P-ARM
FP-growth
17,424,000 pixels (transactions)1,742,400 pixels (transactions)
P-ARM versus FP-growth (cont.)P-ARM versus FP-growth (cont.)
Scalability with the number of transactions
0
200
400
600
800
1000
1200
100 500 900 1300 1700
Number of transactions(K)
Tim
e (S
ec.)
FP-growth
P-ARM
Support threshold =10% Support threshold =10%
ConclusionConclusion A model for association rule mining on RSI data
– P-trees facilitate fast calculation of support– P-trees facilitates significant pruning techniques
Applications other than precision agriculture– Flood prediction and monitoring– Community and regional planning– Virtual archeology– Mineral exploration– Bioinformatics/Genomics– VLSI design