p-trees: universal data structure for query optimization to data mining

76
Structure for Query Optimization to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. The main component of any DBMS is the query processor, but so far QPs deal with standard workload only (on left). On the standard_query end, we have much work yet to be done to solve the problem of delivering standard workload answers with low response times (D. DeWitt, ACM SIGMOD’02). On the Data Mining end, we have barely scratched the surface. (But those scratches have made the difference between becoming the biggest corporation in the world and filing for bankruptcy – Walmart vs. KMart) These notes contain NDSU confidential & Proprietary material. Patents pending SELECT FROM WHERE Complex queries (nested, EXISTS.. ) FUZZY queries (e.g., BLAST searches, .. OLAP (rollup, drilldow n, slice/di ce.. Machine Learning Data Mining Standard querying Simple Searching and aggregating Supervised - Classifica tion Regression Unsupervise d- Clustering Association Rule Mining

Upload: lahela

Post on 13-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

P-Trees: Universal Data Structure for Query Optimization to Data Mining. Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: P-Trees: Universal Data Structure for Query Optimization to Data Mining

P-Trees: Universal Data Structure forQuery Optimization to Data Mining

Most people have Data from which they want information.

So, most people need DBMSs whether they know it or not.

The main component of any DBMS is the query processor, but so far QPs deal with standard workload only (on left).

On the standard_query end, we have much work yet to be done to solve the problem of delivering standard workload answers with low response times (D. DeWitt, ACM SIGMOD’02).

On the Data Mining end, we have barely scratched the surface.

(But those scratches have made the difference between becoming the

biggest corporation in the world and filing for bankruptcy – Walmart vs. KMart)

These notes contain NDSU confidential &Proprietary material.Patents pending on bSQ, Ptree technology

SELECT

FROM

WHERE

Complex

queries

(nested,

EXISTS..)

FUZZY queries (e.g.,

BLAST searches, ..

OLAP

(rollup,

drilldown,

slice/dice..

Machine Learning Data Mining Standard querying Simple Searching and aggregating

Supervised -

Classification

Regression

Unsupervised-

Clustering

Association Rule

Mining

Page 2: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Data MiningQuerying: ask specific questions - expect specific answers. We will get back to querying later.

Data Mining: “Go into the DATA MOUNTAIN. Come out with gems” (but also, very likely, some fool’s gold. Relevance and Interestingness analysis assays the gems (helps pick out the valuable information).

A Universal Model for Association Rule Mining, Classification and Clustering of a data table, R(A1..An) where the Ais are

feature attributes assumed numeric (categorical attributes can be coded numeric)

• First order the rows:– Rids or RRNs provide an ordering– arrival ordinal provides an ordering– Peano order of pixels in an image provides an ordering– Raster order does also, but

» For images, raster order should first be converted to Peano order since a raster line is not a geometric or a geographic object. More later.

» In raster order, pixel-ids, (x, y), are sorted by bit-position in the

order, x1x2..xny1y2..yn, while Peano order is, x1y1x2y2…xnyn

Page 3: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Peano Tree (P-tree) Data Structure for Data Mining

Given a data table, R(A1,…, An)

• Order it (e.g., arrival order or RRN)

• Decompose it into attribute projections (maintain the ordering on each) R R[Ai] i=1,…,n (Band SeQential or BSQ projections)

• Decompose each attribute projection by bit position into bit-projections R[Ai] Rij j=1,…,mi (Bit SeQential or bSQ projections)

(e.g., if each Ai-value datatype is a bytes, then mi=8 for all i)

• Build d-dimensional basic P-tree from each bit-projection {Pij | i=1,…,n and j=1,…, mi}

R R[Ai] Rij basic P-trees, Pij

How is this last step done?

Page 4: P-Trees: Universal Data Structure for Query Optimization to Data Mining

1-D CP-tree

Construct 1-D CP using recursive l-r-half counts

(inodes have bitant counts)

1111110011111000111111001111111011111111111111111111111101111111

A bit-projection (bSQ file)

from a table with 64 rows

55

16 15

4 2 4 1 3 4

0 2

0 1

3124

11

5 6 8 7

1 2 0

1 0

1

4 2 4 3

0

13

7 6

2 1

1 0

2

Page 5: P-Trees: Universal Data Structure for Query Optimization to Data Mining

2-D CP-tree (same bSQ file)

Construct the 2D tree by removing every other row

(inodes have quadrant counts)

1111110011111000111111001111111011111111111111111111111101111111

55

16 15

4 2 4 1 3 4

0 1

11

1 0

4 2 4 3

13

1 0

4 4Eliminated a

pure quadrant

1 1Eliminated a

pure quadrant

1 1Eliminated a

pure quadrant

0 01 1

0 0

Eliminated pure

quadrants

01 1 0

Eliminated pure

quadrants

Page 6: P-Trees: Universal Data Structure for Query Optimization to Data Mining

4-D CP-tree from the same bSQ file?

Remove every other row (insufficient number of rows!)Can construct a 4D with 2D leaves:

1111110011111000111111001111111011111111111111111111111101111111

55

4 2 4 1 3 4

0 11 0

4 2 4 3

1 0

4 4

1 11 10 01 10 001 1 0

4 44 4

Page 7: P-Trees: Universal Data Structure for Query Optimization to Data Mining

What about 3-D? (same bSQ file)

Construct the 3D tree from 2Dby removing every other rowspliting the other rows (inodes have octant counts)

1111110011111000111111001111111011111111111111111111111101111111

55

6

5

1 0 0 0

01 1 011 1 1

11 1 1

6

01 1 011 1 17

11 1 011 1 1 11 1 110 1 1

78 8 8

Page 8: P-Trees: Universal Data Structure for Query Optimization to Data Mining

SummaryGiven a feature relation, R(A1,…, An)

– 1. Order rows (RRN, Rid, ArrivalOrdinal, a Raster Spatial Order, …

– 2. Choose a dimension,d (or combination; d1, d2, ...)

– 3. Choose fanout(s) (e.g., d1n1 d2

n2 … drnr)

Basic P-trees can be implemented in many formats:– CountP-Tree (CP) (inode contains quadrant-1-bit-count and child-pointers)

– PredicateP-trees (inodes: 1 iff predicate=true thruout quadrant and child ptrs) Pure1-Trees (P1), Pure0-Trees (P0), PureNot1-Tree (NP1), PureNot0-Tree (NP0) ValueP-trees (VP) TupleP-trees (TP) HalfPure-trees (HP)

Above are lossless, compressed, DM-Ready Interval-Ptree Box-Ptree

How do we datamine heterogeneous datasets?i.e., R,S,T.. describing same entity, different keys/attribsUniversal Relation: transform into one relation (union keys?) Key Fusion: R(K..); S(K’..) Mine them as separate relations but map keys using a tautology.The two are methods are related. Universal Rel approach usually includes definining a universal key to which all local keys are mapped (using a (possibly fragmented) tautological lookup table)

Page 9: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Spatial Data Formats (e.g., images with natural 2-D structure and coordinates; (x,y), - raster ordering)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

Page 10: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

Page 11: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

Page 12: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

bSQ format (16 files) (related to bit planes in graphics)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

Page 13: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Suppose we start with a bit-projection of a raster-ordered spatial file (image)?First, re-order into Peano order.

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

16 16

55

0 4 4 4 4

158

1 1 1 0

3

0 0 1 0

1

1 1

3

0 1

Raster ordered bSQ file. Spatial arrangement Shows Peano order

1111110011111000111111001111111011111111111111111111111101111111

Page 14: P-Trees: Universal Data Structure for Query Optimization to Data Mining

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

Same example

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

0 1 2 3

111

( 7, 1 ) ( 111, 001 )

2

3

2 . 2 . 3

001

Level-0

Level-3

Level-2

Level-1

10.10.11

Page 15: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Each bSQ file, Rij generates a BasicPtree Pij

Each value, v, Ai, generates a ValuePtree, VPi(v) (1 iff purely v thruout quadrant)Each tuple (v1..vn) in R, gens TuplePtree, TP(v1..vn) (1 iff purely (v1..vn) thruout quadrant)Any row-predicate on R gens PredicatePtree, <pred>P (1 iff <pred> true thruout quadrant)Any interval, [l,u], in Ai, gens IntervalPtree, [l,u]Pi (1 iff v[l,u] thruout quadrant)Any box, [li,ui], in R gens a RectanglePtree [li,ui]P (1 iff (v1..vn)[l,u] thruout quadrant)

(each Ptree can be expressed as aCountTree with inode-value=countor a BooleanTree with inode-value=bit)

Some Common types of Ptrees, given R(A1..An)

Value Ptree: P1(001) = P1’11 ^ P1’12 ^ P113

Tuple Ptree (1 if quad contains only that tuple, else 0)

P(001, 010, 111) = P1(001) ^ P2(010) ^ P3(111) = P1’11^ P1’12^P113 ^ P1’21 ^P122^ P1’23 ^ P131^P132^P133

Basic Ptrees

P111, …, P118, P121, …, P128, … P171, …, P178

attribute

tuple (1,2,7), in3-bit precision

Value in 3-bit prec

Review: given a feature relation, R(A1,…, An)– 1. Order rows (RRN, Rid, ArrivalOrdinal, a Raster Spatial Order, …– 2. Choose a dimension, d (or combination; d1, d2, ...) and a fanout(s) (e.g., d1

n1 d2n2 … dr

nr)

Page 16: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Predicate Ptrees (inode: 1 iff condition=true thruout quadrantP1 (Pure1) .---- 0 ----. / / \ \1 0 0 1 // \ \ // \ \ 0 0 1 0 11 0 1 //|\ //|\ //|\1110 0010 1101

CP: .--- 55 ---. / / \ \16 8 15 16 // \ \ // \ \ 3 0 4 1 443 4 //|\ //|\ //|\ 1110 0010 1101

1 1 1 1 1 1 0 01 1 1 1 1 0 0 01 1 1 1 1 1 0 01 1 1 1 1 1 1 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 10 1 1 1 1 1 1 1

P0 (Pure0) .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

Predicate Ptrees can be stored in QidVector (QV) format. mixed quadrant: (qid, ChildTruthVector)

P1-QVQid CV[] 1001[1] 0010[1.0] 1110[1.3] 0010[2] 1101[2.2] 1101

P0-QVQid CV[] 0000[1] 0100[1.0] 0001[1.3] 1101[2] 0000[2.2] 0010

NP0 (NotPure0) .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0-QVQid CV[] 1111[1] 1011[1.0] 1110[1.3] 0010[2] 1111[2.2] 1101

NP1 (NotPure1) .---- 1 ----. / / \ \0 1 1 0 // \ \ // \\ 1 1 0 1 00 10 //|\ //|\ //|\ 0001 1101 0010

NP1-QVQid CV[] 0110[1] 1101[1.0] 0001[1.3] 1101[2] 0010[2.2] 0010

MP (Mixed) .---- 1 ----. / / \ \0 1 1 0 // \ \ // \ \ 1 0 0 1 00 1 0

MP-QVQid CV[] 0110[1] 1001[2] 0010

Leafs always 0000 so omitted.

HPtrees results from the HalfPure1 predicate: < 1BitCnt Pure1Cnt/2 >

Lossless. 1 means pure1 iff no child ptrs. 0 means pure0 iff no child ptrs.

Delete any number of bottom levels = HPtree of coarser granularity.

ANDing HPtrees: if any operand is 0, result = 0

if all operands are 1 and has children, result = 1

else result could be 0 or 1 (depends upon children, but likely = 0)

The Hptree of the complement of a bSQ file is the flip of the HPtree.

HPtrees have the same leaves as Pure1Trees.

HPtree is the “high-order bit” tree of the CPtree.

HP (HalfPure) .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 0 11 1 1 //|\ //|\ //|\1110 0010 1101

HP-QVQid CV[] 1111[1] 1010[1.0] 1110[1.3] 0010[2] 1111[2.2] 1101

Page 17: P-Trees: Universal Data Structure for Query Optimization to Data Mining

The P-tree Algebra (Complement, AND, OR, …) Complementing a Ptree (Ptree for the flip of the bSQ file) (we use the “prime” notation)

– Count-Ptree: formed by purity-complementing each count.– Purity-Ptree (P1, P0, NP0, NP1): formed by bit-flipping leaves Only.– HPtree: formed by bit-flipping all (comp = flip)

(We use”underscore” for the flip of a tree)

P1 = P0’ .---- 0 ---. / / \ \1 0 0 1 // \ \ // \ \ 0 0 1 0 11 0 1 //|\ //|\ //|\1110 0010 1101

P0 = P1’ .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

NP0 = NP1’ .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0VQid PgVc[] 1111[1] 1011[1.0] 1110[1.3] 0010[2] 1111[2.2] 1101

NP1= NP0’= P1 .---- 1 ----. / / \ \0 1 1 0 // \ \ // \\ 1 1 0 1 00 10 //|\ //|\ //|\ 0001 1101 0010

NP1VQid PgVc[] 0110[1] 1101[1.0] 0001[1.3] 1101[2] 0010[2.2] 0010

P1VQid PgVc[] 1001 [1] 0010 [1.0] 1110 [1.3] 0010 [2] 1101 [2.2] 1101

P0VQid PgVc[] 0000 [1] 0100 [1.0] 0001 [1.3] 1101 [2] 0000 [2.2] 0010

P1 .---- 1 ---. / / \ \0 1 1 0 // \ \ // \ \ 1 1 0 1 00 1 0 //|\ //|\ //|\0001 1101 0010

P0 .---- 1 ----. / / \ \1 1 1 1 // \ \ // \ \ 1 0 1 1 11 1 1 //|\ //|\ //|\1110 0010 1101

NP0 = P0 .---- 0 ----. / / \ \0 0 0 0 // \ \ // \ \ 0 1 0 0 00 0 0 //|\ //|\ //|\0001 1101 0010

NP0VQid PgVc[] 0000[1] 0100[1.0] 0001[1.3] 1101[2] 0000[2.2] 1101

NP1 = P1 .---- 0 ----. / / \ \1 0 0 1 // \ \ // \\ 0 0 1 0 11 01 //|\ //|\ //|\ 1110 0010 1101

NP1VQid PgVc[] 1001[1] 0010[1.0] 0001[1.3] 0010[2] 1101[2.2] 1101

P1VQid PgVc[] 0110 [1] 1101 [1.0] 0001 [1.3] 1101 [2] 0010 [2.2] 0010

P0VQid PgVc[] 1111 [1] 1011 [1.0] 1110[1.3] 1101 [2] 1111 [2.2] 1101

Page 18: P-Trees: Universal Data Structure for Query Optimization to Data Mining

ANDing (for all Truth-trees, just AND bit-wise)

0 0 100 101 102 12 132 2020 2121 220 221 223220 221 223 2323 3 AND 00 20 20 2121 2222 231231 00 20 20 2121 220 221 223220 221 223 231231

Pure1-quad-list method: For each operand, list the qids of the pure1 quad’s in depth-first order. Do one multi-cursor scan across the operand lists , for every pure1 quad common to all operands, install it in the result.

P1operand1 01 0 0 1 // \ \ // \\ 0 0 1 0 1 1 01 //|\ //|\ //|\1110 0010 1101

P0operand1 00 0 0 0 // \ \ // \ \ 0 1 0 0 0 0 00 //|\ //|\ //|\0001 1101 0010

NP0operand1 11 1 1 1 // \ \ // \\ 1 0 1 1 1 1 11 //|\ //|\ //|\ 1110 0010 1101

NP1operand1 NP0’ 1 0 1 1 0 // \ \ // \\ 1 1 0 1 0 0 10 //|\ //|\ //|\ 0001 1101 0010

1 1 1 1 1 1 0 01 1 1 1 1 0 0 01 1 1 1 1 1 0 01 1 1 1 1 1 1 01 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 10 1 1 1 1 1 1 1

P1operand2 01 0 0 0 / / \ \ 1 1 1 0 //|\ 0100

P0op2 = P1’op2 00 1 0 1 / / \ \ 0 0 0 0 //|\ 1011

NP0operand2 11 0 1 0 / / \ \ 1 11 1 //|\ 0100

NP1operand2 NP0’ 10 1 1 1 / / \ \ 0 0 0 1 //|\ 1011

P1op1^P1op2 01 0 0 0 // | \ 11 0 0 //|\ //|\ 1101 0100

P1op1^P0op2 = P1op1^P1’op2 00 0 0 1 // \ \ //\ \ 0 0 1 0 000 0 //|\ //|\ //|\1110 0010 1011

NP0op1^NP0op2

11 0 1 0 // | \ 11 1 1 //|\ //|\ 1101 0100

NP0op1^NP0’op2

10 1 1 1 // \ \ /// \ 1 0 1 1 000 1 //|\ //|\ //|\ 1110 0010 1011

1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 0 1 0 0 0 01 1 0 0 0 0 0 0

1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 0 1 0 0 0 00 1 0 0 0 0 0 0

AND

=

Depth first traversal using1^1=1, 1^0=0, 0^0=0.

bitwise

Page 19: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Association Rule Mining (ARM)

Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C

where A is a subset of tuples called the antecedent

And is disjoint from the subset, C, called consequent.

Page 20: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Precision Ag ARM example Identifying high / low crop yields (usually a classification problem)

E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectance from the pixel or grid cell at (x,y)– Y is the yield at (x,y). – Assume all are 8-bit values.

High Support and Confidence rules are searched for in which the consequent is entirely in the Yield attribute, such as:– [192,255]G [0,63]R [128,255]Y

How to apply rules?– Obtain rules from previous year’s data, then apply rules in the current year after

each aerial photo is taken at different stages of plant growth.– By irrigating/adding Nitrate where lower Yield is indicated, overall Yield may

be increased. – We note that this problem is more of a classification problem (classify Yield

levels) – that is, the rules are classification rules, not association.

Page 21: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Image Data Terminology Pixel – a point in a space Band – feature attribute of the pixels Value – usually one byte (0~255) Images have different numbers of bands

– TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2)– TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC)– TIFF: 3 bands (B, G, R)– Ground data: individual bands (Yield, Moisture, Nitrate, Temp, elevation…)

RSI data can be viewed as collection of pixels. Each has a value for each feature attrib.

TIFF image Yield Map

E.g., RSI dataset above has 320 rows and 320 cols of pixels (102,400 pixels) and 4 feature attributes (B,G,R,Y). The (B,G,R) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map.Existing formats

–BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel)

New format–bSQ (bit Sequential)

Page 22: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Data Mining in Genomics

• There is (will be?) an explosion of gene expression data.

• Current emphasis is on extracting meaningful information from huge raw data sets.

• Consistent data store and the use of P-trees to facilitate Association Rule Mining as well as Clustering / Classification will facilitate the extract of information and answers from raw data on demand.

• Microarray data is most often represented as a Gene Table, G(Gid, E1, E2, ., En)where Gid is the gene identifier; E1…. En are the various treatments (or conditions or experiments) and the data values are gene expression levels (Excel spreadsheet).

• A gene regulatory pathway component can be represented as an association rule, {G1..Gn} Gm where {G1…Gn} is the antecedent & Gm is the consequent.

• Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments

• clustering the gene table

Page 23: P-Trees: Universal Data Structure for Query Optimization to Data Mining

ARM for Microarray Data (Contd.)• An alternate data format exits (called the “Experiment Table”.)

T(Eid, G1, G2, …. , Gn) where Eid is the Experiment (or Treatment or Condition) identifier and G1…Gn are the gene identifiers.• Experiment tables are a convenient form for ARM of gene expression levels.• Goal is to mine for rules among genes by associating treatment table columns.

….….….….E4

….….….….E3

….….….….E2

….….….….E1

G4G3G2G1 GeneIDExpmtID .

Gene expression

values

The form of the Experiment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8-10 years.

Page 24: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Experiment Table

…….….…E4

…….….…E3

…….….…E2

…….….…E1

G4G3G2G1

Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements.

It is a 2-D data cube which can be rotated (to the Experiment Table), rolledup, sliced, diced, drilled down, association rule mined etc.

Gene Table

……….…G4

……….…G3

……….…G2

……….…G1

E4E3E2E1

Page 25: P-Trees: Universal Data Structure for Query Optimization to Data Mining

A Universal Format? E.g., One large universal table with 5 dimensions based on MIAME standard?

– E = Experimental design – Hybridisation Procedures– A = Array design– S = Samples– M = Measurements– N = Normalization Control for data mining across all experiments and genes?

Page 26: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Gene-Rep

Eid(E,A,S,M,N in

5D Peano order)

G1 G2 … Gn

E,A,S,M,N1 …. …. ….

E,A,S,M,N2 …. …. ….

. . .

E,A,S,M,Nm …. …. ….

Gene expression values

“MIAME HYPERCUBE“

(5-D Universal Gene Expression Cube)

Cardinality is high, but compression will be substantial (next slide).

Page 27: P-Trees: Universal Data Structure for Query Optimization to Data Mining

MIAME HYPRCUBE rolled up onto (E,S)

1 5 2 0…

1 7 0...

90.

0 8 1 7 6 5...

70.

zeros

zeros

E(Experiment)

Gene

E1A1S1M1N1

.

.E1A1S1M1Nn

.

.

.

EnAnSnMnNn

G1 G2 . . . Gn

The non-zero blocks may occur off the diagonal.The Point: Massive but very sparse dataset!

The AD (All Digital) implementation format for distributed P-tree processing is one in which the bit filter (mask) approach is universally adopted. If the dimension is 5 (as in the MIAME HYPERCUBE), the only operation is a 32-bit AND operation – which fits current commodity processors perfectly (32-bit registers!).

A hardware drop-in 5-D P-tree AND card of standard PCs please!!! “Anyone? Anyone?..

Page 28: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Market Basket ARM example Identifying purchasing patterns

• If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?)

E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo)• Tid=transaction id (for a customer going thru checkout). In any field of a tuple there is a 1 if the customer has that product in his/er basket, else 0 (existence, not count).

Support and Confidence: Given itemsets, A and C,• Supp(A) = ratio of the number of trans supporting A over the total number of transs.• Supp(AC) = ratio of the number of customers buying AC over the total cust.• Conf(AC) = ratio of # of customers buying A C over # of cust buying A

= Supp(AC) / Supp(A) in list notation

Thresholds• Frequent Itemsets = Support exceeds a min support threshold (minsupp).

– Lk denotes the set of frequent k-itemsets (sets with k items in them).

• High Confidence Rules = Confidence exceeds a min threshold (minconf).

Page 29: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Lists versus Vectors in MBR In most MBR, the table (relation) is T(Tid, Itemset)

– Where Tid is the customer transaction id (during checkout)– Itemset is the set of items customer purchases (in Market Basket).

• Itemset is a set and therefore T is non-First_Normal_Form• Therefore the bit-vector approach is usually taken in MBR:

BT(Tid, item1 item2 … itemn)• Itemset is expressed as a bit-vector, [0100101…1000]

– where each item is assigned to a bit position and that bit is 1 if t-itemset contains that item and 0 otherwise.

– The Vector version corresponds to the table model we have been using, with R(A1,…,An), ordering is by Tid and the Ai‘s are the items in an assigned order (the datatype of each is Boolean)

Page 30: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Many-to-Many Relationships (M-M)(List vs. Vector model?)

A M-M relationship between entities, E1 and E2, is modeled as a table, R(E1, E2List)where E2List is the list of all E2-occurrences related to the corresponding E1 occurrence.

Or it is modeled as the “rotation” R’(E2, E1List) . Note that both tables are non-1NF!

Non-1NF tables are difficult, so List model is typically transformed to the Vector model :

R(E1, E2,1, E2,2, … , E2,n ) where each E2,j value is Boolean (1 iff that E2-occurrence is

related to the E1 occurrence).

This transformation and the APRIORI work done between 1992-present has made MBR a sea-change event.

Walmart adopted it early to analyze and manage supply and Kmart did not.

This year Walmart became the world largest company and Kmart filed for

bankruptcy protection. Is it effective technology?

Gene-to-Experiment and CustomerTrans-to-Item are M-M relationships – quite similar!

Page 31: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Association Rule ExampleEach trans is a list (or bit vector)

of items purchased by a customer in a visit):

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Tid A B C D E F

2000 1 1 1 0 0 0

1000 1 0 1 0 0 0

4000 1 0 0 1 0 0

5000 0 1 0 0 1 1

minsupp=50%, minconf=50%

Find the frequent itemsets: the sets of items that have minsupp

A subset of a freq itemset must also be a freq itemset

if {A, B} is freq itemset, {A} and {B} must be frequent

APRIORI: Iteratively find frequent itemsets with size from 1 to k.

Use the frequent itemsets to generate association rules.

Ck will denote the candidate frequent k-itemsets

Lk will denote the frequent k-itemsets.

3 2 2 1 1 1Support Count

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q where p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1

< q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do if (s is not in Lk-1) delete c from Ck

Page 32: P-Trees: Universal Data Structure for Query Optimization to Data Mining

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database Ditemset sup.

{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan DScan D

C1 L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2C2C2

Scan DScan D

C3L3

itemset{2 3 5}

Scan DScan D itemset sup{2 3 5} 2

TID 1 2 3 4 5

100 1 0 1 1 0

200 0 1 1 0 1

300 1 1 1 0 1

400 0 1 0 0 1

P1 2 //\\ 1010

P2 3 //\\

0111

P3 3 //\\ 1110

P4 1 //\\ 1000

P5 3 //\\ 0111

BuildPtrees:Scan DScan D

L1={1,2,3,5}

P1^P2 1 //\\

0010

P1^P3 2 //\\

1010

P1^P5 1 //\\

0010

P2^P3 2 //\\

0110

P2^P5 3 //\\

0111

P3^P5 2 //\\

0110

L2={13,23,25,35}

P1^P2^P3 1 //\\

0010

P1^P3 ^P5 1 //\\

0010

P2^P3 ^P5 2 //\\

0110

L3={235}

Minsup=2Minsup=2

{123} pruned because {12} not frequent.

{135} pruned because {15}not frequent..

• The P-ARM algorithm assumes a fixed value precision in all bands.

• p-gen function for numeric spatial data differs from apriori-gen by using additional pruning.

•AND_rootcount function is used to calculate Itemset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases.

Page 33: P-Trees: Universal Data Structure for Query Optimization to Data Mining

P-ARM versus Apriori

Scalability with support threshold

• 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).

• 2-bits precision

• Equi-length partition

0

100

200

300

400

500

600

700

800

10% 20%30%40%50%60%70%80%90%

Support threshold

Ru

n t

ime

(Sec

.)

P-ARM

Apriori

Compare with Apriori (classical method) and FP-growth (recently proposed).Find all frequent itemsets, not just those containing Yield, for fairness. The images are actual aerial TIFF images with synchronized yield maps.

Scalability with number of transactions

0

200

400

600

800

1000

1200

100 500 900 1300 1700

Number of transactions(K)

Tim

e (S

ec.)

Apriori

P-ARM

Identical resultsP-ARM is more scalable for lower support thresholds.P-ARM algorithm is more

scalable to large spatial datasets.

Page 34: P-Trees: Universal Data Structure for Query Optimization to Data Mining

P-ARM versus FP-growth

Scalability with support threshold

0

100

200

300

400

500

600

700

800

10% 30% 50% 70% 90%

Support threshold

Ru

n t

ime (

Sec.)

P-ARM

FP-grow th

17,424,000 pixels (transactions)

0

200

400

600

800

1000

1200

100 500 900 1300 1700

Number of transactions(K)

Tim

e (S

ec.)

FP-growth

P-ARM

Scalability with number of trans

FP-growth = efficient, tree-based frequent pattern mining method (details later)Identical results.For a dataset of 100K bytes, FP-growth runs very fast. But for images of large

size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.

Page 35: P-Trees: Universal Data Structure for Query Optimization to Data Mining

P-cube and P-table of R(A1,…,An)

Given R(A1, A2, A3 ), form P-trees for R

Form the data cube, P-cube , of all TupleP-trees,

PcubeR

Applying Peano ordering to the P-cube cells,defines the P-table

PtableR([],[0],[0.0]…[1],[1.0]... )

Quadrants are the feature attribute (column)names, listed in depth-first or pre-order.

Can form P-trees on PtableR

- What are they? - What is the relationship to the Haar wavelet low-pass tree?

0 0

0 0

0 0

0 0

0 0

1 5

0 0

0 0

1100 01 10

00

01

10

11

0 0

1 0

0 1

0 0

0 0

14 5

0 0

3 0

1000 01 10

00

01

10

11

0 0

1 0

0 0

0 0

0 0

5 5

0 0

17 0

0100

01

10

11rc

P(0,0,0)

00 01 10 11

11

10

01

00

00

A1

A2

A 3

rcP(1,0,0)

rcP(0,2,0)

rcP(1,2,0)

rcP(2,2,0)

rcP(3,2,0)

rcP(0,3,0)

rcP(1,3,0)

rcP(2,3,0)

rcP(3,3,0)

rcP(0,0,0)

rcP(1,1,0)

rcP(2,1,0)

rcP(3,1,0)

rcP(3,0,0)

rcP(2,0,0)

rcP(0,0,1)

rcP(1,0,1)

rcP(2,0,1)

rcP(3,0,1)

rcP(0,0,2)

rcP(1,0,2)

rcP(2,0,2)

rcP(3,0,2)

rcP(2,0,3)

rcP(1,0,3)

rcP(0,0,3)

rcP(3,0,3)

rcP313

rcP312

rcP311

rcP323

rcP333

rcP322

rcP321

rcP331

rcP332

Page 36: P-Trees: Universal Data Structure for Query Optimization to Data Mining

High Confidence Rules Application areas on spatial data

– Forest fires– Big ticket item buyer identification.– Gene function determination– Identification of agricultural pest infestations

Traditional algorithms are not suitable– Too many frequent itemsets in the case of low support threshold

P-tree P-cube Establish a very low minsupp though

– To eliminate rules that result from noise and outliers Eliminate redundant rules

Page 37: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Confident Rule Mining Algorithm

Build the set of confident rules, C (initially empty) as follows:– Start with 1-bit values, 2 bands;

– then 1-bit values and 3 bands; …

– then 2-bit values and 2 bands;

– then 2-bit values and 3 bands; …

– . . .

– At each stage defined above, do the following:

• Find all confident rules by rolling-up the T-cube along each potential consequent set using summation.

• Comparing these sums with the support threshold to isolate rule support sets with the minimum support.

• Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules.

• Place any new confident rule in C, but only if non-redundant.

Page 38: P-Trees: Universal Data Structure for Query Optimization to Data Mining

5 19

25 15

1,0 1,1

2,0

2,1

Example

30 34 sums

24 27.2 thresholds

32 40

19.2 24

Assume minimum confidence threshold 80%, minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2

C: B1={0} => B2={0} c = 83.3%

Page 39: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count

is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is

useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at

least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to

determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets

are estimated to be frequent

The core of the Apriori algorithm:

– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets

– Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation

1. Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover frequent pattern of size 100, eg, {a1…a100}, need to generate 2100 1030 candidates.

2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

Page 40: P-Trees: Universal Data Structure for Query Optimization to Data Mining

SMILEY Distributed P-Tree Architecture Synchronized dataset to be data mined

(at some URL) (RSI, Genomic dataset,

MBR dataset, …)

R(A1,…,An) USE DATA MINING RESULTS

DADI (Drag And Drop Interface) DVI (Data Visualization Interface)

SMILEY, BisonBlast, BisonArray

(ARM, Classification, Clustering

Algorithm implementations)

DCI (Data Capture Interface) DMI (Data Mining Interface)

Basic P-Trees striped (distributed) across Beowulf cluster (MidAS)

(Parameters: Qid striping level(s), fanout, implementation model…)

Page 41: P-Trees: Universal Data Structure for Query Optimization to Data Mining

BSM — A Bit Level Decomposition Storage ModelA model of query optimization of all types

Vertical partitioning has been studied within the context of both centralized database system as well as distributed ones. It is a good strategy when small numbers of columns are retrieved by most queries. The decomposition of a relation also permits a number of transactions to execute concurrently. Copeland et al presented an attribute level decomposition storage model (DSM) [CK85] storing each column of a relational table into a separate binary table. The DSM showed great comparability in performance.

Beyond attribute level decomposition, Wong et al further took the advantage of encoding attribute values using a small number of bits to reduce the storage space [WLO+85]. In this paper, we will decompose attributes of relational tables into bit position level, utilize SPJ query optimization strategy on them, store the query results in one relational table, finally data mine using our very good P-tree methods.

Our method offers these advantages:– (1) By vertical partitioning, we only need to read everything we need. This method makes hardware caching work really well

and greatly increases the effectiveness of the I/O device.– (2) We encode attribute values into bit vector format, which makes compression easy to do.– (3) SPJ queries can be formulated as Boolean expressions, which facilitates fast implementation on hardware.– (4) Our model is fit not only for query processing but for data mining as well.

• [CK85] G.Copeland, S. Khoshafian. A Decomposition Storage Model. Proc. ACM Int. Conf. on Management of Data (SIGMOD’85), pp.268-279, Austin, TX, May 1985.

• [WLO+85] H. K. T. Wong, H.-F. Liu, F. Olken, D. Rotem, and L. Wong. Bit Transposed Files.

• Proc. Int. Conf. on Very Large Data Bases (VLDB’85), pp.448-457, Stockholm, Sweden, 1985.

Page 42: P-Trees: Universal Data Structure for Query Optimization to Data Mining

SPJ Query Optimization Strategies - One-table Selections

There are two categories of queries in one-table selections: Equality Queries and Range Queries. Most techniques [WLO+85, OQ97, CI98] used to optimize them employ encoding schemes – equality encoding and range encoding. Chan and Ioannidis [CI99] defined a more general query format called interval query. An interval query on attribute A is a query of the form “x≤A≤y” or “NOT (x≤A≤y)”. It can be an equality query or a range query when x or y satisfies different kinds of conditions.

We defined interval P-trees in previous work [DKR+02], which is equivalent to the bit vectors of corresponding intervals. So for each restriction in the form above, we have one corresponding interval P-tree. The ANDing result of all the corresponding interval P-trees represents all the rows satisfy the conjunction of all the restriction in the where clause.

• [CI98] C.Y. Chan and Y. Ioannidis. Bitmap Index Design and Evaluation. Proc. ACM Intl. Conf. on Management of Data (SIGMOD’98), pp.355-366, Seattle, WA, June 1998.

• [CI99] C.Y. Chan and Y.E. Ioannidis. An Efficient Bitmap Encoding Scheme for Selection Queries. Proc. ACM Intl. Conf. on Management of Data (SIGMOD’99), pp.216-226, Philadephia, PA, 1999.

• [DKR+02] Q. Ding, M. Khan, A. Roy, and W. Perrizo. The P-tree algebra. Proc. ACM Symposium Applied Computing (SAC 2002), pp.426-431, Madrid, Spain, 2002.

• [OQ97] P. O’Neill and D. Quass. Improved Query Performance with Variant Indexes. Proc. ACM Int. Conf. on Management of Data (SIGMOD’97), pp.38-49, Tucson, AZ, May 1997.

Page 43: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Select-Project-StarJoin (SPSJ) QueriesA Select-Project-StarJoin query is a SPJ query in which there is one multiway join along with selections and projections typically there is a central fact relation to which several dimension

relations are joined. The dimension relations can be viewed as points on a star centered on the fact relation. For example, given the Student (S), Course (C), and Enrollment (E) database shown below (note a bit encoding is shown in reduced font italics for certain attributes), take SPSJ query,

SELECT S.s,S.name,C.name FROM S,C,E WHERE S.s=E.s AND C.c=E.c AND S.gen=M AND E.grade=A AND C.term=S

S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10|

The bSQ attributes are stored as follows (note, e.g., bit-1 of S.s has been labeled Ss1, etc.).Ss1 Ss2 Ss3 Sg Cc1 Cc2 Cc3 Ct Es1 Es2 Es3 Ec1 Ec2 Ec3 Eg1 Eg20011 0000 0101 0001 0011 0000 0101 0110 0000 0000 0011 0000 0010 1010 1101 010000 11 01 11 00 11 01 10 0000 1111 1100 0000 0111 1101 1011 1001 11 00 01 11 00 01 11 00

BSQ attributes stored as single attribute files:S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|

Page 44: P-Trees: Universal Data Structure for Query Optimization to Data Mining

For character string attributes, LZW or some other run-length compression could be used to further reduce storage requirements. The compression scheme should be chosen so that any range of offset entries can be uncompressed independently of the rest. Each of these BSQ files would require only a few pages of storage, allowing the entire BSQ file to be brought into memory whenever any portion of it is needed, thus eliminating the need for indexes and paging.

A bit mask is formed for each selection as follows.The bit mask for S.gen=M is just the complement of S.g (since M has been coded as 0), therefore mS=Sg'.Similarly, mC=Ct and mE=Eg1 AND Eg2. mS mC mE 1110 0110 0100 00 10 1001 00Logically ANDing mE into the E.s and E.c attributes, reduces E.s and E.c as follows. Es1 Es2 Es3 Ec1 Ec2 Ec3 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1

We note that the reduced S.s and C.c attributes would need to be reduced only when S.s and C.c are not already surrogates attributes. In E, each tuple is compared with the participation masks, mS and mC to eliminate non-participating (E.s, E.c) pairs. The (E.s, E,c) pairs in binary are (000, 000), (011, 001) and (010, 011) or in decimal, (0,0), (3,1) and (2,3). The mask mS and mC reveal that S.s=0,1,4 and C.c=1,2,4 are the participating values. Therefore (3,1) and (2,3) are non-participating pairs and can be eliminated. Therefore there is but one participating (E.s, E.c) pair, namely (0, 0). Therefore to answer the query only the S.name value at offset 0 and the E.name value at offset 0 need to be retrieved. The output is (0, CLAY, BI).

To review, once the basic P-trees for the join and selection attributes have been processed to remove all non-participants, only the participating BSQ values need to be accessed. The basic P-trees files for the join and selection attributes would typically be striped across a cluster of nodes so that the AND operations could be done very quickly in a parallel fashion. Our implementation on a 16 node cluster of 266 MHz Pentium computers shows that any multiway AND operation can be done in a few milliseconds.

Page 45: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Select-Project-Join (SPJ) QueriesWe deal with an example in which more than one join is required and there are more than one join attribute (bushy query tree). We organize our

query trees using the "constellation" model in which one of the fact files is considered central and the others are points in a star around that central attribute. Each secondary star-point fact file can be the center of a "sub-star". We apply the selection masks first. Then we perform semi-joins from the boundary toward the central fact file. Finally we perform semi-joins back out again. The result is the full elimination of all non-participants. The following is an example of such a bushy query. Those details that are identical to the above are not repeated here.

SELECT S.n,C.n,R.capacity FROM S,C,E,O,R WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.gen=M & C.cred=2 & E.grade=A & R.capacity=10;

In this query, O is taken as the central fact relation of the following database.S___________ C___________ E_________________ O_______________ R_____________ |s |n|gen| |c |n|cred| |s |o |grade| |o |c |r | |r |capacity||0 000|A|M 0| |0 00|B|1 01| |0 000|1 001|2 10| |0 000|0 00|0 01| |0 00|30 11||1 001|T|M 0| |1 01|D|3 11| |0 000|0 000|3 11| |1 001|0 00|1 01| |1 01|20 10||2 010|S|F 1| |2 10|M|3 11| |3 011|1 001|3 11| |2 010|1 01|0 00| |2 10|30 11||3 011|B|F 1| |3 11|S|2 10| |3 011|3 011|0 00| |3 011|1 01|1 01| |3 11|10 01||4 100|C|M 0| |1 001|3 011|0 00| |4 100|2 10|0 00||5 101|J|F 1| |1 001|0 000|2 10| |5 101|2 10|2 10| Sn |2 010|2 010|2 10| |6 110|2 10|3 11| A |2 010|3 011|3 11| |7 111|3 11|2 10| T |4 100|4 100|2 10| S |5 101|5 101|2 10|Ss1 Ss2 Ss3 Sgen B0011 0000 0101 0001 C Egrade1 Egrade2 Cn00 11 01 11 J 1101 0100 Cc1 Cc2 Ccred1 Ccred2 B 1011 1001 00 01 01 11 DEs1 Es2 Es3 Eo1 Eo2 Eo3 11 00 11 01 11 10 M0000 0000 0011 0000 0010 1010 S0000 1111 1100 0000 0111 1101 Rr1 Rr2 Rcap1 Rcap211 00 01 11 00 01 00 01 11 10 11 01 10 11Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or20011 0000 0101 0011 0000 0001 11000011 1111 0101 0011 1101 0011 0110

Page 46: P-Trees: Universal Data Structure for Query Optimization to Data Mining

1. Apply selection masks:mE mR mC 0100 00 00 1001 10 01 00

Es1 Es2 Es3 Eo1 Eo2 Eo3 0000 0000 0011 0000 0010 1010 0000 1111 1100 0000 0111 1101 11 00 01 11 00 01 Rc1 Rc2 Cr1 Cr2 11 10 01 11 10 11 11 10

Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0011 0000 0101 0011 0000 0001 11000011 1111 0101 0011 1101 0011 0110

2. Results in the following,

Es1 Es2 Es3 Eo1 Eo2 Eo3 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1

Rc1 Rc2 Cr1 Cr2 1 1 1 0 Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0011 0000 0101 0011 0000 0001 11000011 1111 0101 0011 1101 0011 0110

3. Apply join-attribute selection-masks externally to further reduce the P-trees:mS s=0,1,4 are the participants. mC c=3 is the only participant. mR r=2 is only participant1110 00 0000 01 10 Produces:

Es1 Es2 Es3 Eo1 Eo2 Eo3 Rc1 Rc2 Cr1 Cr2 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0

Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0011 0000 0101 1 00011 1111 0101 1 1 1 0

4. Completing the elimination of newly discovered non-participants internally in each file, results in:

Es1 Es2 Es3 Eo1 Eo2 Eo3 Rc1 Rc2 Cr1 Cr2 Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 0

5. Thus, s has to be 0 (000), o has to be 0 (000), c has to be 3 (11) and r has to be 2 (10). But o also has tobe 7 (111). Since o cannot be both 0 and 7 there are no participants.

Page 47: P-Trees: Universal Data Structure for Query Optimization to Data Mining

DISTINCT Keyword, GROUP BY Clause, ORDER BY Clause, HAVING Clause and Aggregate Operations

Duplicate elimination after a projection (SQL DISTINCT keyword) is one of the most expensive operations in query optimisation. In general, it is as expensive as the join operation. However, in our approach, it can automatically be done while forming the output tuples (since that is done in a order). While forming all output records for a particular value of the ORDER BY attribute, duplicates can be easily eliminated without the need for an expensive algorithm.

  The ORDER BY and GROUP BY clauses are very commonly used in queries and can

require a sorting of the output relation. However, in our approach, if the central relation is chosen to be the one with the sort attribute and the surrogation is according to the attribute order (typically the case – always the case for numeric attributes), then the final output records can be put together and aggregated in the requested order without a separate sort step at no additional cost. Aggregation operators such as COUNT, SUM, AVG, MAX, and MIN can be implemented without additional cost during the output formation step and any HAVING decision can be made as output records are being composed, as well.

  If the Count aggregate is requested by itself, we note that P-trees automatically provide the

full counts for any predicate with just one multiway AND operation.

Page 48: P-Trees: Universal Data Structure for Query Optimization to Data Mining

The following example illustrates these points.SELECT DISTINCT C.c, R.capacity FROM S,C,E,O,R WHERE S.s=E.s AND C.c=O.c AND O.o=E.o

AND O.r=R.r AND C.cred>1 AND (E.grade='B' OR E.grade='A') AND R.capacity>10 ORDER BY C.c;

S___________ C___________ E_________________ O_______________ R_____________ |s |n|gen| |c |n|cred| |s |o |grade| |o |c |r | |r |capacity||0 000|A|M 0| |0 00|B|1 01| |0 000|1 001|2 10| |0 000|0 00|0 01| |0 00|30 11||1 001|T|M 0| |1 01|D|3 11| |0 000|0 000|3 11| |1 001|0 00|1 01| |1 01|20 10||2 010|S|F 1| |2 10|M|3 11| |3 011|1 001|3 11| |2 010|1 01|0 00| |2 10|30 11||3 011|B|F 1| |3 11|S|2 10| |3 011|3 011|0 00| |3 011|1 01|1 01| |3 11|10 01||4 100|C|M 0| |1 001|3 011|0 00| |4 100|2 10|0 00||5 101|J|F 1| |1 001|0 000|2 10| |5 101|2 10|2 10| Sn |2 010|2 010|2 10| |6 110|2 10|3 11| A |2 010|3 011|3 11| |7 111|3 11|2 10| T |4 100|4 100|2 10| S |5 101|5 101|2 10|Ss1 Ss2 Ss3 Sgen B0011 0000 0101 0001 C Egrade1 Egrade2 Cn00 11 01 11 J 1101 0100 Cc1 Cc2 Ccred1 Ccred2 B 1011 1001 00 01 01 11 DEs1 Es2 Es3 Eo1 Eo2 Eo3 11 00 11 01 11 10 M0000 0000 0011 0000 0010 1010 S0000 1111 1100 0000 0111 1101 Rr1 Rr2 Rcap1 Rcap211 00 01 11 00 01 00 01 11 10 11 01 10 11Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or20011 0000 0101 0011 0000 0001 11000011 1111 0101 0011 1101 0011 0110

Apply selection masks: mE =Egrade1 mR =Rcap1 mC =Ccred1 1101 11 01 1011 10 11 11

Page 49: P-Trees: Universal Data Structure for Query Optimization to Data Mining

results in, Es1 Es2 Es3 Eo1 Eo2 Eo3 Rr1 Rr2 Cc1 Cc2 00 0 00 0 00 1 00 0 00 0 10 0 00 01 0 1 0 00 1 11 1 00 0 00 0 11 1 01 1 0 11 01 11 00 01 11 00 01

Semijoin (toward center), EO(on o=0,1,2,3,4,5), RO(on r=0,1,2), CO(on c=1,2,3), reduces

Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or20011 0000 0101 0011 0000 0001 11000011 1111 0101 0011 1101 0011 0110 to

Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or2 11 00 01 11 00 01 0000 11 01 00 11 00 01

Thus, the participants are c=1,2; r=0,1,2; o=2,3,4,5. Semijoining back again produces the following. Cc1 Cc2 Rr1 Rr2 Es1 Es2 Es3 Eo1 Eo2 Eo3 0 1 00 01 00 11 00 00 11 011 0 1 0 11 00 01 11 00 01 Thus, s partic are s=2,4,5.

Ss1 Ss2 Ss3 11 00 01 0 1 0  Output tuples are determined from participating O.c P-trees. RC(PO.c(2)) = RC(Oc1^Oc2’)=2, since

Oc1 ^ Oc2’11 11 = 1100 00 00 Since the 1-bits are in positions 4 and 5, the two O-tuples have O.o surrogate values 4 and 5. The r-values at positions 4 and 5 of O.r are 0 and 2. Thus, we retrieve the R.capacity values at offsets 0 and 2. However, both of these R.capacity values are 30. Thus, this duplication is discovered without sorting or additional processing. The only output is (2,30). Similarly, RCntPO.c(1) = RCntOc1’^Oc2=2,Oc1’ ^ Oc200 00 = 0011 11 11

Finally note, if ORDER BY clause is over an attribute which is not in the relation O (e.g., over student number, s) then we center the query tree (or wheel) on a fact file that contains the ORDER BY attribute (e.g., on E in this case). If the ORDER BY attribute is not in any fact file (in a dimension file only) then the final query tree can be re-arranged to center on the dimension file containing that attribute.  Since output ordering and duplicate elimination are traditionally very expensive sub-operations of SPJ query processing, the fact that our BDM model and the P-tree data structure provide a fast and efficient way to accomplish these operations is a very favorable aspect of the approach.

Page 50: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Combining Data Mining and Query Processing Many data mining request involve pre-selection, pre-join, and pre-projection of a database to

isolate the specific data subset to which the data mining algorithm is to be applied. For example, in the above database, one might be interested in all Association Rules of a given support threshold and confidence threshold across all the relations of the database. The brute force way to do this is to first join all relations into one universal relation and then to mine that gigantic relation. This is not a feasible solution in most cases due to the size of the resulting universal relation. Furthermore, often some selection on that universal relation is desirable prior to the mining step.

Our approach accommodates combinations of querying and data mining without necessitation the creation of a massive universal relation as an intermediate step. Essentially, the full vertical partitioning and P-trees provide a selection and join path which can be combined with the data mining algorithm to produce the desired solution without extensive processing and massive space requirements. The collection of P-trees and BSQ files constitute a lossless, compressed version of the universal relation. Therefore the above techniques, when combined with the required data mining algorithm can produce the combination result very efficiently and directly.

Page 51: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Appendix on examples and implementations Example1: One band, B1, with 3-bit precision

PNP0V11 P1V11 (combined into 1 table)

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110 1110[01.11] 0010 0010[10.10] 1101 1101

P12

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

P13

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

Redundant! Since, at leaf, NP0=P

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11 B13B12

1 1 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0

0 0 0 0 1 1 1 10 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

B1

Page 52: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example1: ANDing to get rc P1(6)

P1(6) = P1(110) = P111^P112^P013 = P11^P12^NP0”13

PM1(110)= P1(110) xor NP01(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13

At [ ]: CNT[ ]=1-cnt*4level =1*42=16 since P1(110)[ ] = 1001^1000^1000=1000

PM1(110)[ ] = P11 ^ P12 ^NP0”13 xor NP011^NP012^P1”13

=1001^1000^ 1000 xor 1111 ^ 1010 ^1110 = 0010

At [10]: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000

PM1(110)[10] = P11^P 12 ^NP0”13 xor NP011^NP012^P1”13

=1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001

At [10.00]: CNT=[10.00]1-cnt*4level=3*40=3 since P1(110)[10.00]= 1111^1111^0111=0111

At [10.11]: CNT=[10.11]1-cnt*4level=3*40=3 since P1(110)[10.11]= 1111^0111^1111=0111

Thus, rcP1(6) = 16 + 0 + 3 + 3 = 22

[10] only mixed child

[10.00], [10.11] mixed children

BpQid NP0 P111[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001 11[01] 1011 001013[01] 1111 1110

11[01.00] 1110

11[01.11] 001013[01.11]

0110

11[10] 1111 110112[10] 1111 111013[10] 1110 0110

13[10.00] 1000

11[10.10] 1101

12[10.11] 0111

For P(p)= P(100- ---- , … , 011- ---- ): At each [..]1. swap and take bit comp of each [..]NP0V [..]P1V pair corresponding to 0-bits.2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors.

Page 53: P-Trees: Universal Data Structure for Query Optimization to Data Mining

ANDing in the NP0V-P1V Vector-Pair Format

For P(p)= P(110- ---- , … , ---- ---- ) (previous example, P1(6) at qid[ ] )

At each [..]1. swap and complement each [..]NP0V [..]P1V pair corresponding to 0-bits. Result denoted with *2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors to get [..]PMV(p)

bit NP0V* P1V*1 1 1 1 1 1 0 0 11 1 0 1 0 1 0 0 00 1 1 1 0 1 0 0 0-----

-…-_____________________ 1 0 1 0 1 0 0 0

pos NP0V P1V1 1 1 1 1 1 0 0 12 1 0 1 0 1 0 0 03 0 1 1 1 0 0 0 1-----

-…-

NP0V P1V

p 1 0 1 0 1 0 0 0

PMV(p) = 0 0 1 0

Page 54: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Distributed P trees?

Assume 5-computer cluster; NodeC, Node00, Node01, Node10, Node11

Send to Nij if qid ends in ij:

BpQid NP0 P1 0011[01.00] 111013[10.00] 1000

BpQid NP0 P1 C11[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001

BpQid NP0 P1 0111[01] 1011 001013[01] 1111 1110

BpQid NP0 P1 1011[10] 1111 110111[10.10] 110112[10] 1111 111013[10] 1110 0110

BpQid NP0 P1 1111[01.11] 001012[10.11] 011113[01.11] 0110

BpQid NP0 P111[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001 11[01] 1011 001013[01] 1111 1110

11[01.00] 1110

11[01.11] 001013[01.11]

0110

11[10] 1111 110112[10] 1111 111013[10] 1110 0110

13[10.00] 1000

11[10.10] 1101

12[10.11] 0111

P11(110) = P111^P112^P013 = P11^P12^NP0”13 PM1(110) = P11^P12^NP0”13 xor NP011^NP012^P1”13

At NC: CNT[ ]=1-cnt*4level =1*42=16 since P1(110)[ ]= 1001^1000^1000=1000

PM1(110)[ ] =1001^1000^1000 xor 1111^1010^1110= 0010

At N10: CNT[10]= 1-cnt*4level=0*41=0 since P1(110)[10]= 1101^1110^0001=0000

PM1(110)[10] = 1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001

At N00: CNT=[10.00]1-cnt*4level=3*40=3 since P1(110)[10.00]= 1111^1111^0111=0111

At N11: CNT=[10.11]1-cnt*4level=3*40=3 since P1(110)[10.11]= 1111^0111^1111=0111

Every node sends accumulated CNT to C, where rcP1(6) = 16 + 0 + 3 + 3 = 22 calculated.

Page 55: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Distributed P trees?

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110[01.11] 0010[10.10] 1101

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

P11 P12 P13

Alternatively, Send to Nodeij if qid starts with qid segment, ij. Is this better? How would the AND code be revised? AND performance?

OR: Send to Nodeij if the largest qid segment divisible by p is ij eg if p=4: [0]->0; [0.3]->0; [0.3.2]->0; [0.3.2.2]->2; [0.3.2.2.3]->2; [0.3.2.2.3.1]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0]->2; [0.3.2.2.3.1.0.1]->1 etc.Similar to fanout 4. Implement by multicasting externally only every 4th segment. More generally, choose any increasing sequence, p=(p1..pL), define x p = {max pi x},then multicast [s1.s2…sk] --> Node k p

Bp qid NP0 P1 00

Bp qid NP0 P1 C11[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001

Bp qid NP0 P1 0111[01] 1011 001011[01.00] 111011[01.11] 001013[01] 1111 111013[01.11] 0110

Bp qid NP0 P1 1011[10] 1111 110111[10.10] 110112[10] 1111 111012[10.11] 011113[10] 1110 011013[10.00] 1000 Bp qid NP0 P1 11

Page 56: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Distributed P trees?

qid NP0 P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110[01.11] 0010[10.10] 1101

qid NP0 P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

qid NP0 P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

P11 P12 P13

Alternatively, The Sequence can be a tree in the most general setting (i.e., a different sequence can be used on different branches, tuned to the very best tree of "multicast delays":Define a function F:{set of qids} --> {0,1,...} where if F([q1.q2...qn]) = p > 0 then F([q1.q2...qn-1]) = p-1 and if F([q1.q2...qn]) = 0 then the there is a multicast at this level. Said another way, there is a "multicast tree that tells you when to multicast (to node corresponding to last segment of the qid), eg:

[] / / ... \ / [0.1] \ [0.0.0] //..\ \ //..\ // \ [3.3.3.3] // \// [0.1.3.3.3] // . . \

Each node knows if it is suppose to make a distr. call for the next level or if it is suppose to compute that level (multicast to itself) by consulting the tree (or we could attach that info when we stripe).IN this way we have full flexibility to tune the multicast-compute balance to minimizeexecution time – on a “per P-tree basis”.

The AD-implementation vector format (All Digital) replaces the qid column with a depth-first ordered vector indicating the mixed inodes.

Page 57: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example 1 (bottom-up)1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

Band, B1, with 3-bit values

Bp qid NP0 P111[00.00] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 1111

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 111111[00.11] 1111

Bp qid NP0 P111[00] 0000 1111

Bp qid NP0 P111[00] 0000 111111[01.00] 1110

This ends the possibilityof a larger pure1 quad.So 00 can be installed inparent as a pure1.

Bp qid NP0 P111[01.00] 111011[01.01] 0000

Mixed leaf quad sent.Also ends possibilityparent is pure so it &all siblings are installedas bits in parent.

11[01.10] 1111

11[01.11] 0001

Mixed leaf quad sent.Ends parent so install bits in grandparent also

Node-00Node-00 Bp qid NP0 P111[01.00] 1110

Node-01Node-01 Bp qid NP0 P111[01] 1011 0010

Node-10Node-10 Bp qid NP0 P1

Node-11Node-11 Bp qid NP0 P111[01.11] 0001

Node-CNode-C Bp qid NP0 P111[] 01__ 10__

Page 58: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example 1 (bottom-up)1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

B11

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 5 0 1 6 6 6 6 5 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 5 7 7 4 6 5 5 5 5 3 7 6 6 5 5 5 5

Band, B1, with 3-bit values

Bp qid NP0 P111[10.00] 1111

Bp qid NP0 P111[10.00] 111111[10.01] 1111

Bp qid NP0 P111[10.00] 111111[10.01] 111111[10.10] 110111[10.11] 1111

Bp qid NP0 P111[11.00] 111111[11.01] 111111[11.10] 111111[11.11] 1111

Bp qid NP0 P111[11] 0000 1111

Node-00Node-00 Bp qid NP0 P111[01.00] 1110

Node-01Node-01 Bp qid NP0 P111[01] 1011 0010

Node-10Node-10 Bp qid NP0 P111[10.10] 110111[10] 1111 1101

Node-11Node-11 Bp qid NP0 P111[01.11] 0001

Node-CNode-C Bp qid NP0 P111[] 0111 1001

Ends the possibilityof a larger pure1 quad.All can be installed inparent/grandparentas a 1-bit.10.10 can be installed.

Ends quad-11.All can be installed inParent as a 1-bit.

Bottom-up bottom-line: Since it is better to use 2-D than 3-D (higher compression), it should be better to use 1-D than 2-D? This should be investigated.

Page 59: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2

B1 B11 B12 B13

6 6 6 6 5 5 1 16 6 6 6 5 1 1 1 6 6 6 6 5 6 6 6 6 5 5 0 7 6 7 7 5 5 5 5 6 6 7 7 5 5 5 7 7 4 6 5

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 0 0 0 01 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0

0 0 0 0 1 1 1 10 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1

B2 B21 B22 B23

4 4 4 4 3 2 1 14 4 4 2 3 2 1 1 3 3 2 2 3 3 3 2 2 3 3 2 3 6 6 6 2 2 2 2 6 6 7 7 2 2 2 6 6 5 3 2

1 1 1 1 0 0 0 01 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0

0 0 0 0 1 1 0 00 0 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1

0 0 0 0 1 0 1 10 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0

X, Y, B1, B2

000 000 6 4 000 001 6 4 000 010 6 4 000 011 6 4 000 100 5 3 000 101 5 2 000 110 1 1 000 111 1 1 001 000 6 4 001 001 6 4 001 010 6 4 001 011 6 2 001 100 5 3 001 101 1 2 001 110 1 1 001 111 1 1 010 000 6 3 010 001 6 3 010 010 6 2 010 011 6 2 010 100 5 3 011 000 6 3 011 001 6 3 011 010 6 2 011 011 6 2 011 100 5 3 011 101 5 3 011 111 0 2 100 111 5 2 100 000 7 3 100 001 6 6 100 010 7 6 100 011 7 6 100 100 5 2 100 101 5 2 100 110 5 2 101 000 6 6 101 001 6 6 101 010 7 7 101 011 7 7 101 100 5 2 101 101 5 2 101 110 5 2 111 000 7 6 111 001 7 6 111 010 4 5 111 011 6 3 111 100 5 2

Example2

Page 60: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2: StripingX, Y, B1, B2

000 000 6 4 000 001 6 4 000 010 6 4 000 011 6 4 000 100 5 3 000 101 5 2 000 110 1 1 000 111 1 1 001 000 6 4 001 001 6 4 001 010 6 4 001 011 6 2 001 100 5 3 001 101 1 2 001 110 1 1 001 111 1 1 010 000 6 3 010 001 6 3 010 010 6 2 010 011 6 2 010 100 5 3 011 000 6 3 011 001 6 3 011 010 6 2 011 011 6 2 011 100 5 3 011 101 5 3 011 111 0 2 100 111 5 2 100 000 7 3 100 001 6 6 100 010 7 6 100 011 7 6 100 100 5 2 100 101 5 2 100 110 5 2 101 000 6 6 101 001 6 6 101 010 7 7 101 011 7 7 101 100 5 2 101 101 5 2 101 110 5 2 111 000 7 6 111 001 7 6 111 010 4 5 111 011 6 3 111 100 5 2

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 0 1 0 1 10 0 0 1 0 1 1 0 1 0 1 00 0 0 1 1 0 0 0 1 0 0 10 0 0 1 1 1 0 0 1 0 0 10 0 1 0 0 0 1 1 0 1 0 00 0 1 0 0 1 1 1 0 1 0 00 0 1 0 1 0 1 1 0 1 0 00 0 1 0 1 1 1 1 0 0 1 00 0 1 1 0 0 1 0 1 0 1 10 0 1 1 0 1 0 0 1 0 1 00 0 1 1 1 0 0 0 1 0 0 10 0 1 1 1 1 0 0 1 0 0 10 1 0 0 0 0 1 1 0 0 1 10 1 0 0 0 1 1 1 0 0 1 10 1 0 0 1 0 1 1 0 0 1 00 1 0 0 1 1 1 1 0 0 1 00 1 0 1 0 0 1 0 1 0 1 10 1 1 0 0 0 1 1 0 0 1 10 1 1 0 0 1 1 1 0 0 1 10 1 1 0 1 0 1 1 0 0 1 00 1 1 0 1 1 1 1 0 0 1 00 1 1 1 0 0 1 0 1 0 1 10 1 1 1 0 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 1 1 1 01 0 0 0 1 1 1 1 1 1 1 01 0 0 1 0 0 1 0 1 0 1 01 0 0 1 0 1 1 0 1 0 1 01 0 0 1 1 0 1 0 1 0 1 01 0 0 1 1 1 1 0 1 0 1 01 0 1 0 0 0 1 1 0 1 1 01 0 1 0 0 1 1 1 0 1 1 01 0 1 0 1 0 1 1 1 1 1 11 0 1 0 1 1 1 1 1 1 1 11 0 1 1 0 0 1 0 1 0 1 01 0 1 1 0 1 1 0 1 0 1 01 0 1 1 1 0 1 0 1 0 1 01 1 0 0 0 0 1 1 1 1 1 01 1 0 0 0 1 1 1 1 1 1 01 1 0 0 1 0 1 0 0 1 0 11 1 0 0 1 1 1 1 0 0 1 11 1 0 1 0 0 1 0 1 0 1 0

X, Y, B11B12B13B21B22B23

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

__PNP0V_ __P1V__ Band111 222 111 222bit-pos123 123 123 123[ ] === === === === 110 111 110 000 101 011 000 000 111 111 100 000 101 010 101 010

00_PNP0V__ __P1V__ 110 111 110 000

11_PNP0V__ __P1V__ 101 010 101 010

01_PNP0V__ __P1V__ 101 011 000 000

10_PNP0V__ __P1V__ 111 111 100 000

Send B21B22B23 to Node00

Send B11B13 B22B23 to Node01

Send B12B13 B21B22B23 to Node10

Send nothing to Node11

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Purity Template[ ] 16 12 12 8

Raster order Peano order

OR for PNP0AND for P1

Page 61: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2: striping at Node 00

0 0 0 0 0 0 1 0 00 0 0 0 0 1 1 0 00 0 0 0 1 0 1 0 00 0 0 0 1 1 1 0 0

0 0 0 1 0 0 1 0 00 0 0 1 0 1 1 0 00 0 0 1 1 0 1 0 00 0 0 1 1 1 0 1 0

0 0 1 0 0 0 0 1 10 0 1 0 0 1 0 1 10 0 1 0 1 0 0 1 10 0 1 0 1 1 0 1 1

0 0 1 1 0 0 0 1 00 0 1 1 0 1 0 1 00 0 1 1 1 0 0 1 00 0 1 1 1 1 0 1 0

x1y1x2y2x3y3B11B12B13 B21B22B23

_PNP0V__ __P1V__ 110 100 110 100

_PNP0V__ __P1V__ 110 010 110 010

_PNP0V__ __P1V__ 110 110 110 000

_PNP0V__ __P1V__ 110 011 110 011

Send nothing to Node00

Send nothing to Node10

Send nothing to Node11

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[00 ] === === === === 100 100 110 000 011 011 010 010

Send [ ]B21B22 to Node01

Bp qid NP0 P1 0021[00 ] 1100 100022[00 ] 0111 001123[00 ] 0010 0010PurityTemplate [00] 4 4 4 411[01.00 ] 111023[01.00 ] 1010

12[10.00 ] 111113[10.00 ] 100021[10.00 ] 011122[10.00 ] 111123[10.00 ] 1000

0 1 0 0 0 0 1 10 1 0 0 0 1 1 00 1 0 0 1 0 1 10 1 0 0 1 1 0 0

x1y1x2y2x3y3 B11 B23

From [01 ]

P1Band 12bit-pos 13[01.00 ] == 11 10 11 00

To [01 ]

1 0 0 0 0 0 1 1 0 1 11 0 0 0 0 1 1 0 1 1 01 0 0 0 1 0 1 0 1 1 01 0 0 0 1 1 1 0 1 1 0

x1y1x2y2x3y3 B12B12 B23B23B23

From [10 ]

P1Band 11 222bit-pos 23 123[10.00 ] == === 11 011 10 110 10 110 10 110

Bp qid NP0 P1 0012[10.00 ] 1111

Bp qid NP0 P1 0013[10.00 ] 1000

Bp qid NP0 P1 0021[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P1 0022[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P1 0023[00 ] 0010 001023[01.00 ] 101023[10.00 ] 1000

Bp qid NP0 P1 0011[01.00 ] 1110

Pages on disk

Page 62: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2: striping at Node 01

0 1 0 0 0 0 1 1 1 10 1 0 0 0 1 1 1 1 00 1 0 0 1 0 1 1 1 10 1 0 0 1 1 0 1 1 0

0 1 0 1 0 0 0 1 0 10 1 0 1 0 1 0 1 0 10 1 0 1 1 0 0 1 0 10 1 0 1 1 1 0 1 0 1

0 1 1 0 0 0 1 1 1 10 1 1 0 1 0 1 1 1 10 1 1 0 1 1 1 1 1 1

0 1 1 1 1 1 0 0 1 0

x1y1x2y2x3y3 B11 B13 B22B23

_PNP0V__ __P1V__ 1 1 11 0 1 10

_PNP0V__ __P1V__ 0 0 10 0 0 10

_PNP0V__ __P1V__ 0 1 01 0 1 01

_PNP0V__ __P1V__ 1 1 11 1 1 11

Send [01]B11B23 to Node00

Send nothing to Node10

Send nothing to Node11

Send nothing to Node01

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[01 ] === === === === 1 1 11 0 1 10 0 1 01 0 1 01 1 1 11 1 1 11 0 0 10 0 0 10

0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1

x1y1x2y2x3y3 B21B22

From [00 ]

P1Band 22bit-pos 12[00.01 ] == 10 10 10 01

To [00 ]

1 0 0 1 0 0 01 0 0 1 0 1 01 0 0 1 1 0 11 0 0 1 1 1 1

x1y1x2y2x3y3 B23

From [10 ]

P1Band 2bit-pos 3[10.01 ] == 0 0 1 1

Bp qid NP0 P1 0121[00.01 ] 1110

Bp qid NP0 P1 0123[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 0122[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 0113[01 ] 1110 1110

Bp qid NP0 P1 0111[01 ] 1010 0010

Bp qid NP0 P1 0111[01 ] 1010 001013[01 ] 1110 111022[01 ] 1010 101023[01 ] 1110 0110PurityTemplate [01] 4 4 3 121[00.01 ] 111022[00.01 ] 0001

23[10.01 ] 0011

Pages on disk

Page 63: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2: striping at Node 10

1 0 0 0 0 0 1 1 0 1 11 0 0 0 0 1 1 0 1 1 01 0 0 0 1 0 1 0 1 1 01 0 0 0 1 1 1 0 1 1 0

1 0 0 1 0 0 1 1 1 1 01 0 0 1 0 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1

1 0 1 0 0 0 1 1 1 1 01 0 1 0 0 1 1 1 1 1 0

1 0 1 1 0 0 0 0 1 0 11 0 1 1 0 1 1 0 0 1 1

x1y1x2y2x3y3 B12B13B21B22B23

_PNP0V__ __P1V__ 11 111 10 010

_PNP0V__ __P1V__ 10 111 00 001

_PNP0V__ __P1V__ 11 111 11 110

_PNP0V__ __P1V__ 11 110 11 110

Send [10]B13B21B23 to Node00

Send nothing to Node10

Send [10]B12B21B22 to Node11

Send [10] B23 to Node01

_PNP0V__ __P1___Band 111 222 111 222bit-pos 123 123 123 123[10 ] === === === === 11 111 10 010 11 111 11 110 11 110 11 110 10 111 00 001

To [00 ] To[01 ]

To [11 ]

Pages on diskBp qid NP0 P1 1012[10 ] 1111 1110

Bp qid NP0 P1 1013[10 ] 1110 0110

Bp qid NP0 P1 1021[10 ] 1111 0110

Bp qid NP0 P1 1022[10 ] 1111 1110

Bp qid NP0 P1 1023[10 ] 1101 0001

Bp qid NP0 P1 1012[10 ] 1111 111013[10 ] 1110 011021[10 ] 1111 011022[10 ] 1111 111023[10 ] 1101 0001PurityTemplate [10] 4 4 2 2

Page 64: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2: striping at Node11

1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1

x1y1x2y2x3y3 B12 B21B22

From [10 ]

P1Band 122bit-pos 223[10.11 ] === 010 101

Bp qid NP0 P1 1112[10.11 ] 0122[10.11 ] 1023[10.11 ] 01

Bp qid NP0 P1 1112[10.11 ] 01

Bp qid NP0 P1 1123[10.11 ] 01

Bp qid NP0 P1 1122[10.11 ] 10

Pages on disk

Page 65: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.1AND at NodeC or [ ]

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

RC(P 101,010) = P11^ P’12^ P13^ P’21^ P22^ P’23

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[]NP0111101110111111111111111------AND0111

[]P1101101010001010100010001------AND0001

Sum= 8 so far. Invocation= [ ] 101,010 send to Nodes 01, 10

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

Page 66: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.1AND at Node01

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [01] 101,010Sent to Node00

[01] NP011 101012 13 111021 22 101023 1001AND------ 1000

[01] P111 001012 13 111021 22 101023 0001AND------ 0000

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

[ ] 101,010 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Page 67: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.1AND at Node10

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [10] 101,010Sent nowhere (no mixed)

[10] NP011 12 0001 13 111021 100122 111123 1110AND------ 0000

[10] P111 12 13 21 22 23 AND------

P1-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

NP0-pattern NP0 P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

[ ] 101,010 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Page 68: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.1AND at Node00

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Sum=1, sent to NodeC gives a

sum total of 8 + 1 = 9

[01.00] P111 111012 13 21 22 23 0101AND------ 0100

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

[01] 101,010 received

P1-pattern P111 xxxx12 prime13 xxxx21 prime22 xxxx23 prime

Page 69: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.2AND at NodeC or [ ]

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

RC(P 100,101) = P11^ P’12^ P’13^ P21^ P’22^ P23

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[]NP0------AND0010

[]P1------AND0000

Sum= 0 so far. Invocation= [ ] 100, 101 send to Node 10

P1-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

NP0-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

Page 70: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.2AND at Node10

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

Invocation= [10] 100, 101Sent to Node 11

[10] NP011 12 13 21 22 23 AND------ 0001

[10] P111 12 13 21 22 23 AND------ 0000

[ ] 100,101 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

P1-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

NP0-pattern NP0 P111 xxxx12 prime13 prime21 xxxx22 prime23 xxxx

Page 71: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2.2AND at Node11

Bp qid NP0 P112[10.11 ] 01

Bp qid NP0 P123[10.11 ] 01

Bp qid NP0 P122[10.11 ] 10

Disk 11

Bp qid NP0 P112[10 ] 1111 1110

Bp qid NP0 P1 13[10 ] 1110 0110

Bp qid NP0 P1 21[10 ] 1111 0110

Bp qid NP0 P122[10 ] 1111 1110

Bp qid NP0 P123[10 ] 1101 0001

Bp qid NP0 P1 21[00.01 ] 1110

Bp qid NP0 P1 23[01 ] 1110 011023[10.01 ] 0011

Bp qid NP0 P1 22[01 ] 1010 101022[00.01 ] 0001

Bp qid NP0 P1 13[01 ] 1110 1110

Bp qid NP0 P1 11[01 ] 1010 0010

Bp qid NP0 P112[10.00 ] 1111

Bp qid NP0 P113[10.00 ] 1000

Bp qid NP0 P121[00 ] 1100 100021[10.00 ] 0111

Bp qid NP0 P122[00 ] 0111 001122[10.00 ] 1111

Bp qid NP0 P123[00 ] 0010 001023[01.00] 101023[10.00] 1000

Bp qid NP0 P111[01.00 ] 1110

Bp qid NP0 P1 C11[ ] 1111 101112[ ] 1010 100013[ ] 0111 000121[ ] 1010 000022[ ] 1111 000123[ ] 1110 0000

[10] P111 0112 13 21 22 0123 01AND------ 01

[10] 100,101 received

Disk 10 PT[10] 4 4 2 2Disk 01 PT[01] 4 4 3 1Disk 00 PT[00] 4 4 4 4Disk C PT[ ] 16 12 12 8

Sum=1, sent to NodeC gives a sum total of 1

Page 72: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111112[00.00] 111113[00.00] 000021[00.00] 111122[00.00] 000023[00.00] 0000

Peano order

Page 73: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 1111

12[00.00] 111112[00.01] 1111

13[00.00] 000013[00.01] 0000

21[00.00] 111121[00.01] 1110

22[00.00] 000022[00.01] 0001

23[00.00] 000023[00.01] 0000

Peano order

Mixed quads (can be sent to node01)

Bp qid NP0 P121[00.01] 111022[00.01] 0001

Page 74: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 1111

12[00.00] 111112[00.01] 111112[00.10] 1111

13[00.00] 000013[00.01] 000013[00.10] 0000

21[00.00] 111121[00.01] 111021[00.10] 0000

22[00.00] 000022[00.01] 000122[00.10] 1111

23[00.00] 000023[00.01] 000023[00.10] 1111

Peano order

Bp qid NP0 P1 at 0023[00] 001- 001-

Mixed quads (sent to node00)

Bp qid NP0 P1 at 0121[00.01] 111022[00.01] 0001

Page 75: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Example2, bottom-up

0 0 0 0 0 0 1 1 0 1 0 00 0 0 0 0 1 1 1 0 1 0 00 0 0 0 1 0 1 1 0 1 0 00 0 0 0 1 1 1 1 0 1 0 00 0 0 1 0 0 1 1 0 1 0 00 0 0 1 0 1 1 1 0 1 0 00 0 0 1 1 0 1 1 0 1 0 00 0 0 1 1 1 1 1 0 0 1 00 0 1 0 0 0 1 1 0 0 1 10 0 1 0 0 1 1 1 0 0 1 10 0 1 0 1 0 1 1 0 0 1 10 0 1 0 1 1 1 1 0 0 1 10 0 1 1 0 0 1 1 0 0 1 00 0 1 1 0 1 1 1 0 0 1 00 0 1 1 1 0 1 1 0 0 1 00 0 1 1 1 1 1 1 0 0 1 00 1 0 0 0 0 1 0 1 0 1 10 1 0 0 0 1 1 0 1 0 1 00 1 0 0 1 0 1 0 1 0 1 10 1 0 0 1 1 0 0 1 0 1 00 1 0 1 0 0 0 0 1 0 0 10 1 0 1 0 1 0 0 1 0 0 10 1 0 1 1 0 0 0 1 0 0 10 1 0 1 1 1 0 0 1 0 0 10 1 1 0 0 0 1 0 1 0 1 10 1 1 0 1 0 1 0 1 0 1 10 1 1 0 1 1 1 0 1 0 1 10 1 1 1 1 1 0 0 0 0 1 01 0 0 0 0 0 1 1 1 0 1 11 0 0 0 0 1 1 1 0 1 1 01 0 0 0 1 0 1 1 0 1 1 01 0 0 0 1 1 1 1 0 1 1 01 0 0 1 0 0 1 1 1 1 1 01 0 0 1 0 1 1 1 1 1 1 01 0 0 1 1 0 1 1 1 1 1 11 0 0 1 1 1 1 1 1 1 1 11 0 1 0 0 0 1 1 1 1 1 01 0 1 0 0 1 1 1 1 1 1 01 0 1 1 0 0 1 0 0 1 0 11 0 1 1 0 1 1 1 0 0 1 11 1 0 0 0 0 1 0 1 0 1 01 1 0 0 0 1 1 0 1 0 1 01 1 0 0 1 0 1 0 1 0 1 01 1 0 0 1 1 1 0 1 0 1 01 1 0 1 0 0 1 0 1 0 1 01 1 0 1 0 1 1 0 1 0 1 01 1 0 1 1 0 1 0 1 0 1 01 1 1 0 0 0 1 0 1 0 1 0

x1y1x2y2x3y3 B11B12B13B21B22B23

Bp qid NP0 P111[00.00] 111111[00.01] 111111[00.10] 111111[00.11] 1111

12[00.00] 111112[00.01] 111112[00.10] 111112[00.11] 1111

13[00.00] 000013[00.01] 000013[00.10] 000013[00.11] 0000

21[00.00] 111121[00.01] 111021[00.10] 000021[00.11] 0000

22[00.00] 000022[00.01] 000122[00.10] 111122[00.11] 1111

23[00.00] 000023[00.01] 000023[00.10] 111123[00.11] 0000

Peano order

00 quads that are pure are:

Bp qid NP0 P111[00] 1111 111112[00] 1111 111113[00] 0000 0000

At 00Bp qid NP0 P123[00] 0010 0010

At 01Bp qid NP0 P121[00.01] 111022[00.01] 0001

Page 76: P-Trees: Universal Data Structure for Query Optimization to Data Mining

Appendix: Firm Math Foundation where RRN-order assumed in raster order

Given any relation or table R(A1..An), assign RRNs, {0,1,.., (2d)L } (d=dimension, L=level) Write RRNs as bit strings: x11..x1L.x21..x2L..xd1..xdL (d=2: x1..xLy1..yL)

k=0..L define the concept of a level-k polytant Q[x11x21..xd1•x12…xd2•..•x1k..xdk] by Q { tR | t.Kij=xij }, Kij = ijth bit of the RRN

- Q = (SRdk([x11..x1L.x21..x2L..xd1..xdL])).R = {t|t.R SRdk([x11..x1L.x21..x2L..xd1..xdL])} (tuple variable notation

- d=2: Q[x1y1•..•xkyk] is a quadrant. - Q[]=R; Q[x11x21..xd1•x12…xd2•..•x1L..xdL]=single_tuple=1x..x1-polytant.

- imposes a “d-space” structure on R (for RSI, which already has such, can skip this step.)

Quadrant-conditions: On each quadrant, Q, in R define conditions (Q{T,F}) (level=k):

Q-COND DESCRpure1 true if C is true of all Q-tuplespure0 true if C is false of all Q-tuplesmixed true if C is true of some Q-tuples and false of some Q-tuplesp-count true if C is true of exactly p Q-tuples ( 0 p cardQ = 2dk)

Every Ptree is a Quadrant-condition Ptree on R, e.g., Pij, basic Ptree, is Pcond where cond = (SR8-j ( SLj-1 ( t.Ai )))P1i(v) for value, v Ai is Pcond where cond = (t.Ai = v, t Q)NP0(a1..an) is Pcond where cond = ( i : ( t Q : t.Ai = ai ) )

Notation: bSQ files, Pij(cond) ; BSQ files, Pi(cond); Relations, P.