document

•13•12 1

docu

men

t

2 3 4 5course

•1•11

•1•1•1•1•1•0

•1•1•0Text

4

3

2

1

person

1 1

1 0 Enroll

•1•10

•1•11

•1•10

•1•11

Buy

MYRRH ManY-Relationship-Rule Harvesteruses pTrees for association rule mining of multiple relationships.

Applications:

ConCur Concurrency Controluses pTrees for ROCC and ROLL concurrency

control.

PGP-D Pretty Good Protection of Data protects vertical pTree data.

5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...key=array(offset,pad)

FAUST Fast Accurate Unsupervised, Supervised Treemininguses pTtrees for classification and clustering of spatial data.

pTrees predicate Tree technologiesprovide fast, accurate horizontal processing ofcompressed, data-mining-ready, vertical data structures.

PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification.

DOVE DOmain VEctorsUses pTrees for database query processing.

0 0 0 0 1

P11

4. Left half of rt half ? false0 00 0 0

2. Left half pure1? false 0

00 0

1. Whole thing pure1? false 0

5. Rt half of right half? true1

00 0 0 1

R11 0 0 0 0 0 0 1 1

predicate Trees (pTrees): project each attribute (now 4 files)

Record the truth of predicate: "pure1 (all 1's)" in a tree recursively on halves, until the half is pure (all 1’s or all 0’s).

3. Right half pure1? false 0 00 0

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 10 1 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

But it's pure0 so this branch ends

then vertically slice off each bit position (now 12 files)then compress each bit slice into a pTreee.g., the compression of R11 into P11 goes as follows:

P11

pure1? false=0

pure1? false=0

pure1? false=0pure1? true=1

pure1? false=0

1st, Vertically Processing of Horizontal Data (VPHD)

R(A1 A2 A3 A4)

2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4

for Horizontally structuredrecords,we scan vertically

010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

Base 10 Base 2

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1

1

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 0 01

0 0 00 0 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^

e.g., to find the number of occurences of 7 0 1 4 =22nd, pTrees find # of occurences of 7 0 1 4?

To count (7,0,1,4)s use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43

7 0 1 4

0 *23

0 0 *22 =2 0 1 *21

*20

=

pTrees predicate Tree technologiesprovide fast, accurate horizontal processing ofcompressed, data-mining-ready, vertical data structures.

Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20

t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

t12 0 0 1 0 1 1 0 2

t13 0 0 1 0 1 0 0 1

t15 0 0 1 0 1 0 1 2

a5 a6 a10=C a11 a12 a13 a14 dis from

a=000000

area for 3nearestnbrs

0 0 0 0 0 0

distance=2, don’t replace

0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0

distance=1, replace

t53 0 0 0 0 1 0 0 1

0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 1

C=1 wins!

First 3NN using horizontal data to classify an unclassified sample, a =( 0 0 0 0 0 0 ).

0 0 0 0 0 0


PINE Podium Incremental Neighborhood Evaluatoruses pTrees for Closed k Nearest Neighbor Classification (CkNNC)

Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20

t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

Next C3NN using horizontal data: (a second pass is necessary to find all other voters that are at distance 2 from a)

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=4, don’t include

0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0

d=1, already voted

0 0 0 0 0 0


0 0 0 0 0 0


0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0


0 0 0 0 0 0

d=2, already voted

0 0 0 0 0 0

d=1, already voted

0 1

Vote after 1st scan.

C=0 wins now!

t12 0 0 1 0 1 1 0 2

t13 0 0 1 0 1 0 0 1

a5 a6 a10=C a11 a12 a13 a14 distance

t53 0 0 0 0 1 0 0 1

Unclassified sample: 0 0 0 0 0 03NN set after 1st scan

0 0 0 0 0 0

d=2, include it also0 0 0 0 0 0


C'

00000000001111111

C

11111111110000000

PINE: a Closed 3NN method using pTrees (vertically data structures). 1st: pTree-based C3NN goes as follows:

a20

11001011101100110

keyt12

t13

t15

t16

t21

t27

t31

t32

t33

t35

t51

t53

t55

t57

t61 t72

t75

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

00001111110000100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a11

00011010001000100

a12 11100001110110011

a13 10011111001001110

a14 00100100010011001

a15 110100011001000 10

a16

10000001000000010

a17 00101110011011101

a18

00101110011011101

a19

01010000100100000

Ps

00000000000000000

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

No neighbors at distance=0

First let all training points at distance=0 vote, then distance=1, then distance=2, ... until 3 votes are cast.For distance=0 (exact matches) constructing the P-tree, Ps

then AND with PC and PC’ to compute the vote.

C'

00000000001111111

C

11111111110000000

a20

11001011101100110

keyt12

t13

t15

t16

t21

t27

t31

t32

t33

t35

t51

t53

t55

t57

t61 t72

t75

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

00001111110000100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

a10 =C11111111110000000

a11

00011010001000100

a12 11100001110110011

a13 10011111001001110

a14 00100100010011001

a15 110100011001000 10

a16

10000001000000010

a17 00101110011011101

a18

00101110011011101

a19

01010000100100000

PD(s,1)

01000000000100000

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

00001111110000100

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a14

11011011101100110

a13

01100000110110001

a12

1 1100001110110011

a11

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a14

11011011101100110

a13

1 0011111001001110

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a14

0 0100100010011001

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji

= OR PS(si,1) S(sj,0)

OR

P5 P6 P11 P12 P13 P14

j{5,6,11,12,13,14}-{i}

0 1

i=5,6,11,12,13,14i=5,6,11,12,13,14

pTree-based C3NN: find all distance=1 nbrs:

keyt12

t13

t15

t16

t21

t27

t31

t32

t33

t35

t51

t53

t55

t57

t61 t72

t75

a5

00001111110000100

a6

00001100000000000

a10 C11111111110000000

a11

00011010001000100

a12 11100001110110011

a13 10011111001001110

a14 00100100010011001

0 1

OR{all double-dim interval-Ptrees}; PD(s,2) = OR Pi,j

Pi,j = PS(si,1) S(sj,1) S(sk,0) k{5,6,11,12,13,14}-{i,j}

i,j{5,6,11,12,13,14}

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

0 0001100000000000

a5

00001111110000100

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

00011010001000100

a6

11110011111111111

a5

00001111110000100

a14

11011011101100110

a13

01100000110110001

a12

1 1100001110110011

a11

11100101110111011

a6

11110011111111111

a5

00001111110000100

a14

11011011101100110

a13

1 0011111001001110

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

00001111110000100

a14

0 0100100010011001

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

00001111110000100

P5,6 P5,11 P5,12 P5,13 P5,14

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

00011010001000100

a6

0 0001100000000000

a5

1111000000111 1011

a14

11011011101100110

a13

01100000110110001

a12

11100001110110011

a11

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a14

11011011101100110

a13

11100001110110011

a12

00011110001001100

a11

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a14

00100100010011001

a13

01100000110110001

a12

00011110001001100

a11

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

P6,11 P6,12 P6,13 P6,14

a14

11011011101100110

a13

01100000110110001

a12

11100001110110011

a11

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a14

11011011101100110

a13

10011111001001110

a12

00011110001001100

a11

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a14

11011011101100110

a13

01100000110110001

a12

00011110001001100

a11

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

P11,12 P11,13 P11,14

a14

11011011101100110

a13

1 0011111001001110

a12

11100001110110011

a11

11100101110111011

a6

11110011111111111

a5

11110000001111011

a14

0 0100100010011001

a13

01100000110110001

a12

11100001110110011

a11

11100101110111011

a6

11110011111111111

a5

11110000001111011

P12,13 P12,14

We now have 3 nearest nbrs. We could quite and declare C=1 winner?

a14

0 0100100010011001

a13

10011111001001110

a12

00011110001001100

a11

11100101110111011

a6

11110011111111111

a5

11110000001111011

P13,14

We now have the C3NN set and we can declare C=0 the winner!PINE=CkNN in which all training samples vote weighted by their nearness to a (~Olympic podiums)

pTree-based C3NN, dist=2 nbrs:

FAUST using impure pTrees (ipTrees)

All pTrees defined by Row Set Predicates (T/F on any row-sets). E.g.: On T(A,B,C), "units bit slice pTree of T.A, using predicate, > 60% 1-bits, true iff >60% of the A-values are odd.

The IRIS dataset can be downloaded from the UCI Data Repository.To cluster the IRIS dataset of 150 iris flower samples, (50 setosa, 50 versicolor, 50 virginica iris's) using 2-level 60% ipTrees with each upper level bit representing the predicate truth applied to 10 consecutive iris samples), level-1 is shown below. FAUST clusters perfectly using only this level (an order of magnitude smaller bit vectors - so faster processing!).

level-1 values: SL SW PL PWsetosa 38 38 14 2setosa 50 38 15 2setosa 50 34 16 2setosa 48 42 15 2setosa 50 34 12 2versicolor 1 24 45 15versicolor 56 30 45 14versicolor 57 28 32 14versicolor 54 26 45 13versicolor 57 30 42 12virginica 73 29 58 17virginica 64 26 51 22virginica 72 28 49 16virginica 77 30 48 22virginica 67 26 50 19Level-1 mn 54.2 30.8 35.8 11.6setosa 47.2 37.2 14.4 2versicolor 45 27.6 41.8 13.6virginica 70.6 27.8 51.2 19.2

se 2 11.6 se 47.2 13.4ve 45 2.2

vi 70.6

SL mn gap SW mn gap

se 37.2

ve 27.6 .2vi 27.8 9.4

se 14.4 27.4

vi 51.2ve 41.8 9.4

PL mn gap

ve 13.6 5.6vi 19.2

PW mn gap

level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j

0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 00 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 00 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 00 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 00 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 00 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 10 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 00 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 00 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 10 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 01 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 11 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 01 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 01 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 01 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1

setosasetosasetosasetosasetosaversicolorversicolorversicolorversicolorversicolorvirginicavirginicavirginicavirginicavirginica

The 150 level_0 raw bits

level_1 = s10gt60_PPW,1

1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011

level_2 =s150_s10_gt60_PPW,1

11111 11100 01011

1

level_0

(The level_2 bit strides 150 level_0 bits)

(Each level_1 bit (15 of them) strides 10 raw bits)

FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTrees for classification and clustering of spatial data.

cH = 2 + 11.6/2 = 7.8

PW mn gapSL mn gap SW mn gap PL mn gapse 2 11.6

se 47.2 13.4ve 45 2.2

vi 70.6 se 37.2

ve 27.6 .2vi 27.8 9.4

se 14.4 27.4

vi 51.2ve 41.8 9.4 ve 13.6 5.6

vi 19.2

CLASS PWsetosa 2setosa 2setosa 2setosa 2setosa 2versicolor 15versicolor 14versicolor 14versicolor 13versicolor 12virginica 17virginica 22virginica 16virginica 22virginica 19

(perfect on setosa!)

SL mn gap SW mn gap PL mn gap PW mn gapve 45 25.6

vi 70.6

ve 27.6 .2vi 27.8

vi 51.2ve 41.8 9.4 ve 13.6 5.6

vi 19.2

cH = 45 + 25.6/2 = 57.8

CLASS SLversicolor 1versicolor 56versicolor 57versicolor 54versicolor 57virginica 73virginica 64virginica 72virginica 77virginica 67

(perfect classification of the rest!)

FAUST (simplest version)

For each attribute (column),1. calculate mean of each class;2. sort those means asc;3. calc mean_gaps=differences_of_means;4. choose largest (relatively) mean_gap to cut.

4. choose best class and attribute for cuttinggapL is gap on low side of a mean. gapH is high

2. Remove record with max gapRELATIVE.

1. 2. 3. done on previous slide

FAUST using impure pTrees (ipTrees) page 2

24 samples from each class as training (every other one in the list of 50),

first form 3-level gt50%ipTrees with level=1 stride=12.

second form 3-level gt50%ipTrees, level=1 stride=24 (i.e., just a root above 3 leaf strides, 1 for each class).

Conclusion: Uncompressed 50%ipTrees (with root truth values) root values are close to the mean?

level_1 s24gt50_PSL,j s24gt50_PSW,j s24_gt50_PPL,j s24gt50_PPW,j

se 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 51 38 15 0se 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 50 34 14 2

ve 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 0 57 28 45 14ve 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 63 30 40 8

vi 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 72 28 49 18vi 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 69 30 48 22

se 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 51 34 15 2ve 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 57 30 41 14vi 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 73 30 49 22

level=1 stride=12, each of the 2 level=1 bits strides 12 of 24

level=1 stride=24, each of the level=1 bits strides 24 of 24

In the previous two FAUST slides, three-level 60% ipTrees were used (leaves are level=0, root is level=2) with each level=1 bit representing the predicate truth applied to 10 consecutive iris samples (leaf bits, i.e., the level=1 stride=10).

Below, instead of taking the entire 150 IRIS samples, 24 are selected from each class as training samples; the 60% is replaced by 50% and level=1 stride=10 is replaced with level=1 stride=12 first, then level=1 stride=24.

Note: The means (averages) are almost the same in all cases.

FAUST using impure pTrees (ipTrees) page 3

R11 0 0 0 0 1 0 1 1

ipTrees construction can be done during the [one-time] construction of the basic pTrees?

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

8_4_2_1_gte50%_ipTree11

0 0

0

0 0

0

1 0

1

0

1

1 1

1

0

node_naming: ( Level, offset (left-to-right) )

E.g., lower left corner node is (0,0).

Array of nodes at level=L is [L, *]

pTree naming: Sn-1_..._S1_S0_gteX%_ipTree for n-level ipTree

with predicate gteX%. S=Stride=#leaf bits strided by the node.

If it is a basic pTree, pTree subscripts specify attribute, bitslice.

Note on bottom_up ipTree construction: One must record the 1-count of the stride of each inode (e.g., In binary trees, if one child is 1, the other is 0, it could be the

1-child is pure1 and the 0-child is just below 50% (so parent_node=1) or the

1-child is just above 50% and the 0-child has almost no 1-bits (so parent node=0). (example on next slide).

Can be done during the one pass through each bit slice required for bottom-up construction of pure1 pTrees.

binary_pure1 pTree11

= 8_4_2_1_gte100%ipTree11

0 0

0

0 0

0

1 0

0

0

0

1 1

1

0

R11 1 0 0 0 1 0 1 1

bottom-up ipTree construction (changed R11 so this issue of recording 1-counts as you go is pertinent)1.1-child is pure1 and 0-child is just below 50% (so parent_node=1)2.1-child is just above 50% and the 0-child has almost no 1-bits (so that the parent node=0). (example on next slide).

1 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

8_4_2_1_gte50%_ipTree11

1 0

1

0 0

0

1 0

1

1

1

1 1

1

0

0 or 1? The 1-count of the left branch =1 and 1-count of the right branch =0, so the stride=4 subtree 1-count=1 (< 50%). We know 1-countt of right branch=0 (pure0), but we wouldn't know 1-count of left branch unless it was recorded. Finally, note that recording the 1-counts as we build the tree upwards is a near-zero-extra-cost step.

0 or 1? Need to know left branch 1ct=1 and right branch 1ct=3. So this stride=8 subtree 1ct=4 ( 50%).

Customer

1

2

3

4

Item

6

5

4

3

Gene

11

1

Doc

1

2

3

4

Gene

11

3

Exp

11

11

11

11

1 2 3 4 Author

1 2 3 4 G 5 6term 7

5 6 7People

11

11

11

3

2

1

Doc

2 3 4 5PI

People

cust item card

authordoc card

termdoc card

docdoc

expgene card

genegene card (ppi)

expPI card

5

6

16

Item

Set

Supp(A) =CusFreq(ItemSet)

genegene card (ppi)

ItemSet

antecedent

1 2 3 4 5 6 16

itemset itemset card

Conf(AB) =Supp(AB)/Supp(A)

mov

ie

0 0 0 0

0 2

0 0

3 0 0 0

1 0 0

5 0

0

0

0

5

1

2

3

4

4 0 0

0 0 0

5

0

0

1

0

3

0

0

customer rates movie card

0 0 0 0

0 0

0 0

0 0 0 0

0 0 0

1 0 0

0

0

0

1

0 0 0

0 0 0

1

0

0

0

0

0

customer rates movie as 5 card

4

3

2

1

Course

Enrollments

1 5people 2 3 4

1

2

3

4

item

s

3 2

1

term

s

DataCube Model for 3 entities, items, people and terms.

76

54

32

t

1

termterm card (share stem?)

Items: i1 i2 i3 i4 i5

|0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10|

People: p1 p2 p3 p4

|0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M|

Terms: t1 t2 t3 t4 t5 t6

|1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00|

Relationship: p1 i1 t1

|0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2

Relational Model:

2 3 4 5PI

RoloDex Model: 2 Entities many relationships

MYRRH pTree-based ManY-Relationship-Rule Harvester uses pTrees for ARM of multiple relationships.

MYRRH_2e_2r ( note: standard pARM is MYRRH_2e_1r )e.g., Rate5(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C)

0 0 0 10 0 1 00 0 0 10 1 0 0 R5(C,B)

(R(E,F))

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

1

2

3

4

P(B,C) (S(E,F))

If cust, c, rates book, b as 5, then c purchase b. For bB, {c| rate5(b,c)=y}{c| purchase(c,b)=y}

ct(R5pTreei & PpTreei) / ct(R5pTreei) mncnf ct(R5pTreei) / sz(R5pTreei) mnsp

Speed of AND: R5pTreeSet & PpTreeSet? (Compute each ct(R5pTreeb&PpTreeb).)

Slice counts, bB, ct(R5pTreeb & PpTreeb) w AND? B(F)

Schema: size(C)=size(R5pTreeb)=size(BpTreeb)=4 size(B)= size(R5pTreec)=size(BpTreec)=4

pre-computed BpTtreec 1-counts

3 2 1 2

R5pTtreec 1-cts0 1 1 2

2

3

1

2 BpTtreeb 1-cts

1

1

1

1

pre-comR5pTtreeb 1-cts

C (E)

2 3 4 5

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0P(B,C)

0 0 0 1

0 0 1 0

0 0 0 1

0 1 0 0R5(C,B)

1

1

0

1

R5pTtreeb&PpTreeb

1-counts

Given eE, If R(e,f), then S(e,f)

If eA R(e,f), then eB S(e,f)




ct(Re & Se)/ct(Re)mncnf, ct(Re)/sz(Re)mnsp

ct( &eARe &eBSe) / ct(&eARe) mncnf. ...

ct( &eARe OReBSe) / ct(&eARe) mncnf. ...

ct( OReARe &eBSe) / ct(OReARe) mncnf. ...

ct( OReARe OReBSe) / ct(OReARe) mncnf. ...

Consder 2 Customer classes, Class1={C=2|3} and Class2={C=4|5}. Then P(B,C) is TrainingSet: C\B 1 2 3 4

2 1 0 1 1

3 0 1 0 1

4 0 1 0 0

5 1 1 0 0

Then the DiffSup table is: B=1 B=2 B=3 B=4

0 1 1 2

Book=4 is very discriminative of Class1 and Class2,

e.g., Class1=salary>$100KP1={B=1|2} P2={B=3|4}

C1 0 1

C2 1 0

DS 1 1

P1 [and P2, B=2 and B=3] is somewhat discriminative of the classes, whereas B=1 is not..

Are "Discriminative Patterns" covered by ARM? E.g., does the same information come out of strong rule mining?

Does "DP" yield information across multiple relationships? E.g., determining the classes via the other relationship?

MYRRH_2e_3r Rate1(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C)Sell(Cust,Book) or S(B,C)

Cust,c. Rates book,b as 1, and c Purchases b, likely c Sells b at term end

For bB, {c| R1(c,b)=y & P(c,b)=y} {c| S(c,b)=y}

ct(R1pTreeb & PpTreeb & SpTreeb) / ct(R1pTreeb & PpTreeb) minconf0 0 0 10 00 00 1 0 0

R1(C,B)

1 0 0 1

1 1 0 0

1234

P(B,C)

B

C 2 3 4 5

1 0 0 1

1 1 0 0

S(B,C)

3e_3r

Students who buy b and courses using b, student enrolls in the course? {(s,c)| Buy(s,b)=y & Text(b,c)=y){(s,c)|Enroll(s,c)=y}. cnt(EpTreeSubSet(BpTreeb×TpTreeb))/(cnt(BpTreeb)*(cnt(TpTreeb)>mncf

1312

1

book

2 3 4 5 course

11

11

11

11

01

10

Text

4321

student

1 1

1 0 Enroll

11

0

11

1

11

0

11

1Buy

0 0 0 10 0000 1 0 0

R5(S,C)

1 0 0 1

123

PHC(B,S)

BS 1 2 3 4

1234

C

5

1

0

1

Rate5(Student,Course), PurchHardCov(Book,Stu)

If a student, s, rates any course as 5,

then s Purchases a HardCover book.

3e_2r

1312

1

book

2 3 4 5 course

11

1

11

1

11

0

11

0

Text

4

3

2

1

student

1 1

1 0 Enroll

11

0

11

1

11

0

11

1Buy

1312

1

offe

ring

11

1

11

1

11

0

11

0

LocationIf s enrolls in c, And c is Offered at L And L uses Text=b, Then s Buys b

4e_4r Any 2 adjacent relationships can be collapsed into 1: R(c,b) and P(b,e) iff RP(c,e). By doing so, we have a whole new relationship to analyze

0 0 0 10 0 1 00 0 0 10 1 0 0 R(C,B)

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

1

2

3

4

P(B,C)

B

C 2 3 4 5

Given c, {b|R(c,b)} is List(PR.c)For b in List(PR,c), {eC|P(b,e)} is List(PP,b)Therefore {e|RP(c,e)}=ORbListPR,c

PP,b

0 1 0 10 1 1 00 0 1 00 0 1 1 RP(C,C)

2

3

4

5

C

C 2 3 4 5

P=PURCHASE(S,B)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

E=ENROLL(S,C)

1

2

3

4

B=BOOK

S=STUDENT 2 3 4 5

1

2

3

4

C=COURSE

0 0 0 1

0 1 1 0

1 0 1 0

0 1 0 1

T=TEXT(C,B)Let Tc = C-pTree of T for C=c with list={b|T(c,b)}

PT=PURCHASE_TEXT(S,C)

0 0 0 10 0 1 10 1 1 00 0 0 1

1 0 0 1

0 1 1 1

1 0 0 0

1 1 0 0

E=ENROLL(S,C)

S=STUDENT 2 3 4 5

12

34

C=COURSE

PTc = ORbListTcPb

also PTs = ORbListPsTb

P=PURCHASE(S,B)

0 0 0 10 0 1 00 0 0 10 1 0 0

1 1 0 1

1 1 1 1

1 1 1 1

1 0 0 0

ET=ENROLL_TEXT(S,B)

1

2

3

4

B=BOOK

S=STUDENT 2 3 4 5

ETs = ORcListEsTc

also ETb=ORcListTbEc

PE=PURCHASE_ENROLL(C,B)

0 0 1 10 0 1 00 0 1 11 0 1 0

1

2

3

4

B=BOOK

1

2

3

4

C=COURSE

0 0 0 1

0 1 1 0

1 0 1 0

0 1 0 1

T=TEXT(C,B)

PEc = ORsListEcPs

also PEb = ORsListPbEs

With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) and the predicate (e.g., the table column id and bit slice number or bitmap involved).pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. PGP-D is a mechanism in which we "scrambled" pTree information (predicate info, but also possibly, ordering info) in a way that data can be processed without unscrambling.

For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone, but a person qualified could issue a data-mining request (classification/ARM/clustering). It is different from encrypting.

The Predicate Key (PK) reveals the pTree predicates (For basic pTrees, e.g., the "predicate" specifies which column and which bit position).Make all pTrees (over the entire [distributed] DB) the same length.Pad in the front [and the back?] so that statistics can not reveal the pTree start position.Scramble the locations of the pTrees. For basic pTrees, PK would reveal offset and pre-padThe example PK reveals that the 1st pTree is found at offset=5 (has been shuffled forward 5 pTree slots - of the slots reserved for that table) and that the first 54 bits are pad bits.If the DB had 5000 files with 50 columns each (on avg) and each column had 32 bits (on avg), we have 8 million pTrees. We could pad with statistically indistinguishable additions to make it impossible to try enough alternatives in human time to break the key. An additional thought: In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always fully replicate (making all retrievals local). Thus we are guaranteed that all pTrees are statistically "real looking" (because the ARE real). We might not need to pad with bogus pTrees. A hacker could extract only the first bit of every pTree (e.g., the 8M bits that IS the first horizontal record), then shuffle those bits until something meaningful appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code (e.g., look at 2nd, 3rd, etc.). To get around that possibility, we could store the entire database as a massive "Big Bit String" and have as part of our Predicate Key (PK) the start offset of each pTree (which would be shuffled randomly). We would include a column of the [randomly determined] amount of padding (now variable) so that the position of first start bits is unknowable. Alternatively, we could use a common length but have random "non-pTree" gaps between them.Alternatively, the "Key" could simply specify the start address of the pTree (and length?)

PGP-D Pretty Good Protection of Data protects vertical pTree data.

5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...key=array(offset,pad)

Could also construct large collection of bogus key-lookup-tables (identify correct one to authorized subgroup only). Additional layer. encrypt?For multiple users at different levels of security with rights to parts of DB and not others) we would have a separate key for each user level.Using the key would be simple and quick, and once the key is applied, then accessing and processing the data would be at zero additional time cost (the current thinking is that we would not encrypt or otherwise alter the pTrees themselves - just their identity).One would only need to work on the "key mechanism" to improve the method in speed, protection level. (individual pTrees are intact/unaltered)Some data collections need not be protected in their entirety (tends to be by column and not by row - pTrees are good for column protection. (I.e., it is usually the case that certain attributes are sensitive and others are routine public information).When there are differences in protection level by row (subsets of instances of the entity require different protection levels) then we would simply create each subset as a separate "file" (all of the same massive length through padding) and protect each at the proper level.

5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...

ROLL has 3 basic methods:

POST (allows a transaction to request its data item needs) POST is an atomic "enqueue" operation (only atomicity required -the only critical section). This can be batched so that low priority transaction POSTs can be delayed in favor of higher.

CHECK (determines requested data item availability). CHECK returns the logical OR of all RVs behind it - the result is called the "Access Vector" or AV. A background ever-running process can be creating and attaching AVs to each RV. Then a transaction CHECK need only proceed until it encounters (ORs in) an AV which specifies new item availability). re-CHECKing can be done any time.

RELEASE: sets some or all of transaction's 1-bits to 0-bits.

ROLL CC: Data items requested for [read and/or write] access by a trans using a REQUEST VECTOR (RV), bit vector. Each data item mapped to a bit position or it can be assumed that the ordering is the table ordering. A 1-bit at a position indicates that item is requested by the transaction and a 0-bit means it is not. If read and write modes are distinguished, ROLL uses a read and a write-bit for each item.

RVk ...010011:

0

RVj 110010:

0

RVi head010010:

0

tail (Where the critical section POST of the next RV_Ti+1 is done by copying tail_ptr to RV_Ti+1_ptr and then resetting tail_ptr to RV_Ti+1)

(Where bkgrd CreateAVs process begins repeatedly ORing RV_Ts going left to right.

110011:

0AVk

CHECK_RV_Tj begins here - ORs next RVs into a copy of RV_Tj+1, moving right (for max recency - else just check its own AV), building an AV_Tj, until it determines sufficient availability. Then it suspend CHECK and begins processing the newly available data items (but may go all the way to the head before suspending). It could also maintain the list of RVs blocking its access so that its next CHECK can OR only those RVs to get a AV_Tj (or check only those AVs).

010010:

0AVi

110010:

0AVj

Every Tj RELEASES (set to 0) bits as the corresponding data item is no longer needed (in RV_Tj )

Designate a separate ROLL for each partition OR use multi-level pTrees where the upper level is the file level.)ROLL RVs and AVs are same structured pTrees (upper level is the file level, then use whatever record level pTree structure is used for the basic

pTrees representing the data in the file itself (e.g., for an image file, the ordering of tuples (pixels) might be Peano or Z ordering and therefore, the RV and AV for (except for the top file level) would also indicate pixel access needs with the same pTree structure (1 means "need that pixel"). So the ROLL elements (RVs and AVs are just coded record-level bit slices (or trees in the multi-level pTree case).

AVs for each POSTed RV would be created by a background process in reverse POST order (time-stamped?)As soon as a CHECK process encounters an AV which provides additional accesses not previously available to that transaction, it can stop the

CHECK and use those items; or it can continue to gain a larger set of available items (by ignoring the AV and ORing only the RVs it encounters. This would make sense if the TS is old and/or an entire set of accesses is required to make progress at all - e.g., an entire file)

A record is "available" iff the entire record is available AND every field. A field is available if its record and that field is available.First Come First Serve except: Low priority trans delayed for incoming high priority trans. A read-only data mine ignores concurrency altogether.

ConCur Concurrency Control ROCC and ROLL concurrency control using pTrees

Domain Vectors (DVs) are bitmaps representing the presence of a domain's value.

The mapping which assigns domain vector positions to domain values is the Domain Vector Table (DVT).

DOMAIN VECTORS: Given domain, D

e.g., D={3 letter strings} for name field) for a fieldDVT:nam | surrogate====|==========aaa|0aab|1...aaz|25...zzz|17575

Then an attribute, R.A, in a relation, R, has Domain Vector: DV(R.A) = (0010100100110...0) with a 1-bit in the nth position iff the Domain Value with surrogate, n, occurs in R.A.

DV(CUSTOMER.nam) =(0...1000000000010...010...010...0) ^ ^ ^ ^1886-' | | `13395 1897 3289 SUE "JAY" "JON"

(e.g., JAN is 1886th domain value or has surrogate 1886)

The DV Accelerator method is as follows. Keep DV for some fields (particularly primary keys and frequently joined attributes).

Note, to reduce the size of these vectors, surrogate the "extant domain" (currently appearing domain values), assign to new ones. the next surrogate. Update DV after Insert of new record.

i. Form Modify-Vector (MV) e.g., if ABE joins the buying club, form MV with 1 in 31st position, 0 elsewhere).

ii. OR MV into DV

DOVE DOmain VEctor query processing DB query processing using pTrees

Delete tuple (assume field value was not duplicated) i. Form MV for deleted value (e.g., ABE drops membership). ii. XOR MV into the DV

To Join: i. Materialize primary DV. ii. Logically AND other DV into it, producing a JOIN VECTOR

(We note that a JV is a key-value sorted list of matches). iii. Apply JV to each file-index producing surrogate lists. -1- Nested loop is efficient since all records match. But,

inefficient rereading of pages may occur. -2- iv. is a guess for sparse joins. iv. Sort surrogate-lists, read files, sort file, merge-join. (this should

minimize page- reads and page-faults).

Projection: Depth-first retrieval on index (already optimal).

Selection:i. Form Select Vector (SV) (1 for all values to be selected) If filter is logical combination of key-ranges, form key-range vectors, use corresponding logical ops (OR AND NOT))

e.g., SELECT ALL CUSTOMERS STARTING WITH J: SV=(0..01..10..0) | | 6760 7436

ii. Logically AND DV into SV. iii. Apply SV to file-index producing surrogate list. iv. Sort surrogate-list, read file.

http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex1.html

http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/qpo.html

http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex0.html

document

Documents

horizontal data

vertical data structures

accurate horizontal

nn method

truth of predicate

number of occurences

ptreebased c3nn

unclassified sample