document
DESCRIPTION
pTrees p redicate Tree technologie s provide fast, accurate horizontal processing of compressed, data-mining-ready, vertical data structures. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. course. 2. 3. 4. 5. PINE Podium Incremental Neighborhood Evaluator - PowerPoint PPT PresentationTRANSCRIPT
•13•12 1
docu
men
t
2 3 4 5course
•1•11
•1•1•1•1•1•0
•1•1•0Text
4
3
2
1
person
1 1
1 0 Enroll
•1•10
•1•11
•1•10
•1•11
Buy
MYRRH ManY-Relationship-Rule Harvesteruses pTrees for association rule mining of multiple relationships.
Applications:
ConCur Concurrency Controluses pTrees for ROCC and ROLL concurrency
control.
PGP-D Pretty Good Protection of Data protects vertical pTree data.
5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...key=array(offset,pad)
FAUST Fast Accurate Unsupervised, Supervised Treemininguses pTtrees for classification and clustering of spatial data.
pTrees predicate Tree technologiesprovide fast, accurate horizontal processing ofcompressed, data-mining-ready, vertical data structures.
PINE Podium Incremental Neighborhood Evaluator uses pTrees for Closed k Nearest Neighbor Classification.
DOVE DOmain VEctorsUses pTrees for database query processing.
0 0 0 0 1
P11
4. Left half of rt half ? false0 00 0 0
2. Left half pure1? false 0
00 0
1. Whole thing pure1? false 0
5. Rt half of right half? true1
00 0 0 1
R11 0 0 0 0 0 0 1 1
predicate Trees (pTrees): project each attribute (now 4 files)
Record the truth of predicate: "pure1 (all 1's)" in a tree recursively on halves, until the half is pure (all 1’s or all 0’s).
3. Right half pure1? false 0 00 0
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 10 1 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100
But it's pure0 so this branch ends
then vertically slice off each bit position (now 12 files)then compress each bit slice into a pTreee.g., the compression of R11 into P11 goes as follows:
P11
pure1? false=0
pure1? false=0
pure1? false=0pure1? true=1
pure1? false=0
1st, Vertically Processing of Horizontal Data (VPHD)
R(A1 A2 A3 A4)
2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4
for Horizontally structuredrecords,we scan vertically
010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100
=
Base 10 Base 2
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1
1
0 0 00 0 0 1 01 10
0 1 0
0 1 0 1 0
0 0 01 0 01
0 1 0
0 0 0 1 0
0 0 10 1
0 0 10 0 01
0 0 00 0 01
0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^
e.g., to find the number of occurences of 7 0 1 4 =22nd, pTrees find # of occurences of 7 0 1 4?
To count (7,0,1,4)s use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43
7 0 1 4
0 *23
0 0 *22 =2 0 1 *21
*20
=
pTrees predicate Tree technologiesprovide fast, accurate horizontal processing ofcompressed, data-mining-ready, vertical data structures.
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20
t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
t12 0 0 1 0 1 1 0 2
t13 0 0 1 0 1 0 0 1
t15 0 0 1 0 1 0 1 2
a5 a6 a10=C a11 a12 a13 a14 dis from
a=000000
area for 3nearestnbrs
0 0 0 0 0 0
distance=2, don’t replace
0 0 0 0 0 0
distance=4, don’t replace
0 0 0 0 0 0
distance=4, don’t replace
0 0 0 0 0 0
distance=3, don’t replace
0 0 0 0 0 0
distance=3, don’t replace
0 0 0 0 0 0
distance=2, don’t replace
0 0 0 0 0 0
distance=3, don’t replace
0 0 0 0 0 0
distance=2, don’t replace
0 0 0 0 0 0
distance=1, replace
t53 0 0 0 0 1 0 0 1
0 0 0 0 0 0
distance=2, don’t replace
0 0 0 0 0 0
distance=2, don’t replace
0 0 0 0 0 0
distance=3, don’t replace
0 0 0 0 0 0
distance=2, don’t replace
0 1
C=1 wins!
First 3NN using horizontal data to classify an unclassified sample, a =( 0 0 0 0 0 0 ).
0 0 0 0 0 0
distance=2, don’t replace
PINE Podium Incremental Neighborhood Evaluatoruses pTrees for Closed k Nearest Neighbor Classification (CkNNC)
Key a1 a2 a3 a4 a5 a6 a7 a8 a9 a10=C a11 a12 a13 a14 a15 a16 a17 a18 a19 a20
t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
Next C3NN using horizontal data: (a second pass is necessary to find all other voters that are at distance 2 from a)
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=4, don’t include
0 0 0 0 0 0
d=4, don’t include
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=1, already voted
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=2, already voted
0 0 0 0 0 0
d=1, already voted
0 1
Vote after 1st scan.
C=0 wins now!
t12 0 0 1 0 1 1 0 2
t13 0 0 1 0 1 0 0 1
a5 a6 a10=C a11 a12 a13 a14 distance
t53 0 0 0 0 1 0 0 1
Unclassified sample: 0 0 0 0 0 03NN set after 1st scan
0 0 0 0 0 0
d=2, include it also0 0 0 0 0 0
d=2, include it also
C'
00000000001111111
C
11111111110000000
PINE: a Closed 3NN method using pTrees (vertically data structures). 1st: pTree-based C3NN goes as follows:
a20
11001011101100110
keyt12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61 t72
t75
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
00001111110000100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a11
00011010001000100
a12 11100001110110011
a13 10011111001001110
a14 00100100010011001
a15 110100011001000 10
a16
10000001000000010
a17 00101110011011101
a18
00101110011011101
a19
01010000100100000
Ps
00000000000000000
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
No neighbors at distance=0
First let all training points at distance=0 vote, then distance=1, then distance=2, ... until 3 votes are cast.For distance=0 (exact matches) constructing the P-tree, Ps
then AND with PC and PC’ to compute the vote.
C'
00000000001111111
C
11111111110000000
a20
11001011101100110
keyt12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61 t72
t75
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
00001111110000100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
a10 =C11111111110000000
a11
00011010001000100
a12 11100001110110011
a13 10011111001001110
a14 00100100010011001
a15 110100011001000 10
a16
10000001000000010
a17 00101110011011101
a18
00101110011011101
a19
01010000100100000
PD(s,1)
01000000000100000
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
00001111110000100
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a14
11011011101100110
a13
01100000110110001
a12
1 1100001110110011
a11
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a14
11011011101100110
a13
1 0011111001001110
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a14
0 0100100010011001
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji
= OR PS(si,1) S(sj,0)
OR
P5 P6 P11 P12 P13 P14
j{5,6,11,12,13,14}-{i}
0 1
i=5,6,11,12,13,14i=5,6,11,12,13,14
pTree-based C3NN: find all distance=1 nbrs:
keyt12
t13
t15
t16
t21
t27
t31
t32
t33
t35
t51
t53
t55
t57
t61 t72
t75
a5
00001111110000100
a6
00001100000000000
a10 C11111111110000000
a11
00011010001000100
a12 11100001110110011
a13 10011111001001110
a14 00100100010011001
0 1
OR{all double-dim interval-Ptrees}; PD(s,2) = OR Pi,j
Pi,j = PS(si,1) S(sj,1) S(sk,0) k{5,6,11,12,13,14}-{i,j}
i,j{5,6,11,12,13,14}
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
0 0001100000000000
a5
00001111110000100
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
00011010001000100
a6
11110011111111111
a5
00001111110000100
a14
11011011101100110
a13
01100000110110001
a12
1 1100001110110011
a11
11100101110111011
a6
11110011111111111
a5
00001111110000100
a14
11011011101100110
a13
1 0011111001001110
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
00001111110000100
a14
0 0100100010011001
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
00001111110000100
P5,6 P5,11 P5,12 P5,13 P5,14
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
00011010001000100
a6
0 0001100000000000
a5
1111000000111 1011
a14
11011011101100110
a13
01100000110110001
a12
11100001110110011
a11
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a14
11011011101100110
a13
11100001110110011
a12
00011110001001100
a11
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a14
00100100010011001
a13
01100000110110001
a12
00011110001001100
a11
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
P6,11 P6,12 P6,13 P6,14
a14
11011011101100110
a13
01100000110110001
a12
11100001110110011
a11
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a14
11011011101100110
a13
10011111001001110
a12
00011110001001100
a11
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a14
11011011101100110
a13
01100000110110001
a12
00011110001001100
a11
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
P11,12 P11,13 P11,14
a14
11011011101100110
a13
1 0011111001001110
a12
11100001110110011
a11
11100101110111011
a6
11110011111111111
a5
11110000001111011
a14
0 0100100010011001
a13
01100000110110001
a12
11100001110110011
a11
11100101110111011
a6
11110011111111111
a5
11110000001111011
P12,13 P12,14
We now have 3 nearest nbrs. We could quite and declare C=1 winner?
a14
0 0100100010011001
a13
10011111001001110
a12
00011110001001100
a11
11100101110111011
a6
11110011111111111
a5
11110000001111011
P13,14
We now have the C3NN set and we can declare C=0 the winner!PINE=CkNN in which all training samples vote weighted by their nearness to a (~Olympic podiums)
pTree-based C3NN, dist=2 nbrs:
FAUST using impure pTrees (ipTrees)
All pTrees defined by Row Set Predicates (T/F on any row-sets). E.g.: On T(A,B,C), "units bit slice pTree of T.A, using predicate, > 60% 1-bits, true iff >60% of the A-values are odd.
The IRIS dataset can be downloaded from the UCI Data Repository.To cluster the IRIS dataset of 150 iris flower samples, (50 setosa, 50 versicolor, 50 virginica iris's) using 2-level 60% ipTrees with each upper level bit representing the predicate truth applied to 10 consecutive iris samples), level-1 is shown below. FAUST clusters perfectly using only this level (an order of magnitude smaller bit vectors - so faster processing!).
level-1 values: SL SW PL PWsetosa 38 38 14 2setosa 50 38 15 2setosa 50 34 16 2setosa 48 42 15 2setosa 50 34 12 2versicolor 1 24 45 15versicolor 56 30 45 14versicolor 57 28 32 14versicolor 54 26 45 13versicolor 57 30 42 12virginica 73 29 58 17virginica 64 26 51 22virginica 72 28 49 16virginica 77 30 48 22virginica 67 26 50 19Level-1 mn 54.2 30.8 35.8 11.6setosa 47.2 37.2 14.4 2versicolor 45 27.6 41.8 13.6virginica 70.6 27.8 51.2 19.2
se 2 11.6 se 47.2 13.4ve 45 2.2
vi 70.6
SL mn gap SW mn gap
se 37.2
ve 27.6 .2vi 27.8 9.4
se 14.4 27.4
vi 51.2ve 41.8 9.4
PL mn gap
ve 13.6 5.6vi 19.2
PW mn gap
level_1 s10gt60_PSL,j s10gt60_PSW,j s10_gt60_PPL,j s10gt60_PPW,j
0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 00 1 1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 00 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 00 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 1 00 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 00 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 10 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 1 1 00 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 1 00 1 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 0 10 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 0 01 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 11 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 01 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 01 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 1 01 0 0 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1
setosasetosasetosasetosasetosaversicolorversicolorversicolorversicolorversicolorvirginicavirginicavirginicavirginicavirginica
The 150 level_0 raw bits
level_1 = s10gt60_PPW,1
1111101110 1100100111 1010110111 1001011011 1111011111 1110100101 1111011111 1010011011 1100101000 01 01000010 0101100110 0100011111 1001011100 1011110110 0111011011
level_2 =s150_s10_gt60_PPW,1
11111 11100 01011
1
level_0
(The level_2 bit strides 150 level_0 bits)
(Each level_1 bit (15 of them) strides 10 raw bits)
FAUST Fast Accurate Unsupervised, Supervised Treemining uses pTrees for classification and clustering of spatial data.
cH = 2 + 11.6/2 = 7.8
PW mn gapSL mn gap SW mn gap PL mn gapse 2 11.6
se 47.2 13.4ve 45 2.2
vi 70.6 se 37.2
ve 27.6 .2vi 27.8 9.4
se 14.4 27.4
vi 51.2ve 41.8 9.4 ve 13.6 5.6
vi 19.2
CLASS PWsetosa 2setosa 2setosa 2setosa 2setosa 2versicolor 15versicolor 14versicolor 14versicolor 13versicolor 12virginica 17virginica 22virginica 16virginica 22virginica 19
(perfect on setosa!)
SL mn gap SW mn gap PL mn gap PW mn gapve 45 25.6
vi 70.6
ve 27.6 .2vi 27.8
vi 51.2ve 41.8 9.4 ve 13.6 5.6
vi 19.2
cH = 45 + 25.6/2 = 57.8
CLASS SLversicolor 1versicolor 56versicolor 57versicolor 54versicolor 57virginica 73virginica 64virginica 72virginica 77virginica 67
(perfect classification of the rest!)
FAUST (simplest version)
For each attribute (column),1. calculate mean of each class;2. sort those means asc;3. calc mean_gaps=differences_of_means;4. choose largest (relatively) mean_gap to cut.
4. choose best class and attribute for cuttinggapL is gap on low side of a mean. gapH is high
2. Remove record with max gapRELATIVE.
1. 2. 3. done on previous slide
FAUST using impure pTrees (ipTrees) page 2
24 samples from each class as training (every other one in the list of 50),
first form 3-level gt50%ipTrees with level=1 stride=12.
second form 3-level gt50%ipTrees, level=1 stride=24 (i.e., just a root above 3 leaf strides, 1 for each class).
Conclusion: Uncompressed 50%ipTrees (with root truth values) root values are close to the mean?
level_1 s24gt50_PSL,j s24gt50_PSW,j s24_gt50_PPL,j s24gt50_PPW,j
se 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 51 38 15 0se 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 50 34 14 2
ve 0 1 1 1 0 0 1 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 0 57 28 45 14ve 0 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 0 1 0 0 0 63 30 40 8
vi 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 72 28 49 18vi 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 69 30 48 22
se 1 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 51 34 15 2ve 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0 1 0 0 1 1 1 0 57 30 41 14vi 1 0 0 1 0 0 1 0 1 1 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 73 30 49 22
level=1 stride=12, each of the 2 level=1 bits strides 12 of 24
level=1 stride=24, each of the level=1 bits strides 24 of 24
In the previous two FAUST slides, three-level 60% ipTrees were used (leaves are level=0, root is level=2) with each level=1 bit representing the predicate truth applied to 10 consecutive iris samples (leaf bits, i.e., the level=1 stride=10).
Below, instead of taking the entire 150 IRIS samples, 24 are selected from each class as training samples; the 60% is replaced by 50% and level=1 stride=10 is replaced with level=1 stride=12 first, then level=1 stride=24.
Note: The means (averages) are almost the same in all cases.
FAUST using impure pTrees (ipTrees) page 3
R11 0 0 0 0 1 0 1 1
ipTrees construction can be done during the [one-time] construction of the basic pTrees?
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
8_4_2_1_gte50%_ipTree11
0 0
0
0 0
0
1 0
1
0
1
1 1
1
0
node_naming: ( Level, offset (left-to-right) )
E.g., lower left corner node is (0,0).
Array of nodes at level=L is [L, *]
pTree naming: Sn-1_..._S1_S0_gteX%_ipTree for n-level ipTree
with predicate gteX%. S=Stride=#leaf bits strided by the node.
If it is a basic pTree, pTree subscripts specify attribute, bitslice.
Note on bottom_up ipTree construction: One must record the 1-count of the stride of each inode (e.g., In binary trees, if one child is 1, the other is 0, it could be the
1-child is pure1 and the 0-child is just below 50% (so parent_node=1) or the
1-child is just above 50% and the 0-child has almost no 1-bits (so parent node=0). (example on next slide).
Can be done during the one pass through each bit slice required for bottom-up construction of pure1 pTrees.
binary_pure1 pTree11
= 8_4_2_1_gte100%ipTree11
0 0
0
0 0
0
1 0
0
0
0
1 1
1
0
R11 1 0 0 0 1 0 1 1
bottom-up ipTree construction (changed R11 so this issue of recording 1-counts as you go is pertinent)1.1-child is pure1 and 0-child is just below 50% (so parent_node=1)2.1-child is just above 50% and the 0-child has almost no 1-bits (so that the parent node=0). (example on next slide).
1 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
8_4_2_1_gte50%_ipTree11
1 0
1
0 0
0
1 0
1
1
1
1 1
1
0
0 or 1? The 1-count of the left branch =1 and 1-count of the right branch =0, so the stride=4 subtree 1-count=1 (< 50%). We know 1-countt of right branch=0 (pure0), but we wouldn't know 1-count of left branch unless it was recorded. Finally, note that recording the 1-counts as we build the tree upwards is a near-zero-extra-cost step.
0 or 1? Need to know left branch 1ct=1 and right branch 1ct=3. So this stride=8 subtree 1ct=4 ( 50%).
Customer
1
2
3
4
Item
6
5
4
3
Gene
11
1
Doc
1
2
3
4
Gene
11
3
Exp
11
11
11
11
1 2 3 4 Author
1 2 3 4 G 5 6term 7
5 6 7People
11
11
11
3
2
1
Doc
2 3 4 5PI
People
cust item card
authordoc card
termdoc card
docdoc
expgene card
genegene card (ppi)
expPI card
5
6
16
Item
Set
Supp(A) =CusFreq(ItemSet)
genegene card (ppi)
ItemSet
antecedent
1 2 3 4 5 6 16
itemset itemset card
Conf(AB) =Supp(AB)/Supp(A)
mov
ie
0 0 0 0
0 2
0 0
3 0 0 0
1 0 0
5 0
0
0
0
5
1
2
3
4
4 0 0
0 0 0
5
0
0
1
0
3
0
0
customer rates movie card
0 0 0 0
0 0
0 0
0 0 0 0
0 0 0
1 0 0
0
0
0
1
0 0 0
0 0 0
1
0
0
0
0
0
customer rates movie as 5 card
4
3
2
1
Course
Enrollments
1 5people 2 3 4
1
2
3
4
item
s
3 2
1
term
s
DataCube Model for 3 entities, items, people and terms.
76
54
32
t
1
termterm card (share stem?)
Items: i1 i2 i3 i4 i5
|0 001|0 |0 11| |1 001|0 |1 01| |2 010|1 |0 10|
People: p1 p2 p3 p4
|0 100|A|M| |1 001|T|M| |2 010|S|F| |3 011|B|F| |4 100|C|M|
Terms: t1 t2 t3 t4 t5 t6
|1 010|1 101|2 11| |2 001|0 000|3 11| |3 011|1 001|3 11| |4 011|3 001|0 00|
Relationship: p1 i1 t1
|0 0| 1 |0 1| 1 |1 0| 1 |2 0| 2 |3 0| 2 |4 1| 2 |5 1|_2
Relational Model:
2 3 4 5PI
RoloDex Model: 2 Entities many relationships
MYRRH pTree-based ManY-Relationship-Rule Harvester uses pTrees for ARM of multiple relationships.
MYRRH_2e_2r ( note: standard pARM is MYRRH_2e_1r )e.g., Rate5(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C)
0 0 0 10 0 1 00 0 0 10 1 0 0 R5(C,B)
(R(E,F))
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
1
2
3
4
P(B,C) (S(E,F))
If cust, c, rates book, b as 5, then c purchase b. For bB, {c| rate5(b,c)=y}{c| purchase(c,b)=y}
ct(R5pTreei & PpTreei) / ct(R5pTreei) mncnf ct(R5pTreei) / sz(R5pTreei) mnsp
Speed of AND: R5pTreeSet & PpTreeSet? (Compute each ct(R5pTreeb&PpTreeb).)
Slice counts, bB, ct(R5pTreeb & PpTreeb) w AND? B(F)
Schema: size(C)=size(R5pTreeb)=size(BpTreeb)=4 size(B)= size(R5pTreec)=size(BpTreec)=4
pre-computed BpTtreec 1-counts
3 2 1 2
R5pTtreec 1-cts0 1 1 2
2
3
1
2 BpTtreeb 1-cts
1
1
1
1
pre-comR5pTtreeb 1-cts
C (E)
2 3 4 5
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0P(B,C)
0 0 0 1
0 0 1 0
0 0 0 1
0 1 0 0R5(C,B)
1
1
0
1
R5pTtreeb&PpTreeb
1-counts
Given eE, If R(e,f), then S(e,f)
If eA R(e,f), then eB S(e,f)
If eA R(e,f), then eB S(e,f)
If eA R(e,f), then eB S(e,f)
If eA R(e,f), then eB S(e,f)
ct(Re & Se)/ct(Re)mncnf, ct(Re)/sz(Re)mnsp
ct( &eARe &eBSe) / ct(&eARe) mncnf. ...
ct( &eARe OReBSe) / ct(&eARe) mncnf. ...
ct( OReARe &eBSe) / ct(OReARe) mncnf. ...
ct( OReARe OReBSe) / ct(OReARe) mncnf. ...
Consder 2 Customer classes, Class1={C=2|3} and Class2={C=4|5}. Then P(B,C) is TrainingSet: C\B 1 2 3 4
2 1 0 1 1
3 0 1 0 1
4 0 1 0 0
5 1 1 0 0
Then the DiffSup table is: B=1 B=2 B=3 B=4
0 1 1 2
Book=4 is very discriminative of Class1 and Class2,
e.g., Class1=salary>$100KP1={B=1|2} P2={B=3|4}
C1 0 1
C2 1 0
DS 1 1
P1 [and P2, B=2 and B=3] is somewhat discriminative of the classes, whereas B=1 is not..
Are "Discriminative Patterns" covered by ARM? E.g., does the same information come out of strong rule mining?
Does "DP" yield information across multiple relationships? E.g., determining the classes via the other relationship?
MYRRH_2e_3r Rate1(Cust,Book) or R5(C,B), Purchase(Book,Cust) or P(B,C)Sell(Cust,Book) or S(B,C)
Cust,c. Rates book,b as 1, and c Purchases b, likely c Sells b at term end
For bB, {c| R1(c,b)=y & P(c,b)=y} {c| S(c,b)=y}
ct(R1pTreeb & PpTreeb & SpTreeb) / ct(R1pTreeb & PpTreeb) minconf0 0 0 10 00 00 1 0 0
R1(C,B)
1 0 0 1
1 1 0 0
1234
P(B,C)
B
C 2 3 4 5
1 0 0 1
1 1 0 0
S(B,C)
3e_3r
Students who buy b and courses using b, student enrolls in the course? {(s,c)| Buy(s,b)=y & Text(b,c)=y){(s,c)|Enroll(s,c)=y}. cnt(EpTreeSubSet(BpTreeb×TpTreeb))/(cnt(BpTreeb)*(cnt(TpTreeb)>mncf
1312
1
book
2 3 4 5 course
11
11
11
11
01
10
Text
4321
student
1 1
1 0 Enroll
11
0
11
1
11
0
11
1Buy
0 0 0 10 0000 1 0 0
R5(S,C)
1 0 0 1
123
PHC(B,S)
BS 1 2 3 4
1234
C
5
1
0
1
Rate5(Student,Course), PurchHardCov(Book,Stu)
If a student, s, rates any course as 5,
then s Purchases a HardCover book.
3e_2r
1312
1
book
2 3 4 5 course
11
1
11
1
11
0
11
0
Text
4
3
2
1
student
1 1
1 0 Enroll
11
0
11
1
11
0
11
1Buy
1312
1
offe
ring
11
1
11
1
11
0
11
0
LocationIf s enrolls in c, And c is Offered at L And L uses Text=b, Then s Buys b
4e_4r Any 2 adjacent relationships can be collapsed into 1: R(c,b) and P(b,e) iff RP(c,e). By doing so, we have a whole new relationship to analyze
0 0 0 10 0 1 00 0 0 10 1 0 0 R(C,B)
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
1
2
3
4
P(B,C)
B
C 2 3 4 5
Given c, {b|R(c,b)} is List(PR.c)For b in List(PR,c), {eC|P(b,e)} is List(PP,b)Therefore {e|RP(c,e)}=ORbListPR,c
PP,b
0 1 0 10 1 1 00 0 1 00 0 1 1 RP(C,C)
2
3
4
5
C
C 2 3 4 5
P=PURCHASE(S,B)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
E=ENROLL(S,C)
1
2
3
4
B=BOOK
S=STUDENT 2 3 4 5
1
2
3
4
C=COURSE
0 0 0 1
0 1 1 0
1 0 1 0
0 1 0 1
T=TEXT(C,B)Let Tc = C-pTree of T for C=c with list={b|T(c,b)}
PT=PURCHASE_TEXT(S,C)
0 0 0 10 0 1 10 1 1 00 0 0 1
1 0 0 1
0 1 1 1
1 0 0 0
1 1 0 0
E=ENROLL(S,C)
S=STUDENT 2 3 4 5
12
34
C=COURSE
PTc = ORbListTcPb
also PTs = ORbListPsTb
P=PURCHASE(S,B)
0 0 0 10 0 1 00 0 0 10 1 0 0
1 1 0 1
1 1 1 1
1 1 1 1
1 0 0 0
ET=ENROLL_TEXT(S,B)
1
2
3
4
B=BOOK
S=STUDENT 2 3 4 5
ETs = ORcListEsTc
also ETb=ORcListTbEc
PE=PURCHASE_ENROLL(C,B)
0 0 1 10 0 1 00 0 1 11 0 1 0
1
2
3
4
B=BOOK
1
2
3
4
C=COURSE
0 0 0 1
0 1 1 0
1 0 1 0
0 1 0 1
T=TEXT(C,B)
PEc = ORsListEcPs
also PEb = ORsListPbEs
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) and the predicate (e.g., the table column id and bit slice number or bitmap involved).pTrees are compressed, data-mining-ready vertical data structures which need not be uncompressed to be used. PGP-D is a mechanism in which we "scrambled" pTree information (predicate info, but also possibly, ordering info) in a way that data can be processed without unscrambling.
For data mining purposes, the scrambled pTrees would be unrevealing of the raw data to anyone, but a person qualified could issue a data-mining request (classification/ARM/clustering). It is different from encrypting.
The Predicate Key (PK) reveals the pTree predicates (For basic pTrees, e.g., the "predicate" specifies which column and which bit position).Make all pTrees (over the entire [distributed] DB) the same length.Pad in the front [and the back?] so that statistics can not reveal the pTree start position.Scramble the locations of the pTrees. For basic pTrees, PK would reveal offset and pre-padThe example PK reveals that the 1st pTree is found at offset=5 (has been shuffled forward 5 pTree slots - of the slots reserved for that table) and that the first 54 bits are pad bits.If the DB had 5000 files with 50 columns each (on avg) and each column had 32 bits (on avg), we have 8 million pTrees. We could pad with statistically indistinguishable additions to make it impossible to try enough alternatives in human time to break the key. An additional thought: In the distributed case (multiple sites) since we'd want lots of pTrees, it would make sense to always fully replicate (making all retrievals local). Thus we are guaranteed that all pTrees are statistically "real looking" (because the ARE real). We might not need to pad with bogus pTrees. A hacker could extract only the first bit of every pTree (e.g., the 8M bits that IS the first horizontal record), then shuffle those bits until something meaningful appears (or starts to appear). From all meaningful shuffles, he/she might be able to break the key code (e.g., look at 2nd, 3rd, etc.). To get around that possibility, we could store the entire database as a massive "Big Bit String" and have as part of our Predicate Key (PK) the start offset of each pTree (which would be shuffled randomly). We would include a column of the [randomly determined] amount of padding (now variable) so that the position of first start bits is unknowable. Alternatively, we could use a common length but have random "non-pTree" gaps between them.Alternatively, the "Key" could simply specify the start address of the pTree (and length?)
PGP-D Pretty Good Protection of Data protects vertical pTree data.
5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...key=array(offset,pad)
Could also construct large collection of bogus key-lookup-tables (identify correct one to authorized subgroup only). Additional layer. encrypt?For multiple users at different levels of security with rights to parts of DB and not others) we would have a separate key for each user level.Using the key would be simple and quick, and once the key is applied, then accessing and processing the data would be at zero additional time cost (the current thinking is that we would not encrypt or otherwise alter the pTrees themselves - just their identity).One would only need to work on the "key mechanism" to improve the method in speed, protection level. (individual pTrees are intact/unaltered)Some data collections need not be protected in their entirety (tends to be by column and not by row - pTrees are good for column protection. (I.e., it is usually the case that certain attributes are sensitive and others are routine public information).When there are differences in protection level by row (subsets of instances of the entity require different protection levels) then we would simply create each subset as a separate "file" (all of the same massive length through padding) and protect each at the proper level.
5,54 | 7,539 | 87,3 | 209,126 | 25,896 | 888,23 | ...
ROLL has 3 basic methods:
POST (allows a transaction to request its data item needs) POST is an atomic "enqueue" operation (only atomicity required -the only critical section). This can be batched so that low priority transaction POSTs can be delayed in favor of higher.
CHECK (determines requested data item availability). CHECK returns the logical OR of all RVs behind it - the result is called the "Access Vector" or AV. A background ever-running process can be creating and attaching AVs to each RV. Then a transaction CHECK need only proceed until it encounters (ORs in) an AV which specifies new item availability). re-CHECKing can be done any time.
RELEASE: sets some or all of transaction's 1-bits to 0-bits.
ROLL CC: Data items requested for [read and/or write] access by a trans using a REQUEST VECTOR (RV), bit vector. Each data item mapped to a bit position or it can be assumed that the ordering is the table ordering. A 1-bit at a position indicates that item is requested by the transaction and a 0-bit means it is not. If read and write modes are distinguished, ROLL uses a read and a write-bit for each item.
RVk ...010011:
0
RVj 110010:
0
RVi head010010:
0
tail (Where the critical section POST of the next RV_Ti+1 is done by copying tail_ptr to RV_Ti+1_ptr and then resetting tail_ptr to RV_Ti+1)
(Where bkgrd CreateAVs process begins repeatedly ORing RV_Ts going left to right.
110011:
0AVk
CHECK_RV_Tj begins here - ORs next RVs into a copy of RV_Tj+1, moving right (for max recency - else just check its own AV), building an AV_Tj, until it determines sufficient availability. Then it suspend CHECK and begins processing the newly available data items (but may go all the way to the head before suspending). It could also maintain the list of RVs blocking its access so that its next CHECK can OR only those RVs to get a AV_Tj (or check only those AVs).
010010:
0AVi
110010:
0AVj
Every Tj RELEASES (set to 0) bits as the corresponding data item is no longer needed (in RV_Tj )
Designate a separate ROLL for each partition OR use multi-level pTrees where the upper level is the file level.)ROLL RVs and AVs are same structured pTrees (upper level is the file level, then use whatever record level pTree structure is used for the basic
pTrees representing the data in the file itself (e.g., for an image file, the ordering of tuples (pixels) might be Peano or Z ordering and therefore, the RV and AV for (except for the top file level) would also indicate pixel access needs with the same pTree structure (1 means "need that pixel"). So the ROLL elements (RVs and AVs are just coded record-level bit slices (or trees in the multi-level pTree case).
AVs for each POSTed RV would be created by a background process in reverse POST order (time-stamped?)As soon as a CHECK process encounters an AV which provides additional accesses not previously available to that transaction, it can stop the
CHECK and use those items; or it can continue to gain a larger set of available items (by ignoring the AV and ORing only the RVs it encounters. This would make sense if the TS is old and/or an entire set of accesses is required to make progress at all - e.g., an entire file)
A record is "available" iff the entire record is available AND every field. A field is available if its record and that field is available.First Come First Serve except: Low priority trans delayed for incoming high priority trans. A read-only data mine ignores concurrency altogether.
ConCur Concurrency Control ROCC and ROLL concurrency control using pTrees
Domain Vectors (DVs) are bitmaps representing the presence of a domain's value.
The mapping which assigns domain vector positions to domain values is the Domain Vector Table (DVT).
DOMAIN VECTORS: Given domain, D
e.g., D={3 letter strings} for name field) for a fieldDVT:nam | surrogate====|==========aaa|0aab|1...aaz|25...zzz|17575
Then an attribute, R.A, in a relation, R, has Domain Vector: DV(R.A) = (0010100100110...0) with a 1-bit in the nth position iff the Domain Value with surrogate, n, occurs in R.A.
DV(CUSTOMER.nam) =(0...1000000000010...010...010...0) ^ ^ ^ ^1886-' | | `13395 1897 3289 SUE "JAY" "JON"
(e.g., JAN is 1886th domain value or has surrogate 1886)
The DV Accelerator method is as follows. Keep DV for some fields (particularly primary keys and frequently joined attributes).
Note, to reduce the size of these vectors, surrogate the "extant domain" (currently appearing domain values), assign to new ones. the next surrogate. Update DV after Insert of new record.
i. Form Modify-Vector (MV) e.g., if ABE joins the buying club, form MV with 1 in 31st position, 0 elsewhere).
ii. OR MV into DV
DOVE DOmain VEctor query processing DB query processing using pTrees
Delete tuple (assume field value was not duplicated) i. Form MV for deleted value (e.g., ABE drops membership). ii. XOR MV into the DV
To Join: i. Materialize primary DV. ii. Logically AND other DV into it, producing a JOIN VECTOR
(We note that a JV is a key-value sorted list of matches). iii. Apply JV to each file-index producing surrogate lists. -1- Nested loop is efficient since all records match. But,
inefficient rereading of pages may occur. -2- iv. is a guess for sparse joins. iv. Sort surrogate-lists, read files, sort file, merge-join. (this should
minimize page- reads and page-faults).
Projection: Depth-first retrieval on index (already optimal).
Selection:i. Form Select Vector (SV) (1 for all values to be selected) If filter is logical combination of key-ranges, form key-range vectors, use corresponding logical ops (OR AND NOT))
e.g., SELECT ALL CUSTOMERS STARTING WITH J: SV=(0..01..10..0) | | 6760 7436
ii. Logically AND DV into SV. iii. Apply SV to file-index producing surrogate list. iv. Sort surrogate-list, read file.
http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex1.html
http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/qpo.html
http://web.cs.ndsu.nodak.edu/~perrizo/classes/765/dvex0.html