big data management and analytics assignment 8
TRANSCRIPT
DATABASESYSTEMSGROUP
Big Data Management and AnalyticsAssignment 8
1
DATABASESYSTEMSGROUP
Assignment 8-1
Finding similar items
Suppose that the universal set is given by {1,β¦,10}. Construct minhashsignatures for the following sets:
(a) π1 = {3,6,9}
(b) π2 = {2,4,6,8}
(c) π3 = {2,3,4}
1. Construct the signatures for the sets using the following list ofpermutations:
β’ (1,2,3,4,5,6,7,8,9,10)
β’ (10,8,6,4,2,9,7,5,3,1)
β’ (4,7,2,9,1,5,3,10,6,8)
2
DATABASESYSTEMSGROUP
Assignment 8-1
i. Create first a characteristic matrix foreach permutation
(a) Set one column with the shingles
(b) Set columns for each document
(c) Fill the document columns by setting a 1 where there is an occurence for each shingle, and 0 else
3
Element π1 π2 π3
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
DATABASESYSTEMSGROUP
Assignment 8-1
i. Create first a characteristic matrix for each permutation
4
Element π1 π2 π3
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
Element π1 π2 π3
10 0 0 0
8 0 1 0
6 1 1 0
4 0 1 1
2 0 1 1
9 1 0 0
7 0 0 0
5 0 0 0
3 1 0 1
1 0 0 0
Element π1 π2 π3
4 0 1 1
7 0 0 0
2 0 1 1
9 1 0 0
1 0 0 0
5 0 1 0
3 1 0 1
10 0 1 0
6 1 1 0
8 0 1 0
1st permutation 2nd permutation 3rd permutation
DATABASESYSTEMSGROUP
Assignment 8-1
ii. Compute the minhash for each permutation
5
Element π1 π2 π3
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
1st permutation
Select the first occurence of 1 per setand get the elements at which they canbe found
Which leads to the following minhash:β π1 = 3β π2 = 2β π3 = 2
DATABASESYSTEMSGROUP
Assignment 8-1
ii. Compute the minhash for each permutation
6
The same procedure for the 2nd permutationβ¦
β π1 = 6β π2 = 8β π3 = 4
Element π1 π2 π3
10 0 0 0
8 0 1 0
6 1 1 0
4 0 1 1
2 0 1 1
9 1 0 0
7 0 0 0
5 0 0 0
3 1 0 1
1 0 0 0
2nd permutation
DATABASESYSTEMSGROUP
Assignment 8-1
ii. Compute the minhash for each permutation
7
β¦and for the 3rd permutation
Which leads to the following minhash:β π1 = 9β π2 = 4β π3 = 4
Element π1 π2 π3
4 0 1 1
7 0 0 0
2 0 1 1
9 1 0 0
1 0 0 0
5 0 1 0
3 1 0 1
10 0 1 0
6 1 1 0
8 0 1 0
3rd permutation
This yields the following signatures:ππΌπΊ π1 = {3,6,9}ππΌπΊ π2 = {2,8,4}ππΌπΊ π3 = {2,4,4}
DATABASESYSTEMSGROUP
Assignment 8-1
2. Instead of using the previously given permutations use hash functions:
β1 π₯ = π₯ πππ 10β2 π₯ = (2π₯ + 1) πππ 10β3 π₯ = (3π₯ + 2) πππ 10
8
DATABASESYSTEMSGROUP
Assignment 8-1
2. Instead of using the previously given permutations use hash functions:
i. Set up a table:
9
Element πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0
2 0 1 1
3 1 0 1
4 0 1 1
5 0 0 0
6 1 1 0
7 0 0 0
8 0 1 0
9 1 0 0
10 0 0 0
DATABASESYSTEMSGROUP
Assignment 8-1
ii. Compute the hash values (except for zero-rows):
10
Element πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
11
πΊπ πΊπ πΊπ
ππ β β β
ππ β β β
ππ β β β
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
12
πΊπ πΊπ πΊπ
ππ β 2 2
ππ β 5 5
ππ β 8 8
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
Update of 2nd row:
Update only π2, π3!
π2:min β, 2 = 2min β, 5 = 5min β, 8 = 8
π3:min β, 2 = 2min β, 5 = 5min β, 8 = 8
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
13
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 7 5 5
ππ 1 8 1
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
Update of 3rd row:
Update only π1, π3!
π1:min β, 3 = 3min β, 7 = 7min β, 1 = 1
π3:min 2,3 = 2min 5,7 = 5min 8,1 = 1
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
14
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 7 5 5
ππ 1 4 1
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
Update of 4th row:
Update only π2, π3!
π2:min 2,4 = 2min 5,9 = 5min 8,4 = 4
π3:min 2,4 = 2min 5,9 = 5min 1,4 = 1
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
15
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 3 3 5
ππ 0 0 1
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
Update of 6th row:
Update only π1, π2!
π1:min 3,6 = 3min 7,3 = 3min 1,0 = 0
π2:min 2,6 = 2min 5,3 = 3min 4,0 = 0
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
16
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 3 3 5
ππ 0 0 1
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
Update of 8th row:
Update only π2!
π2:min 2,8 = 2min 3,7 = 3min 0,6 = 0
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
17
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 3 3 5
ππ 0 0 1
e πΊπ πΊπ πΊπ ππ(π) ππ(π) ππ(π)
1 0 0 0 - - -
2 0 1 1 2 5 8
3 1 0 1 3 7 1
4 0 1 1 4 9 4
5 0 0 0 - - -
6 1 1 0 6 3 0
7 0 0 0 - - -
8 0 1 0 8 7 6
9 1 0 0 9 9 9
10 0 0 0 - - -
Update of 8th row:
Update only π1!
π1:min 3,9 = 3min 3,9 = 3min 0,9 = 0
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Create a table for all hash functions and sets and initialize them withinfinite distance:
This yields the following signatures:ππΌπΊ π1 = 3,3,0ππΌπΊ π2 = 2,3,0ππΌπΊ π3 = 2,5,1
18
πΊπ πΊπ πΊπ
ππ 3 2 2
ππ 3 3 5
ππ 0 0 1
DATABASESYSTEMSGROUP
Assignment 8-1
3. How does the estimated Jaccard similarity from (1.) and (2.) comparewith the true Jaccard similarity of the original data? How to reducedeviations in the approximated Jaccard similarities?
19
RECAP: Jaccard similarity:
ππ½ππππππ π·1, π·2 = π·1β©π·2π·1βͺπ·2
DATABASESYSTEMSGROUP
Assignment 8-1
i. Actual Jaccard similarity:
20
RECAP: Jaccard similarity:
ππ½ππππππ π·1, π·2 = π·1β©π·2π·1βͺπ·2
πΊπ πΊπ πΊπ
πΊπ - 1 6 1 5
πΊπ - - 2 5
πΊπ - - -
(a) π1 = {3,6,9}(b) π2 = {2,4,6,8}(c) π3 = {2,3,4}
DATABASESYSTEMSGROUP
Assignment 8-1
ii. Permutation estimated Jaccard similarity:
21
πΊπ πΊπ πΊπ
πΊπ - 0 6 0 5
πΊπ - - 2 3
πΊπ - - -
(a) π1 = {3,6,9}(b) π2 = {2,8,4}(c) π3 = {2,4,4}
DATABASESYSTEMSGROUP
Assignment 8-1
iii. Hash estimated Jaccard similarity:
22
πΊπ πΊπ πΊπ
πΊπ - 2 3 0 5
πΊπ - - 1 5
πΊπ - - -
(a) π1 = {3,3,0}(b) π2 = {2,3,0}(c) π3 = {2,5,1}
DATABASESYSTEMSGROUP
Assignment 8-1
iv. Comparison:
23
πΊπ πΊπ πΊπ
πΊπ - 2 3 0 3
πΊπ - - 1 3
πΊπ - - -
πΊπ πΊπ πΊπ
πΊπ - 0 3 0 3
πΊπ - - 2 3
πΊπ - - -
πΊπ πΊπ πΊπ
πΊπ - 1 6 1 5
πΊπ - - 2 5
πΊπ - - -
Actual J. similarity Perm. est. J similarity Hash. est. J similarity
Estimation of the actual J. similarity is rather poor, why?
Too small minhash vectors. Get more permutations of the universal set ormore hash functions to extend the minhash vectors!
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
24
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
25
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
26
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
27
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
28
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
29
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Wait for the first 6 points to arrive
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
30
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Apply standard k-Means to create q clusters.
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
31
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
32
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
πΆπΉπ1 = (πΆπΉ2π, πΆπΉ1π, πΆπΉ2π‘, πΆπΉ1π‘, π)
12 + 22
12 + 12=5
2
1 + 2
1 + 1=3
2
12 + 22 = 5
1 + 2 = 3
2
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
33
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
πΆπΉπ1 = (5
2,3
2, 5,3,2)
cπππ‘ππ = 322 =
1.51
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
34
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
πΆπΉπ1 = (5
2,3
2, 5,3,2)
πππππ’π = πΆπΉ2ππ β πΆπΉ1π
π2
= 522 β
322
2
=2.51
β1.51
2
=2.51
β2.251
= 2.5 β 2.25 + (1 β 1) = 0.5
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
35
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
ππ πͺπππΏ πͺπππΏ πͺπππ πͺπππ π πππ πππ
1 5
2
3
2
5 3 2 1.5
1
0.5
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
36
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
ππ πͺπππΏ πͺπππΏ πͺπππ πͺπππ π πππ πππ
1 5
2
3
2
5 3 2 1.5
1
0.5
2 32
145
8
17
25 7 2 4
8.5
0.5
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
37
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
For each discovered cluster, assign a uniqueID and create ist micro-cluster summary.
ππ πͺπππΏ πͺπππΏ πͺπππ πͺπππ π πππ πππ
1 5
2
3
2
5 3 2 1.5
1
0.5
2 32
145
8
17
25 7 2 4
8.5
0.5
3 181
25
19
7
61 11 2 9.5
3.5
0.7 := centroids
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
38
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Compute distance between point p and eachof the q maintained micro-cluster centroids.
πππΏ2(πΆ1, π) = 2.06
πΆ1
πΆ2
πΆ3ππΏ2(πΆ2, π) = 5.85
ππΏ2(πΆ3, π) = 7.51
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
39
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Select a cluster clu as the closestmicro-cluster to p.
πππΏ2(πΆ1, π) = 2.06
πΆ1
πΆ2
πΆ3ππΏ2(πΆ2, π) = 5.85
ππΏ2(πΆ3, π) = 7.51
DATABASESYSTEMSGROUP
Assignment 8-2 - CluStream
40
β’ initPoints = 6
β’ q = 3
β’ Factor of clu radius t = 5
t 1 2 3 4 5 6 7 8 9 10 11 12
p 1
1
2
1
4
9
4
8
10
4
9
3
2
3
11
3
12
12
12
11
11
12
4
2
0 1 2 3 4 5 6 7 8 9101112
123456789
101112
Find the maximum boundary of clu:= factor of t of clu radius= 5 β 0.5 = 2.5
π
ππΏ2(πΆ1, π) = 2.06
πΆ1
πΆ2
πΆ3is p within the maximum cluster boundary of πΆ1?
(π₯ β π₯π1ππππ‘ππππ)2+(π¦ β π¦π1ππππ‘ππππ)
2β€ (π‘ β ππππΆ1)2?
(2 β 1.5)2+(3 β 1)2β€ (5 β 0.5)2?