class 18 hashing -...
TRANSCRIPT
hashingprof. Stratos Idreos
HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/
class 18
/33CS165, Fall 2015 Stratos Idreos 2
professors(id,name,…)
courses(id,name, profId,…)
students(id,name,…)
database
give me all students enrolled in cs165select student.name from students, enrolled, courses where courses.name=“cs165” and enrolled.courseId=course.id and student.id=enrolled.studentId
enrolled(studentId,
courseId,…) foreign key
join
/33CS165, Fall 2015 Stratos Idreos 3
new resL[]; new resR[]; k=0 for (i=0;i<L.size;i=i++) for (j=0;j<R.size;j++) if L[i]==R[j] resL[k]=i resR[k++]=j
nested loops
L R
/33CS165, Fall 2015 Stratos Idreos 4
hash join
hash
join input 1 hash table
hash
join input 2
/33CS165, Fall 2015 Stratos Idreos 5
hash table
bucket
val (key),pos
hash
h=f(val)bucket=h mod k
k bucketsN keys
[(val1,pos1), (val7,pos7), …]
/33CS165, Fall 2015 Stratos Idreos 6
stream input
p1 p2 p3 p4
1. read input into stream buffer, hash and write to respective partition buffer 2. when input buffer is consumed, bring the next one 3. when a partition buffer is full, write to L2 we can partition into L1-1 pieces in one pass
partition
p1
p2 p3
p4
Level2
Level 1
/33CS165, Fall 2015 Stratos Idreos 7
grace hash join
hash partitioning
then join each pair of partitionsindependently in memory
join input 1 join input 2
/33CS165, Fall 2015 Stratos Idreos 8
stream input left
p1 right hash table result
probe
p1 left
p1 right
result
L1-2
Level2
Level 1
as long as at least one of the pieces <=L1-2
/33CS165, Fall 2015 Stratos Idreos 9
grace hash join
hash partitioning
apply recursively if a partition does not fit in memory (L1-2)
/33CS165, Fall 2015 Stratos Idreos 10
today: how to create the hash table
/33CS165, Fall 2015 Stratos Idreos 11
selectR.C
selectS.F
join
fetchR.A
fetchS.A
maxR.D
minS.G
/33CS165, Fall 2015 Stratos Idreos 12
static hashing
hash table
bucket
updates may create long lists
can this happen anyway, i.e., even with no updates
/33CS165, Fall 2015 Stratos Idreos 13
simple solution would be to double the hash table and rehash everything…
but…
/33CS165, Fall 2015 Stratos Idreos 14
dynamic hashing
extendible hashing - linear hashing
/33CS165, Fall 2015 Stratos Idreos 15
extendible hashing
directory bucketswhen there is no more space
split individual bucketsand if needed double the directory
hash table
with directory in memory we need 1 I/O to fetch the bucket
not true from memory to caches in general
/33CS165, Fall 2015 Stratos Idreos 16
directory maps hash values to buckets
h=hash(key)
use binary form of h use X last bits to map 2^x buckets
00 01 10 11
for X=2000 001 010 011 100 101 110 111
for X=3
extending the directory …
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
new
split
/33CS165, Fall 2015 Stratos Idreos 17
00 01 10 11
dic. X=2
000 001 010 011 100 101 110 111
for X=3
buckets
buckets
full
newextend directory only when no split is possible
cannot split
can split
split
/33CS165, Fall 2015 Stratos Idreos 18
linear hashingdo not use a directory (indirection cost)
instead incrementally extend the hash table one bucket at a time
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
next bucket to split00
01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
next bucket to split00
01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
*
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
next bucket to split00
01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
*full
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
next bucket to split00
01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
*full
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
next bucket to split00
01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
*full
/33CS165, Fall 2015 Stratos Idreos 19
e.g.,use 2 bits of hash value (h0)
00 01 10 11
dictionary is implicit when insert causes
overflow somewhere split next bucket
use +1 bit of hash value
next bucket to split
*full
/33CS165, Fall 2015 Stratos Idreos 20
000 001 010 011 100 101 110 111
for X=3buckets
newdictionary is
implicit
00 01 10 11
dictionary is implicit
next bucket to split
for X=2
search: use current hash function (bits) if after split bucket ok else use next hash function (bits+1)
/33CS165, Fall 2015 Stratos Idreos 21
000 001 010 011 100 101 110 111
for X=3buckets
C
00 01 10 11
next bucket to split
for X=2
keep splitting next bucket with each overflow until all buckets are split (for x=2) then restart for x=3
any problemswhat would happen
when we split bucket C
/33CS165, Fall 2015 Stratos Idreos 22
selectR.C
selectS.F
join
fetchR.A
fetchS.A
maxR.D
minS.G
What can we do to start working immediately? (hint: vectorized processing)
What can we do if we wait for all data to arrive? (hint: bulk processing)
/33CS165, Fall 2015 Stratos Idreos 2323
symmetric hash joinL R
while there exist buffered values from L, hash and
probe HT of R
1) hash
2) probe
3) output
when buffer is empty, switch!
/33CS165, Fall 2015 Stratos Idreos 2323
symmetric hash joinL R
while there exist buffered values from R, hash and
probe HT of L
2) probe1) hash
3) output
when buffer is empty, switch!
/33CS165, Fall 2015 Stratos Idreos 2424
static hashing with 2 passeskeys
hash
count size of each bucket
phase 1
hash tablekeys
hash
phase 2
we now exactlywhat
hash tableto build
/33CS165, Fall 2015 Stratos Idreos 25
keys
hash
count size of each bucket
phase 1
size1 size2 size3
… sizeK
array of k slots where k the buckets we want to have
for each bucket we know its size
D
/33CS165, Fall 2015 Stratos Idreos 26
hash tablekeys
hash
phase 2
we now exactlywhat
hash tableto build
size1 size2 size3
… sizeK
0 D[0]+s1 D[1]+s2 D[2]+s3
…
D D
pass D and sum all counts to get offsets in a sequentially stored hash table
/33CS165, Fall 2015 Stratos Idreos 27
can be done with binary ops only
use X LSBs
h=a*key+bb=h mod K
which hash function should we use?
/33CS165, Fall 2015 Stratos Idreos 28
what happens after the join?
selectR.C
selectS.F
join
fetchR.A
fetchS.A
maxR.D
minS.G
select max(R.D),min(S.G) from R,S where R.A=S.A and R.C<10 and S.F>30
block operator
access patterns
/33CS165, Fall 2015 Stratos Idreos 29
select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …
R.* scan
pos R.Jfetch
posR.J
ordered sparse
ordered sparse
preparing the R join input
same for the S join input
we need the original positions so we can fetch other R columns after the join
/33CS165, Fall 2015 Stratos Idreos 30
join input R.J R.J + posR
join input S.J S.J + posS partition (=reorder)
both join inputs to join
join result posR + posS
both sides are unordered
join
ordered sparse
ordered sparse
select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …
/33CS165, Fall 2015 Stratos Idreos 31
join result posR + posS
cluster on R
join result clustered on R posR + posS
fetch R payload with sequential pattern
1 2 3 4
ID + posS
cluster on S4 2 3 1
ID + posS
clus
tere
d
clus
tere
d fetch S payload with sequential pattern
4 2 3 1
e.g., ID + S.A
sort on ID decluster
all result columns aligned
radix declustering
both sides are unordered
select R.A, R.B, R.C, S.A, S.B, S.C from R, S where R.J=S.J and …
Cache-Conscious Radix Decluster ProjectionsBy S. Manegold, P. Boncz, N. Nes, and M. Kersten Very Large Databases Conference, 2004
/33CS165, Fall 2015 Stratos Idreos 32
Textbook Chapter 11
Cache-Conscious Radix Decluster ProjectionsS. Manegold, P. Boncz, N. Nes, and M. Kersten Very Large Databases Conference, 2004
DATA SYSTEMSprof. Stratos Idreos
class 18
hashing