hashing with linear probingfileadmin.cs.lth.se/cs/personal/andrzej_lingas/raslide.pdf · hashing...
TRANSCRIPT
![Page 1: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/1.jpg)
Hashing with linear probing
Rasmus Pagh
IT University of Copenhagen
Teoripärlor, Lund, September 16, 2007
1
![Page 2: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/2.jpg)
Hashing with linear probing
2
![Page 3: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/3.jpg)
Hashing with linear probing
2
![Page 4: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/4.jpg)
Hashing with linear probing
2
![Page 5: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/5.jpg)
Hashing with linear probing
2
![Page 6: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/6.jpg)
Hashing with linear probing
2
![Page 7: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/7.jpg)
Hashing with linear probing
It was settled in the 60s that this is inferior to e.g. double hashing. So why care?
2
![Page 8: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/8.jpg)
3
![Page 9: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/9.jpg)
389 km/h
20 km/h
3
![Page 10: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/10.jpg)
Race car vs golf car
• Linear probing uses a sequential scan and is thus cache-friendly.
• On my laptop: 24x speed difference between sequential and random access!
• Experimental studies have shown linear probing to be faster than other methods for load factor α in the range 30-70%.
For 4-byte words
For “small” keys
4
![Page 11: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/11.jpg)
Race car vs golf car
• Linear probing uses a sequential scan and is thus cache-friendly.
• On my laptop: 24x speed difference between sequential and random access!
• Experimental studies have shown linear probing to be faster than other methods for load factor α in the range 30-70%.
• But: No theory behind the hash functions used for linear probing in practice.
For 4-byte words
For “small” keys
4
![Page 12: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/12.jpg)
History of linear probing
• First described in 1954.
• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.
• Over 30 papers using this assumption.
• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.
5
![Page 13: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/13.jpg)
History of linear probing
• First described in 1954.
• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.
• Over 30 papers using this assumption.
• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.
5
![Page 14: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/14.jpg)
History of linear probing
• First described in 1954.
• Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is truly random.
• Over 30 papers using this assumption.
• Siegel and Schmidt (1990) showed that it suffices that h is O(log n)-wise independent.
Our main result: It suffices that h is 5-wise independent.
5
![Page 15: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/15.jpg)
log(n)-wise independence• Siegel (1989) showed time-space trade-offs
for evaluation of a function from a log(n)-wise independent family:
• Upper bound 2 is theoretically appealing, but has a huge constant factor – and uses many random memory accesses!
Time SpaceLower bound log(n)
log(s/ log n) s
Upper bound 1! O(log n) O(log n)Upper bound 2 O(1) n!
6
![Page 16: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/16.jpg)
log(n)-wise independence• Siegel (1989) showed time-space trade-offs
for evaluation of a function from a log(n)-wise independent family:
• Upper bound 2 is theoretically appealing, but has a huge constant factor – and uses many random memory accesses!
Time SpaceLower bound log(n)
log(s/ log n) s
Upper bound 1! O(log n) O(log n)Upper bound 2 O(1) n!
6
![Page 17: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/17.jpg)
5-wise independence
• Polynomial hash function:
• Tabulation-based hash function:
Carter and Wegman (FOCS ’79)
Thorup and Zhang (SODA ‘04)
Already quite fast
Within factor 2 of the fastest universal hash functions
h(x) =
(4∑
i=0
aixi mod p
)mod r
h(x1, x2) = T1[x1]! T2[x2]! T3[x1 + x2]
7
![Page 18: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/18.jpg)
Today
• Background and motivation
• Hash functions
‣ New analysis of linear probingJoint work with Anna Pagh and Milan Ružić
•Hash functions - details
8
![Page 19: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/19.jpg)
Total cost of insertions
((analysis on blackboard))
9
![Page 20: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/20.jpg)
Hash function details
((analysis on blackboard))
h(x1, x2) = T1[x1]! T2[x2]! T3[x1 + x2]
10
![Page 21: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/21.jpg)
Single insertion upper bound
11
![Page 22: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/22.jpg)
Single insertion upper bound
11
![Page 23: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/23.jpg)
Single insertion upper bound
1. Choose max t so B balls hash to B-t
slots, for some B
{
11
![Page 24: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/24.jpg)
Single insertion upper bound
1. Choose max t so B balls hash to B-t
slots, for some B
{ {2. Choose max C such that C balls hash to C+t slots
11
![Page 25: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/25.jpg)
Single insertion upper bound
1. Choose max t so B balls hash to B-t
slots, for some B
{ {2. Choose max C such that C balls hash to C+t slots
11
![Page 26: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/26.jpg)
Single insertion upper bound
1. Choose max t so B balls hash to B-t
slots, for some B
{ {2. Choose max C such that C balls hash to C+t slots
11
![Page 27: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/27.jpg)
Single insertion upper bound
1. Choose max t so B balls hash to B-t
slots, for some B
{ {2. Choose max C such that C balls hash to C+t slots
Cost( )≤1+C+t
Lemma:
11
![Page 28: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/28.jpg)
Proof idea• Lemma: If operation on x goes on for more
than k steps, then there are “unusually many” keys with hash values in either:
1) Some interval with h(x) as right endpoint, or2) The interval [h(x),h(x)+k]
!h(x) h(x) + k
12
![Page 29: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/29.jpg)
Proof idea
• To bound cost, upper bound probability of each event using tail bounds for sums of random variables with limited independence.
• Lemma: If operation on x goes on for more than k steps, then there are “unusually many” keys with hash values in either:
1) Some interval with h(x) as right endpoint, or2) The interval [h(x),h(x)+k]
!h(x) h(x) + k
12
![Page 30: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/30.jpg)
Our main resultTheorem 2 Consider any sequence of insertions, dele-tions, and lookups in a linear probing hash table usinga 5-wise independent hash function. Then the expectedcost of any operation, performed at load factor !, is
O(1 + (1! !)!3) .
As a consequence, the expected average cost of successfullookups is O(1 + (1! !)!2).
13
![Page 31: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/31.jpg)
Our main result
factor (1! !)!1
from what can beproved using fullindependence
Theorem 2 Consider any sequence of insertions, dele-tions, and lookups in a linear probing hash table usinga 5-wise independent hash function. Then the expectedcost of any operation, performed at load factor !, is
O(1 + (1! !)!3) .
As a consequence, the expected average cost of successfullookups is O(1 + (1! !)!2).
13
![Page 32: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/32.jpg)
End remarks
• Theory and practice of linear probing now (seem) much closer.
• We can generalize to variable key lengths.
14
![Page 33: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/33.jpg)
End remarks
• Theory and practice of linear probing now (seem) much closer.
• We can generalize to variable key lengths.
• Open:
‣ Still many hashing schemes where theory does not provide satisfactory methods.
‣ Tighter analysis, lower independence?
14
![Page 34: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/34.jpg)
Advertisement
• Call for applications, PhD scholarships at ITU: http://www1.itu.dk/sw66047.asp
• For project proposals in algorithms, talk to me before October 1.
15
![Page 35: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/35.jpg)
Problems
• Show that the total number of insertion steps is independent of the insertion order.
• Show how to efficiently implement deletions in a linear probing hash table.
• Show that the following hash function is 5-wise independent (hint: recursion):
h(x1, x2, x3, x4) = T1[x1]! T2[x2]! T3[x3]! T4[x4]!T5[x1 + x2]! T6[x3 + x4]!T7[x1 + x2 + x3 + x4]
16
![Page 36: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/36.jpg)
Practical exercise
• Implement a dictionary for null-terminated strings using linear probing. Beware of the “bad” way(s) of implementing it!
• Use the dictionary to count the number of distinct words in “Love and War”.http://www.gutenberg.org/etext/2600
• Compare the performance to the standard hash table in your programming environment.
17
![Page 37: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/37.jpg)
18
![Page 38: Hashing with linear probingfileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/raslide.pdf · Hashing with linear probing It was settled in the 60s that this is inferior to e.g. double](https://reader031.vdocuments.mx/reader031/viewer/2022011810/5d663b1288c993280c8b93ea/html5/thumbnails/38.jpg)
T H E N DE
18