intermediate information structurespages.cpsc.ucalgary.ca/~rokne/cpsc335/stuff/slides_2017/... ·...

37
CPSC 335 Intermediate Information Structures Computer Science University of Calgary Canada LECTURE 5 POLYA and DYNAMIC AND EXTENDIBLE HASHING Jon Rokne Modified from Marina’s lectures.

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

CPSC 335

Intermediate Information Structures

Computer Science

University of Calgary

Canada

LECTURE 5

POLYA

and

DYNAMIC AND EXTENDIBLE HASHING

Jon Rokne

Modified from Marina’s lectures.

Page 2: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

2

POLYA

Page 3: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  UNDERSTANDING THE PROBLEM •  First. •  You have to understand the problem. •  What is the unknown? What are the data? What is

the condition? •  Is it possible to satisfy the condition? Is the

condition sufficient to determine the unknown? Or is it insufficient? Or redundant? Or contradictory?

•  Draw a figure. Introduce suitable notation. •  Separate the various parts of the condition. Can you

write them down?

G. Polya, How to Solve It

Page 4: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  DEVISING A PLAN •  Second. •  Find the connection between the data

and the unknown. You may be obliged to consider auxiliary problems if an immediate connection cannot be found. You should obtain eventually a plan of the solution.

•  Have you seen it before? Or have you seen the same problem in a slightly different form?

•  Do you know a related problem? Do you know a theorem that could be useful?

Page 5: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  Look at the unknown! And try to think of a familiar problem having the same or a similar unknown.

•  Here is a problem related to yours and solved before. Could you use it? Could you use its result? Could you use its method? Should you introduce some auxiliary element in order to make its use possible?

Page 6: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  Could you restate the problem? Could you restate it still differently? Go back to definitions.

•  If you cannot solve the proposed problem try to solve first some related problem. Could you imagine a more accessible related problem? A more general problem? A more special problem? An analogous problem?

Page 7: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  Could you solve a part of the problem? Keep only a part of the condition, drop the other part; how far is the unknown then determined, how can it vary? Could you derive something useful from the data? Could you think of other data appropriate to determine the unknown? Could you change the unknown or data, or both if necessary, so that the new unknown and the new data are nearer to each other?

•  Did you use all the data? Did you use the whole condition? Have you taken into account all essential notions involved in the problem?

Page 8: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

•  CARRYING OUT THE PLAN •  Third. •  Carry out your plan. •  Carrying out your plan of the solution,

check each step. Can you see clearly that the step is correct? Can you prove that it is correct?

Page 9: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

n  Looking Back n  Fourth. n  Examine the solution obtained. n  Can you check the result? Can you

check the argument? n  Can you derive the solution

differently? Can you see it at a glance?

n  Can you use the result, or the method, for some other problem?

Page 10: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

10

Page 11: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

Applying Polya’s method

Page 12: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

n  Extendible hashing n  Expandable and dynamic hashing n  Virtual hashing n  Summary

12

OUTLINE

Page 13: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

13

Ø  Standard hashing works on fixed file size.

Ø  What if we add / delete many keys? What if the file sizes change significantly?

Ø  Then we will develop separate techniques. Two types: - Directory schemes - Directory less schemes

Hash Functions for Extendible Hashing

Page 14: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

14

Ø  Keys stored in buckets.

Ø  Each bucket can only hold a fixed size of items.

Ø  Index is an extendible table; h(x) hashes a key value x to a bit map; only a portion of a bit map is used to build a directory. Example: buckets h(kn) = 11011 Add kn

b00 ********************************

b00

b01 b01

b10

Table

b1 b11

Extendible Hashing

00011

00110

00101

01100

01011

10011

11110

11111

00 01 10 11

00 01 10 11

10011

11011

11110

11111

Page 15: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

15

Ø  Directory schemes - Extendible Hashing (Fagin et. al. 1979) - Expandable hashing (Knott 1971) - Dynamic Hashing (Larson 1978) Ø  Directory less schemes - Virtual hashing (Litwin 1978)

Hash Functions for Extendible Hashing

Page 16: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

16

Ø  Size of a bucket = MAX # of pseudokeys (3 in our example) Ø  Once the bucket is full – split the bucket into two Two situation will be possible: - Directory remains of the same size adjust pointer to a bucket - Size of directory grows from 2k to 2k+1 i.e. directory size can be 1, 2, 4, 8, 16 etc (8 is shown in the figure). The number of buckets will remain the same, i.e. some references will point to the same bucket. Finally, one can use bitmap to build the index but store an actual key in the bucket!

Extendible Hashing

000

001

010

011

100

101

110

111

Page 17: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

17

1.   Use as much space as needed.

2.   Input the file name, # of words to insert Use bucket size: 128 3.   Use any function h(k) that returns the string of bits of up to

32 bits (integer type can be used).

4.   Bucket – char array

5.   Main idea: only the FIRST bits of the mask are used for search

Extendible Hashing

Page 18: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

18

Assume that a hashing technique is applied to a dynamically changing file composed of buckets, and each bucket can hold only a fixed number of items.

Extendible hashing accesses the data stored in buckets indirectly through an index that is dynamically adjusted to reflect changes in the file. The characteristic feature of extendible hashing is the organization of the index, which is an expandable table.

Extendible Hashing

Page 19: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

19

Ø  A hash function applied to a certain key indicates a position in the index and not in the file (or table or keys). Values returned by such a hash function are called pseudokeys.

Ø  The file requires no reorganization when data are added to or deleted from it, since these changes are indicated in the index.

Only one hash function h can be used, but depending on the size of the index, only a portion of the added h(K) is utilized.

Ø  A simple way to achieve this effect is by looking at the address into the string of bits from which only the i leftmost bits can be used.

The number i is the depth of the directory. In figure 1(a) (in the next slide), the depth is equal to two.

Extendible Hashing

Page 20: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

20

Extendible Hashing

Figure 1. An example of extendible hashing (Drozdek Textbook)

Page 21: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

21

Extendible Hashing insertion/deletion examples

Suppose that we are using an extendible hash table with bucket size 2 and suppose that our hash function H is such thatH(ANT) = 1110… H(DOG) = 0101… H(PIG) = 1001…H(BEAR)= 0010… H(ELK) = 1000… H(RAT) = 0000…H(CAT) = 1010… H(GORN)= 1010… H(WOLF)= 0111…H(COW) = 0001… H(MOOSE) = 0001…

Page 22: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

22

Extendible Hashing insertion/deletion examples

Each bucket has an associated label (or signature) indicating which cells in the directory point to it: namely, all those having an index whose binary representation has the label as a prefix.

Page 23: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

23

Extendible Hashing insertion/deletion examples

For each of the following operations, apply it to the hash table above (not to the result of applying the previous operations) and show the hash table that results. (a) Insert WOLF. (b) Insert ANT. (c) Insert GORN. (d) Delete DOG. (e) Delete RAT. (f) Delete CAT. (g) Insert MOOSE.

SOLUTIONS:(a) Insert WOLF. WOLF fits quite nicely alongside DOG in the bucket with label 01. (Illustration omitted.)

Page 24: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

24

Extendible Hashing insertion/deletion examples

(b) Insert ANT. This causes overflow of the bucket with label 1, and thus that bucket is split into buckets with labels 10 and 11, into which CAT and ELK are placed appropriately, after which we attempt to insert ANT again. Because 10 is a prefix of both H(CAT) and H(ELK), both of these animals are placed into the bucket with label 10, leaving the 11 bucket empty. Insertion of ANT now goes smoothly, as it belongs in the 11 bucket.

Page 25: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

25

Extendible Hashing insertion/deletion examples

(c) Insert GORN. This causes overflow of the bucket with label 1, and thus that bucket is split into buckets with labels 10 and 11, into which CAT and ELK are placed appropriately, after which we attempt to insert GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both of these animals are placed into the bucket with label 10, leaving the 11 bucket empty. Attempting to insert GORN leads to splitting the 10 bucket into buckets with label 100 and 101. ELK is placed into the former and CAT into the latter. Attempting to insert GORN once again, we find room for him in the 101 bucket.

Page 26: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

26

Extendible Hashing insertion/deletion examples d) Delete DOG. Remove DOG from the 01 bucket. As there are no sibling buckets with which to combine it, we simply leave the 01 bucket empty. (Only a bucket with label 00 could be a "sibling" to the bucket with label 01, and there is no such bucket.) (Illustration omitted.)

(e) Delete RAT. Remove RAT from the 000 bucket. As the 000 and 001 buckets are "siblings" and the total # of entries in the two of them is now two, we can merge them into a 00 bucket containing COW and BEAR. Because now the maximum length of any bucket's label is two, we can halve the size of the directory, making its depth two. (In real life, we probably wouldn't merge two buckets unless the resulting bucket were somewhat less than full, because otherwise the resulting bucket would be likely to undergo a split in the near future.)

(f) Delete CAT. Remove CAT from the 1 bucket. There is no sibling bucket, so that is all we can do. (Illustration omitted.)

Page 27: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

27

(g) Insert MOOSE. This causes overflow of the bucket with label 000. Because this bucket has depth 3, which corresponds to DIR_DEPTH, we double the size of the directory, making each entry in the new directory point to the correct bucket. Then we split the overflowing bucket into buckets with labels 0000 and 0001, into which COW and BEAR are placed appropriately. Then we attempt once more to insert MOOSE. This time, MOOSE fits nicely alongside COW in the 0001 bucket.

Page 28: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

28

Page 29: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

29

http://www.cosc.brocku.ca/~efoxwell/2P03/slides/Week12Slides.pdf (the next 6 slides)

Page 30: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

30

Page 31: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

31

Page 32: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

32

Page 33: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

33

Page 34: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

34

Expandable Hashing Ø  Similar idea to an extendible hashing. But binary tree is used to store an index on the buckets. Dynamic Hashing Ø  multiple binary trees are used. Outcome: - To shorten the search. - Based on the key --- select what tree to search.

Expandable & Dynamic Hashing

Page 35: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

35

Ø  Larson method Ø  Index is simplified to be represented as a set of binary trees. Ø  Height of each tree is limited.

Ø  h(x) is searched in ALL trees. Ø Time: m – trees, k keys in each max, overall: m*lgk. Ø Advantage: shorter search time in index file

Dynamic Hashing

Page 36: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

36

Litwin’s Virtual Hashing Ø  Expand buckets in a linear fashion.

Ø  Store them continuously in the memory.

Ø  No table is needed, the procedure is simple.

Virtual Hashing

Page 37: Intermediate Information Structurespages.cpsc.ucalgary.ca/~rokne/CPSC335/stuff/SLIDES_2017/... · 2017-01-25 · GORN again. Because 10 is a prefix of both H(CAT) and H(ELK), both

37

Summary

n  Extendible hashing advantages: n  Initially allocated space can increase indefinitely n  Location of a bucket where key belongs requires only very fast bits

comparison n  Very flexible in choosing size of the bucket, and allows their storage on

disks/remote memory access

n  Extendible hashing disadvantages: n  Increased algorithm complexity n  Extra memory overhead to store index inside the bucket