cs212: data structures lecture 10:hashing 1. outline 2 map abstract data type map abstract data...

23
CS212: DATA STRUCTURES Lecture 10:Hashing 1

Upload: martin-floyd

Post on 12-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

CS212: DATA STRUCTURES

Lecture 10:Hashing

1

Page 2: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

2

Outline

Map Abstract Data type Map Abstract Data type methods What is hash Hash tables Bucket Arrays Hash function

Page 3: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

3

Map Abstract Data type

A map allows us to store elements and these elements can be located quickly using key.

Map stores key-value pairs (k,v) ,where k is the key and v is its corresponding value.

Each key is unique key. Motivation:

to search for each element that has been stored by using its key.

Page 4: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

4

Map Abstract Data type

Example: To map storing student records as (student’s

name ,address and course grades), the key will be the student’s ID number. Keys(labels)

Assigned to values (diskettes) Labeled diskettes are inserted into the map (file

cabinet) Keys can be used later to retrieve or remove values

Page 5: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

5

Map Abstract Data type methods

size() :return number of entries in M map.

isEmpty() :test whether M is empty.

get(k) :if M contains an entry e with key equal to k, then return the value of e, else return null.

put(k,v): if M doesn’t have an entry with key equal to K, then add entry (k,v) to M and return null. Else, replace with v the existing value of the entry with key equal to k .

remove(k): remove from M the entry with key equal to k and returns its value.

Keys(): returns an iterable collection contains all the keys stored in M.

Values():returns an iterable collection contains all the values as sociated with keys stored in M.

Entries(): return an iterable collection containing all the K-value entries in M.

Page 6: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

6

Map Abstract Data type methods

Example:put(2,C) null {(5,A), (7,B), (2,C)} put(8,D) null {(5,A), (7,B), (2,C), (8,D)}

put(2,E) C{(5,A), (7,B), (2,E), (8,D)} get(7) B{(5,A), (7,B), (2,E), (8,D)}

Operation Output Map

isEmpty() true Φ

put(5,A) null {(5,A)}

put(7,B) null {(5,A), (7,B)}

get(4) null {(5,A), (7,B), (2,E), (8,D)} get(2) E{(5,A), (7,B), (2,E), (8,D)} size() 4

{(5,A), (7,B), (2,E), (8,D)} remove(5) A {(7,B), (2,E), (8,D)}

remove(2) E{(7,B), (8,D)}

get(2) null {(7,B), (8,D)}

isEmpty() false {(7,B), (8,D)}

Page 7: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

7

What is hash?

Hashing is the process of mapping large amount of data item to a smaller table with the help of a hashing function.

Hashing uses a data structure called a hash table. Although hash tables provide fast insertion, deletion,

and retrieval, operations that involve searching, such as finding the minimum or maximum value, are not performed very quickly.

Hashing is also used in many encryption algorithms. Hash table advantages:

From linear search to binary search, we improved our search efficiency from O(n) to O(logn) . We now present a new data structure, called a hash table, that will increase our efficiency to O(1) , or constant time.

Page 8: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

8

Hash Table is a data structure in which keys are mapped to array positions by a hash function. This table can be searched for an item in fast time using a hash function to form an address from the key.

Hash Function is a function which, when applied to the key, produces an integer which can be used as an address in a hash table. Perfect hash function Good hash function

When more than one element tries to occupy the same array position, we have a collision.

Collision is a condition resulting when two or more keys produce the same hash location.

Hash Table

Page 9: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

9

Bucket Arrays

A bucket array for a hash table is an array A of size N, where each cell of A is thought of as a "bucket" (that is, a collection of key-value pairs) and the integer N defines the capacity of the array.

Example: A bucket array of size 11 for the entries (1,D), (3,C), (3,F), (3,Z), (6,A), (6,C) and (7Q)

Page 10: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

10

Bucket Arrays drawbacks

searches, insertions, and removals in the bucket array take O(1) time. This sounds like a great achievement, but it has two drawbacks.

First, the space used is proportional to N. Thus, if N is much larger than the number of entries n actually present in the map, we have a waste of space. The second draw back is that keys are required to be integers in the range [0, N − 1], which is often not the case

Page 11: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

11

Hash functions

The hash function maps the record's key to an integer called the hash index.

A collision occurs when two keys are mapped to the same hash index.

One way to resolve collisions is to allow each bucket to store multiple records. This is called chaining. Example:

0 1 2 3 4 5 6

1data

1information

4math

4Discrete mathematics

4Algebra

4Solid geometry

Page 12: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

12

The search time of each algorithm depend on the number n of elements of the collection S of the data.

A searching technique called Hashing or Hash addressing which is essentially independent of the number n.

Comparison of keys was the main operation used by the previous discussed searching methods .

There is a different way of searching by calculates the position of the key based on the value of the key.

We need to find a function h that can transfer a key K (string, number, record, etc..) into an index the a table used for storing items of the same type as K. This function is called hash function.

Hash functions

Page 13: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

13

1-Division function :One simple compression function is the division method, which maps an integer i to : |i| mod N.

Example:

Suppose we want to store a sequence of randomly generated numbers, keys: 5, 17, 37, 20, 42, 3. The array A, the hash table, where we want to store the numbers:

0 1 2 3 4 5 6 7 8 | | | | | | | | | |

We need a way of mapping the numbers to the array indexes, a hash function, that will let us store the numbers and later recompute the index when we want to retrieve them. There is a natural choice for this.

Hash functions

Page 14: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

14

Our hash table has 9 fields and the mod function, which sends every integer to its remainder modulo 9, will map an integer to a number between 0 and 8.

5 mod 9 = 5

17 mod 9 = 8

37 mod 9 = 1

20 mod 9 = 2

42 mod 9 = 6

3 mod 9 = 3

We store the values:

| | 37 | 20 | 3 | | 5 | 42 | | 17 |

In this case, computing the hash value of the number n to be stored: n mod 9, costs a constant amount of time. And so does the actual storage, because n is stored directly in an array field.

Hash functions

Page 15: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

15

Hash Functions

1. Division A hash function must guarantee that the number it

returns is a valid index to one of the table entries. The simplest way is to use division modulo. TSize=sizeof(table), as in h(K)= K mod TSize. It is best if TSize is a prime number. Advantages:

simple useful if we don't know much about the keys

Page 16: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

16

Hash Functions

2.Extraction Idea: use only part of the key to compute the hash value/ address/ index.

Exe: Key is (SSN) 123456789This method might use for example: the first four digits ( 1234) or the last four (6789), or combined the first two with the last two (1289) to be the index.

Page 17: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

17

Hash Functions

3. Folding Idea: divide the key into parts, then combine (fold) � �

the parts to create the index The key is divided into several parts. These parts

are combined or folded together and are usually transformed in a certain way to create (address) index into the table.

This is done by first dividing the key into parts where each of the parts of the key will be the same length as the desired index

Note: after combining the key parts if the resulted index is grater that the desired length then you can apply either division (which is usually used) or use extraction.

Page 18: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

18

There are two types of folding 1) Shift folding

The key is divided into several parts then these parts are added together to create the index

Exe: Key is (SSN) 123456789

(SSN) 123-45-6789 can be divided into three parts, 123, 456, 789, and then these parts can be added. The resulting 1,368 can be divided modulo TSize.

Hash Functions

Page 19: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

19

2) Boundary folding Same as shift folding, except that every other part is written

backwards

Exe: Key is (SSN) 123456789

(SSN) with three parts, 123, 456, 789.

the first part is taken in the same order

the second part is in reverse order

the third pat is in the same order

The result is 123+654+789=1,566 , then division

Exe: Key is 23459087632

Boundary folding: 234 + 095 + 876 + 23 = 1228This process is simple and fast especially when bit patterns are used

instead of numerical values, replace addition in previous examples with XOR

Hash Functions

Page 20: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

20

4. Mid-Square function Idea: square the key (key is multiplied by

itself), then use the middle (mid) part of the �result as the address.�

Note: extraction could be used to extract the mid part.

Exe: Key is 3121

Square the key: (3,121)2 =9,740,641

Then use the mid part as the address (406)

Here, for 1,000-cell table, h(3,121)=406

Hash Functions(cont’)

Page 21: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

21 Detecting and resolving collisions Even with the methods introduced previously, collisions may

still occur. We cannot hash two keys to the same location, so we must

find a way to resolve collisions. Choice of hash function and choice of table size may reduce

collisions, but will not eliminate them. Methods for resolving collisions:

open addressing: find another empty position chaining: use linked lists bucket addressing: store elements at same location

Hash Functions

Page 22: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

22

Applications of Hash tables

Lots of recent research into using distributed hash

tables in peer-to-peer networks (searching,

lookup) Symbol tables (compilers) Databases (of phone numbers, IP

addresses, etc.) Dictionaries

Page 23: CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket

23

References:• Text book, chapter10: Hashing

End Of Chapter