comp 335 file structures hashing. what is hashing? a process used with record files that will try to...

Comp 335File Structures

Hashing

What is Hashing?

A process used with record files that will try to achieve O(1) (i.e. – constant) access to a record’s location in the file.

An algorithm, called a hash function (h), is given a primary key as input; the resulting output is the location of the record within the file; h(key) = address.

Hashing Example

Assume you want to store 5,000 data records on file. You want this to be a hashed file for quick access. Each record will be fixed in length and the primary key for each record is an employee number which is 8 digits long.

A common hash function is called modulo arithmetic. h(key) = key mod n; n = 5000 h(82461792) = 82461792 mod 5000 = 1792 The address (RRN) of the record with this key is

1792

Other Hashing MethodsFolding

Folding requires extracting certain groupings from the key and then adding or multiplying the groupings in some fashion to form the hash address.

Example : Key = “BISON” Address Space = 101 Step 1 – get ASCII values of each character in the string

B(66), I(73), S(83), O(79), N(78) Step 2 – Add “even[even index val]”

66 +83+78 = 227 Step 3 – Add “odd[odd index val]”

73+79 = 152 Step 4 – Multiply results

227 * 152 = 34504 Step 5 – Modulo results

34504 mod 101 = 63 (hash address)

Other Hashing MethodsMid-Square

Involves squaring the “numeric” form of a key and extracting some of the digits from the “middle of the square”.

Example: Assume address space is 1000 Key(4 digit int) = 2973 2973 * 2973 = 8838729 Extract “middle” digits = 387 (hash address)

Other Hashing MethodsRadix Transformation

Convert the key to a different base and then use modulo arithmetic.

Example: Address space is 100. Key is 43510

Conversion: 38211

382 mod 100 = 82 (hash address)

Other Hashing MethodsMultiplicative Function

Involves multiplying the key by some constant less than one, the hash function will return some of the digits of the fractional part of the result.

Example: Address space = 1000 Key (5 digit integer): 82165 Multiplier: 0.39731 82165 * 0.39731 = 32644.97615 First three digits of fractional part is hash address

= 976

Major Problem with Hashing

Given a random set of keys and a hash function (h), it is highly probable that some keys in the set will be hash synonyms. In other words, the same hash function output can be obtained from different keys in the set.

A hashing algorithm can yield three different types of address distributions: Perfect – no synonyms given a set of keys; the probability

of obtaining a perfect distribution from a large set of unknown keys is very, very low (textbook – 1 out 10120,000)

Random – “few” synonyms generated; what we strive for! Scud – many synonyms generated

If the set of keys is known beforehand, it is possible to generate a perfect hashing algorithm (Pearson, Cichelli)

Collisions

When two or more keys hash to same address, this is called a collision.

This has to be accounted for with random hashing algorithms.

The handling of collisions becomes a critical issue in the overall search efficiency of a given file. Remember each search could mean a “disk access”.

Decreasing the Probability of Collisions

Increase the address space – a common technique; allocate more addresses in the file than records to store; this can decrease the possibility of collisions greatly assuming the hashing algorithm is random. The disadvantage obviously is wasted space.

Place more than one record at an address. This is commonly referred to as buckets. A single address space can store an array of records. This has been shown to increase search efficiency.

Collision Resolution

Even if you have tried to decrease the probability of collisions, they still can and will happen.

Ways to resolve collisions: Linear Probing Double Hashing Prime area with overflow Chaining

Linear Probing

If a key is hashed to an address already occupied or full, search the address space linearly until the first free space is found.

Easy to implement, however this technique can lead to poor search efficiency. This technique can take away home addresses from other keys resulting in more collision handling.

It can also take many accesses to determine if a key does not exist.

What about if a key is deleted using this technique? Could be bad if not handled properly.

Double Hashing

Upon a collision, the key re-hashed using a different algorithm; this determines the increment to take to search for an open address space.

The same problems exist as with linear probing.

Research has shown that this technique will give better performance than linear probing.

Prime area with Overflow

Usually used with buckets. A bucket will hold x number of records in the prime address space and will also contain a pointer to an overflow area of the file which is entry-sequenced. This pointer will contain the first overflow record and each overflow record will contain a pointer to the next overflow record.

This is a common technique and gives excellent search efficiency.

Chaining

The file consists of a hash table which is simply an array of pointers. When a key is hashed, the result is an index into the hash table. At this location is a pointer to the first record which has this hash address. All the records are then “chained” together as a linked list.

The data record portion of the file can be entry sequenced.

Hash Address Distributions

Assuming you have a random hash function, the Poisson Function can be used to compute various probabilities such as: How many empty hash slots will there be? What percentage of the time will access to a key

result in more than one access to find it? What is the probability that a certain hash

address will have x number of keys assigned to it?

Poisson Function

p(x) = (r/n)x e-r/n

x!

n – the address spacer - number of keys to hashx – number of records assigned to a given address

r/n = packing density; load factor

Poisson Function Example

Assume 1,000 records to be hashed into a 1,000 address hashed file.

1) What is the probability that a given address will have two keys hashed to it?p(2) = (1,000/1,000)2 e-1,000/1,000

2! = e-1

2 = .368/2 = .184

2) 1,000 (number of addresses) * .184 = 184Therefore there are approximately 184 addresses which will have 2 keys hashed to it which means there will be 184 overflow records.

Poisson Function Example

Assume 1,000 records to be hashed into a 1,500 address hashed file.

1) What is the probability that a given address will have two keys hashed to it?p(2) = (1,000/1,500)2 e-1,000/1,500

2! = (.67)2 e-.67 = (.449)(.512)/2

2! = .230/2 = .115

2) 1,500 (number of addresses) * .115 = 172.5 (173)Therefore there are approximately 173 addresses which will have 2 keys hashed to it which means there will be 173 overflow records.

comp 335 file structures hashing. what is hashing? a process used with record files that will try to...

Documents