a note on hashing functions and tabu search algorithms

E L S E V I E R European Journal of Operational Research 95 (1996) 237-239

EUROPEAN JOURNAL

OF OPERATIONAL RESEARCH

T h e o r y a n d M e t h o d o l o g y

A note on hashing functions and tabu search algorithms

W i l l i a m B. C a r l t o n a, j . W e s l e y B a r n e s b, .

a Department of Mechanical Engineering, ETC 5.128, The University of Texas at Austin, Austin, TX 78712, USA b Department of Mechanical Engineering, ETC 5.128D, The University of Texas at Austin, Austin, TX 78712, USA

Received 23 February 1995; revised 31 July 1995

Abstract

Woodruff and Zemel present four effective functions that can be used for hashing within tabu search algorithms. However, the authors overlook one property that may affect the performance of algorithms that use the proposed functions. This paper clarifies the effects that the "birthday paradox" may have on algorithms using these functions.

In their 1993 paper, Woodruf f and Zemel [4] present a variety of functions that can be used for hashing within a tabu search context. The overriding purpose of any hashing function is to enable a tabu search algori thm to rapidly detect whenever it transi- tions to a solution that has been already visited. The pr imary strength of tabu search algori thms is their abil i ty to rapidly move through the solution space and not become stuck in a " c o m p l e x at t ractor" as described by Battiti and Tecchiol l i [1] or some other suboptimal location.

Woodruf f and Zemel present four different hashing functions for use within a tabu search algorithm. Whi le each function is intended to be used within a well-defined context, each of these functions uses a series of random integers in order to compute the hashing function. The authors state that there is a threefold purpose for these hashing functions h which are:

* Corresponding author. Email: [email protected].

1. "Computa t ion and update of h should be as easy as possible. This means that the structure of h should reflect the structure of the neighborhood sets.

2. The integer generated should be in a range that results in a reasonable storage requirement and comparison effort (e.g. an integer requiring two or four bytes).

3. The probabil i ty of coll is ion should be low. A collision occurs when two different vectors are encountered with the same hash function va lue ." The authors define the fol lowing four hashing

functions which assign a hashing value to a solution vector. The choice of the appropriate function is based on the form of the solution and the nature of the problem at hand. The authors give a detailed discussion of when each function is most appropri- ately used. The first two hashing functions are based upon the fol lowing parameters:

X = a solution vector at some step in the tabu search algorithm, where x i ~ X for i = 1 . . . . . n.

Z = a vector o f pseudo random integers drawn

0377-2217/96/$15.00 Copyright © 1996 Elsevier Science B.V. All rights reserved. SSDI 0377-2217(95)00249-9

238 W.B. Carlton, J. W. Barnes~European Journal of Operational Research 95 (1996) 237-239

from the range (1 . . . . . m), where z i E Z for i = 1 , . . . ,n.

h o = ~ ZiX i. i=l

The authors note that " (u)nless m is very small, the function may take values in excess of the maximum integer that can be represented by the computer, MAXINT. On most machines, this will result in overflow so the function will effectively be

hl = [ i=~l ZiXi] mod[MAXINT + l ]."

The third function is based on a matrix associated with the solution vector x such that: Z - a matrix of pseudo random integers drawn from the range (1 . . . . . m). So that

h2 = ~ Z ( x i , x i + l ) . i=1

The authors note that "(g)enerat ion of the Z matrix may require excessive effort and storage. A better scheme is to simulate the existence of the matrix by generating a . . . vector of pre-computed random weights z and replacing Z(i,j) with z(i)z(j) ."

This leads to the fourth hashing function, again where overflow is anticipated:

h3=[i=~z(x i )z (x i+1)]m°d[MAXINT+l] .

The four functions that Woodruf f and Zemel present accomplish the first two objectives. With respect to the third objective, the performance of the functions depends on the parameters selected. In analyz- ing the performance, their paper fails to consider one property of the hashing functions that can have a

significant impact on parameter selection. It is this property that we will attempt to clarify with this note.

With regard to the probabili ty of collision, the authors say that even if the probabil i ty of a collision is not zero, then overall the effects of such occurrences on an algori thm are not significant. However , if collisions occur, then the algorithm will detect a false return to a solution and may react differently than is intended by the analyst. Even more troubling is the case for an algorithm that alters the search parameters based on the detection of cycling. In this case the " fa l se pos i t ive" detections may cause a deterministic algori thm to behave as though it were stochastic and return different solutions depending solely upon the random number stream used to generate the integers to compute the hashing function. The authors claim that, in general, the functions " m a y take values in excess of the maximum [unsigned] integer that can be represented on the computer, M A X I N T . " The authors state that in general the probabil i ty of collision is 1 / M A X I N T . If the computer ' s word size is two bytes, then the maximum unsigned integer, M A X I N T is 216 - - 1 = 65,535, and 1 / M A X I N T = 0.0000153. Now this seems to be an acceptable risk of collision. However , in practice the occurrences of false positives occur more frequently.

During the development of a deterministic algori thm using hashing function h 3 [2], we observed unexplained random behavior. The only random component in our algorithm was the generation of the random integers in Z. Because we were in the early development stages, we used very few iterations during any one experiment. According to the results presented by the authors, we should have had a very small chance of collisions. However , using as

Table 1 MAXINT = 365 216 - - 1 = 65,535 232 - 1 = 4.295 × 1 0 9

Iterations: Pr(Collision) Iterations: Pr(Collision) Iterations: Pr(Collision)

10 0.11695 100 0.07278 1,000 0.00012 20 0.41144 200 0.26211 5,000 0.00291 30 0.70632 500 0.85168 10,000 0.01157 50 0.97037 1,000 0.99953 20,000 0.04550 75 0.99972 2,000 1.00000 50,000 0.25251

W.B. Carlton, J.W. Barnes~European Journal of Operational Research 95 (1996) 237-239 239

few as 200 iterations, we observed different solutions and different search paths being produced by an otherwise deterministic search. These random results occurred whenever we used different seed values to generate different random vectors Z. Investigation revealed that the hashing function values assigned to nonidentical tours in independent experiments col- lided with a much greater frequency than predicted.

The reason for this discrepancy is due primarily to the authors' definition of "collision". In property (3) above note that, " A collision occurs when two different vectors are encountered with the same hash function." By this definition, a collision occurs when a particular vector has the same hashing function value as another specific vector. This is significantly different from the definition for collision being, "a collision occurs whenever any two different vectors have the same hashing function value." This is not a new insight, Knuth [3] indicates that this is due to a well known paradox - the birthday paradox. The birthday paradox is a result from probability theory that allows one to show that if there are 25 people at a party, there is a better than 50-50 chance (56.9% chance precisely) that there are two people present who have the same birthday. In fact, if 50 people come to the party, it is almost certain that two people present will have the same birthday (97% chance). For a "hashing" example, let each guest's hashing function be the Julian date of her birthday (January 1 = 1 and December 31 = 365, if leap years are not considered). The word size of our party's "computer" is MAXINT = 365, and the number of guests corresponds to the number of "vectors" that require hashing which is 25. Again, the probability that any two people have the same birthday (have the same hashing value) is not the same as the probability that the host has the same birthday as another guest. The probability that the host has the same birthday as another uniquely specified guest is 1 /MAXINT = 1/365 = 0.00274. However, the probability that any of the guests have the same birthday is 0.569, which is significantly different.

It is this paradox that is not specifically addressed in the Woodruff and Zemel article. The more impor- tant question that should be answered is, " Is the probability of any two vectors having the same hashing value acceptable given the word size of the

computer?" Table 1 gives an indication of the impact of the birthday paradox on realized collisions.

Table 1 indicates that for a 2 byte word size, if 200 vectors are hashed, then there is a better than 1 in 4 chance that two vectors will have the same hashing value. This assumes that the hashing function used is uniformly distributed across all values of unsigned integers, (0 . . . . . MAXINT). Worse, it indicates that from among only 1,000 different vectors it is virtually certain that at least two vectors will have hashing values that match. This may not be acceptable under any hashing scheme, if one desires to develop a completely deterministic algorithm. How- ever, if one uses a 4 byte word, then the probability of collision may be small enough to have a reasonable chance of not having a collision.

This phenomenon will occur regardless of the hashing function used. If the analyst chooses a hashing function that does not uniformly assign the hashing function across the range of unsigned integers, then the effects of the birthday paradox will become more pronounced. Indeed, the analyst should still take care to use a hashing function such as one of the ones suggested by Woodruff and Zemel. We have found that the functions proposed by Woodruff and Zemel work well in practice, provided that one is aware of the effects of this paradox.

i

Acknowledgements

The authors gratefully acknowledge the valuable assistance ~f Dr. Dave Woodruff, and the helpful comments provided by the reviewers in the prepara- tion of this note.

References

[1] Battiti, R., and Tecchiolli, G., "The reactive tabu search", ORSA Journal on Computing 6 / 2 (1994) 126-140.

[2] Carlton, W.B., and Barnes, J.W., A Tabu Search Approach to the Travelling Salesman Problem with Time Windows, submit- ted to Institute of Industrial Engineers Transactions, 1995.

[3] Knuth, D., The Art of Computer Programming, Vol. 3, Addi- son and Wesley, Reading, MA, 1973.

[4] Woodruff, D.L., and Zemel, E., "Hashing vectors for tabu search' ', Annals of Operations Research 41 (1993) 123-137.

a note on hashing functions and tabu search algorithms

Documents