chapter 9 tables and information retrieval 表格及信息获取

Chapter 9 Tables And Information Retrieval

表格及信息获取

This chapter continues the study of information retrieval that was started

in chapter7, but now concentrating on tables instead of lists. We begin

with ordinary rectangular arrays, then we consider other kinds of arrays,

and then we generalize to the study of hash tables. One of our major

purposes in this chapter is to analyze and compare various algorithms, to see

which are preferable under different conditions. Applications in the chapter

include a sorting method based on tables and a version of the Life game using a

hash table.

本章继续研究信息获取，从第 7 章开始，我们针对的是 lists ，现在针对的是 tables ，我们首先从矩形表格开始，然后是其它的数组，最后主要讨论哈希表，本章的主要目的是分析、比较各种算法，揭示出在不同情形下哪种算法最合适。本章的应用是基于表格的一种排序策略和使用哈希表的 lifegame 。

Content Points

Rectangular Arrays

Tables of Various Shapes

Tables: A New Abstract Data Type

Application: Radix Sort

Hashing

Analysis of Hashing

Comparison of Methods

Breaking the lgn barrier By use of key comparisons alone, it is impossible to complete a search of n items in fewer than lgn comparisons, on average 。通过关键字比较的查找，平均情况下至少需要 lgn 次比较。 Ordinary table lookup or array access requires only constant time O(1). 普通的表格查找或数组访问时间只需 O(1) 。 Both table lookup and searching share the same essential purpose, that of information retrieval. The key used for searching and the index used for table lookup have the same essential purpose: one piece of information that is used to locate further information.利用某一条信息去查找更多信息。

Breaking the lgn barrier Both table lookup and searching algorithms provide functions from a set of keys or indices to locations in a list or array.

In this chapter we study ways to implement and access various kinds of tables in contiguous storage.

Several steps may be needed to retrieve an entry from some kinds of tables, but the time required remains O(1). It is bounded by a constant that does not depend on the size of the table. Thus table lookup can be more efficient than any searching method.

The entries of the tables will be indexed by sequences of integers,just as array entries are indexed by such sequences.

Breaking the lgn barrier We shall implement abstractly defined tables with arrays. In order to distinguish between the abstract concept and its implementation, we introduce:( 用数组来表示 table)

Example:Thus T(1, 2, 3) denotes the entry of the table T that is indexed by the sequence 1, 2, 3, and A[1][2][3] denotes the correspondingly indexed entry of the C++ array A.( 逻辑表示与存储表示的区分）

Rectangular Tables

Rectangular Tables Row-Major and Column-Major Ordering Indexing Rectangular Tables

Entry (i,j) in a rectangular table goes to position ni+ j in a sequential array.

A formula of this kind, which gives the sequential location of a table entry, is called an index function.（下标函数）

Variation: An Access Array （访问数组）This method is to keep an auxiliary array, a part of the multiplication table for n. The array will contain the values 0, n, 2n, 3n, ……, (m – 1)n

TABLES OF VARIOUS SHAPES （各种特殊矩阵）

Triangular Tables （三角矩阵）

index function:

entry （ i ， j ） of the triangular table corresponds to entry,

1/2*i(i+1)+j

of the contiguous array.

access array,

Jagged Tables（锯齿矩阵）

where there is no predictable relation between the position of a row and its length.

Inverted Tables

Inverted Tables For the names we set up one access array. The first entry in this table is the position where the records of the subscriber whose name is first in alphabetical order are stored. （按姓名中的字母序） In a second access array,the first entry is the location of the subscriber’s records whose telephone number happens to be smallest in numerical order. （按电话号码数字次序） In a third access array the entries give the locations of the records sorted lexicographically by address. （按地址的统计次序）

TABLES: A NEW ABSTRACT DATA TYPE

Functions

函数与表格查找非常相似。根据自变量计算相应的值或是根据下标确定相应的值，


Functions In mathematics a function is defined in terms of two sets and

a correspondence from elements of the first set to elements of the second. If f is a function from a set A to a set B, then assigns to each element of A a unique element of B.

The set A is called the domain of f , and the set B is called the codomain of f .

The subset of B containing just those elements that occur as values of f is called the range of f . Index set, value type( 下标集合、值类型）

Table access begins with an index and uses the table to look up a corresponding value. Hence for a table we call the domain the index set, and we call the codomain the base type or value type.

Example : 1 、 double array[n]; 2 、 a triangular table;


An Abstract Data Type


Implementation

The implementation of a table is divided into two smaller problems: finding an access array or index function and programming an array.


Comparisons （ List and table ）retrieval

The time required for searching a list generally depends on the number n of entries in the list and is at least lgn, but the time for accessing a table does not usually depend on the number of entries in the table; that is, it is usually O(1). For this reason, in many applications, table access is significantly faster than list searching.


Comparisons （ List and table ）traversal

On the other hand, traversal is a natural operation for a list but not for a table. It is generally easy to move through a list performing some operation with every entry in the list. In general, it may not be nearly so easy to perform an operation on every entry in a table, particularly if some special order for the entries is specified in advance. （在 table 中遍历不是它的操作）

Comparisons （ tables and arrays ） In general, we shall use table as we have defined it in this section

and restrict the term array to mean the programming feature available in C++ and most high-level languages and used for implementing both tables and contiguous lists.(array 是高级语言中的概念，通常用于实现表格及表。 )

APPLICATION: RADIX SORT

The IdeaThe idea is to consider the key one character at a time and to divide the entries into as many sublists as there are possibilities for the given character from the key. If our keys, for example, are words or other alphabetic strings, then we divide the list into 26 sublists at each stage. That is, we set up a table of 26 lists and distribute the entries into the lists according to one of the characters in the key.( 每次只考虑关键字中的一个字符，并将所有元素分成多个子表；需要创建一个含有 26 个表的表格，并按关键字中的某个字符将元素分配到表中。）


Example:

APPLICATION: RADIX SORTImplementation

After each time the items have been partitioned into sublists in a table, the sublists must be recombined into a single list so that the items can be redistributed according to the next most significant position in the key. We shall treat the sublists as queues, since entries are always inserted at the end of a sublist and, when recombining the sublists, removed from the beginning. （子表需重新组合，将子表看成为队列）

APPLICATION: RADIX SORT Implementation

APPLICATION: RADIX SORT List definition

template <class Record>class Sortable_list: public List<Record> {public: // sorting methods void radix_sort( ); // Specify any other sorting methods here.private: // auxiliary functions void rethread(Queue queues[]);//used to recombine the queues.};

Record definitionclass Record {public:char key_letter(int position) const;Record( ); //default constructoroperator Key( ) const; // cast to Key// Add other methods and data members for the class.};

APPLICATION: RADIX SORT The Sorting Method

const int max_chars = 28;template <class Record>void Sortable_list<Record> :: radix_sort( )/* Post: The entries of the Sortable_list have been sorted

so all their keys are in alphabetical order.Uses: Methods from classes List, Queue, and Record;functions position and rethread. */{Record data;Queue queues[max_chars];

APPLICATION: RADIX SORT The Sorting Method

for (int position = key_size - 1; position >= 0; position--) {// Loop from the least to the most significant position.

while (remove(0, data) == success) {int queue_number = alphabetic_order(data.key_letter(posit

ion));queues[queue_number].append(data); // Queue operation.

}rethread(queues); // Reassemble the list.}} /* alphabetic_order to determine which Queue corresponds to

a particular character, and Sortable_list :: rethread( ) to recombine the queues as the reordered Sortable_list. */


Selecting a Queue The function alphabetic_order checks whether a particular character is in the alphabet and assigns it to the appropriate position, where all nonalphabetical characters other than blanks go to position 27. Blanks are assigned to position 0. The function is also adjusted to make no distinction between upper- and lowercase.

int alphabetic_order(char c)/* Post: The function returns the alphabetic position of character c, or it returns 0 if the character is blank. */{if (c == ‘ ’ ) return 0;if (‘a’ <= c && c <= ‘z’) return c – ‘a’+ 1;if (‘A’ <= c && c <= ‘Z’) return c – ‘A’ + 1;return 27;}

APPLICATION: RADIX SORT Connecting the Queues template <class Record>

void Sortable_list<Record> :: rethread(Queue queues[ ])/* Post: All the queues are combined back to the Sortable_list, leaving all the queues empty.Uses: Methods of classes List and Queue. */{Record data;for (int i = 0; i < max_chars; i++)

while (!queues[i].empty( )) {queues[i].retrieve(data);insert(size( ), data);queues[i].serve( );}

}

APPLICATION: RADIX SORT Analysis

The time used by radix sort is O(nk) where n is the number of items being sorted and k is the number of characters in a key.

The relative performance of radix sort to other methods will relate to the relative sizes of nk and n lg n; that is, of k and lg n.

If the keys are long but there are relatively few of them, then k is large and lg n relatively small, and other methods (such as mergesort) will outperform radix sort.

If k is small (the keys are short) and there are a large number of keys, then radix sort will be faster than any other method we have studied.

HASHING Sparse Tables( 稀疏表格 )

Index Functions When the key is no longer an index that can be used

directly as in array indexing. What we can do is to set up a one-to-one correspondence between the keys and indices . The index function that we produce will be somewhat more complicated than those of previous sections, but in principle it can still be done. （当关键字不再是一个可以直接引用的下标，仍然可以建立一个关键字和下标之间的一一对应关系。）

The only difficulty arises when the number of possible keys exceeds the amount of space available for our table.

{Zhao, Qian, Sun, Li, Wu, Chen, Han, Ye, Dai}

example：We have nine keys as below:

The record stores at f(key) =

(Ord(first alphabet) -Ord('A')+1)/2

0 1 2 3 4 5 6 7 8 9 10 11 12 13Chen ZhaoQian SunLi WuHan YeDai

question: insert another record “ Zhou”, how to do?

Sparse Tables Hashing Tables

The idea of a hash table is to allow many of the different possible keys to be mapped to the same location in an array under the action of the index function. Then there will be a possibility that two records will want to be in the same place, but if the number of records that actually occur is small relative to the size of the array, then this possibility will cause little loss of time. Even when most entries in the array are occupied, hash methods can be an effective means of information retrieval. （哈希表中允许不同的关键字记录映射到数组的同一个位置上）

Sparse Tables

Sparse Tables Hashing Tables

Start with an array that holds the hash table. （用一个数组来存储哈希表格）

Use a hash function to take a key and map it to some index in the array. This function will generally map several different keys to the same index. （使用哈希函数将关键字映射为数组的下标 , 一个下标可能会对应多个关键字）

If the desired record is in the location given by the index,then we are finished; otherwise we must use some method to resolve the collision that may have occurred between two records wanting to go to the same location. （如果被查找记录就在对应下标位置，否则根据如何解决冲突的方法确定位置）

To use hashing we must

(a) Find good hash functions （找到一个好的哈希函数）

(b) Determine how to resolve collisions. （解决冲突的方法）

summary：Hash function:记录的存储位置与记录的关键字之间的对应关系函数。

Hash table:根据 hash function 和解决冲突的方法所构造出的记录存储位置表。一般采用顺序数组结构进行存储。

Hash 法查找和前面介绍过的通过关键字进行查找方法的不同：

不是建立在关键字比较的基础上的查找。一种表格查找，在理想状态下能通过 hash 函数的

计算求得记录的下标值进行直接查找。在有冲突的一般情形下，只要根据冲突解决办法从

冲突位置开始进行探测。

Choosing a Hash Function A hash function should be easy and quick to compute.（简单、计算快速） A hash function should achieve an even distribution of the keys that actually occur across the range of indices.（均匀） The usual way to make a hash function is to take the key, chop it up, mix the pieces together in various ways, and thereby obtain an index that will be uniformly distributed over the range of indices. （对关键字进行分割，混合等多种运算） Note that there is nothing random about a hash function. If the function is evaluated more than once on the same key,then it must give the same result every time, so the key can be retrieved without fail. （哈希函数对于某一个关键字，每次计算都应得到相同的结果。）

Choosing a Hash Function Truncation: Sometimes we ignore part of the key, and use the remaining part as the index. （截断） Folding: We may partition the key into several parts and combine the parts in a convenient way. （折叠） Modular arithmetic: We may convert the key to an integer,divide by the size of the index range, and take the remainder as the result. （模运算） A better spread of keys is often obtained by taking the size of the table (the index range) to be a prime number. （质数模）

Choosing a Hash Function C++ Example

As a simple example, let us write a hash function in C++ for transforming a key consisting of eight alphanumeric characters into an integer in the range 0 . . hash_size-1.

Therefore, we shall assume that we have a class Key with the methods and functions of the following definition:

class Key: public String{

public:

char key_letter(int position) const;

void make_blank( );

// Add constructors and other methods.

};

Choosing a Hash Function We inherit the methods of the class String from Chapter 6. The method key_letter(int position) must return the character in a particular position of the Key, or return a blank if the Key has length less than n. The final method make_blank sets up an empty Key.

int hash(const Key &target)/* Post: target has been hashed, returning a value between 0 and hash_ size -1 .Uses: Methods for the class Key . */{int value = 0;for (int position = 0; position < 8; position++)

value = 4 * value + target.key_letter(position);return value%hash_size;}

Collision Resolution with Open Addressing （开放定址法解决冲突）

Linear Probing （线性探测法） Linear probing starts with the hash address and searches sequ

entially for the target key or an empty position. The array should be considered circular, so that when the last location is reached, the search proceeds to the first location of the array. （向后顺序查找一个空位置，当后面没有空位时，从开始位置向后探测）

Clustering （聚集） The major drawback of linear probing is that, as the table beco

mes about half full,there is a tendency toward clustering; that is, records start to appear in long strings of adjacent positions with gaps between the strings. Thus the sequential searches needed to find an empty position become longer and longer.

Collision Resolution with Open Addressing

avoid clustering

If we are to avoid the problem of clustering, then we must use some more sophisticated way to select the sequence of locations to check when a collision occurs.（使用更好的方法解决聚集问题）

There are many ways to do so. One, called rehashing, uses a second hash function to obtain the second position to consider. If this position is filled, then some other rehashing method is needed to get the third position, and so on. （再哈希法）


Increment Functions （增量函数）

We will do just as well to find a more sophisticated way of determining the distance to move from the first hash position and apply this method, whatever the first hash location is. Hence we wish to design an increment function that can depend on the key or on the number of probes already made and that will avoid clustering. （比较好的方法是：确定从原哈希函数确定的位置的相对距离，因此我们设计一个增量函数，这个函数可能与关键字或已经进行的探测次数有关。）


Quadratic Probing （二次探测）

Other methods

Key-dependent increments

Random probing

这时，增量函数为： i2 (i=1,2,3,……)

Example: we have the records as below

{ 19, 01, 23, 14, 55, 68, 11, 82, 36 }

The hash function is H(key) = key MOD 11And the table has length 11

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

1901 23 1455 68

1901 23 14 68

Linear probing

Quadratic probing

11 82 36

55 11 82 36

1 1 2 1 3 6 2 5 1

C++ Algorithmsconst int hash_ size = 997; // a prime number of appropriate size

class Hash_table {

public:

Hash_table( );

void clear( );

Error_code insert(const Record &new entry);

Error_code retrieve(const Key &target, Record &found) const;

private:

Record table[hash_ size];

};



Hash Table InsertionError_code Hash_table :: insert(const Record &new_entry)

/* Post: If the Hash table is full, a code of overflow is returned. If the table already contains an item with the key of new entry a code of duplicate error is returned. Otherwise:The Record new entry is inserted into the Hash table and success is returned.

Uses: Methods for classes Key , and Record . The function hash . */

{//采用二次探测解决冲突Error_code result = success;

int probe_count, //Counter to be sure that table is not full.

increment, //Increment used for quadratic probing.此处 increment指的是相邻两次探测的距离，与前面增量函数的含义不同。probe; //Position currently probed in the hash table.

Key null; //Null key for comparison purposes.

null.make_blank( );

probe = hash(new_entry);

Hash Table Insertionprobe_count = 0;increment = 1;while (table[probe] != null // Is the location empty? && table[probe] != new_entry // Duplicate key? && probe_ count < (hash_size + 1)/2) { // Has overflow occurred?

probe_ count++;probe = (probe + increment)%hash_size;increment += 2; // Prepare increment for next iteration.

}if (table[probe] == null) table[probe] = new_entry;// Insert new_entry.else if (table[probe] == new_entry) result = duplicate_error;else result = overflow; // The table is full.return result;}


Probe h h+1 h+4 h+9 h+16 h+25

Increment 1 3 5 7 9

Collision Resolution by Chaining

Collision Resolution by Chaining

Advantages and Disadvantage of chaining

The linked lists from the hash table are called chains.

If the records are large, a chained hash table can save space.

Collision resolution with chaining is simple; clustering is no problem.

The hash table itself can be smaller than the number of records;overflow is no problem.

Deletion is quick and easy in a chained hash table.

If the records are very small and the table nearly full, chaining may take more space.

Collision Resolution by Chaining Code for Chained Hash Tables

Definitionclass Hash table {public:// Specify methods here.private:List<Record> table[hash size];};

The class List can be any one of the generic linked implementations of a list studied in Chapter 6. Constructor

The implementation of the constructor simply calls the constructor for each list in the array. Clear

To clear the table, we must clear the linked list in each of the table positions, using the List method clear( ).

Collision Resolution by Chaining Code for Chained Hash Tables

Retrieval

sequential search(table[hash(target)], target, position);

Insertion

table[hash(new entry)].insert(0, new entry);

Deletion

总结：解决冲突的方法：

1 、开放定址法线性探测法二次探测法随机探测法增量函数与关键字有关

2 、再哈希法3 、链接表（链地址法）4 、建立一个公共溢出区（参见习题 p409 E7)

ANALYSIS OF HASHING The Birthday Surprise

The probability that m people all have different birthdays is

This expression becomes less than 0.5 whenever m 23.

结论 :随机挑选 23 个人，就可能就 2 个人的生日是一样的。

ANALYSIS OF HASHINGFor hashing, the birthday surprise says that for any problem of reasonable size, collisions will almost certainly occur. A probe is one comparison of a key with the target. （定义一次探测为与目标记录的一次比较） The load factor of the table is = n/t , where n positions are occupied out of a total of t positions in the table. （装载因子）

ANALYSIS OF HASHING

CONCLUSIONS: COMPARISON OF METHODS

We have studied four principal methods of information retrieval,the first two for lists and the second two for tables. Often we can choose either lists or tables for our data structures. Sequential search is O(n)

Sequential search is the most flexible method. The data may be stored in any order, with either contiguous or linked representation.

Binary search is O(log n)Binary search demands more, but is faster: The keys must be in order, and the data must be in random-access representation (contiguous storage).

CONCLUSIONS: COMPARISON OF METHODS

Table lookup is O(1)Ordinary lookup in contiguous tables is best, both in speed and convenience, unless a list is preferred, or the set of keys is sparse, or insertions or deletions are frequent.

Hash-table retrieval is O(1)Hashing requires the most structure, a peculiar ordering of the keys well suited to retrieval from the hash table, but generally useless for any other purpose. If the data are to be available for human inspection, then some kind of order is needed, and a hash table is inappropriate.

Pointers and Pitfalls 1. Use top-down design for your data structures, just as you

do for your algorithms. 2. Before considering detailed structures, decide what

operations on the data will be required. 3. For the design and programming of lists, see Chapter 6. 4. Use the logical structure of the data to decide what kind

of table to use. 5. Let the structure of the data help you decide whether an

index function or an access table is better for accessing a table of data.

6. In using a hash table, let the nature of the data and the required operations help you decide between chaining and open addressing.

chapter 9 tables and information retrieval 表格及信息获取

Documents