Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann
LECTURE 2 INDEXING 26.09.2012 Information Retrieval, ETHZ 2012 1
Today’s Overview
1. Introduction 2. Dictionaries 3. Index Construction 4. Distributed Indexing 5. Multiple Query Terms 6. Advanced Posting List Intersection 7. Web-scale Index Serving
Class from 9:15-10:45 (no break), 11-12: Excercise
Information Retrieval, ETHZ 2012 2
INTRODUCTION
3 Information Retrieval, ETHZ 2012
Basic Index: Challenge Design solution to a simple lookup problem:
Efficiently identify documents containing a given term t “Efficiently” = do this in time O(# documents returned)
Use a data structure to be constructed off-line (@ indexing time) in order to avoid linear scanning (@ query time).
Tradeoff response time & query throughput for pre-processing costs & index space (memory, disk).
Any data structure for storing a set of records could be used. Here: focus on arrays & linked lists = posting lists.
4 Information Retrieval, ETHZ 2012
Pre-Computer Age: Book Index Book indexes
Record pages mentioning (e.g.) keywords and names
Goes back to the age of printed books (15th century)
Information Retrieval, ETHZ 2012 5
Posting Lists
ETHZ
docID_4 = docID(“www.ethz.ch”) docID_2 = docID(“wikipedia.org/wiki/ETH_Zurich”)
docID_3 = docID(“www.systems.ethz.ch/…”) docID_1 = docID(“swissinfo.ch/…”)
docID_1 docID_2 docID_3 docID_4 …
ETHZ
Array or linked list Information Retrieval, ETHZ 2012 6
DICTIONARIES
7 Information Retrieval, ETHZ 2012
Basic Index: Dictionary
For each admissible (i.e. single term) query we need to find the corresponding posting list (if it exists, else NULL)
We need an efficient data structure for term look-up, i.e. a dictionary Preferred solution: Hash table
§ Hash function
§ Mechanism for dealing with collisions: e.g. linked list
§ O(1) access for “good” hash functions and large enough n
§ Standard implementations: re-scale at load >0.75
Information Retrieval, ETHZ 2012 8
¤ Büttcher, S., Clarke Ch. L. A., and Cormack, G. V.: Information Retrieval. Implementing and Evaluating Search Engines, Section 4.2, 2010.
Dictionary Hash Table
Information Retrieval, ETHZ 2012 9
terms
class
…
hashes
ETHZ
mountain
weather
0 1 2
r
n
r+1
h
. . .
collision lists
mountain 549283471
ETHZ 398437231
class 234443989
weather 770209991
…
…
… …
class 234443989
<token> <posting list address> =
INDEX CONSTRUCTION
10 Information Retrieval, ETHZ 2012
Basic Index: Generation
Construct all posting lists in one pass over the document collection.
INDEXGENERATION(C) 1 for all documents d in collection C 2 for all terms t occurring in d 3 if not EXISTS(posting_list(t)) 4 then CREATE(posting_list(t)) 5 ADD(posting_list(t),d) 6 else if not CONTAINS(posting_list(t),d) 7 then ADD(posting_list(t),d) 8 return posting_list
Note: indexing terms (=vocabulary) can be identified on the fly. Dictionary construction can happen in parallel.
Information Retrieval, ETHZ 2012 11
Index Construction
Conceptually: 3 steps 1. Make a pass through the collection and assemble all
postings, i.e. pairs (term, doc-id) or (term-id, doc-id)
2. Sort the postings using the term(-id) as the primary and the doc-id as the secondary key
3. Organize doc-ids into posting lists for each term
Information Retrieval, ETHZ 2012 12
Scalable Index Construction
In-memory index construction does not scale. How can we construct an index for very large collections?
Taking into account the hardware constraints on memory, disk, speed etc.
Information Retrieval, ETHZ 2012 13
Sort-Based Index Construction
As we build index, we parse docs one at a time. The final postings for any term is potentially incomplete until the end.
At 10–12 bytes per postings entry, demands a lot of space for large collections.
For large document collections, we need to store intermediate results on disk.
Information Retrieval, ETHZ 2012 14
Blocked Sort-Based Indexing (BSBI)
12-byte (4+4+4) postings (term-id, doc-id, document frequency)
Must now sort many Billions of postings by term-id. Define a block to consist of (say) 10M such postings. We can easily fit that many postings into memory.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk. Then merge the block into one long sorted order.
Information Retrieval, ETHZ 2012 15
BSBI Index Construction
Information Retrieval, ETHZ 2012 16
BSBI: Merging Blocks
Information Retrieval, ETHZ 2012 17
Problems with Sort-Based Algorithm
Our assumption was: we can keep the dictionary in memory.
We need the dictionary (which grows dynamically) in order to implement a term to term-id mapping. Actually, we could work with (term, doc-id) postings instead of (term-id, doc-id) postings . . .
. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)
… term fingerprinting an alternative, but inexact.
Information Retrieval, ETHZ 2012 18
Single Pass in Memory Indexing
Abbreviation: SPIMI Key idea #1: Generate separate dictionaries for each block – no need to maintain term-term-id mapping across blocks. Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur.
With these two ideas we can generate a complete inverted index for each block.
These separate indexes can then be merged into one big index.
Information Retrieval, ETHZ 2012 19
DISTRIBUTED INDEXING
20 Information Retrieval, ETHZ 2012
Distributed Index Generation
For web-scale indexing: must use a distributed computer cluster
Individual machines are fault-prone and may unpredictably slow down or fail
How do we exploit such a pool of machines?
Information Retrieval, ETHZ 2012 21
Master Coordination
Maintain a master machine directing the indexing job – considered “safe”
Break up indexing into sets of parallel tasks
Master machine assigns each task to an idle machine from a pool.
Information Retrieval, ETHZ 2012 22
Parallel Tasks
We will define two sets of parallel tasks and deploy two types of machines to solve them:
§ Parsers
§ Inverters
Break the input document collection into splits (corresponding to blocks in BSBI/SPIMI)
Each split is a subset of documents.
Information Retrieval, ETHZ 2012 23
Parsers
Master assigns a split to an idle parser machine. Parser reads a document at a time and emits (term, doc) pairs.
Parser writes pairs into j term-partitions. Each for a range of terms’ first letters
E.g., a-f, g-p, q-z (here: j = 3)
Information Retrieval, ETHZ 2012 24
Inverters
An inverter collects all (term, doc) pairs (= postings) for one term-partition.
Sorts and writes to postings lists
Information Retrieval, ETHZ 2012 25
Data Flow
Information Retrieval, ETHZ 2012 26
Map Reduce
The index construction algorithm we just described is an instance of Map Reduce.
Map Reduce is a robust and conceptually simple framework for distributed computing . . . . . . without having to write code for the distribution part.
The open source version is called Hadoop.
Hadoop is a key tool for big data. See lecture 3 of Donald Kossmann’s class.
Information Retrieval, ETHZ 2012 27
¤ J. Dean & S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design & Implementation, 2004.
MULTIPLE QUERY TERMS
28 Information Retrieval, ETHZ 2012
Basic Index: Modified Challenge Deal with multiple terms:
§ Efficiently identify documents containing a given set of terms t1,…,tk.
§ This is also known as Boolean retrieval with “AND”.
In which way do we need to generalize the • index data structures • index generation, and • query processing?
Information Retrieval, ETHZ 2012 29
Multi-Term Posting Lists Challenge: #terms may be large: O(billions), but #sets of
terms grows exponentially in the set size
In practice some sets (or n-grams) of terms may be used frequently (“mountain bike trails”), but most term combinations will never be observed.
Idea #1: Multi-term posting lists § Identify frequent k-term combinations (from documents
or query logs, k=2 or k=3). Create posting lists for those.
§ Advantage: popular k term queries can be answered as fast as one term queries
Information Retrieval, ETHZ 2012 30
Intersecting Posting Lists
Idea #2: Traverse multiple posting lists in parallel to compute intersection.
In order to be effective (for ~ equal length posting lists): sorted posting lists - sort entries in each list using the same total order (e.g. ascending documentID).
Basic method: § Always advance in posting list with smallest current
element. § Check for documents contained in all lists
Information Retrieval, ETHZ 2012 31
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 32
ETHZ 370871 391223 623920 … 789908
systems
370871 927382 391223 623920 … sort
177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 33
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 34
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 35
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 36
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 37
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
add docID 391223 to result set
Intersecting Posting Lists: Example
Information Retrieval, ETHZ 2012 38
ETHZ 370871 391223 623920 … 789908
systems 177883 300012 391223 391248 …
information retrieval
391223 800123 990002 991226 …
Intersection Algorithm
For simplicity, we focus on the case of two posting lists
Multiple terms can be handled by generalizing to k posting lists
… or by creating temp intermediate posting lists and recursion
Optimization: start with shorter posting lists
Information Retrieval, ETHZ 2012 39
INTERSECT(p1, p2) 1 answer := < > 2 while (p1 != NULL) AND (p2 != NULL) 3 if docID(p1) == docID(p2) then 4 ADD(answer, docID(p1)) 5 ADVANCE(p1) 6 ADVANCE(p2) 7 else if docID(p1) < docID(p2) 8 ADVANCE(p1) 9 else 10 ADVANCE(p2) 11 return answer
Intersecting Posting Lists
How expensive is the parallel intersection of k posting lists?
Number of pointer advances
Reasonable efficiency, if posting lists are approximately of the same length.
Access time dominated by longest posting list. Can we also devise a method that is dominated by the shortest?
Information Retrieval, ETHZ 2012 40
ADVANCED POSTING LIST INTERSECTION
41 Information Retrieval, ETHZ 2012
Alternative Posting List Intersection
Naïve approach when |L_1| << |L_2| § Build a hash map dictionary of docIDs for L_1 and L_2
§ Lookup the elements of L_1 in the dictionary for L_2
§ O(|L_1|) time
Only works well in highly asymmetric case. Can we do better?
Information Retrieval, ETHZ 2012 42
Alternative Posting List Intersection: Refinement § Compute hashed sets h(L1) and h(L2)
§ Bucketed bit set representation of set of hash values
§ Fast intersection in bit set representation
§ Exact intersection
§ #bits in h: small enough to allow for fast intersection; large enough to make L’1 and L’2 small.
Information Retrieval, ETHZ 2012 43
¤ P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages 739–750, 2007.
Posting Lists with Skip Pointers
Other ways to speed up list based intersection: introduce skip pointers
Traverse skip pointers instead of next element pointer, if whole segment can be skipped.
Where to put skip pointers? Heuristics: sqrt spacing
Trade-offs:
(1) space and I/O (!) requirements for skip pointers vs. not
(2) additional comparisons with skip pointers vs. skip gains
Information Retrieval, ETHZ 2012 44
Use of Skip Pointers: Example
Information Retrieval, ETHZ 2012 45
When 8 is reached in both lists. Next element in top list is 41. We can advance to that element. However, we can skip over the block in bottom list and move past 31, skipping 4 elements.
WEB SCALE INDEX SERVING
46 Information Retrieval, ETHZ 2012
Disk vs. RAM
When building a scalable (i.e. Web scale) index, one key design question is to use disk vs. RAM (today also: SSD).
§ RAM ~200x more expensive than disk
§ Disk ~10-20x slower to access
§ Additional overhead for random access = disk seeks
Hardware economics also influence system architecture.
Information Retrieval, ETHZ 2012 47
Distributed Index: Sharding
A related problem for large indexes is how to split up the index into pieces or shards. Relevant performance dimensions are response time or latency (how long does it take to compute a response?) as well as throughput (how many queries/s can be answered?). In addition fault tolerance may be an issue. There are two basic ways of sharding: document sharding or term sharding. Document sharding: each shard contains short posting lists (for a subset of documents). Term sharding: each shard contains few posting lists Information Retrieval, ETHZ 2012 48
Document Sharding
Information Retrieval, ETHZ 2012 49
Term Sharding
Information Retrieval, ETHZ 2012 50
Document Sharding - Pros & Cons
Pros § each shard can
process queries independently
§ easy to keep additional per-doc information
§ network traffic (requests/ responses) small
Information Retrieval, ETHZ 2012 51
Cons § query has to be
processed by each shard
§ O(K*N) disk seeks for K word query on N shards
Term Sharding - Pros & Cons
Pros § K word query =>
handled by at most K shards
§ O(K) disk seeks for K word query
Information Retrieval, ETHZ 2012 52
Cons § much higher network
bandwidth needed § data about each term for
each matching doc must be collected in one place
§ harder to have per-doc information
Document sharding is “standard” approach in Web search.
Basic Design Principles
Document Keying § Documents assigned small integer ids (docids)
§ Smaller ids for higher quality/more important docs: allows for approximation/cut-offs
Index Servers
§ Given (query) return sorted list of (score, docid, ...)
§ Partitioned (“sharded”) by docid
§ Index shards are replicated for capacity
§ Cost is O(# queries * # docs in index)
Information Retrieval, ETHZ 2012 53
Web Search Serving System (Google @ year ~2000)
Information Retrieval, ETHZ 2012 54
Caching
Cache servers § Cache both index results and doc snippets
§ Hit rates typically 30-60% • Depends on frequency of index updates, query traffic, level of
personalization, etc.
Main benefits
§ Performance! 10s of machines do work of 100(0)s
§ Reduce query latency on hits
§ Cache served queries are typically popular and often expensive
Information Retrieval, ETHZ 2012 55
Dealing with Growth
More web pages: more shards
More queries: more replicas
Information Retrieval, ETHZ 2012 56
From Document Sharding to In-memory Index
Must add shards to keep response time low as index size increases
... but query cost increases with # of shards
§ typically >= 1 disk seek / shard / query term
§ even for very rare terms
As # of replicas increases, total amount of memory available increases
Eventually, have enough memory to hold an entire copy of the index in memory Radically changes many design parameters
Information Retrieval, ETHZ 2012 57
In Memory Index (a la Google)
Information Retrieval, ETHZ 2012 58
Anecdote form the Life of a Search Engine 1999 J
Index updates (~once per month) § Wait until traffic is low
§ Take some replicas offline
§ Copy new index to these replicas
§ Start new frontends pointing at updated index
Disk-optimized update scheme
Information Retrieval, ETHZ 2012 59