Download - Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Information Retrieval ETH Zürich, Fall 2012 Thomas Hofmann

LECTURE 2 INDEXING 26.09.2012 Information Retrieval, ETHZ 2012 1

Today’s Overview

1.  Introduction 2.  Dictionaries 3.  Index Construction 4.  Distributed Indexing 5.  Multiple Query Terms 6.  Advanced Posting List Intersection 7.  Web-scale Index Serving

Class from 9:15-10:45 (no break), 11-12: Excercise

Information Retrieval, ETHZ 2012 2

INTRODUCTION

3 Information Retrieval, ETHZ 2012

Basic Index: Challenge Design solution to a simple lookup problem:

Efficiently identify documents containing a given term t “Efficiently” = do this in time O(# documents returned)

Use a data structure to be constructed off-line (@ indexing time) in order to avoid linear scanning (@ query time).

Tradeoff response time & query throughput for pre-processing costs & index space (memory, disk).

Any data structure for storing a set of records could be used. Here: focus on arrays & linked lists = posting lists.


Pre-Computer Age: Book Index Book indexes

Record pages mentioning (e.g.) keywords and names

Goes back to the age of printed books (15th century)


Posting Lists

ETHZ

docID_4 = docID(“www.ethz.ch”) docID_2 = docID(“wikipedia.org/wiki/ETH_Zurich”)

docID_3 = docID(“www.systems.ethz.ch/…”) docID_1 = docID(“swissinfo.ch/…”)

docID_1 docID_2 docID_3 docID_4 …

ETHZ

Array or linked list Information Retrieval, ETHZ 2012 6

DICTIONARIES


Basic Index: Dictionary

For each admissible (i.e. single term) query we need to find the corresponding posting list (if it exists, else NULL)

We need an efficient data structure for term look-up, i.e. a dictionary Preferred solution: Hash table

§  Hash function

§  Mechanism for dealing with collisions: e.g. linked list

§  O(1) access for “good” hash functions and large enough n

§  Standard implementations: re-scale at load >0.75


¤ Büttcher, S., Clarke Ch. L. A., and Cormack, G. V.: Information Retrieval. Implementing and Evaluating Search Engines, Section 4.2, 2010.

Dictionary Hash Table


terms

class

…

hashes

ETHZ

mountain

weather

0 1 2

r

n

r+1

h

. . .

collision lists

mountain 549283471

ETHZ 398437231

class 234443989

weather 770209991

…

…

… …

class 234443989

<token> <posting list address> =

INDEX CONSTRUCTION


Basic Index: Generation

Construct all posting lists in one pass over the document collection.

INDEXGENERATION(C) 1 for all documents d in collection C 2 for all terms t occurring in d 3 if not EXISTS(posting_list(t)) 4 then CREATE(posting_list(t)) 5 ADD(posting_list(t),d) 6 else if not CONTAINS(posting_list(t),d) 7 then ADD(posting_list(t),d) 8 return posting_list

Note: indexing terms (=vocabulary) can be identified on the fly. Dictionary construction can happen in parallel.


Index Construction

Conceptually: 3 steps 1.  Make a pass through the collection and assemble all

postings, i.e. pairs (term, doc-id) or (term-id, doc-id)

2.  Sort the postings using the term(-id) as the primary and the doc-id as the secondary key

3.  Organize doc-ids into posting lists for each term


Scalable Index Construction

In-memory index construction does not scale. How can we construct an index for very large collections?

Taking into account the hardware constraints on memory, disk, speed etc.


Sort-Based Index Construction

As we build index, we parse docs one at a time. The final postings for any term is potentially incomplete until the end.

At 10–12 bytes per postings entry, demands a lot of space for large collections.

For large document collections, we need to store intermediate results on disk.


Blocked Sort-Based Indexing (BSBI)

12-byte (4+4+4) postings (term-id, doc-id, document frequency)

Must now sort many Billions of postings by term-id. Define a block to consist of (say) 10M such postings. We can easily fit that many postings into memory.

Basic idea of algorithm:

Accumulate postings for each block, sort, write to disk. Then merge the block into one long sorted order.


BSBI Index Construction


BSBI: Merging Blocks


Problems with Sort-Based Algorithm

Our assumption was: we can keep the dictionary in memory.

We need the dictionary (which grows dynamically) in order to implement a term to term-id mapping. Actually, we could work with (term, doc-id) postings instead of (term-id, doc-id) postings . . .

. . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

… term fingerprinting an alternative, but inexact.


Single Pass in Memory Indexing

Abbreviation: SPIMI Key idea #1: Generate separate dictionaries for each block – no need to maintain term-term-id mapping across blocks. Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur.

With these two ideas we can generate a complete inverted index for each block.

These separate indexes can then be merged into one big index.


DISTRIBUTED INDEXING


Distributed Index Generation

For web-scale indexing: must use a distributed computer cluster

Individual machines are fault-prone and may unpredictably slow down or fail

How do we exploit such a pool of machines?


Master Coordination

Maintain a master machine directing the indexing job – considered “safe”

Break up indexing into sets of parallel tasks

Master machine assigns each task to an idle machine from a pool.


Parallel Tasks

We will define two sets of parallel tasks and deploy two types of machines to solve them:

§  Parsers

§  Inverters

Break the input document collection into splits (corresponding to blocks in BSBI/SPIMI)

Each split is a subset of documents.


Parsers

Master assigns a split to an idle parser machine. Parser reads a document at a time and emits (term, doc) pairs.

Parser writes pairs into j term-partitions. Each for a range of terms’ first letters

E.g., a-f, g-p, q-z (here: j = 3)


Inverters

An inverter collects all (term, doc) pairs (= postings) for one term-partition.

Sorts and writes to postings lists


Data Flow


Map Reduce

The index construction algorithm we just described is an instance of Map Reduce.

Map Reduce is a robust and conceptually simple framework for distributed computing . . . . . . without having to write code for the distribution part.

The open source version is called Hadoop.

Hadoop is a key tool for big data. See lecture 3 of Donald Kossmann’s class.


¤ J. Dean & S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design & Implementation, 2004.

MULTIPLE QUERY TERMS


Basic Index: Modified Challenge Deal with multiple terms:

§  Efficiently identify documents containing a given set of terms t1,…,tk.

§  This is also known as Boolean retrieval with “AND”.

In which way do we need to generalize the •  index data structures •  index generation, and •  query processing?


Multi-Term Posting Lists Challenge: #terms may be large: O(billions), but #sets of

terms grows exponentially in the set size

In practice some sets (or n-grams) of terms may be used frequently (“mountain bike trails”), but most term combinations will never be observed.

Idea #1: Multi-term posting lists §  Identify frequent k-term combinations (from documents

or query logs, k=2 or k=3). Create posting lists for those.

§  Advantage: popular k term queries can be answered as fast as one term queries


Intersecting Posting Lists

Idea #2: Traverse multiple posting lists in parallel to compute intersection.

In order to be effective (for ~ equal length posting lists): sorted posting lists - sort entries in each list using the same total order (e.g. ascending documentID).

Basic method: §  Always advance in posting list with smallest current

element. §  Check for documents contained in all lists


Intersecting Posting Lists: Example


ETHZ 370871 391223 623920 … 789908

systems

370871 927382 391223 623920 … sort

177883 300012 391223 391248 …

information retrieval

391223 800123 990002 991226 …



ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …


391223 800123 990002 991226 …



ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …


391223 800123 990002 991226 …

add docID 391223 to result set



ETHZ 370871 391223 623920 … 789908

systems 177883 300012 391223 391248 …


391223 800123 990002 991226 …

Intersection Algorithm

For simplicity, we focus on the case of two posting lists

Multiple terms can be handled by generalizing to k posting lists

… or by creating temp intermediate posting lists and recursion

Optimization: start with shorter posting lists


INTERSECT(p1, p2) 1 answer := < > 2 while (p1 != NULL) AND (p2 != NULL) 3 if docID(p1) == docID(p2) then 4 ADD(answer, docID(p1)) 5 ADVANCE(p1) 6 ADVANCE(p2) 7 else if docID(p1) < docID(p2) 8 ADVANCE(p1) 9 else 10  ADVANCE(p2) 11  return answer

Intersecting Posting Lists

How expensive is the parallel intersection of k posting lists?

Number of pointer advances

Reasonable efficiency, if posting lists are approximately of the same length.

Access time dominated by longest posting list. Can we also devise a method that is dominated by the shortest?


ADVANCED POSTING LIST INTERSECTION


Alternative Posting List Intersection

Naïve approach when |L_1| << |L_2| §  Build a hash map dictionary of docIDs for L_1 and L_2

§  Lookup the elements of L_1 in the dictionary for L_2

§  O(|L_1|) time

Only works well in highly asymmetric case. Can we do better?


Alternative Posting List Intersection: Refinement §  Compute hashed sets h(L1) and h(L2)

§  Bucketed bit set representation of set of hash values

§  Fast intersection in bit set representation

§  Exact intersection

§  #bits in h: small enough to allow for fast intersection; large enough to make L’1 and L’2 small.


¤  P. Bille, A. Pagh, and R. Pagh. Fast Evaluation of Union-Intersection Expressions. In ISAAC, pages 739–750, 2007.

Posting Lists with Skip Pointers

Other ways to speed up list based intersection: introduce skip pointers

Traverse skip pointers instead of next element pointer, if whole segment can be skipped.

Where to put skip pointers? Heuristics: sqrt spacing

Trade-offs:

(1) space and I/O (!) requirements for skip pointers vs. not

(2) additional comparisons with skip pointers vs. skip gains


Use of Skip Pointers: Example


When 8 is reached in both lists. Next element in top list is 41. We can advance to that element. However, we can skip over the block in bottom list and move past 31, skipping 4 elements.

WEB SCALE INDEX SERVING


Disk vs. RAM

When building a scalable (i.e. Web scale) index, one key design question is to use disk vs. RAM (today also: SSD).

§  RAM ~200x more expensive than disk

§  Disk ~10-20x slower to access

§  Additional overhead for random access = disk seeks

Hardware economics also influence system architecture.


Distributed Index: Sharding

A related problem for large indexes is how to split up the index into pieces or shards. Relevant performance dimensions are response time or latency (how long does it take to compute a response?) as well as throughput (how many queries/s can be answered?). In addition fault tolerance may be an issue. There are two basic ways of sharding: document sharding or term sharding. Document sharding: each shard contains short posting lists (for a subset of documents). Term sharding: each shard contains few posting lists Information Retrieval, ETHZ 2012 48

Document Sharding


Term Sharding


Document Sharding - Pros & Cons

Pros §  each shard can

process queries independently

§  easy to keep additional per-doc information

§  network traffic (requests/ responses) small


Cons §  query has to be

processed by each shard

§  O(K*N) disk seeks for K word query on N shards

Term Sharding - Pros & Cons

Pros §  K word query =>

handled by at most K shards

§  O(K) disk seeks for K word query


Cons §  much higher network

bandwidth needed §  data about each term for

each matching doc must be collected in one place

§  harder to have per-doc information

Document sharding is “standard” approach in Web search.

Basic Design Principles

Document Keying §  Documents assigned small integer ids (docids)

§  Smaller ids for higher quality/more important docs: allows for approximation/cut-offs

Index Servers

§  Given (query) return sorted list of (score, docid, ...)

§  Partitioned (“sharded”) by docid

§  Index shards are replicated for capacity

§  Cost is O(# queries * # docs in index)


Web Search Serving System (Google @ year ~2000)


Caching

Cache servers §  Cache both index results and doc snippets

§  Hit rates typically 30-60% •  Depends on frequency of index updates, query traffic, level of

personalization, etc.

Main benefits

§  Performance! 10s of machines do work of 100(0)s

§  Reduce query latency on hits

§  Cache served queries are typically popular and often expensive


Dealing with Growth

More web pages: more shards

More queries: more replicas


From Document Sharding to In-memory Index

Must add shards to keep response time low as index size increases

... but query cost increases with # of shards

§  typically >= 1 disk seek / shard / query term

§  even for very rare terms

As # of replicas increases, total amount of memory available increases

Eventually, have enough memory to hold an entire copy of the index in memory Radically changes many design parameters


In Memory Index (a la Google)


Anecdote form the Life of a Search Engine 1999 J

Index updates (~once per month) §  Wait until traffic is low

§  Take some replicas offline

§  Copy new index to these replicas

§  Start new frontends pointing at updated index

Disk-optimized update scheme


Download - Information Retrieval - Systems Group · Information Retrieval, ETHZ 2012 44 . Use of Skip Pointers: Example Information Retrieval, ETHZ 2012 45 When 8 is reached in both lists. Next

Top Related