munching & crunching - lucene index post-processing

1

Munching & crunchingLucene index post-processing and applications

Andrzej Białecki

<[email protected]>

http://mail.google.com/mail/?ui=1&view=att&th=11e8ab72613d00d3&attid=0.1&disp=attd&zw

Intro

Started using Lucene in 2003 (1.2-dev?)

Created Luke – the Lucene Index Toolbox

Nutch, Hadoop committer, Lucene PMC member

Nutch project lead


Munching and crunching? But really...

Stir your imagination

Think outside the box

Show some unorthodox use and practical applications

Close ties to scalability, performance, distributed search and query latency


20 May 2010Apache Lucene EuroCon

Agenda

●Post-processing● Splitting, merging, sorting, pruning

●Tiered search

●Bit-wise search

●(Map-reduce indexing models)



Why post-process indexes?

Isn't it better to build them right from the start?

Sometimes it's not convenient or feasible Correcting impact of unexpected common words

Targetting specific index size or composition:

Creating evenly-sized shards

Re-balancing shards across servers

Fitting indexes completely in RAM

… and sometimes impossible to do it right Trimming index size while retaining quality of top-N results



Merging indexes

It's easy to merge several small indexes into one

Fundamental Lucene operation during indexing (SegmentMerger) Command-line utilities exist: IndexMergeTool

API:

IndexWriter.addIndexes(IndexReader...)

IndexWriter.addIndexesNoOptimize(Directory...)

Hopefully a more flexible API on the flex branch

Solr: through CoreAdmin action=mergeindexes Note: schema must be compatible



Splitting indexes

IndexSplitter tool: Moves whole segments to standalone indexes

Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file

Cons:

Requires a multi-segment index!

Very limited control over content of resulting indexes → MergePolicy

_0 _1 _2

segments_2

original index

segm

ents

_0

segm

ents

_0

segm

ents

_0

new indexes



Splitting indexes, take 2 MultiPassIndexSplitter tool:

Uses an IndexReader that keeps the list of deletions in memory The source index remains unmodified For each partition:

Marks all source documents not in the partition as deleted Writes a target split using IndexWriter.addIndexes(IndexReader)

IndexWriter knows how to skip deleted documents Removes the “deleted” mark from all source documents

Pros: Arbitrary splits possible (even partially overlapping) Source index remains intact

Cons: Reads complete index N times – I/O is O(N * indexSize) Takes twice as much space (source index remains intact)

… but maybe it's a feature?

d1

d2

d3

original indexdel1

new indexes

d4

del2

d1 d2

d3 d4

pass 1 pass 2



Splitting indexes, take 3

SinglePassSplitter Uses the same processing workflow as

SegmentMerger, only with multiple outputs

Write new SegmentInfos and FieldInfos

Merge (pass-through) stored fields

Merge (pass-through) term dictionary

Merge (pass-through) postings with payloads

Merge (pass-through) term vectors

Renumbers document id-s on-the-fly to form contiguous space

Pros: flexibility as with MultiPassIndexSplitter

Status: work started, to be contributed soon...

stored fields

term dict

postings+payloads

term vectors

partitionerstoredterms

postingsterm vectors

storedterms

postingsterm vectors

1 2 3 4 5 6 7 8 9 10 ...

1 3 5… 1' 2' 3' 4' 5' 6'...

2 4 6 … 1' 2' 3' 4' 5' 6'...

renumber



Splitting indexes, summary

SinglePassSplitter – best tradeoff of flexibility/IO/CPU

Interesting scenarios with SinglePassSplitter: Split by ranges, round-robin, by field value, by frequency, to a target size, etc...

“Extract” handful of documents to a separate index

“Move” documents between indexes:

“extract” from source

Add to target (merge)

Delete from source

Now the source index may reside on a network FS – the amount of IO is

O(1 * indexSize)



Index sorting - introduction

“Early termination” technique If full execution of a query takes too long then terminate and estimate

Termination conditions: Number of documents – LimitedCollector in Nutch

Time – TimeLimitingCollector

(see also extended LUCENE-1720 TimeLimitingIndexReader)

Problems: Difficult to estimate total hits

Important docs may not be collected if they have high docID-s



Index sorting - details Define a global ordering of

documents (e.g. PageRank, popularity, quality, etc) Documents with good rank

should generally score higher

Sort (internal) ID-s by this ordering, descending

Map from old to new ID-s to follow this ordering

Change the ID-s in postings

0c

1e

2h

3f

4a

5d

6g

7b

doc ID

rank

40

71

02

53

14

35

66

27

old doc ID

new doc ID

0a

1b

2c

3d

4e

5f

6g

7h

doc ID

rank

original index

sorted index

ID mapping

early termination == poor

early termination == good



Index sorting - summary

Implementation in Nutch: IndexSorter Based on PageRank – sorts by decreasing page quality

Uses FilterIndexReader

NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither



Index pruning

Quick refresh on the index composition: Stored fields

Term dictionary

Term frequency data

Positional data (postings)

With or without payload data

Term frequency vectors

Number of documents may be into millions

Number of terms commonly is well into millions Not to mention individual postings …



Index pruning & top-N retrieval

N is usually << 1000

Very often search quality is judged based on top-20

Question: Do we really need to keep and process ALL terms and ALL

postings for a good-quality top-N search for common queries?



Index pruning hypothesis

There should be a way to remove some of the less important data While retaining the quality of top-N results!

Question: what data is less important?

Some answers: That of poorly-scoring documents

That of common (less selective) terms

Dynamic pruning: skips less relevant data during query processing → runtime cost...

But can we do this work in advance (static pruning)?



What do we need for top-N results?

Work backwards

“Foreach” common query: Run it against the full index

Record the top-N matching documents

“Foreach” document in results: Record terms and term positions that contributed to the score

Finally: remove all non-recorded postings and terms

First proposed by D. Carmel (2001) for single term queries



… but it's too simplistic:

Hmm, what about less common queries? 80/20 rule of “good enough”?

Term-level is too primitive Document-centric pruning

Impact-centric pruning

Position-centric pruning

0 quick

1 brown

2 fox

0 quick

1 brown

2 fox

before pruning after pruning

Query 1: brown - topN(full) == topN(pruned)Query 2: “brown fox” - topN(full) != topN(pruned)



Smarter pruning Not all term positions are equally

important

Metrics of term and position importance: Plain in-document term frequency (TF)

TF-IDF score obtained from top-N results of TermQuery (Carmel method)

Residual IDF – measure of term informativeness (selectivity)

Key-phrase positions, or term clusters

Kullback-Leibler divergence from a language model →

corpus language model

document language model

Term

Freq



Applications

Obviously, performance-related Some papers claim a modest impact on quality when pruning up to 60% of

postings

See LUCENE-1812 for some benchmarks confirming this claim

Removal / restructuring of (some) stored content

Legacy indexes, or ones created with a fossilized external chain



Stored field pruning

Some stored data can be compacted, removed, or restructured

Use case: source text for generating “snippets” Split content into sentences

Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)

NOTE: this may use collection wide statistics!

Remove the bottom x% of sentences



LUCENE-1812: contrib/pruning tools and API

Based on FilterIndexReader Produces output indexes via IndexWriter.addIndexes(IndexReader[])

Design: PruningReader – subclass of FilterIndexReader with necessary boilerplate and

hooks for pruning policies

StorePruningPolicy – implements rules for modifying stored fields (and list of field names)

TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads

PruningTool – command-line utility to configure and run PruningReader



PruningReader

Details of LUCENE-1812

IndexWriter consumes source data filtered via PruningReader

Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID If source index has no deletions

If target index is empty

stored fields

term dict

postings+payloads

term vectors

StorePruningPolicy

TermPruningPolicy

stored fields

term dict

postings+payloads

term vectors

IndexWr iter

source index target index

IW.addIndexes(IndexReader...)



API: StorePruningPolicy

May remove (some) fields from (some) documents

May as well modify the values

May rename / add fields



API: TermPruningPolicy

Thresholds (in the order of precedence): Per term

Per field

Default

Plain TF pruning – TFTermPruningPolicy Removes all postings for a term where TF (in-document term frequency) is below

a threshold

Top-N term-level – CarmelTermPruningPolicy TermQuery search for top-N docs

Removes all postings for a term outside the top-N docs



Results so far...

TF pruning: Term query recall very good

Phrase query recall very poor – expected...

Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries

Recognizing and keeping key phrases would help Use query log for frequent-phrase mining?

Use collocation miner (Mahout)?

Savings on pruning will be smaller, but quality will significantly improve



References

Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01

A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06

Locality-based pruning methods for web search, deMoura et al, ACM TIS '08

Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06



Index pruning applied ...

Index 1: A heavily pruned index that fits in RAM: excellent speed

poor search quality for many less-common query types

Index 2: Slightly pruned index that fits partially in RAM: good speed, good quality for many common query types,

still poor quality for some other rare query types

Index 3: Full index on disk: Slow speed

Excellent quality for all query types

QUESTION: Can we come up with a combined search strategy?



Tiered search

Can we predict the best tier without actually running the query?

How to evaluate if the predictor was right?

?predict

evaluatesearch box 3search box 3

search box 1search box 1

search box 2search box 2

RAM70% pruned

SSD

HDD

0% pruned

30% pruned



Tiered search: tier selector and evaluator Best tier can be predicted (often enough ):

Carmel pruning yields excellent results for simple term queries

Phrase-based pruning yields good results for phrase queries (though less often)

Quality evaluator: when is predictor wrong? Could be very complex, based on gold standard and qrels

Could be very simple: acceptable number of results

Fall-back strategy: Serial: poor latency, but minimizes load on bulkier tiers

Partially parallel:

submit to the next tier only the border-line queries

Pick the first acceptable answer – reduces latency



Tiered versus distributed Both applicable to indexes and query loads exceeding single

machine capabilities

Distributed sharded search: increases latency for all queries (send + execute + integrate from all shards)

… plus replicas to increase QPS:

Increases hardware / management costs

While not improving latency

Tiered search: Excellent latency for common queries

More complex to build and maintain

Arguably lower hardware cost for comparable scale / QPS



Tiered search benefits

Majority of common queries handled by first tier: RAM-based, high QPS, low latency

Partially parallel mode reduces average latency for more complex queries

Hardware investment likely smaller than for distributed search setup of comparable QPS / latency



Example Lucene API for tiered search

Could be implemented as a Solr SearchComponent...



Lucene implementation details



References

Efficiency trade-offs in two-tier web search systems, Baeza-Yates et al., SIGIR'09

ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08

Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05



Bit-wise search

Given a bit pattern query:

1010 1001 0101 0001 Find documents with matching bit patterns in a field

Applications: Permission checking

De-duplication

Plagiarism detection

Two variants: non-scoring (filtering) and scoring



Non-scoring bitwise search (LUCENE-2460)

Builds a Filter from intersection of: DocIdSet of documents matching a Query

Integer value and operation (AND, OR, XOR)

“Value source” that caches integer values of a field (from FieldCache)

Corresponding Solr field type and QParser: SOLR-1913

Useful for filtering (not scoring)

00x01

a

10x02

b

20x03

b

30x04

a

40x05

a

docIDflags

type

“type:a”

0x01 0x02 0x03 0x04 0x05

val=0x01op=AND

flags

Filter



Scoring bitwise search (SOLR-1918)

BooleanQuery in disguise:1010 = Y-1000 | N-0100 |

Y-0010 | N-0001

Solr 32-bit BitwiseField Analyzer creates the bitmasks field

Currently supports only single value per field

Creates BooleanQuery from query int value

Useful when searching for best matching (ranked) bit patterns

D11010

D21011

D30011

Y1000N0100Y0010N0001

Y1000N0100Y0010Y0001

N1000N0100Y0010Y0001

Results:

D1 matches 4 of 4 → #1D2 matches 3 of 4 → #2D3 matches 2 of 4 → #3

docID

flags

bits

Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001



Summary

Index post-processing covers a range of useful scenarios: Merging and splitting, remodeling, extracting, moving ...

Pruning less important data

Tiered search + pruned indexes: High performance

Practically unchanged quality

Less hardware

Bitwise search: Filtering by matching bits

Ranking by best matching patterns



Meta-summary

Stir your imagination

Think outside the box

Show some unorthodox use and practical applications

Close ties to scalability, performance, distributed search and query latency



Q & A


05/25/10Apache Lucene EuroCon

Thank you!



Massive indexing with map-reduce

Map-reduce indexing models Google model

Nutch model

Modified Nutch model

Hadoop contrib/indexing model

Tradeoff analysis and recommendations



Google model Map():

IN: <seq, docText> terms = analyze(docText)

foreach (term)

emit(term, <seq,position>)

Reduce()

IN: <term, list(<seq,pos>)> foreach(<seq,pos>)

docId = calculate(seq, taskId)

Postings(term).append(docId, pos)

Pros: analysis on the map side

Cons: Too many tiny intermediate records → Combiner

DocID synchronization across map and reduce tasks

Lucene: very difficult (impossible?) to create index this way



Nutch model (also in SOLR-1301) Map():

IN: <seq, docPart> docId = docPart.get(“url”)

emit(docId, docPart)

Reduce()

IN: <docId, list(docPart)> doc = luceneDoc(list(docPart))

indexWriter.addDocument(doc)

Pros: easy to build Lucene index

Cons: Analysis on the reduce side

Many costly merge operations (large indexes built from scratch on reduce side)

(plus currently needs copy from local FS to HDFS – see LUCENE-2373)



Modified Nutch model (N/A...) Map():

IN: <seq, docPart> docId = docPart.get(“url”)

ts = analyze(docPart)

emit(docId, <docPart,ts>)

Reduce()

IN: <docId, list(<docPart,ts>)> doc = luceneDoc(list(<docPart,ts>))


Pros: Analysis on map side Easy to build Lucene index

Cons: Many costly merge operations (large indexes built from scratch on reduce side)

(plus currently needs copy from local FS to HDFS – see LUCENE-2373)



Hadoop contrib/indexing model Map():

IN: <seq, docText> doc = luceneDoc(docText)


emit(random, indexData)

Reduce()

IN: <random, list(indexData)> foreach(indexData)

indexWriter.addIndexes(indexData)

Pros: analysis on the map side Many merges on the map side Supports also other operations (deletes, updates)

Cons: Serialization is costly, records are big and require more RAM to sort



Massive indexing - summary

If you first need to collect document parts → SOLR-1301 model

If you use complex analysis → Hadoop contrib/index NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...


munching & crunching - lucene index post-processing

Technology