munching & crunching - lucene index post-processing
DESCRIPTION
Lucene EuroCon 10 presentation on index post-processing (splitting, merging, sorting, pruning), tiered search, bitwise search, and a few slides on MapReduce indexing models (I ran out of time to show them, but they are there...)TRANSCRIPT
1
Munching & crunchingLucene index post-processing and applications
Andrzej Białecki
Intro
Started using Lucene in 2003 (1.2-dev?)
Created Luke – the Lucene Index Toolbox
Nutch, Hadoop committer, Lucene PMC member
Nutch project lead
Munching and crunching? But really...
Stir your imagination
Think outside the box
Show some unorthodox use and practical applications
Close ties to scalability, performance, distributed search and query latency
20 May 2010Apache Lucene EuroCon
Agenda
●Post-processing● Splitting, merging, sorting, pruning
●Tiered search
●Bit-wise search
●(Map-reduce indexing models)
20 May 2010Apache Lucene EuroCon
Why post-process indexes?
Isn't it better to build them right from the start?
Sometimes it's not convenient or feasible Correcting impact of unexpected common words
Targetting specific index size or composition:
Creating evenly-sized shards
Re-balancing shards across servers
Fitting indexes completely in RAM
… and sometimes impossible to do it right Trimming index size while retaining quality of top-N results
20 May 2010Apache Lucene EuroCon
Merging indexes
It's easy to merge several small indexes into one
Fundamental Lucene operation during indexing (SegmentMerger) Command-line utilities exist: IndexMergeTool
API:
IndexWriter.addIndexes(IndexReader...)
IndexWriter.addIndexesNoOptimize(Directory...)
Hopefully a more flexible API on the flex branch
Solr: through CoreAdmin action=mergeindexes Note: schema must be compatible
20 May 2010Apache Lucene EuroCon
Splitting indexes
IndexSplitter tool: Moves whole segments to standalone indexes
Pros: nearly no IO/CPU involved – just rename & create new SegmentInfos file
Cons:
Requires a multi-segment index!
Very limited control over content of resulting indexes → MergePolicy
_0 _1 _2
segments_2
original index
segm
ents
_0
segm
ents
_0
segm
ents
_0
new indexes
20 May 2010Apache Lucene EuroCon
Splitting indexes, take 2 MultiPassIndexSplitter tool:
Uses an IndexReader that keeps the list of deletions in memory The source index remains unmodified For each partition:
Marks all source documents not in the partition as deleted Writes a target split using IndexWriter.addIndexes(IndexReader)
IndexWriter knows how to skip deleted documents Removes the “deleted” mark from all source documents
Pros: Arbitrary splits possible (even partially overlapping) Source index remains intact
Cons: Reads complete index N times – I/O is O(N * indexSize) Takes twice as much space (source index remains intact)
… but maybe it's a feature?
d1
d2
d3
original indexdel1
new indexes
d4
del2
d1 d2
d3 d4
pass 1 pass 2
20 May 2010Apache Lucene EuroCon
Splitting indexes, take 3
SinglePassSplitter Uses the same processing workflow as
SegmentMerger, only with multiple outputs
Write new SegmentInfos and FieldInfos
Merge (pass-through) stored fields
Merge (pass-through) term dictionary
Merge (pass-through) postings with payloads
Merge (pass-through) term vectors
Renumbers document id-s on-the-fly to form contiguous space
Pros: flexibility as with MultiPassIndexSplitter
Status: work started, to be contributed soon...
stored fields
term dict
postings+payloads
term vectors
partitionerstoredterms
postingsterm vectors
storedterms
postingsterm vectors
1 2 3 4 5 6 7 8 9 10 ...
1 3 5… 1' 2' 3' 4' 5' 6'...
2 4 6 … 1' 2' 3' 4' 5' 6'...
renumber
20 May 2010Apache Lucene EuroCon
Splitting indexes, summary
SinglePassSplitter – best tradeoff of flexibility/IO/CPU
Interesting scenarios with SinglePassSplitter: Split by ranges, round-robin, by field value, by frequency, to a target size, etc...
“Extract” handful of documents to a separate index
“Move” documents between indexes:
“extract” from source
Add to target (merge)
Delete from source
Now the source index may reside on a network FS – the amount of IO is
O(1 * indexSize)
20 May 2010Apache Lucene EuroCon
Index sorting - introduction
“Early termination” technique If full execution of a query takes too long then terminate and estimate
Termination conditions: Number of documents – LimitedCollector in Nutch
Time – TimeLimitingCollector
(see also extended LUCENE-1720 TimeLimitingIndexReader)
Problems: Difficult to estimate total hits
Important docs may not be collected if they have high docID-s
20 May 2010Apache Lucene EuroCon
Index sorting - details Define a global ordering of
documents (e.g. PageRank, popularity, quality, etc) Documents with good rank
should generally score higher
Sort (internal) ID-s by this ordering, descending
Map from old to new ID-s to follow this ordering
Change the ID-s in postings
0c
1e
2h
3f
4a
5d
6g
7b
doc ID
rank
40
71
02
53
14
35
66
27
old doc ID
new doc ID
0a
1b
2c
3d
4e
5f
6g
7h
doc ID
rank
original index
sorted index
ID mapping
early termination == poor
early termination == good
20 May 2010Apache Lucene EuroCon
Index sorting - summary
Implementation in Nutch: IndexSorter Based on PageRank – sorts by decreasing page quality
Uses FilterIndexReader
NOTE: “Early termination” will (significantly) reduce quality of results with non-sorted indexes – use both or neither
20 May 2010Apache Lucene EuroCon
Index pruning
Quick refresh on the index composition: Stored fields
Term dictionary
Term frequency data
Positional data (postings)
With or without payload data
Term frequency vectors
Number of documents may be into millions
Number of terms commonly is well into millions Not to mention individual postings …
20 May 2010Apache Lucene EuroCon
Index pruning & top-N retrieval
N is usually << 1000
Very often search quality is judged based on top-20
Question: Do we really need to keep and process ALL terms and ALL
postings for a good-quality top-N search for common queries?
20 May 2010Apache Lucene EuroCon
Index pruning hypothesis
There should be a way to remove some of the less important data While retaining the quality of top-N results!
Question: what data is less important?
Some answers: That of poorly-scoring documents
That of common (less selective) terms
Dynamic pruning: skips less relevant data during query processing → runtime cost...
But can we do this work in advance (static pruning)?
20 May 2010Apache Lucene EuroCon
What do we need for top-N results?
Work backwards
“Foreach” common query: Run it against the full index
Record the top-N matching documents
“Foreach” document in results: Record terms and term positions that contributed to the score
Finally: remove all non-recorded postings and terms
First proposed by D. Carmel (2001) for single term queries
20 May 2010Apache Lucene EuroCon
… but it's too simplistic:
Hmm, what about less common queries? 80/20 rule of “good enough”?
Term-level is too primitive Document-centric pruning
Impact-centric pruning
Position-centric pruning
0 quick
1 brown
2 fox
0 quick
1 brown
2 fox
before pruning after pruning
Query 1: brown - topN(full) == topN(pruned)Query 2: “brown fox” - topN(full) != topN(pruned)
20 May 2010Apache Lucene EuroCon
Smarter pruning Not all term positions are equally
important
Metrics of term and position importance: Plain in-document term frequency (TF)
TF-IDF score obtained from top-N results of TermQuery (Carmel method)
Residual IDF – measure of term informativeness (selectivity)
Key-phrase positions, or term clusters
Kullback-Leibler divergence from a language model →
corpus language model
document language model
Term
Freq
20 May 2010Apache Lucene EuroCon
Applications
Obviously, performance-related Some papers claim a modest impact on quality when pruning up to 60% of
postings
See LUCENE-1812 for some benchmarks confirming this claim
Removal / restructuring of (some) stored content
Legacy indexes, or ones created with a fossilized external chain
20 May 2010Apache Lucene EuroCon
Stored field pruning
Some stored data can be compacted, removed, or restructured
Use case: source text for generating “snippets” Split content into sentences
Reorder sentences by a static “importance” score (e.g. how many rare terms they contain)
NOTE: this may use collection wide statistics!
Remove the bottom x% of sentences
20 May 2010Apache Lucene EuroCon
LUCENE-1812: contrib/pruning tools and API
Based on FilterIndexReader Produces output indexes via IndexWriter.addIndexes(IndexReader[])
Design: PruningReader – subclass of FilterIndexReader with necessary boilerplate and
hooks for pruning policies
StorePruningPolicy – implements rules for modifying stored fields (and list of field names)
TermPruningPolicy – implements rules for modifying term dictionary, postings and payloads
PruningTool – command-line utility to configure and run PruningReader
20 May 2010Apache Lucene EuroCon
PruningReader
Details of LUCENE-1812
IndexWriter consumes source data filtered via PruningReader
Internal document ID-s are preserved – suitable for bitset ops and retrieval by internal ID If source index has no deletions
If target index is empty
stored fields
term dict
postings+payloads
term vectors
StorePruningPolicy
TermPruningPolicy
stored fields
term dict
postings+payloads
term vectors
IndexWr iter
source index target index
IW.addIndexes(IndexReader...)
20 May 2010Apache Lucene EuroCon
API: StorePruningPolicy
May remove (some) fields from (some) documents
May as well modify the values
May rename / add fields
20 May 2010Apache Lucene EuroCon
API: TermPruningPolicy
Thresholds (in the order of precedence): Per term
Per field
Default
Plain TF pruning – TFTermPruningPolicy Removes all postings for a term where TF (in-document term frequency) is below
a threshold
Top-N term-level – CarmelTermPruningPolicy TermQuery search for top-N docs
Removes all postings for a term outside the top-N docs
20 May 2010Apache Lucene EuroCon
Results so far...
TF pruning: Term query recall very good
Phrase query recall very poor – expected...
Carmel pruning – slightly better term position selection, but still heavy negative impact on phrase queries
Recognizing and keeping key phrases would help Use query log for frequent-phrase mining?
Use collocation miner (Mahout)?
Savings on pruning will be smaller, but quality will significantly improve
20 May 2010Apache Lucene EuroCon
References
Static Index Pruning for Information Retrieval Systems, Carmel et al, SIGIR'01
A document-centric approach to static index pruning in text retrieval systems, Büttcher & Clark, CIKM'06
Locality-based pruning methods for web search, deMoura et al, ACM TIS '08
Pruning strategies for mixed-mode querying, Anh & Moffat, CIKM'06
20 May 2010Apache Lucene EuroCon
Index pruning applied ...
Index 1: A heavily pruned index that fits in RAM: excellent speed
poor search quality for many less-common query types
Index 2: Slightly pruned index that fits partially in RAM: good speed, good quality for many common query types,
still poor quality for some other rare query types
Index 3: Full index on disk: Slow speed
Excellent quality for all query types
QUESTION: Can we come up with a combined search strategy?
20 May 2010Apache Lucene EuroCon
Tiered search
Can we predict the best tier without actually running the query?
How to evaluate if the predictor was right?
?predict
evaluatesearch box 3search box 3
search box 1search box 1
search box 2search box 2
RAM70% pruned
SSD
HDD
0% pruned
30% pruned
20 May 2010Apache Lucene EuroCon
Tiered search: tier selector and evaluator Best tier can be predicted (often enough ):
Carmel pruning yields excellent results for simple term queries
Phrase-based pruning yields good results for phrase queries (though less often)
Quality evaluator: when is predictor wrong? Could be very complex, based on gold standard and qrels
Could be very simple: acceptable number of results
Fall-back strategy: Serial: poor latency, but minimizes load on bulkier tiers
Partially parallel:
submit to the next tier only the border-line queries
Pick the first acceptable answer – reduces latency
20 May 2010Apache Lucene EuroCon
Tiered versus distributed Both applicable to indexes and query loads exceeding single
machine capabilities
Distributed sharded search: increases latency for all queries (send + execute + integrate from all shards)
… plus replicas to increase QPS:
Increases hardware / management costs
While not improving latency
Tiered search: Excellent latency for common queries
More complex to build and maintain
Arguably lower hardware cost for comparable scale / QPS
20 May 2010Apache Lucene EuroCon
Tiered search benefits
Majority of common queries handled by first tier: RAM-based, high QPS, low latency
Partially parallel mode reduces average latency for more complex queries
Hardware investment likely smaller than for distributed search setup of comparable QPS / latency
20 May 2010Apache Lucene EuroCon
Example Lucene API for tiered search
Could be implemented as a Solr SearchComponent...
20 May 2010Apache Lucene EuroCon
Lucene implementation details
20 May 2010Apache Lucene EuroCon
References
Efficiency trade-offs in two-tier web search systems, Baeza-Yates et al., SIGIR'09
ResIn: A combination of results caching and index pruning for high-performance web search engines, Baeza-Yates et al, SIGIR'08
Three-level caching for efficient query processing in large Web search engines, Long & Suel, WWW'05
20 May 2010Apache Lucene EuroCon
Bit-wise search
Given a bit pattern query:
1010 1001 0101 0001 Find documents with matching bit patterns in a field
Applications: Permission checking
De-duplication
Plagiarism detection
Two variants: non-scoring (filtering) and scoring
20 May 2010Apache Lucene EuroCon
Non-scoring bitwise search (LUCENE-2460)
Builds a Filter from intersection of: DocIdSet of documents matching a Query
Integer value and operation (AND, OR, XOR)
“Value source” that caches integer values of a field (from FieldCache)
Corresponding Solr field type and QParser: SOLR-1913
Useful for filtering (not scoring)
00x01
a
10x02
b
20x03
b
30x04
a
40x05
a
docIDflags
type
“type:a”
0x01 0x02 0x03 0x04 0x05
val=0x01op=AND
flags
Filter
20 May 2010Apache Lucene EuroCon
Scoring bitwise search (SOLR-1918)
BooleanQuery in disguise:1010 = Y-1000 | N-0100 |
Y-0010 | N-0001
Solr 32-bit BitwiseField Analyzer creates the bitmasks field
Currently supports only single value per field
Creates BooleanQuery from query int value
Useful when searching for best matching (ranked) bit patterns
D11010
D21011
D30011
Y1000N0100Y0010N0001
Y1000N0100Y0010Y0001
N1000N0100Y0010Y0001
Results:
D1 matches 4 of 4 → #1D2 matches 3 of 4 → #2D3 matches 2 of 4 → #3
docID
flags
bits
Q = bits:Y1000 bits:N0100 bits:Y0010 bits:N0001
20 May 2010Apache Lucene EuroCon
Summary
Index post-processing covers a range of useful scenarios: Merging and splitting, remodeling, extracting, moving ...
Pruning less important data
Tiered search + pruned indexes: High performance
Practically unchanged quality
Less hardware
Bitwise search: Filtering by matching bits
Ranking by best matching patterns
20 May 2010Apache Lucene EuroCon
Meta-summary
Stir your imagination
Think outside the box
Show some unorthodox use and practical applications
Close ties to scalability, performance, distributed search and query latency
20 May 2010Apache Lucene EuroCon
Q & A
05/25/10Apache Lucene EuroCon
Thank you!
20 May 2010Apache Lucene EuroCon
Massive indexing with map-reduce
Map-reduce indexing models Google model
Nutch model
Modified Nutch model
Hadoop contrib/indexing model
Tradeoff analysis and recommendations
20 May 2010Apache Lucene EuroCon
Google model Map():
IN: <seq, docText> terms = analyze(docText)
foreach (term)
emit(term, <seq,position>)
Reduce()
IN: <term, list(<seq,pos>)> foreach(<seq,pos>)
docId = calculate(seq, taskId)
Postings(term).append(docId, pos)
Pros: analysis on the map side
Cons: Too many tiny intermediate records → Combiner
DocID synchronization across map and reduce tasks
Lucene: very difficult (impossible?) to create index this way
20 May 2010Apache Lucene EuroCon
Nutch model (also in SOLR-1301) Map():
IN: <seq, docPart> docId = docPart.get(“url”)
emit(docId, docPart)
Reduce()
IN: <docId, list(docPart)> doc = luceneDoc(list(docPart))
indexWriter.addDocument(doc)
Pros: easy to build Lucene index
Cons: Analysis on the reduce side
Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)
20 May 2010Apache Lucene EuroCon
Modified Nutch model (N/A...) Map():
IN: <seq, docPart> docId = docPart.get(“url”)
ts = analyze(docPart)
emit(docId, <docPart,ts>)
Reduce()
IN: <docId, list(<docPart,ts>)> doc = luceneDoc(list(<docPart,ts>))
indexWriter.addDocument(doc)
Pros: Analysis on map side Easy to build Lucene index
Cons: Many costly merge operations (large indexes built from scratch on reduce side)
(plus currently needs copy from local FS to HDFS – see LUCENE-2373)
20 May 2010Apache Lucene EuroCon
Hadoop contrib/indexing model Map():
IN: <seq, docText> doc = luceneDoc(docText)
indexWriter.addDocument(doc)
emit(random, indexData)
Reduce()
IN: <random, list(indexData)> foreach(indexData)
indexWriter.addIndexes(indexData)
Pros: analysis on the map side Many merges on the map side Supports also other operations (deletes, updates)
Cons: Serialization is costly, records are big and require more RAM to sort
20 May 2010Apache Lucene EuroCon
Massive indexing - summary
If you first need to collect document parts → SOLR-1301 model
If you use complex analysis → Hadoop contrib/index NOTE: there is no good integration yet of Solr and Hadoop contrib/index module...