advanced indexing techniques with apache lucene - payloads advanced indexing techniques with michael...
TRANSCRIPT
![Page 1: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/1.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Advanced Indexing Techniques
with
Michael Busch
![Page 2: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/2.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda
• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads
• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting
![Page 3: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/3.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures
InvertedIndex
Store
search
Results
retrieve stored fields
Hits
![Page 4: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/4.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: not
String comparison slow!
Solution: Inverted index
![Page 5: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/5.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: notInverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0 1
1
0
0
0 1
0
0
Document IDs
![Page 6: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/6.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Inverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0 1
1
0
0
0 1
0
0
0 1 2 3 4 5
0 1 2 3 4 5
6 7
Query: ”not to”
Document IDs
![Page 7: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/7.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Query: ”not to”Inverted index
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0
1
0
0
0
0
0
1
0 1 2 3 4 5
0 1 2 3 4 5
6 7
1
1
3
4
2
7
6
5
0
2
5
0 41
Document IDsPositions
![Page 8: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/8.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
c:\docs\shakespeare.txt:
To be or not to be.
c:\docs\einstein.txt:
The important thing is not tostop questioning.
Inverted index with Payloads
be
important
is
not
or
questioning
stop
to
the
thing
0
1
1
0
0
0
1
0
0
0
0
0
0 1 2 3 4 5
0 1 2 3 4 5
6 7
1
1
3
4
2
7
6
5
0
2
0
1
5
1
Document IDsPositions Payloads
4
![Page 9: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/9.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
So far…
• String comparison slow
• Inverted index used to accelerate search
• Store positions in posting lists to allow phrase searches
• Store payloads in posting lists to store arbitrary data with each position
![Page 10: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/10.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Lucene’s data structures
InvertedIndex
Store
search
Results
retrieve stored fields
Hits
![Page 11: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/11.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Store
StoreField 1: titleField 2: contentField 3: hashvalue
Documents:
F3D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
![Page 12: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/12.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
F3
Store
D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
• Optimized for random access
• Document-locality
![Page 13: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/13.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
F3
Store
D0 F1 F2 F3 D1 F1 F2 D2 F1 F2 F3
• Optimized for scanning and skipping
• Value-locality
Posting list with Payloads
D0 D1 D1F30 0 0F3 F3Document IDsPositions Payloads
XXX
![Page 14: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/14.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Agenda
• Part 1: Inverted Index 101– Posting Lists– Stored Fields vs. Payloads
• Part 2: Use cases for Payloads– BoostingTermQuery– Simple facet counting
![Page 15: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/15.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
org.apache.lucene.analysis.Token
void setPayload(Payload payload)
org.apache.lucene.index.TermPositions
int getPayloadLength();byte[] getPayload(byte[] data, int offset)
Payloads - API
![Page 16: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/16.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Analyzer:
final byte BoldBoost = 5;…Token token = new Token(…);…If (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost}));}…return token;
Example: BoostingTermQuery
![Page 17: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/17.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Similarity:Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; };
Example: BoostingTermQuery
![Page 18: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/18.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Example: BoostingTermQuery
BoostingTermQuery:
Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”));
Searching:
Searcher searcher = new IndexSearcher(…);Searcher.setSimilarity(boostingSimilarity);…Hits hits = searcher.search(btq);
![Page 19: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/19.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Analyzer:
public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token;}}}}
Example: Simple facet counting
![Page 20: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/20.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Hitcollector:
Example: Simple facet counting
• Use different PriorityQueues for different sites
• Instead of returning top-n results of the whole data set, return top-n results per site
![Page 21: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/21.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Summary
Example: Simple facet counting
• In this example: facet (site) used for scoring, but extendable for facet counting
• Good performance due to locality of facet values
![Page 22: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/22.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Conclusion
• Payloads offer great flexibility
• Payloads are stored very space-efficient
• Sophisticated data structures enable efficient skipping over payloads
• Payloads should be used whenever special data is required for finding hits and scoring
![Page 23: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/23.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Outlook
• Finalize API (currently Beta)
• Add more out-of-the-box query types
• Per-document Payloads
![Page 24: Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch (buschmi@apache.org)](https://reader030.vdocuments.mx/reader030/viewer/2022033023/56649c7b5503460f9492ef89/html5/thumbnails/24.jpg)
Advanced Indexing Techniques with Apache Lucene - Payloads
Advanced Indexing Techniques
with
Questions ?