kd-2014-optimizing-document-search-using-lucene
TRANSCRIPT
Document Search Optimization using Lucene API
Harsha Ummerpillai ([email protected])
Shyam Gedela ([email protected])
Michigan State University
11/13/2014
1
2
During KD 2013 we presented MSU’s approach to optimizing and improving performance of Rice Document search using Lucene API. We are back this year to share the lessons learned and results of our implementation.
Background
Kuali Days 2014Indianapolis 3
Introduction
• Background• Goals for Lucene implementation• Technical Recap• Implementation• Performance Results• How to• Demo
Kuali Days 2014Indianapolis 4
Background
• Document Search - Why is it important• MSU implementation
– Go Live - Jan 1 2011• Rice 2.1.9• KFS 5.0.x• KMM 1.x• OOI 1.x
– ~4 yrs of operation– ~4 million documents– ~50 million search attributes
Kuali Days 2014Indianapolis 5
Goals
• Goals for Lucene implementation– Fast – Improved and consistent response
times– Configurable – can be enabled/disabled
using configuration– Seamless – No change to user screen– Scalable– Customizable
Kuali Days 2014Indianapolis 7
Document search
• Client applications define searchable attributes in Data Dictionary.
• Rice extracts and builds index, saving key value pairs into DB.
• Attributes saved into 4 different tables based on data types.• Existing structure – 1 document to n indexed records• Standard searchable fields
– Status codes– Initiator– Approver – Action dates.
• Custom attributes defined by document types
Kuali Days 2014Indianapolis 13
Technical features
• Documents are queued for Lucene indexing with four separate stages
– WAIT_FOR_REALTIME("0"), – READY_FOR_REALTIME("1"), – WAIT_FOR_MASTER("2"), – READY_FOR_MASTER("3")
• Two indexes; master and real-time• Master refreshed 3 times a day• Real-time index refreshed every 5 seconds• Single master node writes index to shared file storage
Kuali Days 2014Indianapolis 15
Index Storage
• Directory structure within Lucene Index store• temp: Storage location before merge into active index• meta-info: Index stats and message files
Kuali Days 2014Indianapolis 17
Performance Test Scenarios
• 7 business scenarios• Invaluable for daily operations
– E.g. how many payment requests are department approved but have not been extracted by PDP (Vendors not paid)
Kuali Days 2014Indianapolis 18
Performance Charts - Comparison
ACCT Approver PO REQS PCDO PREQ CM0
50000
100000
150000
200000
250000
300000
350000
No Lucene
Lucene
19
We have created an open contribution JIRA CONTRIB-95 and happy to provide latest fixes and patches from our production.
How To
Kuali Days 2014Indianapolis 20
How to guide
• Visit contribution JIRA https://jira.kuali.org/browse/CONTRIB-95• Download and apply the patch file to rice (base version - 2.1.9)
workspace.• Add Lucene configuration properties to rice application configuration file.• Setup shared file store location where index will be saved and shared.• Add Lucene index queue table using lucene-setup.sql• Build and start rice application with Lucene configuration enabled• Visit “Administration > Lucene Administration “ click “Build Master Index”• Click refresh link to see the status, when index.ready file is listed master
index is ready for use.• Create a document and see if it is available in search, if real time indexer
is working new document should appear in search results within 5~10 seconds.
• Use administration page to see the latest status and manage the index.
Kuali Days 2014Indianapolis 22
References
• Lucene http://lucene.apache.org/core/
• KD 2013 Presentation https://jira.kuali.org/secure/attachment/77886/KD-2013-Optimizing-Document-Search-using-Lucene.pptx
• CONTRIB-95 https://jira.kuali.org/browse/CONTRIB-95