tuning oracle text for rapid document retrieval - a case study
TRANSCRIPT
![Page 1: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/1.jpg)
Tuning Oracle Text for
Rapid Document Retrieval
- A Case Study
Sandi Wilson Senior Oracle DBA
Summit Information Systems, a Fiserv Company
![Page 2: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/2.jpg)
2
Speaker Background
Corvallis, Oregon, US BS in Computer Science, UCF Oracle DBA 10+ years (7.3.4+) Current Employment : Fiserv
– Banking applications
– Financial document archiving and retrieval
![Page 3: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/3.jpg)
3
Agenda
What is Oracle Text?
Vocabulary
Scope
Structures
Parameters
Packages
Impacting Performance
Q&A
![Page 4: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/4.jpg)
4
What is Oracle Text?
Earlier versions – SQL*TextRetrieval
– TextServer
– ConText Option/Cartridge
– InterMedia Text
Robust text search and document classification
Supports a wide variety of source document types
![Page 5: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/5.jpg)
5
What is Oracle Text? (cont.)
Query methods include: keywords, contexts, themes, word stems, pattern matching, fuzzy matching, HTML/XML sections
Output formats include: original document format, unformatted text, HTML with keyword highlighting, information visualization
![Page 6: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/6.jpg)
6
Vocabulary
OT
Token/Search token
– a text string
Docid
– an identifier assigned by and used within the
OT index to reference a source document
Preferences
– Parameter classes
![Page 7: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/7.jpg)
7
Scope
CONTEXT index type
Keyword & XML section searches
10gR2 Standard edition features
Non-static source: 50-500GB
![Page 8: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/8.jpg)
8
Index Creation
Use:
CREATE INDEX mytext_idx
ON Documents(text) INDEXTYPE IS ctxsys.Context PARMETERS („datastore mydatastore storage mystorage section group mysectiongroup lexer mylexer memory 400M‟);
Creates a set of internal index tables using defined or default index preferences
![Page 9: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/9.jpg)
9
Index Creation (cont.)
Parses and tokenizes text according to lexer parameters
Filters by removing tokens identified in stopword list
Source documents are processed in batches and appended to the index
![Page 10: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/10.jpg)
10
Internal Index Structure
DR$<index_name>$I – “Token” table
– List of tokens and where they occur
– BLOB: token_info
DR$<index_name>$X – B-tree index on $I table
DR$<index_name>$N – “Deleted” docids table
– List of invalid docids
![Page 11: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/11.jpg)
11
Internal Index Structure (cont.)
DR$<index_name>$K
– Index-organized “Lookup” table
– Maps external (source data) rowids to internal
index docids
DR$<index_name>$R
– Index-organized “Reverse Lookup” table
– Maps internal index docids to external rowids
![Page 12: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/12.jpg)
12
Index Parameters
Storage
– Define storage clauses for DR$ objects
Datastore
– Identifies where source text is located
Lexer
– Rules for converting text to tokens
Stop words
– High frequency words, excluded from indexing
Section Groups
– HTML, XML, AUTO
![Page 13: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/13.jpg)
13
Index Packages
CTX_ADM
CTX_DDL
CTX_OUTPUT
CTX_REPORT
![Page 14: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/14.jpg)
14
CTX_ADM
Set_Parameter
– Use:
CTX_ADM.SET_PARAMETER
(„max_index_memory‟,‟400M‟);
– default_index_memory
– log_directory
– Default_<preference> (i.e. datastore, lexer,
stoplist, etc)
![Page 15: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/15.jpg)
15
CTX_DDL
SYNC_INDEX
– Use:
CTX_DDL.SYNC_INDEX(„mytext_idx‟,‟100M‟);
– Processes newly added or modified documents
CTX_USER_PENDING view
– Source documents are processed in batches
and appended to the index
![Page 16: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/16.jpg)
16
CTX_DDL (cont.)
Optimize_Index
– Use:
CTX_DDL.OPTIMIZE_INDEX(„mytext_idx‟,‟fast‟);
– Defragments token info
– Removes references to deleted documents
– Modes: Fast (no GC), Full, Token, Token
Type, Rebuild
![Page 17: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/17.jpg)
17
CTX_DDL (cont.)
Create_Preference
Add_Stopword
Add_Stop_Section
Add_MDATA_Section
![Page 18: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/18.jpg)
18
CTX_OUTPUT
Start_Log/Stop_Log
Add_Event/Remove_Event
– EVENT_INDEX_PRINT_ROWID
– EVENT_OPT_PRINT_TOKEN
– EVENT_INDEX_PRINT_TOKEN
Start_Query_Log/Stop_Query_Log
Add_Trace/Remove_Trace
![Page 19: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/19.jpg)
19
CTX_REPORT
Use to generate reports on index information and query activity
Name Description
DESCRIBE_INDEX Creates a report describing the index.
DESCRIBE_POLICY Creates a report describing a policy.
CREATE_INDEX_SCRIPT Creates a SQL*Plus script to duplicate the named index.
CREATE_POLICY_SCRIPT Creates a SQL*Plus script to duplicate
the named policy.
![Page 20: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/20.jpg)
20
CTX_REPORT (cont.)
Name Description
INDEX_SIZE Creates a report to show the internal objects of an index, their tablespaces and used sizes.
INDEX_STATS Creates a report to show the various statistics of an index.
QUERY_LOG_SUMMARY Creates a report showing query statistics TOKEN_INFO Creates a report showing the information
for a token, decoded. TOKEN_TYPE Translates a name and returns a numeric
token type
![Page 21: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/21.jpg)
21
Impacting Performance
Parameters
Creation
Synchronization
Optimization
Query tuning
Other
$R
output
$I $K
report
syn
c
opti create
lexer $N stop
$X
![Page 22: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/22.jpg)
22
Index Parameters and Performance
Define large max_index_memory and default_index_memory
Choose appropriate lexer preferences
– Avoid tokenizing useful composite strings Ex: [email protected]
Define comprehensive stopword lists
– Avoid indexing unhelpful tokens Ex: “account”, “credit”, “authorize”
– Use CTX_REPORT.INDEX_STATS to identify new stopword candidates
![Page 23: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/23.jpg)
23
Index Creation and Performance
Maximize memory allocation for index creation
– If possible, reduce SGA in order to increase
max_index_memory
Define MDATA sections to avoid costly “mixed” queries
Use NOLOGGING storage parameter during creation, alter to LOGGING after index is created
![Page 24: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/24.jpg)
24
Synchonization and Performance
Sync infrequently
Maximize memory allocation for sync
![Page 25: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/25.jpg)
25
Optimization and Performance
Optimize after index creation
Optimize frequently
Rebuild mode achieves best $I defragmentation and space consolidation
Use lightweight „token‟ and „token type‟ modes for targeted improvements
Implement an aggressive optimization schedule
![Page 26: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/26.jpg)
26
Query Tuning
Use a single CONTAINS clause
Do not nest CONTAINS clauses in inner loop
When possible, use MDATA sections instead of mixed queries
![Page 27: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/27.jpg)
27
Other Performance Considerations
Gather statistics on the index, but not the internal tables
Use KEEP pool for internal tables
![Page 28: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/28.jpg)
34
Summary
Determine appropriate parameters for your OT application
Allocate maximum memory for OT processing
Synchronize infrequently
Optimize frequently
Tune the text queries
![Page 29: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/29.jpg)
35
References
Faisal, Mohammed, “Real World Performance Part III - Oracle Text”, OOW 2006
Ford, Roger, “Advanced MDATA - Tips and Tricks”, Oracle Technology Network, http://www.oracle.com/technology/products/text/htdocs/mdata_tricks.html
Kyte, Thomas, Expert One-On-One
![Page 30: Tuning Oracle Text for Rapid Document Retrieval - A Case Study](https://reader036.vdocuments.mx/reader036/viewer/2022071602/613d6760736caf36b75ced31/html5/thumbnails/30.jpg)
36
Q&A