benchmarking oracle 8i intermedia text background for this benchmark interesting new features in...

22
Benchmarking Oracle 8 Benchmarking Oracle 8 i i Intermedia Text Intermedia Text • Background for this benchmark • Interesting new features in OIMT • Benchmarking, methodology and problems • Results • Conclusions ODF 000419 Benchmarking Oracle8i Intermedia Text

Upload: itzel-cleverly

Post on 01-Apr-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Benchmarking Oracle 8Benchmarking Oracle 8ii Intermedia TextIntermedia Text

• Background for this benchmark

• Interesting new features in OIMT

• Benchmarking, methodology and problems

• Results

• Conclusions

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 2: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Background for this benchmarkBackground for this benchmark

The task of the thesis projectThe EDMS Search Engine

CERN’s EDMS

CADIM/EDB, managing product data documents

MP5, managing physical components

Oracle 7, database platform

Implement Oracle8i in the future?

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 3: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Oracle 8Oracle 8i i IntermediaIntermedia

A product embedded in Oracle8i

Intermedia allows a unified technique for accessing various types of data such as:

Text Documents Images Audio Video

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 4: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Features in Intermedia TextFeatures in Intermedia Text

ODF 000419 Benchmarking Oracle8i Intermedia Text

Database integration• Mixed queries against multiple text columns

• Single SQL API

• Indexes on most database columns

• All index data in the database

• Automatic triggers on textcolumns to detect changes

Full text search• Exact Word and phrase search, operators, removing stopwords

• XML and HTML document section searching

Page 5: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Features in Intermedia TextFeatures in Intermedia Text

ODF 000419 Benchmarking Oracle8i Intermedia Text

About Search (All languages)• Parses any text search to perform an optimal search

• Complements full-text search

Theme Identification (English only)• Identifies strong themes in documents using a ”Theme base”

• ”Thematic” information is put into the index

• Themesearch is available via the about operator

• May be customized for specific terminology

Page 6: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Features in Intermedia TextFeatures in Intermedia Text

ODF 000419 Benchmarking Oracle8i Intermedia Text

Document services• View documents as plain text or HTML format

• Store documents in a database, file system or at an URL adress

Multilingual text search• Full-text search in most languages including Japanese, Chinese and Korean

• Support for all Oracle-NLS character sets

• Stemming and Fuzzy search for Dutch, English, French, German and Spanish

• Base-letter support and alternate spelling for Western European languages

Page 7: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Features in Intermedia TextFeatures in Intermedia Text

To be investigated:

• The text indexing technique

• The new query operators

• How to use this in the EDMS Search Engine

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 8: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT TextindexesOIMT TextindexesAn OIMT textindex can be created on: • varchar2 columns

• Large object (LOB) columns

• B-file columns, indexing entire files

Allowing queries like:Select id from table where contains(column,’Atlas’)>0;

1 One of the detectors in the LHC ring is ATLAS

2 The Atlasmountains are situated in the north of Africa

-> 1

… contains(column,’Atlas inner detector’)>0;

1 The problems with the ATLAS inner detector...

2 The inner detector of ATLAS...

-> 1

Query optimization

Selects the best executing plan based on analyze table…compute statistics;

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 9: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT TextindexesOIMT TextindexesAn OIMT textindex is an ”inverted” index

1 the LHC accelerator

2 the LEP accelerator

-> accelerator: row #1 position 2, row #2 position 2

LHC: row #1 position 1

LEP: row #2 position 1

A textindex is built up by five objects

Four tables: I,K,N,R and one b-tree index: X

Create OIMT textindex statement

create index myindex on table(column)

indextype is ctxsys.context;

Note that table must have a primary key

Updating, two choices

Automaticly by starting ctxsrv and commit

Rebuild ”manually”

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 10: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT TextindexesOIMT Textindexes

Datastore

Filter

Sectioner

Lexer

The indexing ”pipeline”

• Loops over the rows and reads data out of the column

• Or from remote servers, accessed via http or ftp via pointers

• Transformns the data into text representation

• Output can be in HTML or XML

• Takes the output from the filter and converts it to plain text

• Different sectioner for different formats

• Detects important section tags

• Separates text into tokens and words

• Remove stopwords

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 11: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT TextindexesOIMT TextindexesODF 000419 Benchmarking Oracle8i Intermedia Text

The preference system allows customization of textindexes

Classes of ”customizable” objects:

DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE

Create a preference to customize an OIMT textindex:execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’);

execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’);

create index my_index on table(column) indextype is

ctxsys.context parameters(’LEXER my_pref’);

Default preferences

Unset preferences get their value from the default system

Page 12: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT Query operatorsOIMT Query operators

Scoring operators: WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-)Examples: …contains(column, ’(edms*2) AND cms’)>10;

…contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone

Word expansion operators: WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=)Examples: …contains(column, ’?mignets’)>0; Fuzzy correct missspellings

1 The magnets of the LHC accelerator…

-> 1

…contains(column, ’$go’)>0; Stem considers e.g. go, went, gone as the ”same” word

1 I will go to the cinema. as well as plurals, magnet=magnets

2 I went back home.

3 The train has gone.

-> 1,2,3

…contains(column, ’!dog’)>0; Soundex retrieves all word which sounds alike

1 I have a dog.

2 Someone has dug a hole.

-> 1,2

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 13: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Proximity operator: NEAR(;)Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0;

1 EDMS is available for CMS, ATLAS, LHCb...

2 The EDMS at CERN manages all product data documents of CMS

-> 1

Section limiting and theme operators: WITHIN, ABOUTabout is used as a ”general” operator, optimizing the query by including stem($) and theme search if available

Thesaurus operator: SYNExamples: SYN(dog) == {boxer} | {rotweiler} | {terrier}

Boolean operators: AND, OR, NOT

OIMT Query operatorsOIMT Query operatorsODF 000419 Benchmarking Oracle8i Intermedia Text

Page 14: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

OIMT BenchmarksOIMT Benchmarks

• Is SELECT…CONTAINS(column,’word’)>0; faster than SELECT…LIKE(’%word%’); ?

• How do indexes on entire files perform?

• What is the prize in terms of memory storage, maintenance?

• Still fast retrieval?

• Is updating flexible?

• Do the query operators perform as expected?

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 15: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Searchword "ALICE", matching 5% of the documents

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

6009 12018 24036 48072

Tablesize [rows]

Tim

e [

s] contains

like

Comparing CONTAINS and LIKE, retrieval times

Using OIMT’s CONTAINS

Fulltablescan using LIKE, as in EDMS Search today

Y-axis: Retrieval time [s]

X-axis: Tablesize [# of rows]

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 16: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Searchword "POLYIMIDE", matching 0.01% of the documents

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

6009 12018 24036 48072

Tablesize [rows]

Tim

e [

s] contains

like

Comparing CONTAINS and LIKE, retrieval times

Using OIMT’s CONTAINS

Fulltablescan using LIKE, as in EDMS Search today

Y-axis: Retrieval time [s]

X-axis: Tablesize [# of rows]

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 17: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Indexing entire filesTests with smaller amounts of files (<1000) works fine

Encountered heavy problems when indexing up to 9000 files:

• Security bug

A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE.

Special roles will be introduced in the 8.1.7 version to manage this problem.

• Not indexing certain Excel files, bug Said to be fixed in the 8.1.7 version

• Storage problems, tablespace, temp, shared pool, lobsegmentsA OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial.

• Unrelevant errormessagesSomewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned,

making troubleshooting difficult.

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 18: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Indexing entire files

Statistics about the 9000 file index:

Total indexsize: 1.0 GB Number of files: 8942Accumulated filesize: 4.4 GB Tot indexsize/acc filesize: 23.6 %Average indexsize per file: 115.8 KB/fileAverage filesize: 490.1 KB/fileCreation time: 7:06 hours Creation time per file: 2.9 seconds/file

Updating: Works fine, 2 seconds/inserted rowQuering: 1-3 seconds/query (depends on the query!)

Machine: SUN 4CPU 300MHz 1.5 GB RAM

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 19: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

A view of the testtable

ODF 000419 Benchmarking Oracle8i Intermedia Text

Index created on this file reference column

Page 20: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

ConclusionsODF 000419 Benchmarking Oracle8i Intermedia Text

• Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7?

• Quering is fast, both for indexed varchar2 columns and filecolumns

• Updating textindexes can be done automatic

• Index preferences may be customized

• Multilingual

Textindexes

Query Operators

• Works mainly as expected, providing powerful tools for an advanced Search Engine

• Wildcards a very slow when executed on fileindexed columns

O8i IMT

• A very interesting ”platform” for the future EDMS Search Engine

• Will provide tools for fast full-text searches, with both simple and advanced queries

Page 21: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

Application interfaces are non-trivial to create

Interface important to make userfriendly

Many hits may be retrieved from full-text searches, limiting these may be crucial

Some ideas:

Ask the users for hints

Keep the interface as simple as possible, avoid too many graphical objects etc

A simple and an advanced search option

Menus for choosing query operators, scoring etc

The EDMS Search Engine Interface

ODF 000419 Benchmarking Oracle8i Intermedia Text

Page 22: Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions

ODF 000419 Benchmarking Oracle8i Intermedia Text

An OIMT Test Search Engine:

9000 indexed files from the EDMS production database

http://oraweb01.cern.ch:9000/anders/owa/edms_extended_search.enter_search