collections management museums emu searching emu searching explained (what’s going on under the...

27
Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Upload: daniela-wheeler

Post on 18-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

EMu Searching Explained

(What’s going on under the hood!)

Bernard MarshallChief Technical Officer

KE Software

Page 2: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Overview

• The basic theory• Tools and tuning• Searching issues

Page 3: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

EMu search mechanism

• Two level superimposed coding scheme for partial match retrieval

• Developed from research at the University of Melbourne (early 1980s)

• Designed to provide very high speed retrieval from very large datasets

• The more search terms provided, the faster the search time• One set of indexes for all searching (except key searches)

Page 4: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Record Descriptor

• Encodes the contents of one record into a single bit string• Descriptors stored sequentially in the rec file• Each record descriptor has the data offset (from the data file)

appended

rec descriptor 1 offset

rec descriptor 2 offset

rec descriptor 3 offset

rec descriptor 4 offset

rec descriptor 5 offset

rec file data file

record data 1

record data 3

record data 2

Page 5: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Field Terms Bits set (k = 2) Descriptor (b = 15)

First Name Boris 3,10 00010 00000 10000

Surname Badenov 1, 4 01001 00000 00000

City FrostbiteFalls

3, 78, 14

00010 00100 0000000000 00010 00001

Country Pottsylvania 4, 9 00001 00001 00000

Rec Descriptor 01011 00111 10001

termpseudo random

number generator bit numbers

k b column no

Page 6: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Record descriptor (searching)

• Generate record descriptor for search term(s)• AND with all record descriptors to find matching record(s)

Field Terms Bits set (k = 2) Descriptor (b = 15)

First Name Boris 3,10 00010 00000 10000

Query Descriptor 00010 00000 10000

00010 00000 10000 Boris query descriptor

01011 00111 10001 AND record descriptor

00010 00000 10000 resultant descriptor

Page 7: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

False matches

• Query descriptor matches a record descriptor that does not contain the search term

Field Terms Bits set (k = 2) Descriptor (b = 15)

First Name Natasha 7, 9 00000 00101 00000

Query Descriptor 00000 00101 00000

00000 00101 00000 Natasha query descriptor

01011 00111 10001 AND record descriptor

00000 00101 00000 resultant descriptor

Page 8: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

False matches

• Chance of a false match related to bit density• The lower the bit density, the less probability of a false match• EMu uses a bit density of < 25%; that is, less than 25% of bits

are one• Probability of a false match with k = 5 is 1 in 1,024 record

descriptors checked for a single term query• Probability for a two term query 1 in 1,048,576• Lower bit density requires more disk space and produces longer

record descriptors

Page 9: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Segment descriptor• Encodes the contents of multiple records into a bit string• Descriptors stored sequentially in the seg file (bitsliced)

rec descriptor 1

rec descriptor 2

seg descriptor 1 rec descriptor 3

rec descriptor 4

rec descriptor 5

rec descriptor 6

seg descriptor 2 rec descriptor 7

. . .

Page 10: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Segment descriptor

• For each group of records (Nr) a single descriptor is calculated as for a record descriptor

• Segment level has its own values for k (number of bits to set) and b (length of bit string)

Page 11: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Segment descriptor (searching)

• Segment searching checks Nr records per descriptor• For efficient disk access for searching, “flip” seg file (bitslicing)• Penalty is slower record insertions / updates (use oflow file)

00001 00000 00100 00000 01000 seg query descriptor

10011 00010 00111 00001 11001 seg descriptor 1

00011 10000 00001 01100 00100 seg descriptor 2

01000 00110 11000 00011 01001 seg descriptor 3

01001 00100 01100 00101 01000 seg descriptor 4

Page 12: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Segment descriptor (bitsliced)1000 …0011 …0000 …1100 …1101 …0100 …0000 …0011 …1010 …0000 …0010 …0011 …1001 ……

1001 …

AND

• Each bit slice is ANDed to determine matching segments

• Matching segments are given by bit positions with a value of one

Page 13: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Complete search sequence

• Build segment query descriptor for query terms• Search bitslice segment file for list of matching segments• Build record query descriptor for query terms• Search record descriptors in matching segments for matching

records• Exact match record only before showing to user

Page 14: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Number of disk accesses (logical)

• For a single search term with one matching record: ks – bits set per term (segment level) 1 – disk read to read segment to match record descriptor

• Number of logical reads is independent of the table size• Number of physical reads increases as table grows (but disk read

ahead helps here)

Page 15: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Client query evaluation

• Attachment searches performed and matching IRNs on reference column added to query statement

• Reverse attachment searches performed and matching reference values added to query statement

• Local search terms added to query statement• Also search columns added to query statement• Search performed

Page 16: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

What is a term?

Type Term Query examples

Text word Frostbite, falls

Float number 9.12

Integer number 12

Date day, month, year 12-10-2010

Time hour, min, sec 13:12:10.0

Lat/Long deg, min, sec, dir 120 12 10.43 N

String value A1-124/7

• A term is the basic index component

Page 17: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Term modifiers

Modifier Applicable types Query examples

Null all types *, !*

Partial text, string ab*, a{a-z}*

Stem text ~electric

Phonetic text @smythe

Phrase text “Red house”

• Modifiers alter how the term is indexed

Page 18: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Indexing tools

• texdensity Prints out the bit density for segment and record descriptors

• texanalyse Prints the number of terms per record

• texconf Calculate a suitable index configuration Adjust configuration parameters manually

Page 19: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Configuration parameters

• params file in table directory Override default configuration parameters

• Bit density (rec/seg)• File system block size• False match probability (rec/seg)• Minimum number of records per segment

XML based file

Page 20: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – false matches• Issue

Some queries are slow but disk activity is high• Diagnose

texadmin database usage shows a high number of index false matches

texdensity shows high density or large standard deviation with high maximum density (check seg and rec)

texanalyse shows a large standard deviation for the number of index terms (check seg and rec)

• Fix Reconfigure table Set configuration parameters manually

Page 21: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – common terms• Issue

Some queries containing common terms are slow “false” segment matches

• Diagnose Querying on each term individually results in a large number

of matches (query is quick) Querying on the combination of terms becomes slow

• Fix Cluster table on a common term Sort data before indexing

Page 22: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – block size mismatch• Issue

Overall searching is slow but disk activity is high Using zfs with large record size

• Diagnose Determine the block size of the file system used to hold

index files Use texconf to determine the block size used for indexing

• Fix Set blocksize configuration parameter manually Adjust zfs record size to 16K

Page 23: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – RAID configuration• Issue

Record updates are very slow Fast disks but performance less than optimal

• Diagnose Disk controller or driver is configured to use RAID 5 or 6

• Fix Optimal performance in a RAID environment is RAID 1+0

(RAID 10) (stripe/mirror) Ensure striping agrees with block size of file system Enable striping where possible

Page 24: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – Unindexed fields• Issue

Wildcard / stem / phonetic based queries are extremely slow• Diagnose

Use emuindexing to check indexing of fields being queried • Fix

Add Registry entries to enable indexing required:• System|Setting|Table|table|Stem Index|colname;colname;...• System|Setting|Table|table|Phonetic Index|colname;colname;...• System|Setting|Table|table|Null Index|colname;colname;...• System|Setting|Table|table|Partial Index|colname=parts;...

Page 25: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – Range queries slow• Issue

Queries containing ranges are slow• Diagnose

Use emuindexing to check if range indexing is enabled• Fix

Use emurangeupdate to optimise range based searching Add Registry entries to enable indexing required:

• System|Setting|Table|table|Range Buckets|colname|bucket;...

Page 26: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

Searching Issues – Large attachment queries• Issue

Query is very slow when performing a query containing attachments and other terms

• Diagnose “Optimising query” status is displayed for a long time

• Cause The search engine is re-organising the query

(a AND b) AND (c OR d OR e OR f or g) becomes(a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or(a and b and f) OR (a AND b AND g)

• Fix Rewrite the query optimiser

Page 27: Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software

Collections Management Museums

EMu Searching

References• EMu 4.0.01 Release Notes

System Tuning• Configuration

• Range Indexing• www.kesoftware.com/downloads/EMu/documents/configuration.pdf

• www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf