introduction to the xapian search engine

Introduction to the Xapian Search Engine

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Open Source Search Engine Library

Written in C++ (we use the PERL bindings)

Uses the BM25 ranking function which gives the relevance

matching

“Scales well”: 100+ million documents

Oh… code that we don’t need to maintain!

Presentation

Database

Document

◦ data

◦ terms

◦ Values

(Xapian) Metadata management

Searching

Are you ready for it?

Core Concepts

Collection of files storing indexes, positions, term

frequencies, …

One write-lock, multiple read-locks

Stored in archives/<id>/var/xapian/

Supports multiple-DB’s (unused in EPrints)

Can store arbitrary metadata

Core Concepts: Database

A Document is an item returned by a search

So it’s also the meaty bit of indexing

Maps to a single data-obj in EPrints

Has three main components:

◦ data

◦ terms

◦ values

Core Concepts: Document

Arbitrary blob of data

Un-processed by Xapian

Used to store information needed to display the results

Used to store the data-obj identifier in EPrints in order to

quickly build EPrints::List objects

Could be used to store more complex data: cached

citations, JSON/PERL representation of the data-obj

Limit ~100MB per Document

Core Concepts: Document Data

Basis of relevance search: a search is a process of

comparing the terms specified by a Query against the

terms in the DB

Three main types of terms:

◦ Un-prefixed terms: can be seen as a general pool of indexed terms

◦ Prefixed terms: allow to search a sub-set of information (title,

authors…)

◦ Boolean terms: used to index identifiers (which don’t add any useful

information to the probabilistic indexes)

Core Concepts: Document Terms

Boolean terms useful for filtering exact values (e.g.

subjects:PM, type:article, …). No text processing involved,

values appear 0 or 1 time in Documents.

Textual data - TermGenerator class:

◦ Provides the Stemmer and Stopper classes (note: language-

dependent)

◦ Spelling correction

◦ Exact matching (“hello world”) and the termpos joys

Core Concepts: Document Terms (2)

Unprefixed terms used for the simple search

Prefixed terms used for a field-based search (such as the

advanced search)

Boolean terms used for any identifier-type of fields – this

includes facets (when searching)

Core Concepts: Document Terms (3)

“search helpers” – we used them for ordering and faceting

(occurences & available facets)

Each value (e.g. an order-value, a facet value) is stored in

a numbered slot (32-bit integer)

Mappings between a meaningful string and a slot are

stored in the Xapian DB as metadata

eprint.creators_name.en (1000000) is the slot for the

order-value for the field “creators_name” on the dataset

“eprint” for English

Core Concepts: Document Values

eprint.facet.type.0 (1500300) is the 1st slot for a facet

“type” on the dataset eprint

Used by the MultiValueSorter class to order data (when

not ordered by relevance)

Used to find out available facets (after a search) and the

occurrences of the values e.g. there are 3 items of type

‘article’, 14 items of date ‘2013’

Xapian documentation advises on keeping the number of

values low (slow down searching)

We usually limit the number of slots for a facet to 5

Core Concepts: Document Values (2)

We need to keep track of our slot mappings in the Xapian

Database (not done by Xapian for us )

EPrints reserves 1 000 000 slots per dataset:

◦ 500 000 for order-values (1 per orderable field)

◦ 500 000 for facet slots (1 per facetable value)

EPrints also stores the current slot offsets to know:

◦ where the range for the next dataset starts

◦ where the next slot of order-values are

EPrints also stores some other useful information as

Metadata

Core Concepts: Metadata management

Core Concepts: Metadata management (2)

Reverse process of indexing

Composed of a tree of Query objects (and sometime a

QueryParser object) linked by boolean operators

$query = new Query( “hello” )

$query = new Query( AND, $query, “world” )

Can be stringified to see how the query is interpreted

(easier to read than SQL!)

Core Concepts: Searching

Parses user queries

Supports:

◦ wildcards: wild* will match wildcat

◦ boolean op’s: pear AND (red OR green NOT blue)

◦ love/hate op’s: crab +nebula –crustacean

◦ exact match: “lorem ipsum”

◦ synonyms: colour/color, realise/realize

◦ stemming: happiness/happy -> happi

◦ suggestions: may provide a corrected query

Features can be turned on/off (all are enabled on EPrints)

Core Concepts: Searching - QueryParser

The object which runs the query

Alternative ordering methods can be applied

A MatchDecider method may be provided to filter out

results (in fact, we use that to compute facets)

Returns an MSet (Match Set) which contains the actual

matching Documents

Core Concepts: Search - Enquire

http://xapian.org

◦ architecture overview

◦ documentation

◦ advice for implementation

Questions?

EPrints implementation…

Final words

http://xapian.org/