introduction to the xapian search engine
DESCRIPTION
Introduction to the Xapian Search Engine. Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC. Presentation. Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching - PowerPoint PPT PresentationTRANSCRIPT
Introduction to the Xapian Search Engine
Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC
Open Source Search Engine Library
Written in C++ (we use the PERL bindings)
Uses the BM25 ranking function which gives the relevance
matching
“Scales well”: 100+ million documents
Oh… code that we don’t need to maintain!
Presentation
Database
Document
◦ data
◦ terms
◦ Values
(Xapian) Metadata management
Searching
Are you ready for it?
Core Concepts
Collection of files storing indexes, positions, term
frequencies, …
One write-lock, multiple read-locks
Stored in archives/<id>/var/xapian/
Supports multiple-DB’s (unused in EPrints)
Can store arbitrary metadata
Core Concepts: Database
A Document is an item returned by a search
So it’s also the meaty bit of indexing
Maps to a single data-obj in EPrints
Has three main components:
◦ data
◦ terms
◦ values
Core Concepts: Document
Arbitrary blob of data
Un-processed by Xapian
Used to store information needed to display the results
Used to store the data-obj identifier in EPrints in order to
quickly build EPrints::List objects
Could be used to store more complex data: cached
citations, JSON/PERL representation of the data-obj
Limit ~100MB per Document
Core Concepts: Document Data
Basis of relevance search: a search is a process of
comparing the terms specified by a Query against the
terms in the DB
Three main types of terms:
◦ Un-prefixed terms: can be seen as a general pool of indexed terms
◦ Prefixed terms: allow to search a sub-set of information (title,
authors…)
◦ Boolean terms: used to index identifiers (which don’t add any useful
information to the probabilistic indexes)
Core Concepts: Document Terms
Boolean terms useful for filtering exact values (e.g.
subjects:PM, type:article, …). No text processing involved,
values appear 0 or 1 time in Documents.
Textual data - TermGenerator class:
◦ Provides the Stemmer and Stopper classes (note: language-
dependent)
◦ Spelling correction
◦ Exact matching (“hello world”) and the termpos joys
Core Concepts: Document Terms (2)
Unprefixed terms used for the simple search
Prefixed terms used for a field-based search (such as the
advanced search)
Boolean terms used for any identifier-type of fields – this
includes facets (when searching)
Core Concepts: Document Terms (3)
“search helpers” – we used them for ordering and faceting
(occurences & available facets)
Each value (e.g. an order-value, a facet value) is stored in
a numbered slot (32-bit integer)
Mappings between a meaningful string and a slot are
stored in the Xapian DB as metadata
eprint.creators_name.en (1000000) is the slot for the
order-value for the field “creators_name” on the dataset
“eprint” for English
Core Concepts: Document Values
eprint.facet.type.0 (1500300) is the 1st slot for a facet
“type” on the dataset eprint
Used by the MultiValueSorter class to order data (when
not ordered by relevance)
Used to find out available facets (after a search) and the
occurrences of the values e.g. there are 3 items of type
‘article’, 14 items of date ‘2013’
Xapian documentation advises on keeping the number of
values low (slow down searching)
We usually limit the number of slots for a facet to 5
Core Concepts: Document Values (2)
We need to keep track of our slot mappings in the Xapian
Database (not done by Xapian for us )
EPrints reserves 1 000 000 slots per dataset:
◦ 500 000 for order-values (1 per orderable field)
◦ 500 000 for facet slots (1 per facetable value)
EPrints also stores the current slot offsets to know:
◦ where the range for the next dataset starts
◦ where the next slot of order-values are
EPrints also stores some other useful information as
Metadata
Core Concepts: Metadata management
Core Concepts: Metadata management (2)
Reverse process of indexing
Composed of a tree of Query objects (and sometime a
QueryParser object) linked by boolean operators
$query = new Query( “hello” )
$query = new Query( AND, $query, “world” )
Can be stringified to see how the query is interpreted
(easier to read than SQL!)
Core Concepts: Searching
Parses user queries
Supports:
◦ wildcards: wild* will match wildcat
◦ boolean op’s: pear AND (red OR green NOT blue)
◦ love/hate op’s: crab +nebula –crustacean
◦ exact match: “lorem ipsum”
◦ synonyms: colour/color, realise/realize
◦ stemming: happiness/happy -> happi
◦ suggestions: may provide a corrected query
Features can be turned on/off (all are enabled on EPrints)
Core Concepts: Searching - QueryParser
The object which runs the query
Alternative ordering methods can be applied
A MatchDecider method may be provided to filter out
results (in fact, we use that to compute facets)
Returns an MSet (Match Set) which contains the actual
matching Documents
Core Concepts: Search - Enquire
http://xapian.org
◦ architecture overview
◦ documentation
◦ advice for implementation
Questions?
EPrints implementation…
Final words