sphinx full text search server

33
By Andrew Kandels Search Server Sphinx is an open source full text search server, designed from the ground up with performance, relevance (a.k.a. search quality), and integration simplicity in mind. Craigslist serves 200 million queries/day Used by Slashdot, Mozilla, Meetup Scales to billions of documents (distributed) Support almost any data source (SQL, XML, etc.) Batch and real-time indexes

Upload: andrew-kandels

Post on 15-Jan-2015

3.267 views

Category:

Technology


3 download

DESCRIPTION

Sphinx is a standalone, full-text search daemon that allows advanced searching over large collections of blocks of text, either from a database or as documents on a file system. Sphinx can scale to billions of documents while still providing sub-second results to boolean queries, wildcards and other advanced search features. I cover basic setup, building a simple index, and demonstrate how to use SQL queries to retrieve results through its API.

TRANSCRIPT

Page 1: Sphinx Full Text Search Server

By Andrew Kandels

Search Server

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (a.k.a. search quality), and integration simplicity in mind.

• Craigslist serves 200 million queries/day• Used by Slashdot, Mozilla, Meetup• Scales to billions of documents (distributed)• Support almost any data source (SQL, XML, etc.)• Batch and real-time indexes

Page 2: Sphinx Full Text Search Server

What is a Search Server?

Sphinx is like a database because…

• It has a schema

• It has field types (integer, boolean, strings, dates)

• It responds to queries (SQL, API):

SELECT * FROM Books WHERE MATCH(“a rose by any other name”)

Page 3: Sphinx Full Text Search Server

Documents

Sphinx indexes data from just about any source.

SELECT CONCAT(a.first_name, ' ', a.last_name) AS full_name, COUNT(b.book_id) AS num_books, MIN(b.publish_date) AS first_publishedFROM author aINNER JOIN book b ON a.author_id = b.author_id

<?xml version=“1.0”?><author> <id>1433</id> <name>Mark Twain</name> <books> <book>A Connecticut Yankee in King Arthur’s Court</book> </books></author>

Page 4: Sphinx Full Text Search Server

How it Works

Sphinx parses plain text queries and answers with rows.

Search

@author_id 15 “Mark Twain” king << arthur

Results

1. document=1433, weight=1692, createdAt=Jan 1 1889

Page 5: Sphinx Full Text Search Server

Relevance

Only the strongest will survive; but, relevance is in the eye of the beholder. Some factors include:

• How many times did our keywords match?• How many times did they repeat in the query?• How frequently do keywords appear?• Do keywords in the document appear in the same order as

the query?• Did we match exactly, or is it a stemmed match?

Page 6: Sphinx Full Text Search Server

B-Tree Index

User Index (Last Name (4))First Name Last Name City State Notes Row # ContentsAllison Janney Baltimore MA Cregg 1 JannJohn Spencer Des Moines IA McGarry 5 MoloBradley Whitford Newport VA Lyman 6 SchiMartin Sheen Seattle WA Bartlett 4 SheeJanel Moloney Hollywood CA Moss 2 SpenRichard Schiff Lincoln NE Ziegler 3 Whit

A B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.

Page 7: Sphinx Full Text Search Server

Logical Queries

Logical conditions return a boolean result based on an expression:

country = “United States”AND num_published >= 50AND (author_id = 5 OR author_id = 8 OR author_id = 10)

Logic queries can be complex and typically evaluate based on the whole value of a column.

Page 8: Sphinx Full Text Search Server

Stemming

Stemming (a.k.a. morphology) is the process for reducing inflected or derived words to their stem, base or root form.

For example, “dove” is a synonym for “pigeon”. The words are different; but they can mean the same thing.

Page 9: Sphinx Full Text Search Server

Tokenizing

Sphinx breaks down documents into keywords. This is called tokenization.

Word breaker characters allow exception cases for keywords like AT&T, C++ or T-Mobile.

Short words are ignored (by default, words less than 3 characters) but a placeholder is saved to support proximity and phrase searching.

Page 10: Sphinx Full Text Search Server

Full Text Index

Inversion

Document Index (Full Text)A man caught a fish [spacer]

man, person, human, beingcaught, catch, catcher, catching, catches[spacer]fish, fishing, fished, fisher

Metadataman 2 1 caught 3 1 fish 5 1

Page 11: Sphinx Full Text Search Server

Full Text Queries

Searches multiple columns or within contents in columns, also known as Keyword Searching.

Boolean Search fiction AND (Twain OR Dickens)

Phrase Search “Mark Twain”

Field-Based Search @author_id 15

Proximity Search “fear itself”~2, fear << itself

Substring Search @author[4] Mark

Quorum Search “the world is a wonderful place”/3

Same Sentence/Paragraph fear SENTENCE itself

Page 12: Sphinx Full Text Search Server

Getting Sphinx

Download it from http://www.sphinxsearch.com (RPM, DEB, Tarball)

Page 13: Sphinx Full Text Search Server

Important Files and Binaries

A successful Sphinx installation will yield the following:

searchd The search daemon, answers queries

Indexer Collects documents and builds the index

search Performs a search (useful for debugging)

sphinx.conf Defines your data and configures your indexes and daemon

Page 14: Sphinx Full Text Search Server

Sphinx.conf

Defaults to /etc/sphinx/sphinx.conf, but can exist anywhere.

It can even be executable:

#!/usr/bin/env phpsource mysource{ type = mysql sql_host = <?php echo DB_HOST; ?>}

Page 15: Sphinx Full Text Search Server

Sphinx.conf Blocks

The contents of sphinx.conf consists of several named blocks:

source Defines your data source and queries

index Define sources to index searches for

indexer Configure the indexer utility

searchd Configure the search daemon

Page 16: Sphinx Full Text Search Server

Source

Define the connection to your database and query in the source block.

source filmssource { type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = sakila

sql_query = \ SELECT f.film_id, f.title, f.description,\ f.release_year, f.rating, l.name as language\ FROM film f\ INNER JOIN language l\ ON l.language_id = f.language_id

sql_attr_uint = release_year sql_attr_string = rating sql_attr_string = language}

Page 17: Sphinx Full Text Search Server

Index

Define which sources to include and index parameters:

index films{ source = filmssource charset_type = utf-8 path = /home/andrew/sphinx/films stopwords = /home/andrew/sphinx/stopwords.txt enable_star = 1 min_word_len = 2 min_prefix_len = 0 min_infix_len = 2 }

Page 18: Sphinx Full Text Search Server

Indexer (optional)

Configure the indexing process which runs occasionally as a batch:

indexer{ mem_limit = 256M}

Page 19: Sphinx Full Text Search Server

Searchd (optional)

Configure the search daemon (searchd) which answers queries:

searchd{ listen = localhost:9312 listen = localhost:9306:mysql41 log = /home/andrew/sphinx.log read_timeout = 8 max_children = 30 pid_file = /home/andrew/sphinx.pid max_matches = 25 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1}

Page 20: Sphinx Full Text Search Server

stopwords.txt

To generate stopwords from your data, use the indexer binary:

indexer --config /path/to/sphinx.conf --buildstops /path/to/stopwords.txt 25

ofwhomustinandthemadAn

Builds a stopwords.txt file with the 25 most commonly found words. Use --buildfreqs to include counts.

Stopwords can dramatically reduce the index size and time-to-build; but, it’s a good idea to inspect the output before using it!

Page 21: Sphinx Full Text Search Server

Build your Index

To generate your index, use the indexer binary:

indexer --config /path/to/sphinx.conf --all –rotate

Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx.conf`...indexing index 'films'...collected 1000 docs, 0.1 MBsorted 0.3 Mhits, 100.0% donetotal 1000 docs, 108077 bytestotal 0.148 sec, 727012 bytes/sec, 6726.80 docs/sectotal 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avgtotal 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg

Page 22: Sphinx Full Text Search Server

Start the Server

Start the server by executing the searchd binary:

searchd --config /path/to/sphinx.conf

Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx.conf’...listening on 127.0.0.1:9312listening on 127.0.0.1:9306precaching index 'films'precached 1 indexes in 0.001 sec

Page 23: Sphinx Full Text Search Server

Run a Search

Test your index by running a search:

search --limit 3 robot

Sphinx 2.0.4-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file './sphinx.conf'...index 'films': query 'robot ': returned 77 matches of 77 total in 0.000 sec

displaying matches:1. document=138, weight=1612, release_year=2006, rating=R, language=English2. document=920, weight=1612, release_year=2006, rating=G, language=English3. document=6, weight=1581, release_year=2006, rating=PG, language=English

words:1. 'robot': 77 documents, 79 hits

Page 24: Sphinx Full Text Search Server

MySQL Interface

You can query Sphinx using the MySQL protocol:

mysql –h127.0.0.1 –P 9306

Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 1Server version: 2.0.4-release (r3135)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.This software comes with ABSOLUTELY NO WARRANTY. This is free software,and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

Page 25: Sphinx Full Text Search Server

MySQL Interface

Queries are written in SphinxQL, which is much like SQL:

mysql> SELECT * FROM films WHERE MATCH('robot') ORDER BY release_year DESC LIMIT 5;+------+--------+--------------+--------+----------+| id | weight | release_year | rating | language |+------+--------+--------------+--------+----------+| 6 | 1581 | 2006 | PG | English || 16 | 1581 | 2006 | NC-17 | English || 25 | 1581 | 2006 | G | English || 42 | 1581 | 2006 | NC-17 | English || 61 | 1581 | 2006 | G | English |+------+--------+--------------+--------+----------+5 rows in set (0.00 sec)

Page 26: Sphinx Full Text Search Server

MySQL Interface

Additional metrics can also be retrieved:

mysql> SHOW META;+---------------+-------+| Variable_name | Value |+---------------+-------+| total | 77 || total_found | 77 || time | 0.000 || keyword[0] | robot || docs[0] | 77 || hits[0] | 79 |+---------------+-------+6 rows in set (0.00 sec)

Page 27: Sphinx Full Text Search Server

MySQL Interface

You can even do grouping:

mysql> SELECT rating, COUNT(*) AS num_movies, MIN(release_year) AS first_year FROM films GROUP BY rating ORDER BY num_movies DESC;+------+--------+--------------+--------+------------+--------+| id | weight | release_year | rating | first_year | @count |+------+--------+--------------+--------+------------+--------+| 7 | 1 | 2006 | PG-13 | 2006 | 223 || 3 | 1 | 2006 | NC-17 | 2006 | 210 || 8 | 1 | 2006 | R | 2006 | 195 || 1 | 1 | 2006 | PG | 2006 | 194 || 2 | 1 | 2006 | G | 2006 | 178 |+------+--------+--------------+--------+------------+--------+5 rows in set (0.00 sec)

Page 28: Sphinx Full Text Search Server

Other Applications

Sphinx does more than just full text search. It has other practical applications as well:

• Metrics and Reporting

• Data Warehouse

• Materialized Views

• Operational Data Store

• Offloading Queries

Page 29: Sphinx Full Text Search Server

Quick and Dirty PHP

Integrate Sphinx by using any MySQL driver (like PDO):

Page 30: Sphinx Full Text Search Server

SphinxAPI

Or use a native extension like SphinxClient for PHP:

Download it here: http://pecl.php.net/sphinx

Page 31: Sphinx Full Text Search Server

Indexing Strategies

Sphinx supports several types of indexes:

• Disk

• In-memory

• Distributed

• Real-time

Page 32: Sphinx Full Text Search Server

Main+delta Batch Indexes

Disk indexes often use the main+delta(s) strategy:

• One or more delta indexes collect new data as often as every minute.

• Larger batch indexes rebuild daily, weekly or even less frequently.

Disk indexes have the following benefits:

• They can be re-indexed online without interruption (--rotate)

• They can be distributed over filesystems and hardware

Page 33: Sphinx Full Text Search Server

The End

Andrew KandelsWebsite: http://andrewkandels.com

Twitter: @andrewkandels

Facebook/G+: No thanks

There’s a book!