practical full text search

50
Practical full-text search in MySQL Bill Karwin MySQL University • 2009-12-3

Upload: etkm5024

Post on 11-Mar-2015

118 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Practical Full Text Search

Practical full-text search in MySQLBill KarwinMySQL University • 2009-12-3

Page 2: Practical Full Text Search

Me

• 20+ years experience

• Application/SDK developer• Support, Training, Proj Mgmt• C, Java, Perl, PHP

• SQL maven

• MySQL, PostgreSQL, InterBase• Zend Framework• Oracle, SQL Server, IBM DB2, SQLite

• Community contributor

Page 3: Practical Full Text Search

Full Text Search

Page 4: Practical Full Text Search

In a full text search, the search engine examines all of the words in every stored document as it tries to

match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/

Page 5: Practical Full Text Search

Test Data

• StackOverflow.com data dump, exported October 2009

• 1.5 million tuples

• ~1 Gigabyte

Page 6: Practical Full Text Search

StackOverflow ER diagram

searchable text

Page 7: Practical Full Text Search

Naive SearchingSome people, when confronted with a problem,

think “I know, I’ll use regular expressions.” Now they have two problems.

— Jamie Zawinsky

Page 8: Practical Full Text Search

Accuracy issue

• Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.:

body LIKE ‘%one%’

• Regular expressions in MySQLsupport escapes for word boundaries:

body RLIKE ‘[[:<:]]one[[:>:]]’

Page 9: Practical Full Text Search

Performance issue

• LIKE with wildcards:

SELECT * FROM PostsWHERE body LIKE ‘%performance%’

• POSIX regular expressions:

SELECT * FROM PostsWHERE body RLIKE ‘performance’

time: 22 sec

time: 108 sec

Page 10: Practical Full Text Search

Why so slow?

CREATE TABLE telephone_book ( full_name VARCHAR(50));

CREATE INDEX name_idx ON telephone_book (full_name);

INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);

Page 11: Practical Full Text Search

Why so slow?

• Search for all with last name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘Thomas%’

• Search for all with first name “Thomas”

SELECT * FROM telephone_bookWHERE full_name LIKE ‘%Thomas’

uses index

doesn’t use index

Page 12: Practical Full Text Search

Indexes don’t help searching for substrings

Page 13: Practical Full Text Search

Solutions

1. Full-Text Indexing in SQL

2. Sphinx Search

3. Apache Lucene

4. Inverted Index

5. Search Engine Service

Page 14: Practical Full Text Search

MySQLFULLTEXT Index

Page 15: Practical Full Text Search

MySQL FULLTEXT Index

• Special index type for MyISAM

• Integrated with SQL queries

• Balances features vs. speed vs. space

Page 16: Practical Full Text Search

MySQL FULLTEXT:

Indexing

CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);

time: 15 min 6 sec

Page 17: Practical Full Text Search

MySQL FULLTEXT:

Index Caching

SET GLOBAL key_buffer_size = 600*1024*1024;

LOAD INDEX INTO CACHE Posts INDEX(PostText);

time: 11 sec

Page 18: Practical Full Text Search

MySQL FULLTEXT:

Querying

SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( ‘query pattern’ );

must includeall columns of index, in the order defined

Page 19: Practical Full Text Search

MySQL FULLTEXT:

Natural Language Mode

Searches concepts with free text queries:

SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘improving mysql performance’ IN NATURAL LANGUAGE MODE)LIMIT 100;

time with index: 80 milliseconds

Page 20: Practical Full Text Search

MySQL FULLTEXT:

Boolean Mode

Searches words using mini-language:

SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(‘+mysql +performance’ IN BOOLEAN MODE);

time with index: 50 milliseconds

Page 21: Practical Full Text Search

Lucene

Page 22: Practical Full Text Search

Lucene

• Apache Project since 2001

• Apache License

• Java implementation

• Ports exist for other languages:• Lucy (C)• Lucene.NET (C#)• Zend_Search_Lucene (PHP)

• PyLucene (Python)• Plucene (Perl)• Ferret (Ruby)

Page 23: Practical Full Text Search

Lucene:

How to use

1. Add documents to index

2. Parse query

3. Execute query

Page 24: Practical Full Text Search

Lucene:

Creating an index

• Programmatic solution in Java...

time: 6 minutes, 50 seconds

Page 25: Practical Full Text Search

Lucene:

Indexing

String url = "jdbc:mysql://localhost/stackoverflow?" + "user=myappuser&password=xxxx";Class.forName("org.mysql.jdbc.Driver");Connection con = DriverManager.getConnection(url, props);

String sql = "SELECT PostId, Title, Body, Tags FROM Posts";com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement();stmt.enableStreamingResults();ResultSet rs = stmt.executeQuery(sql);

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

any SQL query

open Lucene index writer

Page 26: Practical Full Text Search

Lucene:

Indexing

while (rs.next()) { Document doc = new Document();

doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED));

writer.addDocument(doc);}

writer.optimize();writer.close();

loop over SQL result

each row is a Document

with four Fields

finish and close index

Page 27: Practical Full Text Search

Lucene:

Querying

• Parse a Lucene queryString[] fields = new String[3];fields[0] = “Title”; fields[1] = “Body”; fields[2] = “Tags”;

Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’);

• Execute the querySearcher s = new IndexSearcher(indexDirectory, true);

Hits h = s.search(q);

time: 120 milliseconds

parse search query

define fields

Page 28: Practical Full Text Search

Sphinx Search

Page 29: Practical Full Text Search

Sphinx Search

• Started in 2001

• GPLv2 license

• Good database integration:SphinxSE storage engine for MySQL

Page 30: Practical Full Text Search

Sphinx Search:

How to use

1. Edit configuration file

2. Index the data

3. Query the index

4. Issues

Page 31: Practical Full Text Search

Sphinx Search:

sphinx.conf

source stackoverflowsrc{ type = mysql sql_host = localhost sql_user = myappuser sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id}

Page 32: Practical Full Text Search

Sphinx Search:

sphinx.conf

index stackoverflow{ source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow}

Page 33: Practical Full Text Search

Sphinx Search:

Building index

indexer -c sphinx.conf stackoverflow

collected 1517638 docs, 1021.3 MBsorted 171.5 Mhits, 100.0% donetotal 1517638 docs, 1021342525 bytestotal 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec

time: 2 min 27 sec

Page 34: Practical Full Text Search

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverflow -b “sql & performance”

time: 12 milliseconds

Page 35: Practical Full Text Search

Sphinx Search:

Issues

Cost to update index = cost to build index

• Build a “main” index plus a “delta” index for recent changes

• Merge indexes periodically (much less costly)

• But not all data fits into this model; i.e. good for a forum, but bad for a wiki

Page 36: Practical Full Text Search

Inverted Index

Page 37: Practical Full Text Search

Inverted index

TagsPosts PostTags

many-to-many relationship for Posts

and wordssearchable

words

Page 38: Practical Full Text Search

Inverted index:

Updated ER Diagram

new tables

Page 39: Practical Full Text Search

Inverted index:

Data definition

CREATE TABLE Tags ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL UNIQUE KEY (Tag));

CREATE TABLE PostTags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES Tags (TagId));

Page 40: Practical Full Text Search

Inverted index:

Indexing

1. Query all Posts.Tags strings:“<mysql><search><performance>”

2. Loop over tag strings

3. Dump two CSV files:

• Tags.csv• PostTags.csv

4. Load CSV files with mysqlimport

time: 23.5 seconds

time: 5.2 seconds

Page 41: Practical Full Text Search

Inverted index:

Querying

SELECT p.* FROM Posts pJOIN PostTags pt USING (PostId)JOIN Tags t USING (TagId)WHERE t.Tag = ‘performance’;

250 milliseconds

Page 42: Practical Full Text Search

Inverted Index:

Is it right for you?

• Best for searching selected words

• Simple, portable, standard SQL

• Not as fast as specialized technology,but far better than using LIKE

Page 43: Practical Full Text Search

Search Engine Services

Page 44: Practical Full Text Search

Search engine services:

Google Custom Search Engine

• http://www.google.com/cse/

• DEMO ➪ http://www.karwin.com/demo/gcse-demo.html

even big web sites use this solution

Page 45: Practical Full Text Search

Search engine services:

Is it right for you?

• Your site is public and allows external index

• Search is a non-critical feature for you

• Search results are satisfactory

• You need to offload search processing

Page 46: Practical Full Text Search

Comparison: Time to Build Index

LIKE expression none

MySQL FULLTEXT 15 min

Apache Lucene 6 min 50 sec

Sphinx Search 2 min 27 sec

Inverted index 28 sec

Google / Yahoo! offline

Page 47: Practical Full Text Search

Comparison: Index Storage

LIKE expression none

MySQL FULLTEXT 466 MB

Apache Lucene 1323 MB

Sphinx Search 933 MB

Inverted index 48 MB

Google / Yahoo! offline

Page 48: Practical Full Text Search

Comparison: Query Speed

LIKE expression 22 seconds

MySQL FULLTEXT 50-80 ms

Apache Lucene 120 ms

Sphinx Search 12 ms

Inverted index 250 ms

Google / Yahoo! *

Page 49: Practical Full Text Search

Comparison: Bottom-Line

LIKE expression none none 2000x SQL

MySQL FULLTEXT 32x 10x 6x RDBMS

Apache Lucene 15x 27x 10x 3rd party

Sphinx Search 5x 20x 1x 3rd party

Inverted index 1x 1x 20x SQL

Google / Yahoo! offline offline * Service

indexing storage query solution

Page 50: Practical Full Text Search

Copyright 2009 Bill Karwin

www.slideshare.net/billkarwin

Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/

You are free to share - to copy, distribute and transmit this work, under the following conditions:

Attribution. You must attribute this work to Bill Karwin.

Noncommercial. You may not use this work for commercial purposes.

No Derivative Works. You may not alter, transform, or build

upon this work.