introduction to full-text search

Introduction to Full-text search

About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism,

scalability

Why should you care?

Because every application needs search

We live in an era of big, complex and connected applications.

That means a lot of data

But it's no use if you can't find anything!

But it's no use if you can't quickly find anything something relevant

Relevant

Customized Experience

You can't win by being generic, but you can be the best for your specific type of content.

Deathy's Tip

So back to our full-text search...

Some core ideas "index" (or "inverted index") "document"

Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)

Deathy’s Tip

First we need some documents, more specifically some text samples

Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“

"Stolen" from http://www.slideshare.net/tomdyson/being-google

Important: individual words are the basis for the index

Individual wordsindex = [

"cow","dog","moo","moof","The","says","woof"

]

For each word we have a list of documents to which it belongs

Words, with appearancesindex = {

"cow": ["Doc1", "Doc3"],"dog": ["Doc2", "Doc3"],"moo": ["Doc1"],"moof": ["Doc3"],"The": ["Doc1", "Doc2", "Doc3"],"says": ["Doc1", "Doc2", "Doc3"],"woof": ["Doc2"]

}

Q1: Find documents which contain "moo"A1: index["moo"]

Q2: Find documents which contain "The" and "dog"A2: set(index["The"]) & set(index["dog"])

Try to think of search as unions/intersections or other filters on sets.

Most searches are using simple terms and "boolean" operators.

“boolean” "word" - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document

Example Query: “+type:book content:java content:python -content:ruby”

Find books, with "java" or "python" in content but which don't contain "ruby" in content.

Err...wait...what the hell does "content:java" mean?

Reviewing the "document" concept

An index consists out of one or more documents

Each document consists of one or more "field"s. Each field has

a name and content.

Field examples content title author publication date etc.

So how are fields handled internally?

In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.

New index exampleindex = {

"content:cow": ["Doc1", "Doc3"],"content:dog": ["Doc2", "Doc3"],"content:moo": ["Doc1"],"content:moof": ["Doc3"],"content:The": ["Doc1", "Doc2", "Doc3"],"content:says": ["Doc1", "Doc2", "Doc3"],"content:woof": ["Doc2"],"type:example_documents": ["Doc1", "Doc2", "Doc3"]

}

But enough of that

We missed the most important thing!

We missed saved the most important thing for last!

Analysis

or for mortals: how you get from a long text to small

tokens/words/terms

…borrowing from Lucene naming/API...

(One) Tokenizer

and zero or more Filters

First...

Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!!

EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two

minutes to spare. Who da man?!"

Tokenizer: Breaks up a single string into smaller tokens.

You define what splitting rules are best for you.

Whitespace TokenizerJust break into tokens wherever there is some space. So we get something like:

Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]

Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]

But wait, that doesn't look right...

So we apply Filters

Filter transforms one single token into another single token, multiple

tokens or no token at all you can apply more of them in a specific order

Filter 1: lower-case (since we don't want the search to be

case-sensitive)

Result

Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]

Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]

Filter 2: remove punctuation

Result

Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]

Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]

Add more filter seasoning until it tastes just right.

Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms

Possibilities are endless, enjoy experimenting with

them!

Just one warning…

Always use the same analysis rules when indexing and when parsing search text entered by

the user!

I bet you want to start working with this

Implementations

Lucene (Java main, .NET, Python, C ) SOLR if using from other languages

Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)

Related Books

The theoryIntroduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/information-retrieval-book.htmlWarning: contains a lot of math.

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

http://nlp.stanford.edu/IR-book/information-retrieval-book.html

The practice (for Lucene at least):Lucene in Action, second edition:http://www.manning.com/hatcher3/Warning: contains a lot of Java

http://www.manning.com/hatcher3/

Questions?

Contact me(with interesting problems involving lots of data )

@[email protected]://blog.deathy.info/ (yeah…I know…)

So where’s the Halloween Party?

Happy Halloween !

introduction to full-text search

Technology

python content

word word

search text

specific type of content

text search kind

quick brown fox jumps

lazy dogdoc2

final score