introduction to full-text search
DESCRIPTION
TRANSCRIPT
Introduction to Full-text search
About me Full-time (Mostly) Java Developer Part-time general technical/sysadmin/geeky guy Interested in: hard problems, search, performance, paralellism,
scalability
Why should you care?
Because every application needs search
We live in an era of big, complex and connected applications.
That means a lot of data
But it's no use if you can't find anything!
But it's no use if you can't quickly find anything something relevant
Quick
Relevant
Customized Experience
You can't win by being generic, but you can be the best for your specific type of content.
Deathy's Tip
So back to our full-text search...
Some core ideas "index" (or "inverted index") "document"
Don't be too quick in deciding what a "document" is. Put some thought into it or you'll regret it (speaking from a lot of experience)
Deathy’s Tip
First we need some documents, more specifically some text samples
Documents Doc1: "The cow says moo" Doc2: "The dog says woof" Doc3: "The cow-dog says moof“
"Stolen" from http://www.slideshare.net/tomdyson/being-google
Important: individual words are the basis for the index
Individual wordsindex = [
"cow","dog","moo","moof","The","says","woof"
]
For each word we have a list of documents to which it belongs
Words, with appearancesindex = {
"cow": ["Doc1", "Doc3"],"dog": ["Doc2", "Doc3"],"moo": ["Doc1"],"moof": ["Doc3"],"The": ["Doc1", "Doc2", "Doc3"],"says": ["Doc1", "Doc2", "Doc3"],"woof": ["Doc2"]
}
Q1: Find documents which contain "moo"A1: index["moo"]
Q2: Find documents which contain "The" and "dog"A2: set(index["The"]) & set(index["dog"])
Try to think of search as unions/intersections or other filters on sets.
Most searches are using simple terms and "boolean" operators.
“boolean” "word" - word MAY/SHOULD appear in document "+word" - word MUST appear in document "-word" - word MUST NOT appear in document
Example Query: “+type:book content:java content:python -content:ruby”
Find books, with "java" or "python" in content but which don't contain "ruby" in content.
Err...wait...what the hell does "content:java" mean?
Reviewing the "document" concept
An index consists out of one or more documents
Each document consists of one or more "field"s. Each field has
a name and content.
Field examples content title author publication date etc.
So how are fields handled internally?
In most cases very simple. A word belongs to a specific field, so it can be stored in the term directly.
New index exampleindex = {
"content:cow": ["Doc1", "Doc3"],"content:dog": ["Doc2", "Doc3"],"content:moo": ["Doc1"],"content:moof": ["Doc3"],"content:The": ["Doc1", "Doc2", "Doc3"],"content:says": ["Doc1", "Doc2", "Doc3"],"content:woof": ["Doc2"],"type:example_documents": ["Doc1", "Doc2", "Doc3"]
}
But enough of that
We missed the most important thing!
We missed saved the most important thing for last!
Analysis
or for mortals: how you get from a long text to small
tokens/words/terms
…borrowing from Lucene naming/API...
(One) Tokenizer
and zero or more Filters
First...
Some more interesting documents Doc1: "The quick brown fox jumps over the lazy dog" Doc2: "All Daleks: Exterminate! Exterminate! EXTERMINATE!!
EXTERMINATE!!!" Doc3: "And the final score is: no TARDIS, no screwdriver, two
minutes to spare. Who da man?!"
Tokenizer: Breaks up a single string into smaller tokens.
You define what splitting rules are best for you.
Whitespace TokenizerJust break into tokens wherever there is some space. So we get something like:
Doc1: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["All", "Daleks:", "Exterminate!", "Exterminate!", "EXTERMINATE!!", "EXTERMINATE!!!"]
Doc3: ["And", "the", "final", "score", "is:", "no", "TARDIS,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "Who", "da", "man?!"]
But wait, that doesn't look right...
So we apply Filters
Filter transforms one single token into another single token, multiple
tokens or no token at all you can apply more of them in a specific order
Filter 1: lower-case (since we don't want the search to be
case-sensitive)
Result
Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["all", "daleks:", "exterminate!", "exterminate!", "exterminate!!", "exterminate!!!"]
Doc3: ["and", "the", "final", "score", "is:", "no", "tardis,", "no", "screwdriver,", "two", "minutes", "to", "spare.", "who", "da", "man?!"]
Filter 2: remove punctuation
Result
Doc1: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
Doc2: ["all", "daleks", "exterminate", "exterminate", "exterminate", "exterminate"]
Doc3: ["and", "the", "final", "score", "is", "no", "tardis", "no", "screwdriver", "two", "minutes", "to", "spare", "who", "da", "man"]
Add more filter seasoning until it tastes just right.
Lots of things you can do with filters case normalization removing unwanted/unneeded characters transliteration/normalization of special characters stopwords synonyms
Possibilities are endless, enjoy experimenting with
them!
Just one warning…
Always use the same analysis rules when indexing and when parsing search text entered by
the user!
I bet you want to start working with this
Implementations
Lucene (Java main, .NET, Python, C ) SOLR if using from other languages
Xapian Sphinx OpenFTS MySQL Full-Text Search (kind of…)
Related Books
The theoryIntroduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/information-retrieval-book.htmlWarning: contains a lot of math.
The practice (for Lucene at least):Lucene in Action, second edition:http://www.manning.com/hatcher3/Warning: contains a lot of Java
Questions?
Contact me(with interesting problems involving lots of data )
@[email protected]://blog.deathy.info/ (yeah…I know…)
Fin.
So where’s the Halloween Party?
Happy Halloween !