www.sti-innsbruck.at © copyright 2012 sti innsbruck apache lucene ioan toma based on slides from...

21
www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK www.sti- innsbruck.at Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron @codemass.com

Upload: jaelyn-starkes

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at

Apache Lucene

Ioan Toma

based on slides from Aaron Bannert [email protected]

Page 2: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

What is Apache Lucene?

“Apache Lucene(TM) is a high-performance, full-featured text

search engine library written entirely in Java. It is a technology

suitable for nearly any application that requires full-text search,

especially cross-platform.”

- from http://lucene.apache.org/

Page 3: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Features

• Scalable, High-Performance Indexing

– over 95GB/hour on modern hardware– small RAM requirements -- only 1MB heap– incremental indexing as fast as batch indexing– index size roughly 20-30% the size of text

indexed

• Powerful, Accurate and Efficient Search Algorithms

– ranked searching -- best results returned first– many powerful query types: phrase queries,

wildcard queries, proximity queries, range queries and more

– fielded searching (e.g., title, author, contents)– date-range searching– sorting by any field– multiple-index searching with merged results– allows simultaneous update and searching

Page 4: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Features

• Cross-Platform Solution– Available as Open Source software under the Apache License which

lets you use Lucene in both commercial and Open Source programs– 100%-pure Java– Implementations in other programming languages available that are

index-compatible• CLucene - Lucene implementation in C++ • Lucene.Net - Lucene implementation in .NET • Zend Search - Lucene implementation in the Zend Framework for PHP 5

4

Page 5: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Ranked Searching

1. Phrase Matching

2. Keyword Matching– Prefer more unique terms first– Scoring and ranking takes into account the uniqueness of each term

when determining a document’s relevance score

Page 6: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Flexible Queries

• Phrases“star wars”

• Wildcardsstar*

• Ranges{star-stun}[2006-2007]

• Boolean Operatorsstar AND wars

Page 7: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Field-specific Queries

• Field-specific queries can be used to target specific fields in the Document Index.

• For example

title:”star wars”AND

director:”George Lucas”

Page 8: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Sorting

• Can sort any field in a Document– For example, by Price, Release Date, Amazon Sales Rank, etc…

• By default, Lucene will sort results by their relevance score. Sorting by any other field in a Document is also supported.

Page 9: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

LUCENE INTERNALS

9

Page 10: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Everything is a Document

• A document can represent anything textual:– Word Document– DVD (the textual metadata only)– Website Member (name, ID, etc…)

• A Lucene Document need not refer to an actual file on a disk, it could also resemble a row in a relational database.

• Developers are responsible for turning their own data sets into Lucene Documents

• A document is seen as a list of fields, where a field has a name an a value

Page 11: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Indexes

• The unit of indexing in Lucene is a term. A term is often a word.

• Indexes track term frequencies• Every term maps back to a Document

• Lucene uses inverted index which allows Lucene to quickly locate every document currently associated with a given set up input search terms.

Page 12: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Basic Indexing

1. Parse different types of documents (HTML, PDF, Word, text files, etc.)

2. Extract tokens and related info (Lucene Analyser)

3. Add the Document to an Index

Lucene provide a standard analyzer for English and latin based languages

.

Page 13: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Basic Searching

1. Create a Query• (eg. by parsing user input)

2. Open an Index

3. Search the Index• Use the same Analyzer as before

4. Iterate through returned Documents• Extract out needed results• Extract out result scores (if needed)

Page 14: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Lucene as SOA

1. Design an HTTP query syntax– GET queries– XML for results

2. Wrap Tomcat around core code

3. Write a Client Library

As it follows SOA principles, basic building blocks such as load balancers can be deployed to quickly scale up the capacity of the search subsystem.

Page 15: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Lucene as SOA DiagramSingle-Machine Architecture

Lucene-based Application includes three components

1. Lucene Custom Client Library

2. Search Service

3. Custom Core Search Library

Page 16: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

LUCENE SCALABILITY

16

Page 17: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Scalability Limits

• 3 main scalability factors:– Query Rate– Index Size– Update Rate

Page 18: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Query Rate Scalability

• Lucene is already fast– Built-in caching

• Easy solution for heavy workloads:(gives near-linear scaling)– Add more query servers behind a load balancer– Can grow as your traffic grows

Page 19: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Lucene as SOA DiagramHigh-Scale Multi-Machine Architecture

Page 20: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Index Size Scalability

• Can easily handle millions of Documents• Lucene is very commonly deployed into systems with

10s of millions of Documents. • Main limits related to Index size that one is likely to run in

to will be disk capacity and disk I/O limits.

If you need bigger:• Built-in multi-machine capabilities

– Can merge multiple remote indexes at query-time.

Page 21: Www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK  Apache Lucene Ioan Toma based on slides from Aaron Bannert aaron@codemass.comaaron@codemass.com

www.sti-innsbruck.at

Update Rate

• Lucene is threadsafe

– Can update and query at the same time• I/O is limiting factor

Strategies for achieving even higher update rates:

– Vertical Split – for big workloads (Centralized Index Building)1. Build indexes apart from query service

2. Push updated indexes on intervals

– Horizontal Split – for huge workloads1. Split data into columns2. Merge columns for queries3. Columns only receive their own data for updates