sudarshan gaikaiwari - lucene @ yelp

54
Lucene @ Yelp Sudarshan Gaikaiwari

Upload: lucidworks-archived

Post on 17-Dec-2014

1.419 views

Category:

Technology


2 download

DESCRIPTION

This talk describes how the Yelp uses Lucene to provide search services. It includes * Statistics of Yelp search usage * Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR * Deeper dive into business and review search. This is the most important search service at Yelp.

TRANSCRIPT

Page 1: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucene @ Yelp

Sudarshan Gaikaiwari

Page 2: Sudarshan Gaikaiwari - Lucene @ Yelp

Bio

1. Over a decade of experience in information retrieval2. Used IR techniques at Symantec's DLP group3. Search Engineer at Yelp

Page 3: Sudarshan Gaikaiwari - Lucene @ Yelp

Outline

1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits

Page 4: Sudarshan Gaikaiwari - Lucene @ Yelp

The services we provide

Page 5: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy: business search

Page 6: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy also powers phone search

Page 7: Sudarshan Gaikaiwari - Lucene @ Yelp

Cathy: she 'talks' a lot

Page 8: Sudarshan Gaikaiwari - Lucene @ Yelp

Listsearch: it searches lists....

Page 9: Sudarshan Gaikaiwari - Lucene @ Yelp

Reviewsearch: it searches reviews....

Page 10: Sudarshan Gaikaiwari - Lucene @ Yelp

DYM: did you really mean that?

Page 11: Sudarshan Gaikaiwari - Lucene @ Yelp

Suggest: auto completion

Page 12: Sudarshan Gaikaiwari - Lucene @ Yelp

Federation Motivation

Page 13: Sudarshan Gaikaiwari - Lucene @ Yelp

Problem

Search is too slow

Page 14: Sudarshan Gaikaiwari - Lucene @ Yelp

Hard Disk Seek LatencyDisk seek 10,000,000 ns

Source Software Engineering Advice from Building Large-Scale Distributed SystemsJeffery Dean

Page 15: Sudarshan Gaikaiwari - Lucene @ Yelp

RAM read latency

Main memory reference100 ns

Page 16: Sudarshan Gaikaiwari - Lucene @ Yelp

Pinning Index in RAM

● vmtouch● mlock● http://hoytech.com/vmtouch/

Page 17: Sudarshan Gaikaiwari - Lucene @ Yelp

Problem

Index is too large fit in memory on a single machine

Page 18: Sudarshan Gaikaiwari - Lucene @ Yelp

Geographical sharding

Page 19: Sudarshan Gaikaiwari - Lucene @ Yelp

Geographical Sharding drawbacks

1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.

Page 20: Sudarshan Gaikaiwari - Lucene @ Yelp

Federation

1. �Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be

comparable

Page 21: Sudarshan Gaikaiwari - Lucene @ Yelp

Mapping businesses to shards

1. Assigning businesses to shards

shard = shardlist[hash(business_id) % len(shardlist)]

Problems 1. Involves re-indexing all the businesses if we want to add a new shard

Page 22: Sudarshan Gaikaiwari - Lucene @ Yelp

Virtual Nodes

Page 23: Sudarshan Gaikaiwari - Lucene @ Yelp

Advantages

1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards

Page 24: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy Master Slave Architecture

Separate indexing (masters)A master for each shard of a service

Searching (slaves)A slave for every replica of a service

Page 25: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy Indexing

Page 26: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 27: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 28: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 29: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 30: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy Searching

Page 31: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 32: Sudarshan Gaikaiwari - Lucene @ Yelp

Federator: Combining results across shards1. Once we distribute an index across shards we need a

component which will search all these shards and combine their results.

2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC

Page 33: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucy Server

Page 34: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 35: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 36: Sudarshan Gaikaiwari - Lucene @ Yelp

Tokens to Business Attributes

Page 37: Sudarshan Gaikaiwari - Lucene @ Yelp

Executing queries

1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories

Page 38: Sudarshan Gaikaiwari - Lucene @ Yelp

Lucene

1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the

query (word score)3. Upgrading lucene to 2.9/3.1 is WIP

Page 39: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 40: Sudarshan Gaikaiwari - Lucene @ Yelp

Successive geobounds relaxation

Page 41: Sudarshan Gaikaiwari - Lucene @ Yelp

Successive geobounds relaxation

Page 42: Sudarshan Gaikaiwari - Lucene @ Yelp

Federation

Page 43: Sudarshan Gaikaiwari - Lucene @ Yelp

Efficiently Retrieving top k hits

1. When user moves through multiple pages the number of hits to be returned increases

num hits = start + count

2. So if we need to retrieve 500 hits the naive way would be to retrieve 500 hits from each shard and then sort them

Page 44: Sudarshan Gaikaiwari - Lucene @ Yelp

Distribution of hits in shards

Page 45: Sudarshan Gaikaiwari - Lucene @ Yelp
Page 46: Sudarshan Gaikaiwari - Lucene @ Yelp

Probability a hit is in a shard

Page 47: Sudarshan Gaikaiwari - Lucene @ Yelp

Binomial DistributionProbability (r of top k hits) are in a particular shard

Mean

Variance

Page 48: Sudarshan Gaikaiwari - Lucene @ Yelp

Formula

Std Deviation

Formula

Page 49: Sudarshan Gaikaiwari - Lucene @ Yelp

Simulation

Formula Hits selected from each shard k = 100p = 0.2

Results Missed (%)

24 0.017

32 0.0001407

44 0.00000

Page 50: Sudarshan Gaikaiwari - Lucene @ Yelp

Simulation Graph

Page 51: Sudarshan Gaikaiwari - Lucene @ Yelp

Results

1. ~ 50% savings over 100 hits (44 hits requested from each shard)

2. 77% savings over 1000 hits (228 hits requested from each shard)

Page 52: Sudarshan Gaikaiwari - Lucene @ Yelp

Future work

1. In memory index2. Move towards real time search

Page 53: Sudarshan Gaikaiwari - Lucene @ Yelp

Come Join Us!

Page 54: Sudarshan Gaikaiwari - Lucene @ Yelp

Thank You

[email protected]