hbasecon 2012 | solbase - kyungseog oh, photobucket

18
Kyungseog Oh May 22, 2012 HBaseCon

Upload: cloudera-inc

Post on 25-May-2015

1.276 views

Category:

Technology


0 download

DESCRIPTION

Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.

TRANSCRIPT

Page 1: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Kyungseog OhMay 22, 2012HBaseCon

Page 2: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

What is Solbase?

Solbase is an open-source, real-time search platform based on Lucene, Solr and HBase built at Photobucket

Page 3: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• 40% of total page views• 500 million ‘docs’ or images• 30 million search requests per day• 120 Gigabyte size• Previous infrastructure built on

Solr/Lucene

Search at Photobucket

Page 4: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Memory issues• Indexing time• Speed• Capacity and Scalability

Why Solbase?

Page 5: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Field Cache– Sortable and filterable fields stored in a

java array the size of the maximum document number

• Example– Every doc is sorted by an integer field,

for 500 million documents the array is 2 GB in size

Lucene Memory Issues

Page 6: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Solr indexing took 15-16 hours to rebuild the indices

• We wanted to provide near real-time updates

Indexing Time

Page 7: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Every 100 ms improvement in response time equates to approximately 1 extra page view per visit.

• Can end up being hundreds of millions of extra page views per month

Speed

Page 8: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Impractical to add significant number of new docs and data (Geo, Exif, etc)

• Difficult to divide data set to create brand new shard

• Fault tolerance is not built in

Capacity & Scalability

Page 9: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Modify Lucene and Solr to use HBase as the source of index and document data

The Concept

Page 10: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Term/Document Tables

create 'TV', 'd', {COMPRESSION=>'SNAPPY',NAME=>'d',VERSION=>1, REPLICATION_SCOPE=>1}

create 'Docs', 'field', 'allTerms', 'timestamp', {COMPRESSION=>'SNAPPY',NAME=>'field',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'allTerms',VERSION=>1,REPLICATION_SCOPE=>1},{COMPRESSION=>'SNAPPY',NAME=>'timestamp',VERSION=>1, REPLICATION_SCOPE=>1}

Solbase tables

Page 11: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Term Queries are HBase range scans

Start key<field><delimiter><term><delimiter><begin doc id>0x00000000

End key<field><delimiter><term><delimiter><end doc id>0xffffffff

Query Methodology

Page 12: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Solr ShardingMaster

Shard

Index File

Shard

Index File

Shard

Index File

Shard

Index File

Master

Shard

HBase

Solbase – Distributed Processing

Solbase Sharding

Shard Shard Shard

Page 13: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Extra bits in Encoded Metadata

• Solved Lucene’s sort/filter field

cache issue

Solbase – Sorts & Filters

Page 14: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Solbase – Indexing Process

• Initial Indexing– Leveraging Map/Reduce Framework

• Real-Time Indexing– Using Solr’s update API

Page 15: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Term ‘me’ takes 13 seconds to load from HBase, 500 ms from cache– ‘me’ has ~14M docs, the largest term in our

indices

• Most terms not in cache take < 200 ms• Most cached terms take < 20 ms• Average query time for native Solr/Lucene:

169 ms• Average query time for Solbase: 109 ms or

35% decrease• ~300 real-time updates per second

Results

Page 16: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• Compatibility issue with latest Solr

• CDH3 latest build

• HBase/Solbase clusters per data center

HBase configuration/Limitation

Page 17: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

• https://github.com/Photobucket/Solbase

• https://github.com/Photobucket/Solbase-Lucene

• https://github.com/Photobucket/Solbase-Solr

Repos

Page 18: HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Q&A