lily for the bay area hbase ug - nyc edition

26
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org Presenting Lily Bay Area HBase UG - NYC - 10/11/2010

Upload: ngdata

Post on 12-Apr-2017

4.025 views

Category:

Technology


1 download

TRANSCRIPT

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Presenting LilyBay Area HBase UG - NYC - 10/11/2010

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Devoxx: Nov. 15-19, Antwerp, BelgiumNoSQL/Cloud track

2

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Outerthought

» software product company

» scalable content applications

» open source product portfolio

» Java, REST, internet

3

THIS NOTEBOOK BELONGS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Technology

4

THIS N OT E B OO K B ELO N GS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42

»Lily : NoSQL-based content repository (HBase + SOLR)»Kauri : REST centric webapp dev framework

» Daisy : techdoc / QDoc / publishing CMS

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Needs for Scalable Content

5

»wire-speed capturing

» batch-oriented post-processing

» semantic lifting : extracting knowledge out of noise

» data and inferred data become one

➡ NoSQL & write-optimized storage

➡ map/reduce

➡ Natural Language Processing

➡ smart content repositories

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6

The Lily Project

content repository: store + search

REST-centric content app UI

framework

content augmentation (enrichment)

ins and outs

cloud-scale content applications

alternative indexes

batch processing and

process coordination

} us

} partners

customers

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily essentials

»www.lilyproject.org

»Apache license for maximal flexibility

» (lots of) documentation at docs.outerthought.org

7

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily content repository

8

» Scalable store (HBase) and search (SOLR)

» flexible content model

» index maintenance

» high-level API

» base foundation

content application

repository

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

HBase

» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model

» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it

» HDFS, a convenient place to store large blobs

» Apache license and community, a familiar environment for us

9

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 10

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 11

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12

1. Store, 2. Search...? Ouch.

»CMS = two types of search

» structured, ‘logic’ search» numbers, strings» based on logic (SQL, anyone?)

» information retrieval (or: full-text search)» text» based on statistics

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Search ponderings

»All of that, at scale

13

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Structured Search

»HBase Indexing Library

» idea from Google App Engine datastore indexes

» http://code.google.com/appengine/articles/index_building.html

14

rowkey

A

B

col

val3

val2

col

foo6

foo7

content table index table A

rowkey

val2-B

val3-A

col

order

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Full-text / IR search

» Lucene?

» no sharding (for scale)

» no replication (for availability)

» batched index updates (not real-time)

15

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Beyond Lucene» Katta

» scalable architecture, however only search, no indexing

» Elastic Search

» very young (sorry)

» hbasene et al.

» stores inverted index in HBase, does not scale all features

» SOLR

» widely used, schema, facets, query syntax, cloud branch

16

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17

+?

=Easy ! O

r ?

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18

➙ Need for reliable queuing

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19

Connecting things

»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)

» indexing, reindexing, mass reindexing (M/R)

»we need a reliable method of updating HBase secondary indexes

» all of that eventually to run distributed

» distribution means coping with failure

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Solution

» ... a QUEUE ! (Meh)

»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.

»➙ WAL & Queue implemented on top of HBase tables

20

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

WAL & Queue = RowLog Library

» WAL» guaranteed execution

of synchronous actions

» call doesn’t return before secondary action finishes

» e.g. update secondary indexes

» if all goes well, size = #concurrent ops

» useful outside of Lily context as well!

» Queue» triggering of async

actions

» e.g. (re)index (updated) record with SOLR back-end

» size depends on speed of back-end process

21

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The Sum

» Lily model (records & fields)

» mapped onto HBase (=storage)

» indexed and searchable through SOLR

» using a WAL/Queue mechanismimplemented in HBase

» runtime based on Kauri

» with client/server comms via Avro (and a REST interface with JSON)

22

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23

Architecture

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 24

Architecture

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Lily roadmap» development started Sept. 2009

» development trunk opened Jul. 2010

» end of Oct. 2010: milestone/beta release

» fully distributable» spec-complete

» Onwards:

» ‘business-level’ 1.0 release (packaging, testing, performance)

» user/auth management & access control» UI framework (Kauri)

» ins and outs, semantic lifting

25

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26

» [email protected]

» @stevenn

Thanks for your hospitality and attention !

THIS NOTEBOOK BELONGS TO:

Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42