lily for the bay area hbase ug - nyc edition
TRANSCRIPT
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Presenting LilyBay Area HBase UG - NYC - 10/11/2010
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Devoxx: Nov. 15-19, Antwerp, BelgiumNoSQL/Cloud track
2
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Outerthought
» software product company
» scalable content applications
» open source product portfolio
» Java, REST, internet
3
THIS NOTEBOOK BELONGS TO:
Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Technology
4
THIS N OT E B OO K B ELO N GS TO:
Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42
»Lily : NoSQL-based content repository (HBase + SOLR)»Kauri : REST centric webapp dev framework
» Daisy : techdoc / QDoc / publishing CMS
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Needs for Scalable Content
5
»wire-speed capturing
» batch-oriented post-processing
» semantic lifting : extracting knowledge out of noise
» data and inferred data become one
➡ NoSQL & write-optimized storage
➡ map/reduce
➡ Natural Language Processing
➡ smart content repositories
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 6
The Lily Project
content repository: store + search
REST-centric content app UI
framework
content augmentation (enrichment)
ins and outs
cloud-scale content applications
alternative indexes
batch processing and
process coordination
} us
} partners
customers
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily essentials
»www.lilyproject.org
»Apache license for maximal flexibility
» (lots of) documentation at docs.outerthought.org
7
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily content repository
8
» Scalable store (HBase) and search (SOLR)
» flexible content model
» index maintenance
» high-level API
» base foundation
content application
repository
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
HBase
» a datamodel where you can have column families which keep all versions and others which do not, which fits very well on our CMS document model
» ordered tables with the ability to do range scans on them, which allows to build scalable indexes on top of it
» HDFS, a convenient place to store large blobs
» Apache license and community, a familiar environment for us
9
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 12
1. Store, 2. Search...? Ouch.
»CMS = two types of search
» structured, ‘logic’ search» numbers, strings» based on logic (SQL, anyone?)
» information retrieval (or: full-text search)» text» based on statistics
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Search ponderings
»All of that, at scale
13
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Structured Search
»HBase Indexing Library
» idea from Google App Engine datastore indexes
» http://code.google.com/appengine/articles/index_building.html
14
rowkey
A
B
col
val3
val2
col
foo6
foo7
content table index table A
rowkey
val2-B
val3-A
col
order
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Full-text / IR search
» Lucene?
» no sharding (for scale)
» no replication (for availability)
» batched index updates (not real-time)
15
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Beyond Lucene» Katta
» scalable architecture, however only search, no indexing
» Elastic Search
» very young (sorry)
» hbasene et al.
» stores inverted index in HBase, does not scale all features
» SOLR
» widely used, schema, facets, query syntax, cloud branch
16
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 18
➙ Need for reliable queuing
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 19
Connecting things
»we needed a reliable bridge between our main storage (HBase) and our index/search server(s) (SOLR)
» indexing, reindexing, mass reindexing (M/R)
»we need a reliable method of updating HBase secondary indexes
» all of that eventually to run distributed
» distribution means coping with failure
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Solution
» ... a QUEUE ! (Meh)
»ACMEMessageQueue ? Bzzzzzt.We wanted fault-safe HBase persistence for the queues.Also for ease of administration.
»➙ WAL & Queue implemented on top of HBase tables
20
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
WAL & Queue = RowLog Library
» WAL» guaranteed execution
of synchronous actions
» call doesn’t return before secondary action finishes
» e.g. update secondary indexes
» if all goes well, size = #concurrent ops
» useful outside of Lily context as well!
» Queue» triggering of async
actions
» e.g. (re)index (updated) record with SOLR back-end
» size depends on speed of back-end process
21
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
The Sum
» Lily model (records & fields)
» mapped onto HBase (=storage)
» indexed and searchable through SOLR
» using a WAL/Queue mechanismimplemented in HBase
» runtime based on Kauri
» with client/server comms via Avro (and a REST interface with JSON)
22
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org
Lily roadmap» development started Sept. 2009
» development trunk opened Jul. 2010
» end of Oct. 2010: milestone/beta release
» fully distributable» spec-complete
» Onwards:
» ‘business-level’ 1.0 release (packaging, testing, performance)
» user/auth management & access control» UI framework (Kauri)
» ins and outs, semantic lifting
25
IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26
» @stevenn
Thanks for your hospitality and attention !
THIS NOTEBOOK BELONGS TO:
Noteblock_03.indd 1Noteblock_03.indd 1 23/05/10 14:4223/05/10 14:42