musings on secondary indexing in hbase
DESCRIPTION
Presentation on Secondary Indexes from the 9/11/12 HBase Contributor's Meetup. It discusses the current state of the discussion and some possible future directions.TRANSCRIPT
![Page 1: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/1.jpg)
Secondary Indexing
the discussion so far….
9/11/12 HBase Pow-wow
Jesse YatesSalesforce.com
![Page 2: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/2.jpg)
What is it?
![Page 3: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/3.jpg)
Problem
• HBase rows are multi-dimensional– Only sorted on the row key
• How do you efficiently lookup deeper into the row key?
![Page 4: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/4.jpg)
ExampleRow Family Qualifier Timestamp value
1 Name First 0 Babe
1 Name Last 0 Ruth
How do we find all people with the last name ‘Ruth’?
Full table scan!
![Page 5: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/5.jpg)
Indexing!Row Family Qualifier Timestamp Value
Ruth Name Last 0 1
Store the property we need to search for as the primary key• pointer back to the primary row • fast lookup - O(lg(n))
![Page 6: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/6.jpg)
Use Cases
• Point lookups– Volume of data influences usefulness of index• Let user decide if they need to use an index
• Scan lookup– WHERE age > 16
![Page 7: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/7.jpg)
Implementations
![Page 8: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/8.jpg)
Omid
Full transactional supportCentralized oracle
![Page 9: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/9.jpg)
Lily
WAL implementation on top of HBase100-500 writes/sec
![Page 10: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/10.jpg)
Percolator
Full transactionsDistributed, optimistic locking
~10 sec latencies possible
![Page 11: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/11.jpg)
Culvert
AsyncDead project, incomplete
![Page 12: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/12.jpg)
http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html
Client-side coordinated indexUse timestamps to coordinate
Not yet implemented
![Page 13: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/13.jpg)
Trend Micro Implementation
Still just POC???
![Page 14: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/14.jpg)
Solr/Lucene
Standard Lucene library bolted on HBaseNot commonly used
Lots of formats/codecs already written
![Page 15: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/15.jpg)
Considerations for HBase
What do we need to do?
![Page 16: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/16.jpg)
Built-in vs. external library vs.
semi-supported (e.g. security)
![Page 17: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/17.jpg)
Which should I use??
• HBase experts write a single ‘right’ impl• Officially endorse a ‘correct’ version• What changes do we need to make• How close to the core is the project– Written in everywhere– hbase-index module– External library
![Page 18: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/18.jpg)
Async vs. Synchronous vs.
Transactional
![Page 19: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/19.jpg)
Key Observation
“Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.”
- Lars Hofhansl
![Page 20: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/20.jpg)
Async vs. Synchronous vs.Transactional
• We don’t need full transactions– Transactions are slow – Transactions fail with increasing probability as
number of servers increases• Optionally async or sync– Async• Inherently ‘dirty’ index
• How does index cleanup work?– Inherently different for each type
![Page 21: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/21.jpg)
Locality
![Page 22: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/22.jpg)
Where’s my data?
• Extra columns vs. index table• HBase Region-pinning– Has to be best-effort or will decrease availability – Helps minimize RPC overhead– Cross-table region-pinning– Needs a coprocessor hook to be useful
• HDFS block allocation– Keep index and data blocks on same HDFS node
![Page 23: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/23.jpg)
Index Cardinality
![Page 24: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/24.jpg)
How much data are we talking?
“Seems like there are 3 categories of sparseness:1. sparse indexes (like ipAddress) where a per-table approach is
more efficient for reads
2. dense indexes (like eventType) where there are likely values of every index key on each region
3. very dense indexes (like male/female) where you should just be doing a table scan anyway”
- Matt Corgan (9/10/12)
![Page 25: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/25.jpg)
Impact on implementation
• Need a lot of knowledge of data to pick the right kind of index– User knows their data, let them do the hard work
of picking indexes
![Page 26: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/26.jpg)
Pluggability
![Page 27: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/27.jpg)
Everyone’s got an impl already
• We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching– Lucene style Codec/CodecProvider?
![Page 28: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/28.jpg)
Client-interface
![Page 29: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/29.jpg)
What should it look like?
• Minimal changes to the top-level interfaces– Add a single new flag?– Configuration based?
• Enough that the user gets to be smart about what should be used– We can’t get all cases right – just provide building
blocks• Automatically use an index?• Scanner/Filter style use?
![Page 30: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/30.jpg)
Properties for the client
• Should the user even see the index lookups?
• ACID?• Ordering of results?– Support the current sorted order?– Batch lookup?
• Implications on current features– Replication– splitting
![Page 31: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/31.jpg)
Schema(less)
• Schema enforced?– Rigid usage of index matching an expected schema?– Schema table? Reserved schema columns? .META.?
• Schema-less– Let the user apply whatever they think and use only
what actually works• Best-effort– Use client-hinted schema and try to apply all the
known indexes
![Page 32: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/32.jpg)
My random thoughts….
• Client-side managed indexes are efficient– Minimal RPC overhead• Cleanup is async to client and rarely misses
– Solves the cross-region/server problem• Region-pinning is a nice-to-have optimization
– Scales without concern for locality– Flexible enough to support custom codecs– Can be built to provide server-side optimizations• Locality aware indexes to minimize RPCs
![Page 33: Musings on Secondary Indexing in HBase](https://reader036.vdocuments.mx/reader036/viewer/2022062513/5550982eb4c90590208b46ea/html5/thumbnails/33.jpg)
Discussion!