hbase at flurry
DESCRIPTION
Slides from HBase Meetup at Flurry 2013-08-20TRANSCRIPT
HBase at Flurry
2013-08-20Dave Latham
Overview
History Stats How We Store Data Challenges Mistakes We Made Tips / Patterns Future Moral of the Story
History
2008 –Flurry Analytics for Mobile Apps Sharded MySQL, or HBase!
Launched on 0.18.1 with a 3 node cluster
Great community Now running 0.94.5 (+ patches) 2 data centers with 2 clusters each Bidirectional replication
Stats – Main cluster
1000 slave nodes per cluster 32 GB RAM, 4 drives (1 or 2 TB), 1 GigE, dual
quad-core * 2 HT = 16 procs DataNode, TaskTracker, RegionServer (11GB),
5 Mappers, 2 Reducers ~30 tables, 250k regions, 430 TB (after
LZO) 2 big tables are about 90% of that▪ 1 wide table: 3 CF, 4 billion rows, up to 1MM cells
per row▪ 1 tall table: 1 CF, 1 trillion rows, most 1 cell per row
Stats – Low latency cluster 12 physical nodes
5 region servers with 20GB heaps on each 1 table - 8 billion small rows - 500GB
(LZO) All in block cache (after 20 minute
warmup) 100k-1MM QPS - 99.9% Reads 2ms mean, 99% <10ms
25 ms GC pause every 40 seconds slow after compaction
App Layer
DAO for Java apps Requires:▪ writeRowIndex / readRowIndex▪ readKeyValue / writeRowContents
Provides:▪ save / delete▪ streamEntities / pagination▪ MR input formats on entities (rather than
Result) Uses HTable or asynchbase
Migrations
Change row key format DAO supports both formats1. Create new table2. Writes to both3. Migrate existing4. Validate5. Reads to new table6. Write to (only) new table7. Drop old table
Challenges – Big Cluster
Bottlenecks (not horizontally scalable) HMaster (e.g. HLog cleaning falls behind
creation [HBASE-9208]) NameNode▪ Disable table / shutdown => many HDFS files at
once▪ Scan table directory => slow region assignments
ZooKeeper (HBase replication) JobTracker (heap) META region
Challenges – Big Cluster (cont.) Too many regions (250k)
Max size 256M -> 1 GB -> 5 GB Slow reassignments on failure Slow hbck recovery Lots of META queries / big client cache▪ Soft refs can exacerbate
Slow rolling restarts More failures (Common and otherwise)
Zombie RS
Challenges – Big Cluster (cont.)
Latency long tail HTable Flush write buffer GC pauses RegionServer failure (See The Tail at Scale – Jeff Dean, Luiz André
Barroso)
More Challenges
Shared cluster for MapReduce and live queries IO bound requests hog handler threads Even cached reads get slow RegionServer falls behind, stays behind If the cluster goes down, it takes awhile
to come back
More Challenges
HDFS-5042 Completed files lost after power failure ZOOKEEPER-1277 servers stop serving when
lower 32bits of zxid roll over ZOOKEEPER-1731 Unsynchronized access to
ServerCnxnFactory.connectionBeans results in deadlock
Mistakes We Made (So You Don’t Have To)
Small region size -> many regions Nagle’s Trying to solve a crisis you don’t
understand (hbck fixSplitParents) Setting up replication
Custom backup / restore CopyTable OOM Verification
Tips / Patterns
Compact data matters (even with compression) Block cache, network not compressed
Avoid random reads on non cached tables (duh!)
Write cell fragments, combine at read time to avoid doing random reads
compact later - coprocessor? can lead to large rows▪ probabilistic counter
Future
HDFS HA Snapshots (see how it works with 100k
regions on 1000 servers) 2000 node clusters
test those bottlenecks larger regions, larger HDFS blocks, larger HLogs
More (independent) clusters Load aware balancing? Separate RPC priorities for workloads 0.96
Moral of the Story
Scaled 1000x and more on the same DB
If you’re on the edge you need to understand your system Monitor Open Source Load test
Know your load Disk or Cache (or SSDs?)
Questions
And maybe some answers