hbase at flurry

HBase at Flurry

2013-08-20Dave Latham

Overview

History Stats How We Store Data Challenges Mistakes We Made Tips / Patterns Future Moral of the Story

History

2008 –Flurry Analytics for Mobile Apps Sharded MySQL, or HBase!

Launched on 0.18.1 with a 3 node cluster

Great community Now running 0.94.5 (+ patches) 2 data centers with 2 clusters each Bidirectional replication

Stats – Main cluster

1000 slave nodes per cluster 32 GB RAM, 4 drives (1 or 2 TB), 1 GigE, dual

quad-core * 2 HT = 16 procs DataNode, TaskTracker, RegionServer (11GB),

5 Mappers, 2 Reducers ~30 tables, 250k regions, 430 TB (after

LZO) 2 big tables are about 90% of that▪ 1 wide table: 3 CF, 4 billion rows, up to 1MM cells

per row▪ 1 tall table: 1 CF, 1 trillion rows, most 1 cell per row

Stats – Low latency cluster 12 physical nodes

5 region servers with 20GB heaps on each 1 table - 8 billion small rows - 500GB

(LZO) All in block cache (after 20 minute

warmup) 100k-1MM QPS - 99.9% Reads 2ms mean, 99% <10ms

25 ms GC pause every 40 seconds slow after compaction

App Layer

DAO for Java apps Requires:▪ writeRowIndex / readRowIndex▪ readKeyValue / writeRowContents

Provides:▪ save / delete▪ streamEntities / pagination▪ MR input formats on entities (rather than

Result) Uses HTable or asynchbase

Migrations

Change row key format DAO supports both formats1. Create new table2. Writes to both3. Migrate existing4. Validate5. Reads to new table6. Write to (only) new table7. Drop old table

Challenges – Big Cluster

Bottlenecks (not horizontally scalable) HMaster (e.g. HLog cleaning falls behind

creation [HBASE-9208]) NameNode▪ Disable table / shutdown => many HDFS files at

once▪ Scan table directory => slow region assignments

ZooKeeper (HBase replication) JobTracker (heap) META region

Challenges – Big Cluster (cont.) Too many regions (250k)

Max size 256M -> 1 GB -> 5 GB Slow reassignments on failure Slow hbck recovery Lots of META queries / big client cache▪ Soft refs can exacerbate

Slow rolling restarts More failures (Common and otherwise)

Zombie RS

Challenges – Big Cluster (cont.)

Latency long tail HTable Flush write buffer GC pauses RegionServer failure (See The Tail at Scale – Jeff Dean, Luiz André

Barroso)

More Challenges

Shared cluster for MapReduce and live queries IO bound requests hog handler threads Even cached reads get slow RegionServer falls behind, stays behind If the cluster goes down, it takes awhile

to come back

More Challenges

HDFS-5042 Completed files lost after power failure ZOOKEEPER-1277 servers stop serving when

lower 32bits of zxid roll over ZOOKEEPER-1731 Unsynchronized access to

ServerCnxnFactory.connectionBeans results in deadlock

Mistakes We Made (So You Don’t Have To)

Small region size -> many regions Nagle’s Trying to solve a crisis you don’t

understand (hbck fixSplitParents) Setting up replication

Custom backup / restore CopyTable OOM Verification

Tips / Patterns

Compact data matters (even with compression) Block cache, network not compressed

Avoid random reads on non cached tables (duh!)

Write cell fragments, combine at read time to avoid doing random reads

compact later - coprocessor? can lead to large rows▪ probabilistic counter

Future

HDFS HA Snapshots (see how it works with 100k

regions on 1000 servers) 2000 node clusters

test those bottlenecks larger regions, larger HDFS blocks, larger HLogs

More (independent) clusters Load aware balancing? Separate RPC priorities for workloads 0.96

Moral of the Story

Scaled 1000x and more on the same DB

If you’re on the edge you need to understand your system Monitor Open Source Load test

Know your load Disk or Cache (or SSDs?)

Questions

And maybe some answers

hbase at flurry

Technology

slow regionserver

gb slow reassignments

larger regions

gb lzo

region servers

gb ram

gb heaps

slow rolling