Transcript
Page 1: Tales from the Cloudera Field

Tales From the Cloudera Field

Kevin O’Dell, Kate Ting, Aleks Shulman{kevin, kate, aleks}@cloudera.com

Page 2: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Who Are We?

Kevin O’Dell

- Previously HBase Support Team Lead

- Currently Systems Engineer with a focus on HBase deployments

Kate Ting

- Technical Account Manager of Cloudera’s largest HBase deployments

- Co-author of O’Reilly’s Apache Sqoop Cookbook

Aleks Shulman

- HBase Test Engineer focused on ensuring HBase is enterprise ready

- Primary focus on building compatibility frameworks for rolling upgrades

Page 3: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Cloudera Internal HBase Metrics

• Cloudera uses HBase internally for the Support Team• We ingest Tickets, Cluster Stats, and Apache Mailing Lists

• Cloudera has ~20K HBase nodes under management

• Over 60% of my accounts use HBase

Page 4: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Agenda

● Tales Getting Production Started● Tales Fixing Production Bugs ● Tales Upgrading Production Clusters

Page 5: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Agenda

● Tales Getting Production Started● Tales Fixing Production Bugs ● Tales Upgrading Production Clusters

Page 6: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

HBase Deployment Mistakes

• Cluster Sizing

• Managing Your Regions

• General Recommendations

Page 7: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Why Cluster Sizing Matters

• Jobs Failing• Writes Blocking• Performance Issues

Page 8: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Heavy Write Sizing

java_max_heap 16GB

memstore_upper .50

java_max_heap * memstore = memstore_total_size

Calculating Total Available Memstore

desired_flush_size 128MB

repl_factor 3 (default)

max_file_size 20GB

Calculating Max Regions

memstore_total_size / desired_flush_size = total_regions_per_rs

max_file_size * (total_regions_per_rs * repl_factor) = raw_storage_per_node

X-axis = Flush_SizeY-axis = Region_Count

Page 9: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Update for Known Writes Sizing

write_throughput 20MBs

total_data_size 350TB

hlog_size * number_of_hlogs = amount_of_data_before_flush

Calculating force flushes

hlog_size 128MBs

number_of_hlogs 64

(write_throughput * 60 * 60) / amount_of_data_before_flush = number_nodes_before_flush

Calculating Max Regions

total_data_size 350TB

maxfile_size 20GB

((total_data_size * 1024) / maxfile_size) / desired_RS_count = total_regions_per_rs

Page 10: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Why is Region Management Important

• Initial loads are failing• Region Servers are crashing from overload

Page 11: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Region Management Best Practices

Region Split Policy

ConstantSize Split on Max Filesize Use when pre-splitting all tables

UpperBoundSplitPolicy Split on smarter intervals Use when not able to pre-split all tables

Balancer Policy

SimpleLoadBalancer Aimlessly balance regions Use with lots of tables with low region count

ByTable Balance by table Use with few tables with high region count

Page 12: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

General Recommendations

Feature Benefit When to Enable

Short Circuit Reads (SCR) Speed up read times by bypassing datanode layer

Always

Snappy Compression Speed up read times and lower data consumption

On heavily accessed tables

Bloom Filters Speed up read times when numerous HFiles are present

Row should always be used, Row+Column is more accurate but higher in memory usage

HLog Compression Speed up writes and recovery times Always

Data Block Encoding compress long keys to store more in cache

Best for short/tall tables with long like keys. Scans may be slower

Page 13: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Agenda

● Tales Getting Production Started● Tales Fixing Production Bugs ● Tales Upgrading Production Clusters

Page 14: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Tales Fixing Production Bugs

● RegionServer Hotspotting

● Faulty Hardware

● Application Bug

Page 15: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Tales Fixing Production Bugs

● RegionServer Hotspotting

● Faulty Hardware

● Application Bug

Page 16: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #1: RegionServer Hotspotting - Solution

● Spread rows over all RS by salting the row key

● 100’s of regions avail but increments only done to 10’s of regions

● While locks wait to time out, blocked clients hold onto handlers

Page 17: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #1: RegionServer Hotspotting - Solution

● Option 1: Change row key to something that scales○ Reduce contention by reducing connections: each client

picks one salt and writes only to one RS● Option 2: Implement new coalescing feature in native

HBaseSink, compressing entire batch of Flume events into single HBase RPC call

[row1, colA+=1] [row1, colB+=1] [row1, colB+=1]

=> [row1 colA+=1 colB+=2]

Page 18: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Tales Fixing Production Bugs

● RegionServer Hotspotting

● Faulty Hardware

● Application Bug

Page 19: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #2: Faulty Hardware

● Diagnostics run on bad hardware caused HBase failures

● HBase recoverability = RS back online + locality (compaction)

● Stress test with prod load before needed (i.e. holiday season)

● Imagine financial impact of 7 hours of downtime?

Page 20: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #2: Faulty Hardware - Solution

● Recover faster by failing fast○ Too many retries cause HBase task to exit before it can

print exception identifying stuck RS● Decrease time needed to finish HBase major compaction

○ Run multiple threads during compaction● Replay in parallel

○ Decrease HLog size to limit # of edits to be replayed, increase # of HLogs, constrain WAL file size to minimize time corresponding region is not available

Page 21: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #2: Faulty Hardware - Solution

● Shorten column family names○ Reduce scan time, skip bulk loads, reduce memory usage

● Turn off write cache○ Node crash erases writes in memory, rebuilds block with

outdated data, causing corrupt replica● Turn on checksum

○ Enables RS to use other replicas from the cluster instead of failing the operation if there’s a corrupted HFile

Page 22: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Tales Fixing Production Bugs

● RegionServer Hotspotting

● Faulty Hardware

● Application Bug

Page 23: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #3: Application Bug

● HBase timestamps were hardcoded to be too far out - new data written went unused

● Bug put backup system out of commission for one month○ More vulnerable to HBase outages

Page 24: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #3: Application Bug

Page 25: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Fixing #3: Application Bug - Solution

● Detailed knowledge of internals required to undo damage○ Modified the timestamp to some time in the past for all

records via custom MR jobs over one month: ■ back up data, generate new HFile with correct

timestamp, bulkload data, run MD5 ● Don’t muck around with setting the timestamp yourself● Do use always-increasing timestamps for new puts to a row● Do use a separate timestamp attribute of the row

Page 26: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Agenda

● Tales Getting Production Started● Tales Fixing Production Bugs ● Tales Upgrading Production Clusters

Page 27: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Internal Case Study

CDH4->C5 (0.94->0.96) Upgrade Automation Failed

What Happened? Root Cause• HBase Snapshots vs. HDFS Snapshots• Snapshot directory rename

Outcome• All issues resolved before C5b1 was

shipped

2013-07-12 17:11:42,656 ERROR org.apache.

hadoop.hdfs.server.namenode.FSEditLogLoader:

Encountered exception on operation MkdirOp

[length=0, inodeId=0, path=/hbase/.snapshot,

timestamp=1373674083434, permissions=hbase:

supergroup:rwxr-xr-x, opCode=OP_MKDIR,

txid=614]

org.apache.hadoop.

HadoopIllegalArgumentException: ".snapshot"

is a reserved name. Please rename it before

upgrade.

Page 28: Tales from the Cloudera Field

Automating UpgradesTesting the Upgrade lifecycle

Page 29: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

What is Important?

The Administrator Experience Matters● Major version upgrades● Rolling upgrades

The Developer Experience Matters● API Compatibility Testing

Page 30: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

And Here Is Why It Is Important

Customer Continuity• Smooth upgrades• Curated process• Understanding of customer cluster lifecycle

Developer Continuity• Forward and backward compatibility

• Binary Compatibility• Wire Compatibility

Automation• You can only really make a guarantee about things that are automated• Product is easier to support• Confidence is only possible with testing

Page 31: Tales from the Cloudera Field

Upgrades

Page 32: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Cold vs. Rolling Upgrades

C3u5 CDH4.0.x CDH4.1.x CDH4.2.x CDH4.3.x CDH4.4.x CDH4.5.x CDH4.6.x C5.0 C5.1

-- Rolling Upgrade --> -- Rolling Upgrade -- >

-- Cold Upgrade -->

-- Cold Upgrade -->

Page 33: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Upgrades from HBase 0.90 -> 0.98

CDH Version HBase Version

CDH3u5 HBase 0.90.6

CDH4.1.0 HBase 0.92.1

CDH4.2.0 HBase 0.94.2

CDH4.4.0 HBase 0.94.6

CDH4.6.0 HBase 0.94.15

CDH5.0.0 HBase 0.96.1.1

CDH5.1.0 HBase 0.98.1

A

B

C

Upgrade from version A -> Version B -> Version C

Page 34: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Cold Upgrade Results

● Upgrades work!● Steps:

○ Start at CDH3u5○ Upgrade to a version of CDH4○ Upgrade to CDH5.0.0

● Data Integrity○ Different bloom filters○ Different compression formats

● Next Steps○ CDH 5.1.0 expected to be based on 0.98.1

Page 35: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Rolling Upgrade Results

● What is tested?○ Ingest via Java API○ MapReduce over HBase

■ Bulk load■ RowCount/Export

● Status○ Rolling upgrade broken (red)

in CDH <=4.1.2 due to region_mover issue

○ Soft failure (yellow) for starting version <CDH4.1.0 - due to MapReduce JT/TT version mismatch issue

○ All else green!How to Read This: Pick a column and read down to see for which versions rolling upgrades are advised

Page 36: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Improved Supportability Through Testing

Case Study: Customer Rolling Upgrade SimulationLarge Customer

● Upgrading from CDH4.1.4+patches● Considered several CDH versions to upgrade

○ Custom patches

Automation● Automated testing added to simulate rolling upgrade

○ CM○ HA+QJM○ Parcels

● Scales○ 4 nodes, 20 nodes, 80 nodes

● Subsequently used for other customers with similar upgrade paths

Page 37: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

©2014 Cloudera, Inc. All rights reserved.

Here’s to Fewer Tales Next Year..

Automated Testing Better Cluster Mgmt Fewer Tales From the Field

Page 38: Tales from the Cloudera Field

©2014 Cloudera, Inc. All rights reserved.

Kevin O’Dell @kevinrodell

Kate Ting @kate_ting

Aleks Shulman @a_shulman@clouderaTest

Questions?


Top Related