enterprise-grade rolling upgrade for a live hadoop cluster

© Hortonworks Inc. 2014

Enterprise-Grade Rolling Upgrade for a Live

Hadoop ClusterSanjay Radia, Vinod Kumar Vavilapalli, Hortonworks Inc

Page 1


© Hortonworks Inc. 2013 - Confidential

Agenda

• Introduction

•What is Rolling Upgrade?

•Problem – Several key issues to be addressed

–Wire compatibility and side-by-side installs are not sufficient!!

–Must Address: Data safety, Service degradation and disruption

•Enhancements to various components

–Packaging – side-by-side install

–HDFS, Yarn, Hive, Oozie

Page 2



Hello, my name is Sanjay Radia

•Chief Architect, Founder, Hortonworks

•Part of the Hadoop team at Yahoo! since 2007

–Chief Architect of Hadoop Core at Yahoo!

–Apache Hadoop PMC and Committer

• Prior

–Data center automation, schedulers, virtualization, Java, HA, OSs, File

Systems

– (Startup, Sun Microsystems, Inria …)

–Ph.D., University of Waterloo

Page 3


HDP Upgrade: Two Upgrade Modes

Stop the Cluster UpgradeShutdown services and cluster and then upgrade.

Traditionally this was the only way

Rolling Upgrade

Upgrade cluster and its services while cluster is

actively running jobs and applicationsNote: Upgrade time is proportional to # nodes, not data size

Enterprises run critical services and data on a Hadoop cluster.

Need live cluster upgrade that maintains SLAs without degradation



But you can Revert to Prior State

Rollback

Revert bits and state of cluster and its services back to a

checkpoint’d state.

Why? This is an emergency procedure.

Downgrade

Downgrade the service and component to prior version, but

keep any new data and metadata that has been generated

Why? You are not happy with performance, or app compatibility, ….


But aren’t wire compatibility and

side-by-side installs sufficient for

Rolling upgrades?

Unfortunately No!! Not if you want

• Data safety

• Keep running jobs/apps continue to run correctly

• Maintain SLAs

• Allow downgrade/rollbacks in case of problems

Page 6


Issues that need to be addressed (1)

• Data safety

• HDFS’s upgrade checkpoint does not work for rolling upgrade

• Service degradation – note every daemon is restarted in rolling fashion

• HDFS write pipeline

• Yarn App masters restart

• Node manager restart

• Hive server is processing client queries – it cannot restart to new version without loss

• Client must not see failures – many components do not have retry

BUT Hadoop deals with failures, it will fix pipelines, restart tasks –

what is the big deal!!

Service degradation will be high because every daemon is restarted


Issues that need to be addressed (2)

• Maintaining the job submitters context (correctness)

• Yarn tasks get their context from the local node

– In the past the submitters and node’s context were identical

– But with RU, a node’s binaries are being upgraded and hence may be inconsistent with submitter

- Half of the job could execute with old binaries and the other with the new one!!

• Persistent state

• Backward compatibility for upgrade (or convert)

• Forward compatibility for downgrade (or convert)

• Wire compatibility

• With clients (forward and backward)

• Internally (Between Masters and Slaves or Peers)

– Note: the upgrade is in a rolling fashion


Component Enhancements • Packaging – Side-by-side installs

• HDFS Enhancements

• Yarn Enhancements

• Retaining Job/App Context

• Hive Enhancements


Packaging: Side-by-side Installs (1)• Need side-by-side installs of multiple versions on same node

• Some components are version N, while others are N+1

• For same component, some daemons version N, others N+1 on the same node (e.g. NN and DN)

• HDP’s solution: Use OS-distro standard packaging solution

• Rejected a proprietary packing solution (no lock-in)

• Want to support RU via Ambari and Manually

• Standard packaging solutions like RPMs have useful tools and mechanisms

– Tools to install, uninstall, query, etc

– Manage dependencies automatically

– Admins do not need to learn new tools and formats

• Side benefits for ‘stop-the-world” upgrade:

• Can install the new binaries before the shutdown


Packaging: Side-by-side installs (2)

• Layout: side-by-side

• /usr/hdp/2.2.0.0/hadoop

• /usr/hdp/2.2.0.0/hive

• /usr/hdp/2.3.0.0/hadoop

• /usr/hdp/2.3.0.0/hive

• Define what is current for each component’s

daemon and clients

• /usr/hdp/current/hdfs-nn->/usr/hdp/2.3.0.0/hadoop

• /usr/hdp/current/hadoop-client->/usr/hdp/2.2.0.0/hadoop

• /usr/hdp/current/hdfs-dn->/usr/hdp/2.2.0.0/hadoop

• Distro-select helps you manage the version switch

• Our solution: the package name contains the version number:

• E.g hadoop_2_2_0_0 is the RPM package name itself

– Hadoop_2_3_0_0 is different peer package

• Bin commands point to current:

/usr/bin/hadoop->/usr/hdp/current/hadoop-client/bin/hadoop


Packaging: Side-by-side installs (3)

• distro-select tool to select current binary

• Per-component, Per-daemon

• Maintain stack consistency – that is what QE tested

• Each component refers to its siblings of same stack version

• Each component knows the “hadoop home” of the same stack

– Wrapper bin-scripts set this up

• Config updates can be optionally synchronized with binary upgrade

• Configs can sit in their old location

• But what if the new binary version requires slightly different config?

• Each binary version has its own config pointer

– /usr/hdp/2.2.0.0/hadoop/conf -> /etc/hadoop/conf


HDFS Enhancements (1)

Data safety

• Since version 2007, HDFS supported an upgrade-checkpoint

• Backups of HDFS not practical – too large

• Protects against HDFS bugs in new version deleting files

• Standard practice to use for ALL upgrade even patch releases

• But this only works for “stop-the-world” full upgrade and does not support downgrade

• Irresponsible to do rolling upgrade without such a mechanism

HDP 2.2 has enhanced upgrade-checkpoint (HDFS-5535)

• Markers for rollback

• “Hardlinks” to protect against deletes due to bugs in the new version of HDFS code

• Old scheme had hardlinks but we now delay the deletes

• Added downgrade capability

• Protobuf based fsImage for compatible extensibility


HDFS Enhancements (2)

Minimize service degradation and retain data safety

• Fast datanode restart (HDFS-5498)

• Write pipeline – every DN will be upgraded and hence many write

pipelines will break and repaired

• Umbrella Jira HDFS-5535

– Repair it to the same DN during RU (avoid replica data copy)

– Retain same number of replicas in pipeline

• Upgrade HA standby and failover (NN HA available for a long time)


YARN Enhancements: Minimize Service Degradation• YARN RM retains app/job queue (2013)

• YARN RM HA (2014)

• Note this retains the queues but ALL jobs are restarted

• Yarn RM can restart while retaining jobs (2015)


YARN Enhancements: Minimize Service Degradation• A restarted YARN NodeManager retains existing containers (2015)

• Recall restarting containers will cause serious SLA degradation


YARN Enhancement: Compatibility

• Versioning of state-stores of RM and NMs

• Compatible evolution of tokens over time

• Wire compatibility between mixed versions of RM


Retaining Job/App context

Previously a Job/Apps uses libraries from the local node

• Worked because client-node & compute-nodes had same version

• But during RU, the node manager has multiple versions

• Must use the same version as used by the client when submitting a job

• Solution:

• Framework libraries are now installed in HDFS

• Client-context sent as “distro-version” variable in job config

• Has side benefits

– Frameworks now installed in single node and then uploaded to HDFS

• Note Oozie also enhanced to maintain consistent context


Hive Enhancements

• Fast restarts + client-side reconnection

• Hive metastore and Hive client

• Hive-server2: statefull server that submits the client’s query

• Need to keep it running till the old queries complete

• Solution:

• Allow multiple Hive-servers to run, each registered in Zookeeper

• New client requests go to new servers

• Old server completes old queries but does not receive any new ones

– Old-server is removed from Zookeeper

• Side benefits

• HA + Load balancing solution for Hiveserver2


Automated Rolling Upgrade

Via Ambari

Via Your own cluster management scripts


HDP Rolling Upgrades Runbook

Pre-requisites

• HA

• Configs

Prepare• Install bits

• DB backups

• HDFS checkpoint

Rolling

UpgradeFinalize

Rolling

Downgrade

Rollback

NOT Rolling. Shutdown all

services.

Note: Upgrade time is proportional to # nodes, not data size


Both Manual and Automated Rolling Upgrade

• Ambari supports fully automated upgrades

• Verifies prerequisites

• Performs HDFS upgrade-checkpoint, prompts for DB backups

• Performs rolling upgrade

• All the components, in the right order

• Smoke tests at each critical stages

• Opportunities for Admin verification at critical stages

• Downgrade if you change your mind

• Have published the runbook for those that do not use Ambari

• You can do it manually or automate your own process


Runbook: Rolling Upgrade

Ambari has automated

process for Rolling Upgrades

Services are switched over to

new version in rolling fashion

Any components not installed

on cluster are skipped

Zookeeper

Ranger

Core Masters

Core Slaves

Hive

Oozie

Falcon

Clients

Kafka

Knox

Storm

Slider

Flume

Hue

Finalize

HDFS, YARN, MR,

Tez, HBase, Pig.

Hive, Phoenix,

Mahout

HDFS

YARN

HBase


Runbook: Rolling Downgrade

Zookeeper

Ranger

Core Masters

Core Slaves

Hive

Oozie

Falcon

Clients

Kafka

Knox

Storm

Slider

Flume

Hue

Downgrade

Finalize


Summary

• Enterprises run critical services and data on a Hadoop cluster.

• Need a live cluster upgrade without degradation and maintaining SLAs

• We enhanced Hadoop components for enterprise-grade rolling upgrade

• Non-proprietary packaging solution using OS-standard solution (RPMs, Debs, )

• Data safety

– HDFS checkpoints and write-pipelines

• Maintain SLAs – solve a number of service degradation problems

– HDFS write pipelines, Yarn RM, NM state recovery, Hive, …

• Jobs/apps continue to run correctly with the right context

• Allow downgrade/rollbacks in case of problems

• All enhancements truly open source and pushed back to Apache?

• Yes of course – that is how Hortonworks does business …


Backup slides


Why didn’t you use alternatives

• Alternatives generally keep one version active, not two

• We need to move some services as a pack (clients)

• We need to support managing confs and binaries together and

separately

• Maybe we could have done it, but it was getting complex …..

enterprise-grade rolling upgrade for a live hadoop cluster

Technology

rolling hortonworks

hortonworks incpage

live cluster upgrade

state of cluster

upgrade modesstop

upgrade time

hdp upgrade

app compatibility