keep your hadoop cluster at its best

39
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Keep Your Hadoop Cluster at its Best! Chris Nauroth Sheetal Dolas Hadoop Summit, San Jose, 2016

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

670 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Keep your Hadoop Cluster at its Best

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Keep Your Hadoop Cluster at its Best!Chris Nauroth Sheetal DolasHadoop Summit, San Jose, 2016

Page 2: Keep your Hadoop Cluster at its Best

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

About Us

⬢ Principal Engineer @ Hortonworks⬢ Committer and PMC, Apache Hadoop

– Key contributor to HDFS ACLs, Windows compatibility, and operability improvements

⬢ Hadoop user since 2010– Experience deploying, maintaining and using Hadoop clusters

[email protected] cnauroth

Chris Nauroth

Page 3: Keep your Hadoop Cluster at its Best

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

About Us

⬢ SmartSense Engineering Lead @ Hortonworks ⬢ Most of the career has been in the field, solving real life business

problems ⬢ Last 6+ years in Big Data⬢ Committer and PMC, Apache Metron

[email protected] sheetal_dolas

Sheetal Dolas

Page 4: Keep your Hadoop Cluster at its Best

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

⬢ Days in a life of Hadoop users – Real war stories!⬢ Hadoop Operational Challenges⬢ Winning and avoiding the wars⬢ Q & A

Page 5: Keep your Hadoop Cluster at its Best

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Days in a life of Hadoop usersReal war stories!

Page 6: Keep your Hadoop Cluster at its Best

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Story I: Unstable NameNode, Frequent Fail Overs⬢ NameNode periodically becomes unresponsive⬢ In HA scenario, fails over to standby⬢ In short time, falls back again⬢ Very frequent fail overs and fail backs

It was the garbage collection!

Page 7: Keep your Hadoop Cluster at its Best

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Story II: Very high CPU usage but low throughput⬢ Unusually high system CPU usage⬢ Jobs slowed down⬢ Reduced data IO

System CPU

User CPU N/W IO

Transparent Huge Pages (THP) was turned on!

Page 8: Keep your Hadoop Cluster at its Best

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS Upgrade

HDFS Space

JobPerfor

mance

Cluster Stability

Story III: Cascading impact and cluster melt down⬢ HDFS upgraded⬢ HDFS utilization kept on increasing even after large data deletion⬢ Rebalancing made the situation worse⬢ Eventually HDFS became unresponsive

un-finalized HDFS had cascading impact on cluster!

Page 9: Keep your Hadoop Cluster at its Best

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Story IV: Overloaded cluster

⬢ Jobs run slower⬢ Always waiting containers and jobs, all YARN queues are fully

utilized⬢ Some jobs had to wait for hours to get the container slots

Sub optimally configured container sizes!

Requested Memory

Used Memory

Page 10: Keep your Hadoop Cluster at its Best

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Story V: Accidental deletion of critical datasets

⬢ User accidentally executed hdfs dfs -rm -R on a root directory⬢ Delete is issued in parallel, control + c did not help⬢ In panic, user shuts down HDFS immediately (fortunately)⬢ Restarts later to check trash, loses all data⬢ It’s nearly impossible to recover blocks from local file system

This is a more common mistake than one may think!

Page 11: Keep your Hadoop Cluster at its Best

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Story VI: Hive query returning random results

⬢ A hive query returns different results every time ⬢ Results are usually accurate during office hours⬢ After office hours, results keep changing randomly on every

execution

-- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ?SELECT SUM(amount) FROM   sales WHERE  sale_date = TO_DATE (UNIX_TIMESTAMP()) 

One of the host had a different time zone!

Page 12: Keep your Hadoop Cluster at its Best

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

and the stories continue…

Page 13: Keep your Hadoop Cluster at its Best

Hadoop operational challenges

Page 14: Keep your Hadoop Cluster at its Best

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop has lots of configurations

⬢ So many configurations! Overwhelming for many users⬢ Best practices are evolving and change across versions

Page 15: Keep your Hadoop Cluster at its Best

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Many configurations are cluster and workload specific⬢ A configuration good for one cluster may not be suitable for

another cluster⬢ Optimally configured clusters may become sub optimal tomorrow

as they grow

Page 16: Keep your Hadoop Cluster at its Best

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Large clusters add to the complexities

⬢ Managing, updating and keeping nodes in sync becomes challenging

⬢ Nodes going down miss the maintenance cycles and get out of sync

⬢ Newly added nodes may have different standards (java version, os, user configurations etc.)

⬢ Clusters start having heterogeneous hardware over period of time

Page 17: Keep your Hadoop Cluster at its Best

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Winning andavoidingthe wars with SmartSense

Page 18: Keep your Hadoop Cluster at its Best

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

⬢ Proactive support & personalized cluster insights by– Enabling faster case resolution

– Applying industry best practices

– Providing proactive analysis

⬢ SmartSense is a collection of tools and services– Evaluates cluster’s current configuration and runtime environment against rich set of rules

– Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads

– Continuously evolving and improving rule sets, developed by or in close consultation with active committers, support engineers, field engineers.

SmartSense

Page 19: Keep your Hadoop Cluster at its Best

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

A G E N T A G E N T

A G E N TA G E N TA G E N T

A G E N T

L A N D I N G Z O N E

S E R V E R

A M B A R I

A G E N T A G E N T

A G E N TA G E N TA G E N T

A G E N T

B U N D L E

W O R K E RN O D E

W O R K E RN O D E

W O R K E RN O D E

W O R K E RN O D E

W O R K E RN O D E

W O R K E RN O D E

S m a r t S e n s eA n a l y ti c s

SmartSense Architecture

G AT E W AY

Page 20: Keep your Hadoop Cluster at its Best

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Unstable NameNode, Frequent Fail Overs

Daunting Questions⬢ What is right Heap size for

my NN ?⬢ What should be the new

gen size ?⬢ Which GC should I use ?⬢ What GC options to be

configured?⬢ What if my cluster grows ?

SmartSense Answer⬢ Rule: hdfs_nn_jvm_opts⬢ Calculates Heap size based

on– Current heap usage– Total number of objects in file system– Best practices

⬢ Recalculates dependent JVM options based on Heap size

⬢ Validates existing JVM opts⬢ Provides continuous

validations and proactive recommendations

Page 21: Keep your Hadoop Cluster at its Best

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Heap Size– 200 bytes per HDFS object (files, directories, blocks)– 25 % buffer

-Xms should be same as –Xmx New generation size should be 1/8th of –Xmx (capped at 8G) Use Concurrent Mark Sweep (CMS) Garbage Collection

– -XX:+UseConcMarkSweepGC– -XX:CMSInitiatingOccupancyFraction=70– -XX:+UseCMSInitiatingOccupancyOnly– -XX:ParallelGCThreads=8

NameNode JVM Opts

Page 22: Keep your Hadoop Cluster at its Best

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Very high CPU usage but low throughput

Daunting Questions⬢ Is THP applicable to my OS

version ?⬢ Is it disabled ? Completely

disabled ?⬢ How do I make sure it is

disabled on newly added nodes too ?

⬢ How do I make these configurations person independent ?

SmartSense Answer⬢ Rule: os_thp⬢ Checks if thp is completely

disabled⬢ Provides OS specific

disabling instructions⬢ Continuous evaluation that

validates newly added nodes and re-commissioned nodes

Page 23: Keep your Hadoop Cluster at its Best

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Disable THP

⬢ For RedHat & CentOSecho "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled

⬢ For Debian, Ubuntu & SUSEecho "never" > /sys/kernel/mm/transparent_hugepage/enabled

System CPU

User CPU

N/W IO

Page 24: Keep your Hadoop Cluster at its Best

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Cascading impact and cluster melt down

Daunting Questions⬢ Should I finalize upgrade ?⬢ What is right time to

finalize ?⬢ How do I make sure it does

not fall through cracks ?

SmartSense Answer⬢ Rule:

hdfs_nn_finalize_upgrade⬢ Checks HDFS health after

upgrade⬢ Evaluates how long HDFS is

running in un-finalized state

⬢ Reminds until it is finalized

Page 25: Keep your Hadoop Cluster at its Best

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Check NN UI / JMX for upgrade status

Do not finalize HDFS upgrade until– All files and blocks have been verified after upgrade– Critical jobs have been executed at least once after upgrade

Finalize between 2 - 7 days after upgradehdfs dfsadmin -finalizeUpgrade

HDFS Upgrade finalization

Page 26: Keep your Hadoop Cluster at its Best

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing : Overloaded cluster

Daunting Questions⬢ What is right container size

for my cluster ?⬢ If I add additional

components (HBase, Storm), how does the container size change ?

⬢ How does container sizes change when I add new types of nodes in the cluster ?

⬢ What’s impact on container sizes if I add SSDs to the nodes?

SmartSense Answer⬢ Rules: yarn_container_size,

mr_container_size, tez_container_size

⬢ Evaluates resources available on individual host (CPU, Memory, Disks, Running Services etc.)

⬢ Calculates technology specific container sizes (MR, Tez, Hive)

⬢ Continuously evaluates as the cluster dynamics change

Page 27: Keep your Hadoop Cluster at its Best

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Container sizing

Identify resources (CPU, Memory, Disks) available on each node Keep aside resources required for other processes (OS, DN, NM,

HBase RS) Calculate max possible containers for each resource (CPU,

Memory, Disks)– CPU Containers: 4x cores– Disk Containers: ( 3x HDD + 10x SSD )– Memory Containers: (Available RAM / 2 )

Number of containers = Min (CPU Containers, Disk Containers, Memory Containers)

Page 28: Keep your Hadoop Cluster at its Best

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing: Accidental deletion of critical datasets

Daunting Questions⬢ Is HDFS trash enabled ?⬢ What is safe trash interval ?⬢ How to prevent accidental

deletion of critical data ?

SmartSense Answer⬢ Rule: hdfs_trash_interval

– Checks if trash is enabled– Validates if trash interval is within

reasonable limits

⬢ Rule: hdfs_nn_protect_imp_dirs– New feature available in Hadoop 2.8– Helps you mark critical directories such

as “/”, “/user”, “/user/apps/hive”, “/user/apps/hbase” etc. are delete protected.

Page 29: Keep your Hadoop Cluster at its Best

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS Trash interval and directory protection

fs.trash.interval detects number of minutes after which the trashed data gets deleted– 0 means trash disabled (data gets deleted immediately)– Keep it the range 1440 (1 day) – 10080 (7 days)– Recommended 4320 (3 days)

fs.protected.directories specifies directories that will be delete protected– Available from Hadoop 2.8– List all key directories there ("/", "/user","/user/apps",

"/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data", "/mapred", "/mapred/system", "/tmp" etc. )

Page 30: Keep your Hadoop Cluster at its Best

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Addressing : Hive query returning random results

Daunting Questions⬢ Is my cluster configured

consistently ?⬢ How do I prevent such hard

to analyze issues ?⬢ How do I make sure newly

added do not bring these types of issues ?

⬢ How do I make these set ups person independent ?

SmartSense Answer⬢ Rule: os_time_zone⬢ Checks if all hosts have

same time zone⬢ Rule os_service_ntpd_on

make sure all host times are in sync

⬢ Continuous evaluation that validates newly added nodes and re-commissioned nodes

Page 31: Keep your Hadoop Cluster at its Best

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There are 250+ more such rulesOperations hdfs_dn_volume_tolerance hdfs_dn_xceivers hdfs_nn_handler_count … yarn_zk_quorum yarn_nm_recovery … os_hostname_reverse_looku

p os_ssd_tuning … hive_mr_strict_mode hive_datanucleus_cache … tez_am_heap tez_shuffle_buffer …

Performance ams_mc_distributed_confi

gs ams_mc_write_path ... hbase_jvm_opts hbase_rs_open_region_thr

eads hbase_tcp_nodelay ... hdfs_dn_jvm_opts hdfs_mount_options hdfs_nn_dn_staleness_inte

rval ... hive_auto_convert_join hive_disable_caching hive_enable_cbo ...

Security hdfs_dn_volume_tolerance hdfs_audit_log hdfs_block_access_token hdfs_enable_security_chec

k hdfs_nn_super_user_group hdfs_zkfc_ha_acl ... ranger_policy_refresh_inte

rval smartsense_2_way_ssl_en

abled ... yarn_ats_security yarn_enable_acl ...

Page 32: Keep your Hadoop Cluster at its Best

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There is more than just configurations

How do I show

back/charge back

my tenants ?

Who are the top

users of my platform ?What type

of work loads are

running on my

cluster ?

Which jobs have

significant impact on

my cluster ?

How do I improve

performance of key

jobs ?

What is good time

for maintenanc

e?

Page 33: Keep your Hadoop Cluster at its Best

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Activity Analysis

Page 34: Keep your Hadoop Cluster at its Best

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary

There are many things involved in managing Hadoop cluster Best practices evolve and change across versions What is optimal today may not be optimal for tomorrow Changing cluster dynamics, workload characteristic need

continuous re-evaluation and configuration adjustments SmartSense can significantly help avoid common mistakes,

issues, pitfalls and simplify Hadoop operations

Page 35: Keep your Hadoop Cluster at its Best

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lets keep your Hadoop cluster at its best!Thank You!

Page 36: Keep your Hadoop Cluster at its Best

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Appendix

Page 38: Keep your Hadoop Cluster at its Best

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SmartSense Bundle Security

⬢ All Bundles are Anonymized and Encrypted

⬢ Multiple built-in security measures– Ambari clear text passwords are not collected– Hive and Oozie database properties are not collected– All IP addresses and host names are anonymized

⬢ Extensible security rules– Exclude properties within specific Hadoop configuration files– Global REGEX replacements across all configuration, metrics, and logs

Page 39: Keep your Hadoop Cluster at its Best

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SmartSense Stack Support

HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0

SmartSense 1.x

Ambari 2.2Built-In!

Ambari 2.1Plug-In

Ambari 2.0Plug-In

Ambari 1.7 Ambari 1.6

SmartSense 1.x