hug august 2010: best practices

Apache Hadoop Grid Patterns and Anti-Patterns Arun C Murthy Yahoo! Grid Team, CCDI [email protected]

Upload: hadoop-user-group

Post on 20-Jan-2015




4 download


•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.


Page 1: HUG August 2010: Best practices

Apache Hadoop Grid Patterns and Anti-Patterns

Arun C Murthy Yahoo! Grid Team, CCDI [email protected]

Page 2: HUG August 2010: Best practices


8/18/10 2

Who am I?   Yahoo!

›  Grid Team (CCDI)

›  Lead the Apache Hadoop Map-Reduce Development Team


›  Developer on Apache Hadoop since April 2006

›  Committer

›  Member of Apache Hadoop PMC

Page 3: HUG August 2010: Best practices

Apache Hadoop

8/18/10 3

The Software   Hadoop Distributed File System

  Hadoop Map-Reduce

  Open source from Apache

  Written in Java

  Runs on

›  Linux, Solaris, Mac OS/X

›  Commodity hardware

Page 4: HUG August 2010: Best practices


8/18/10 4

HDFS   Designed to store large files

  Stores files as large blocks (64 to 128 MB)

  Each block stored on multiple servers

  Data is automatically re-replicated on need

  Accessed from command line, Java API or C API

Page 5: HUG August 2010: Best practices

Data Processing

8/18/10 5

Hadoop Map-Reduce   Map-Reduce is a programming model for efficient distributed computing

  Efficiency from

›  Streaming through data, reducing seeks

›  Pipelining

  A good fit for a lot of applications

›  Log processing

›  Web index building

Page 6: HUG August 2010: Best practices

Hadoop in the Enterprise

8/18/10 6

Usage and Importance   Large number of corporations use Apache Hadoop at scale for several business critical


›  Large, shared, multi-tenant deployments to minimize fragmentation across organizations

  Millions of dollars at stake!

›  Yahoo

•  Advertising, Search

•  40,000 machines and counting

Page 7: HUG August 2010: Best practices

Hadoop in the Enterprise

8/18/10 7

… however   Hadoop isn’t a silver bullet (at least as yet!)

›  Hadoop still depends on users to utilize it effectively

›  Pig/Hive help, one can still write badly suited queries

  Need to adapt legacy applications to Hadoop, especially the Map-Reduce paradigm

  Efficient usage of Hadoop clusters is critical to getting return on the investment

Page 8: HUG August 2010: Best practices

Hadoop Map-Reduce

8/18/10 8

Overview   It works like a Unix pipeline:

›  cat input | grep | sort | unique -c | cat > output

›  Input | Map | Shuffle & Sort | Reduce | Output

  Works on key/value pairs

›  map <k1, v1> -> <k2, v2>

›  reduce <k2, v2> -> <k3, v3>

Page 9: HUG August 2010: Best practices

Best Practices

8/18/10 9

Input to Applications   Optimized to process large data-sets

  Pattern: Coalesce processing of multiple small input files into smaller number of maps and use larger HDFS block-sizes for processing very large data-sets.

Page 10: HUG August 2010: Best practices

Best Practices

8/18/10 10

Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files

  Process large chunks of data per-map for large-scale data-processing

›  PetaSort – 66,000 maps with 12.5G per map

  Pattern: Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application.

Page 11: HUG August 2010: Best practices

Best Practices

8/18/10 11

Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files

  Process large chunks of data per-map for large-scale data-processing

›  PetaSort – 66,000 maps with 12.5G per map

  The shuffle cross-bar (maps * reduces) is a key performance factor

  Pattern: Applications should use fewer maps to process data in parallel, as few as possible without having really bad failure recovery cases.

›  Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application

Page 12: HUG August 2010: Best practices

Best Practices

8/18/10 12

Map-Reduce – Combiner and Shuffle   Combiner

›  Map-side aggregation to help reduce network traffic for the shuffle

›  Cost of using combiners


›  Compression of intermediate output

  Pattern: Use combiners judiciously, ensure they really work! Compress intermediate outputs

Page 13: HUG August 2010: Best practices

Best Practices

8/18/10 13

Map-Reduce – Reducers   Efficiency depends on shuffle, and the cross-bar

  Configure appropriate number of reduces

›  Too few reduces hurt the nodes

›  Too many hurt the cross-bar

  Pattern: Applications should ensure that each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios.

Page 14: HUG August 2010: Best practices

Best Practices

8/18/10 14

Map-Reduce – Output   Number of output artifacts is linear w.r.t. number of configured reduces

  Compress outputs

  Use appropriate file-formats for the output

›  E.g. compressed text-files is not a great idea if you aren’t using a splittable codec

  Think of the consumer of your data-set!

  Consider using larger HDFS block-sizes.

  Pattern: : Application outputs to be few large files, with each file spanning multiple HDFS blocks and appropriately compressed.

Page 15: HUG August 2010: Best practices

Best Practices

8/18/10 15

Map-Reduce – Distributed Cache   Efficient distribution of read-only files for applications

  Designed for small number of mid-sized files

  Pattern: Applications should ensure that artifacts in the distributed-cache should not require more i/o than the actual input to the application tasks

Page 16: HUG August 2010: Best practices

Best Practices

8/18/10 16

Map-Reduce – Counters   Global (across all tasks) counters, aggregated by the framework


  Pattern: Applications should not use more than 10, 15 or 25 custom counters.

Page 17: HUG August 2010: Best practices

Best Practices

8/18/10 17

Map-Reduce – Total Order Outputs   Sampling Partitioner

›  Do not use a single reducer!

›  E.g. Terasort/Petasort benchmarks

  Joining fully sorted data-sets

›  Do not need same cardinality e.g. number of buckets for the data-sets being joined

  Pattern: Use combiners judiciously, ensure they really work!

Page 18: HUG August 2010: Best practices

Best Practices

8/18/10 18

HDFS – NameNode and JobTracker Operations   NameNode: Please don’t hurt me!

›  Not yet a silver bullet…

›  Do not perform metadata operations for map/reduce tasks at the backend

  Do not contact for JobTracker for cluster statistics etc. from the backend

  Pattern: Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.

Page 19: HUG August 2010: Best practices

Best Practices

8/18/10 19

Map-Reduce – Logs and Web-UI   Tasks’ stdout/stderr stored on TaskTrackers

›  Limit amount of logs

  JobTracker/NameNode Web-UI

›  Do not screen-scrape!

Page 20: HUG August 2010: Best practices

Best Practices

8/18/10 20

Oozie – Workflows   Production pipelines are run via Oozie

  Ensure workflows have small number of medium-to-large sized Map-Reduce jobs

›  Collapse smaller jobs

  Pattern: A single Map-Reduce job in a workflow should process at least a few tens of GB of data.

Page 21: HUG August 2010: Best practices


8/18/10 21

In a large enough cluster, you see any and all of these…   Applications not using a higher-level interface such as Pig/Hive

  Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.

  Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens of thousands of maps.

  Applications with a large number (thousands) of maps with a very small runtime (e.g. 5s).

  Straight-forward aggregations without the use of the Combiner.

  Applications with greater than 60,000-70,000 maps.

  Applications processing large data-sets with very few reduces (e.g. 1).

›  Pig scripts processing large data-sets without using the PARALLEL keyword

›  Applications using a single reduce for total-order amount the output records

Page 22: HUG August 2010: Best practices


8/18/10 22

  Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data.

  Applications writing out multiple, small, output files from each reduce.

  Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each).

  Applications using tens or hundreds of counters per task.

  Applications performing metadata operations (e.g. listStatus) on the file-system from the map/reduce tasks.

  Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs.

  Workflows comprising of hundreds or thousands of small jobs processing small amounts of data.

Work underway in yahoo-hadoop-0.20.200 to prevent anti-patterns

Page 23: HUG August 2010: Best practices

Blog Post

8/18/10 23

Page 24: HUG August 2010: Best practices


8/18/10 24 Yahoo! Presentation, Confidential