hug august 2010: best practices

Post on 20-Jan-2015






Click to see full reader


•Arun Murthy, from the Hadoop team at Yahoo! will introduce compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of a Grid Pattern which, similar to Design Pattern, represents a general reusable solution for applications running on the Grid. He will even cover the anti-patterns of applications running on the Apache Hadoop clusters. Arun will enumerate characteristics of well-behaved applications and provide guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at the presention is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters and unlikely to fall afoul of most policies and limits.


Apache Hadoop Grid Patterns and Anti-Patterns

Arun C Murthy Yahoo! Grid Team, CCDI


8/18/10 2

Who am I?   Yahoo!

›  Grid Team (CCDI)

›  Lead the Apache Hadoop Map-Reduce Development Team


›  Developer on Apache Hadoop since April 2006

›  Committer

›  Member of Apache Hadoop PMC

Apache Hadoop

8/18/10 3

The Software   Hadoop Distributed File System

  Hadoop Map-Reduce

  Open source from Apache

  Written in Java

  Runs on

›  Linux, Solaris, Mac OS/X

›  Commodity hardware


8/18/10 4

HDFS   Designed to store large files

  Stores files as large blocks (64 to 128 MB)

  Each block stored on multiple servers

  Data is automatically re-replicated on need

  Accessed from command line, Java API or C API

Data Processing

8/18/10 5

Hadoop Map-Reduce   Map-Reduce is a programming model for efficient distributed computing

  Efficiency from

›  Streaming through data, reducing seeks

›  Pipelining

  A good fit for a lot of applications

›  Log processing

›  Web index building

Hadoop in the Enterprise

8/18/10 6

Usage and Importance   Large number of corporations use Apache Hadoop at scale for several business critical


›  Large, shared, multi-tenant deployments to minimize fragmentation across organizations

  Millions of dollars at stake!

›  Yahoo

•  Advertising, Search

•  40,000 machines and counting

Hadoop in the Enterprise

8/18/10 7

… however   Hadoop isn’t a silver bullet (at least as yet!)

›  Hadoop still depends on users to utilize it effectively

›  Pig/Hive help, one can still write badly suited queries

  Need to adapt legacy applications to Hadoop, especially the Map-Reduce paradigm

  Efficient usage of Hadoop clusters is critical to getting return on the investment

Hadoop Map-Reduce

8/18/10 8

Overview   It works like a Unix pipeline:

›  cat input | grep | sort | unique -c | cat > output

›  Input | Map | Shuffle & Sort | Reduce | Output

  Works on key/value pairs

›  map <k1, v1> -> <k2, v2>

›  reduce <k2, v2> -> <k3, v3>

Best Practices

8/18/10 9

Input to Applications   Optimized to process large data-sets

  Pattern: Coalesce processing of multiple small input files into smaller number of maps and use larger HDFS block-sizes for processing very large data-sets.

Best Practices

8/18/10 10

Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files

  Process large chunks of data per-map for large-scale data-processing

›  PetaSort – 66,000 maps with 12.5G per map

  Pattern: Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application.

Best Practices

8/18/10 11

Map-Reduce - Mappers   Process multiple-files per map for jobs with very large number of small input files

  Process large chunks of data per-map for large-scale data-processing

›  PetaSort – 66,000 maps with 12.5G per map

  The shuffle cross-bar (maps * reduces) is a key performance factor

  Pattern: Applications should use fewer maps to process data in parallel, as few as possible without having really bad failure recovery cases.

›  Unless the application's maps are heavily CPU bound, there is almost no reason to ever require more than 60,000-70,000 maps for a single application

Best Practices

8/18/10 12

Map-Reduce – Combiner and Shuffle   Combiner

›  Map-side aggregation to help reduce network traffic for the shuffle

›  Cost of using combiners


›  Compression of intermediate output

  Pattern: Use combiners judiciously, ensure they really work! Compress intermediate outputs

Best Practices

8/18/10 13

Map-Reduce – Reducers   Efficiency depends on shuffle, and the cross-bar

  Configure appropriate number of reduces

›  Too few reduces hurt the nodes

›  Too many hurt the cross-bar

  Pattern: Applications should ensure that each reduce should process at least 1-2 GB of data, and at most 5-10GB of data, in most scenarios.

Best Practices

8/18/10 14

Map-Reduce – Output   Number of output artifacts is linear w.r.t. number of configured reduces

  Compress outputs

  Use appropriate file-formats for the output

›  E.g. compressed text-files is not a great idea if you aren’t using a splittable codec

  Think of the consumer of your data-set!

  Consider using larger HDFS block-sizes.

  Pattern: : Application outputs to be few large files, with each file spanning multiple HDFS blocks and appropriately compressed.

Best Practices

8/18/10 15

Map-Reduce – Distributed Cache   Efficient distribution of read-only files for applications

  Designed for small number of mid-sized files

  Pattern: Applications should ensure that artifacts in the distributed-cache should not require more i/o than the actual input to the application tasks

Best Practices

8/18/10 16

Map-Reduce – Counters   Global (across all tasks) counters, aggregated by the framework


  Pattern: Applications should not use more than 10, 15 or 25 custom counters.

Best Practices

8/18/10 17

Map-Reduce – Total Order Outputs   Sampling Partitioner

›  Do not use a single reducer!

›  E.g. Terasort/Petasort benchmarks

  Joining fully sorted data-sets

›  Do not need same cardinality e.g. number of buckets for the data-sets being joined

  Pattern: Use combiners judiciously, ensure they really work!

Best Practices

8/18/10 18

HDFS – NameNode and JobTracker Operations   NameNode: Please don’t hurt me!

›  Not yet a silver bullet…

›  Do not perform metadata operations for map/reduce tasks at the backend

  Do not contact for JobTracker for cluster statistics etc. from the backend

  Pattern: Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.

Best Practices

8/18/10 19

Map-Reduce – Logs and Web-UI   Tasks’ stdout/stderr stored on TaskTrackers

›  Limit amount of logs

  JobTracker/NameNode Web-UI

›  Do not screen-scrape!

Best Practices

8/18/10 20

Oozie – Workflows   Production pipelines are run via Oozie

  Ensure workflows have small number of medium-to-large sized Map-Reduce jobs

›  Collapse smaller jobs

  Pattern: A single Map-Reduce job in a workflow should process at least a few tens of GB of data.


8/18/10 21

In a large enough cluster, you see any and all of these…   Applications not using a higher-level interface such as Pig/Hive

  Processing thousands of small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.

  Processing very large data-sets with small HDFS block size i.e. 128MB resulting in tens of thousands of maps.

  Applications with a large number (thousands) of maps with a very small runtime (e.g. 5s).

  Straight-forward aggregations without the use of the Combiner.

  Applications with greater than 60,000-70,000 maps.

  Applications processing large data-sets with very few reduces (e.g. 1).

›  Pig scripts processing large data-sets without using the PARALLEL keyword

›  Applications using a single reduce for total-order amount the output records


8/18/10 22

  Applications processing data with very large number of reduces, such that each reduce processes less than 1-2GB of data.

  Applications writing out multiple, small, output files from each reduce.

  Applications using the DistributedCache to distribute a large number of artifacts and/or very large artifacts (hundreds of MBs each).

  Applications using tens or hundreds of counters per task.

  Applications performing metadata operations (e.g. listStatus) on the file-system from the map/reduce tasks.

  Applications doing screen scraping of JobTracker web-ui for status of queues/jobs or worse, job-history of completed jobs.

  Workflows comprising of hundreds or thousands of small jobs processing small amounts of data.

Work underway in yahoo-hadoop-0.20.200 to prevent anti-patterns

Blog Post

8/18/10 23


8/18/10 24 Yahoo! Presentation, Confidential

top related