apache hadoop india summit 2011 talk "hadoop simulation and performance" by ranjit mathew
TRANSCRIPT
![Page 1: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/1.jpg)
Hadoop Simulation and PerformanceApache Hadoop India Summit 2011
Ranjit Mathew, Yahoo! R & D India
Copyright © 2011 Yahoo! All rights reserved.
![Page 2: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/2.jpg)
Overview
2
Introduction
GridMix3
PigMix2
Tips
Plans
Q & A
![Page 3: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/3.jpg)
3
Introduction
![Page 4: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/4.jpg)
Why?
4
Capacity Planning
Benchmarking
Comparative evaluation of releases
Basis for improvements
Debugging
![Page 5: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/5.jpg)
Performance Evaluation Techniques
5
Analytical Modeling
› Use statistics, queuing theory, etc. to model system
› Use models to predict behavior
Simulation
› Simulate work-load based on representation or traces
› Benchmarking used to compare variants
Measurement
› Use metrics gathered from tools and logs
› Measure under peak, regular and light work-loads
Ref.: “The Art of Computer Systems Performance Analysis”, Raj K. Jain (Wiley, 1991)
![Page 6: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/6.jpg)
Hadoop Performance Evaluation Tools
6
GridMix3
PigMix2
TeraSort / GraySort
DFSIO, NNBench, S-Live
HiBench
etc.
![Page 7: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/7.jpg)
7
GridMix3
![Page 8: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/8.jpg)
GridMix Evolution
8
GridMix1 (HADOOP-2369):
• Representative mix of Jobs
• mapreduce/src/benchmarks/gridmix
GridMix2 (HADOOP-3770):
• More configurable; uses JobControl
• mapreduce/src/benchmarks/gridmix2
GridMix3 (MAPREDUCE-776):
• Trace-based; better emulation-accuracy
• mapreduce/src/contrib/gridmix
Rumen (MAPREDUCE-751):
• Supporting tool for GridMix3 et al
• mapreduce/src/tools/org/apache/hadoop/tools/rumen
![Page 9: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/9.jpg)
GridMix3
9
Macro benchmark for Hadoop
Trace-based submission of synthetic Jobs
Traces based on production clusters
Traces generated by Rumen
No access to original Job’s code or data
Emulates I/O and other aspects
Highly configurable
![Page 10: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/10.jpg)
Rumen
10
Comprises:
› TraceBuilder - Job Traces from Job History and Configuration
› Folder - Scales Job Traces to a given time-window
Job Traces are in JSON format
Insulation for release-to-release changes in format and contents
Statistical information on Jobs in Trace
Provides API to access Job Traces
![Page 11: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/11.jpg)
GridMix3 Flow
11
Job Histories
&
Configuration
Job TraceRumen
GridMix3
Data
Generator
Job
Submitter
Production Cluster
Benchmark Cluster
![Page 12: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/12.jpg)
GridMix3 Architecture
12
GridMix3
JobStory
GridmixJob MapReduceJob
Status
Rumen
JobFactory JobSubmitter JobMonitor
JobTracker
Job
![Page 13: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/13.jpg)
GridMix3 Emulation-Accuracy
13
![Page 14: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/14.jpg)
Submission Policies and Job Types
14
Submission policy determines when Jobs are submitted:
› STRESS - Keep cluster under stress (but not overwhelm it)
› REPLAY - Faithful emulation of inter-job submission times
› SERIAL - Submit a Job only after the previous one finishes
Types of synthetic Jobs:
› LOADJOB - Emulates work-load from Job Trace
› SLEEPJOB - Do nothing for periods from Job Trace
![Page 15: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/15.jpg)
15
PigMix2
![Page 16: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/16.jpg)
PigMix Evolution
16
PigMix1:
• Representative mix of 12 Pig scripts and Java programs
• http://wiki.apache.org/pig/PigMix
• http://wiki.apache.org/pig/DataGeneratorHadoop
PigMix2 (PIG-200):
• Added 5 Pig scripts and Java programs
• Re-factored data-generation
![Page 17: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/17.jpg)
PigMix2
17
Benchmark for Pig
Representative mix of 17 Pig scripts
Corresponding native MapReduce Java programs
Specifications-based input-data generator
![Page 18: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/18.jpg)
PigMix2 Flow
18
Input Data PigMix2Data
Generator
Benchmark Cluster
![Page 19: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/19.jpg)
19
Tips
![Page 20: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/20.jpg)
Minimize Variance
20
Check hardware, especially for failing hard-drives
Use large data-sets to minimize effects of overheads
Beware of speculative execution
Set ipc.ping.interval to 5000 (HADOOP-5380)
Use appropriate PARALLEL clause in PigMix2 Pig scripts
Several runs needed for proper analysis
![Page 21: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/21.jpg)
Apples to Apples Comparison
21
Benchmarking versus Production Cluster:
› Same hardware
› Same software stack
› Same configuration
› Similar networking
› Same size (might not be feasible)
Extrapolating results can be tricky
![Page 22: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/22.jpg)
22
Plans
![Page 23: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/23.jpg)
Future Work
23
Greater emulation-accuracy in GridMix3:
› Distributed Cache
› Compression
› CPU usage
› Memory usage
More comprehensive Job Traces from Rumen
Integration of PigMix2 with Pig Statistics
![Page 24: Apache Hadoop India Summit 2011 talk "Hadoop Simulation and Performance" by Ranjit Mathew](https://reader034.vdocuments.mx/reader034/viewer/2022051516/55a333751a28ab7c618b46de/html5/thumbnails/24.jpg)
24
Q & A