programming on hadoop. outline different perspective of cloud computing the anatomy of data center...
TRANSCRIPT
Programming on Hadoop
Outline
• Different perspective of Cloud Computing• The Anatomy of Data Center• The Taxonomy of Computation– Computation intensive– Data intensive
• The Hadoop Eco-system• Limitations of Hadoop
Cloud Computing
• From user perspective– A service which enables users to run their applications
on the Internet • From service provider perspective– A resource pool which is used to deliver cloud services
through the Internet– The resource pool is hosted in on-premise data center– What the data center (DC) looks like ?
An Example of DC
• Google’s Data Center at 2009.
From Jeffrey Dean’s talk on WSDM2009
A Closer Look at DC – Overview
Figure is copied from [4]
A Closer Look of DC – Cooling
Figure is copied from [4]
A Closer Look at DC – Computing Resources
Figure is copied from [4]
The Commodity Server
• Commodity server is NOT low-end server– Standard components vs. proprietary hardware
• Common configuration in 2008– Processor: 2 quad-core Intel Xeon 2.0GHz CPUs– Memory: 8 GB ECC RAM– Storage: 4 1TB SATA disks– Network: Gigabit Ethernet
Approaches to Deliver Service
• The dedicated approach– Serve each customer with dedicated computing
resources• The shared approach (multi-tenant architecture)– Serve customers with the shared resource pool
The Dedicated Approach
• Pros:– Easy to implement– Performance & security guarantee
• Cons:– Pain for the customer to scale their applications– Poor resource utilization
The Shared Approach
• Pros:– No pain for customers to scale their applications– Better resources utilization– Better performance in some cases– Low service cost per customer
• Cons:– Need complicated software layer– Performance isolation/tuning may be complicated
• To achieve better performance customers should be familiar with the software/hardware architecture to some degree
The Hadoop Eco-system
• An software infrastructure to deliver a DC as a service through shared-resources approach– Customers can use Hadoop to develop/deploy certain
data-intensive applications on the cloud • We focus on the Hadoop core in this lecture– Hadoop == Hadoop – core afterwards
Hadoop Distributed File System (HDFS) MapReduceCore
HBase Chukwa Hive PigExtensions
The Taxonomy of Computations
• Computation-intensive tasks– Small data (in-memory), Lots of CPU cycles per data
item processing– Examples: machine learning
• Data-intensive tasks– Large-volume data (in-disk), relatively small CPU
cycles per data item processing– Examples: DBMS
The Data-intensive Tasks
• Streaming-oriented data access– Read/Write a large portion of dataset in streaming manner
(sequentially)– Character:
• NO-seek, high-throughput• Optimized for larger data transferring rate
• Random-oriented data access– Read/Write a small number of data items randomly located
in the dataset– Character:
• Seek-oriented• Optimized for low-latency data access for each data item
What Hadoop does & doesn’t
• Hadoop can perform– High-throughput streaming data access– Limited low-latency random data access through
HBase– Large-scale analysis through MapReduce
• Hadoop cannot do– Perform transactions– Certain time-critical applications
Hadoop Quick Start
• Very simple– Download Hadoop package from Apache
• http://hadoop.apache.org/
– Unpack into a folder– Do some configurations on hadoop-site.xml
• fs.default.name select the default file system (e.g., HDFS)• mapred.job.tracker point to the JobTracker of MapReduce cluster
– Start• Format the file system only once (in a fresh installation)
– bin/hadoop namenode –format• Launch HDFS & MapReduce cluster
– bin/start-all.sh
The Launched HDFS cluster
The Launched MapReduce Cluster
The Hadoop Distributed Filesystem
• Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially
A Closer Look at the API
• Aha, writing “hello word!”– bin/hadoop jar test.jar
public class Main {
public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream fsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }
A Closer Look at the API (cont.)
• Reading data from the HDFS
public class Main {
public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream fsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; int len = fsIn.read(buf); System.out.println(new String(buf, 0, len); }
}
Inside HDFS
• A single NameNode multiple DataNodes architecture (see [5] for reference)– Chop each file as a set of fix-sized blocks and store
those data blocks on all available DataNodes– NameNode hosting all file system meta-data (file
block mapping, block locations etc) in memory– DataNode hosting all file data for reading/writing
Inside HDFS – Architecture
• Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html
Inside HDFS – Writing Data
Figure is copied from [2]
Inside HDFS – Reading Data
• What is the problem with reading/writing ?
Figure is copied from [2]
The HDFS Cons
• Single reader/writer– Reading and writing a single block each time– Only touch ONE data node– Data transferring rate == disk bandwidth of a SINGLE
node– Too slow for a large file
• Suppose disk bandwidth == 100MB/sec• Reading /writing a 1TB file requires ~3 hrs
– How to fix it ?
Multiple Reader/Writers
• Reading/Writing a large data set using multiple processes– Each process reads/writes a subset of the whole data
set and materialize the sub-data set as file– File collection for the whole data set
• Typically, the file collection is stored in a directory named with the data set
Multiple Readers/Writers (cont.)
• Question – what is the proper number of readers and writers ?
Sub-set 2
Sub-set 1
Sub-set 3
Sub-set 4
Data set A
Process 1
Process 2
Process 3
Process 4
part-0001
part-0002
part-0003
part-0004
/root/datasetA
Multiple Readers/Writers (cont.)
• Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS
• But, too painful !– Invocation of multiple readers/writers in the cluster– Coordination of those readers/writers– Machine failure– ….
• Rescue: MapReduce
The MapReduce System
• MapReduce is a programming model and its associated implementation for processing and generating large data sets [1]
• The computation performs key/value oriented operations and consists of two functions– Map: transform the input key/value pair into a set of
intermediate key/value pairs– Reduce: merge intermediate key/value pairs with the
same key and produce an other key/value pair
The MapReduce Programming Model
• Map: (k0, v0) -> (k1, [v1])• Reduce: (k1, [v1]) -> (k2, v2)
The System Architecture
• One JobTacker for Job submission• Multiple TaskTrackers for invocation of mappers
or reducers
Figure is from Google image
The Mapper Interface
• Mapper/Reducer is defined as a generic java interface in Hadoop
public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter);}
public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter);}
The Data Types of MapReduce
• MapReduce makes no assumption of the data type– It does not know what constitutes key/value pair
• Users must figure out what is appropriate input/output data types– The runtime data interpreting pattern– Achieved by implementing two Hadoop interface
• RecordReader<K, V> for parsing input key/value pair• RecordWriter<K, V> for serializing output key/value pair
The RecordReader/Writer Interface
interface RecordReader<K, V> {
// Omit other functions boolean next(K key, V value); }
interface RecordWriter<K, V> {
// Omit other functions void write(K key, V value);
}
The Overall Picture
• The data set are spitted into many parts• Each part is processed by one mapper• The intermediated results are processed by
reducer• Each reducer writes its results as a file
InputSplit-n map reduce part-000nRecordReader Shuffle/merge RecordWriter
Performance Tuning
• A lot of factors …• From architecture level– Record parsing, map-side sorting, …, see [3]– Shuffling see many research papers on VLDB, SIGMOD
• Parameter Tuning– Memory buffer for mapper/reducer– The thumb of rule for concurrent mapper and
reducers• Map: per file block per map• Reducer: a small multiple of available TaskTrackers
Limitations of Hadoop
• HDFS– No reliable appending yet– File is immutable
• MapReduce– Basically row-oriented– Support for complicated computation is not strong
Reference
• [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters
• [2] Tom White. Hadoop: The Definitive Guide• [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of
MapReduce: An Indepth Study• [4] Luiz André Barroso and Urs Holzle. The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines• [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File
System
Thank You!