programming on hadoop. outline different perspective of cloud computing the anatomy of data center...

Programming on Hadoop

Outline

• Different perspective of Cloud Computing• The Anatomy of Data Center• The Taxonomy of Computation– Computation intensive– Data intensive

• The Hadoop Eco-system• Limitations of Hadoop

Cloud Computing

• From user perspective– A service which enables users to run their applications

on the Internet • From service provider perspective– A resource pool which is used to deliver cloud services

through the Internet– The resource pool is hosted in on-premise data center– What the data center (DC) looks like ?

An Example of DC

• Google’s Data Center at 2009.

From Jeffrey Dean’s talk on WSDM2009

A Closer Look at DC – Overview

Figure is copied from [4]

A Closer Look of DC – Cooling


A Closer Look at DC – Computing Resources


The Commodity Server

• Commodity server is NOT low-end server– Standard components vs. proprietary hardware

• Common configuration in 2008– Processor: 2 quad-core Intel Xeon 2.0GHz CPUs– Memory: 8 GB ECC RAM– Storage: 4 1TB SATA disks– Network: Gigabit Ethernet

Approaches to Deliver Service

• The dedicated approach– Serve each customer with dedicated computing

resources• The shared approach (multi-tenant architecture)– Serve customers with the shared resource pool

The Dedicated Approach

• Pros:– Easy to implement– Performance & security guarantee

• Cons:– Pain for the customer to scale their applications– Poor resource utilization

The Shared Approach

• Pros:– No pain for customers to scale their applications– Better resources utilization– Better performance in some cases– Low service cost per customer

• Cons:– Need complicated software layer– Performance isolation/tuning may be complicated

• To achieve better performance customers should be familiar with the software/hardware architecture to some degree

The Hadoop Eco-system

• An software infrastructure to deliver a DC as a service through shared-resources approach– Customers can use Hadoop to develop/deploy certain

data-intensive applications on the cloud • We focus on the Hadoop core in this lecture– Hadoop == Hadoop – core afterwards

Hadoop Distributed File System (HDFS) MapReduceCore

HBase Chukwa Hive PigExtensions

The Taxonomy of Computations

• Computation-intensive tasks– Small data (in-memory), Lots of CPU cycles per data

item processing– Examples: machine learning

• Data-intensive tasks– Large-volume data (in-disk), relatively small CPU

cycles per data item processing– Examples: DBMS

The Data-intensive Tasks

• Streaming-oriented data access– Read/Write a large portion of dataset in streaming manner

(sequentially)– Character:

• NO-seek, high-throughput• Optimized for larger data transferring rate

• Random-oriented data access– Read/Write a small number of data items randomly located

in the dataset– Character:

• Seek-oriented• Optimized for low-latency data access for each data item

What Hadoop does & doesn’t

• Hadoop can perform– High-throughput streaming data access– Limited low-latency random data access through

HBase– Large-scale analysis through MapReduce

• Hadoop cannot do– Perform transactions– Certain time-critical applications

Hadoop Quick Start

• Very simple– Download Hadoop package from Apache

• http://hadoop.apache.org/

– Unpack into a folder– Do some configurations on hadoop-site.xml

• fs.default.name select the default file system (e.g., HDFS)• mapred.job.tracker point to the JobTracker of MapReduce cluster

– Start• Format the file system only once (in a fresh installation)

– bin/hadoop namenode –format• Launch HDFS & MapReduce cluster

– bin/start-all.sh

The Launched HDFS cluster

The Launched MapReduce Cluster

The Hadoop Distributed Filesystem

• Wraps the DC as a resource pool and provides a set of API to let users read/write data from/into the DC sequentially

A Closer Look at the API

• Aha, writing “hello word!”– bin/hadoop jar test.jar

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataOutputStream fsOut = fs.create(“testFile”); fsOut.writeBytes(“Hello Hadoop”) fsOut.close(); } }

A Closer Look at the API (cont.)

• Reading data from the HDFS

public class Main {

public static void main(String[] args) throws Exception { FileSystem fs = FileSystem.get(new Configuration()); FSDataInputStream fsIn = fs.open(new Path(“testFile”)); byte[] buf = new byte[1024]; int len = fsIn.read(buf); System.out.println(new String(buf, 0, len); }

}

Inside HDFS

• A single NameNode multiple DataNodes architecture (see [5] for reference)– Chop each file as a set of fix-sized blocks and store

those data blocks on all available DataNodes– NameNode hosting all file system meta-data (file

block mapping, block locations etc) in memory– DataNode hosting all file data for reading/writing

Inside HDFS – Architecture

• Figure is copied from http://hadoop.apache.org/common/docs/current/hdfs_design.html

Inside HDFS – Writing Data


Inside HDFS – Reading Data

• What is the problem with reading/writing ?


The HDFS Cons

• Single reader/writer– Reading and writing a single block each time– Only touch ONE data node– Data transferring rate == disk bandwidth of a SINGLE

node– Too slow for a large file

• Suppose disk bandwidth == 100MB/sec• Reading /writing a 1TB file requires ~3 hrs

– How to fix it ?

Multiple Reader/Writers

• Reading/Writing a large data set using multiple processes– Each process reads/writes a subset of the whole data

set and materialize the sub-data set as file– File collection for the whole data set

• Typically, the file collection is stored in a directory named with the data set

Multiple Readers/Writers (cont.)

• Question – what is the proper number of readers and writers ?

Sub-set 2

Sub-set 1

Sub-set 3

Sub-set 4

Data set A

Process 1

Process 2

Process 3

Process 4

part-0001

part-0002

part-0003

part-0004

/root/datasetA

Multiple Readers/Writers (cont.)

• Reading/writing a large data set using multiple readers/writers and the materialize the data set as a collection of files is common pattern in HDFS

• But, too painful !– Invocation of multiple readers/writers in the cluster– Coordination of those readers/writers– Machine failure– ….

• Rescue: MapReduce

The MapReduce System

• MapReduce is a programming model and its associated implementation for processing and generating large data sets [1]

• The computation performs key/value oriented operations and consists of two functions– Map: transform the input key/value pair into a set of

intermediate key/value pairs– Reduce: merge intermediate key/value pairs with the

same key and produce an other key/value pair

The MapReduce Programming Model

• Map: (k0, v0) -> (k1, [v1])• Reduce: (k1, [v1]) -> (k2, v2)

The System Architecture

• One JobTacker for Job submission• Multiple TaskTrackers for invocation of mappers

or reducers

Figure is from Google image

The Mapper Interface

• Mapper/Reducer is defined as a generic java interface in Hadoop

public interface Mapper<K1, V1, K2, V2> { void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter);}

public interface Reducer<K2, V2, K3, V3> { void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter);}

The Data Types of MapReduce

• MapReduce makes no assumption of the data type– It does not know what constitutes key/value pair

• Users must figure out what is appropriate input/output data types– The runtime data interpreting pattern– Achieved by implementing two Hadoop interface

• RecordReader<K, V> for parsing input key/value pair• RecordWriter<K, V> for serializing output key/value pair

The RecordReader/Writer Interface

interface RecordReader<K, V> {

// Omit other functions boolean next(K key, V value); }

interface RecordWriter<K, V> {

// Omit other functions void write(K key, V value);

}

The Overall Picture

• The data set are spitted into many parts• Each part is processed by one mapper• The intermediated results are processed by

reducer• Each reducer writes its results as a file

InputSplit-n map reduce part-000nRecordReader Shuffle/merge RecordWriter

Performance Tuning

• A lot of factors …• From architecture level– Record parsing, map-side sorting, …, see [3]– Shuffling see many research papers on VLDB, SIGMOD

• Parameter Tuning– Memory buffer for mapper/reducer– The thumb of rule for concurrent mapper and

reducers• Map: per file block per map• Reducer: a small multiple of available TaskTrackers

Limitations of Hadoop

• HDFS– No reliable appending yet– File is immutable

• MapReduce– Basically row-oriented– Support for complicated computation is not strong

Reference

• [1] Jeffrey Dean, Sanjay Chemawat. MapReduce: Simplified data processing on large clusters

• [2] Tom White. Hadoop: The Definitive Guide• [3] Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu. The Performance of

MapReduce: An Indepth Study• [4] Luiz André Barroso and Urs Holzle. The Datacenter as a Computer: An

Introduction to the Design of Warehouse-Scale Machines• [5] Sanjay Chemawat, Howard Gobioff, Shun-Tak Leung. The Google File

System

Thank You!

programming on hadoop. outline different perspective of cloud computing the anatomy of data center...

Documents

data item slide

larger data

data center dc

hadoop core

dbms slide

degree slide

wsdm2009 slide

anatomy of data center