大规模数据处理 / 云计算 lecture 5 – hadoop runtime 彭波...

Post on 01-Jan-2016

305 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime

彭波北京大学信息科学技术学院

7/23/2013http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

课程评分• Project取消• 4次作业

– wordcount (不记分 )– co-occurrence– index– pagerank

• 1 Week– grace time, one day– 10% for each day delay(60% at most)

'wordcount'How does it work?

Hadoop Cluster

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

4

job提交过程

Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).

Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor

(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).

Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

InputFormat Class Hierarchy

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

9

Serialization Serialization is the process of turning structured objects

into a byte stream for trans-mission over a network or for writing to persistent storage.

Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

The Writable Interface

public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;

}

public interface WritableComparable<T> extends Writable, Comparable<T> A Writable which is also Comparable. public int compareTo(WritableComparable w){}

Shuffle and Sort

Mapper

Reducer

other mappers

other reducers

circular buffer (in memory)

spills (on disk)

merged spills (on disk)

intermediate files (on disk)

Combiner

Combiner?

13

Partitioner

public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int

numPartitions)

}

Q&A

top related