大规模数据处理 / 云计算 lecture 5 – hadoop runtime 彭波...

大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime

彭波北京大学信息科学技术学院

7/23/2013http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

课程评分• Project取消• 4次作业

– wordcount (不记分 )– co-occurrence– index– pagerank

• 1 Week– grace time, one day– 10% for each day delay(60% at most)

'wordcount'How does it work?

Hadoop Cluster

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

job提交过程

Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).

Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor

(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).

Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

InputFormat Class Hierarchy

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

Serialization Serialization is the process of turning structured objects

into a byte stream for trans-mission over a network or for writing to persistent storage.

Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

The Writable Interface

public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;

public interface WritableComparable<T> extends Writable, Comparable<T> A Writable which is also Comparable. public int compareTo(WritableComparable w){}

Shuffle and Sort

Mapper

Reducer

other mappers

other reducers

circular buffer (in memory)

spills (on disk)

merged spills (on disk)

intermediate files (on disk)

Combiner

Combiner?

Partitioner

public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int

numPartitions)

大规模数据处理 / 云计算 lecture 5 – hadoop runtime 彭波...

Documents

大规模数据处理 / 云计算 lecture 3 – hadoop...

南京大学南京财经大学...

参加出来ます化学グランプリ...

郡山女子大学・短期大学部・大学院パンフレット...

国立大学法人総合研究大学院大学事業報告書...

職サークル協賛企画書...

cs402 full material 2011

genomedata - ホーム｜大阪大学大学院医学系...

[cs402(34 to 45)]

cs402- theory of automata genrica · 1 cs402- theory of...

ホーム —...

信州大学インターネット大学院・大学 -...

大规模数据处理 / 云计算 04 – inverted indexing...

what’s next ---where are you from?---トライン大学...

岩手大学教職大学院...メッセージ...

大规模数据处理 / 云计算 lecture 1 - introduction...

四天王寺大学・大学院...

thoery of automata cs402 - vu-multan · theory of automata...

大规模数据处理 / 云计算 lecture 4 – mapreduce...

海外留学支援制度大学コード表126140...