大规模数据处理 / 云计算 lecture 5 – hadoop runtime 彭波...

15
大大大大大大大 / 大大大 Lecture 5 – Hadoop Runtime 彭彭 彭彭彭彭彭彭彭彭彭彭彭彭 7/23/2013 http://net.pku.edu.cn/~cours e/cs402/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United S See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Jimmy Lin University of Maryland SEWMGroup

Upload: alyson-sharp

Post on 01-Jan-2016

305 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime

彭波北京大学信息科学技术学院

7/23/2013http://net.pku.edu.cn/~course/cs402/

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Jimmy LinUniversity of Maryland SEWMGroup

Page 2: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

课程评分• Project取消• 4次作业

– wordcount (不记分 )– co-occurrence– index– pagerank

• 1 Week– grace time, one day– 10% for each day delay(60% at most)

Page 3: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

'wordcount'How does it work?

Page 4: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Hadoop Cluster

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

datanode daemon

Linux file system

tasktracker

slave node

namenode

namenode daemon

job submission node

jobtracker

4

Page 5: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

job提交过程

Page 6: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).

Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.

Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.

Page 7: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor

(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).

Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).

Page 8: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

InputFormat Class Hierarchy

Page 9: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

9

Page 10: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Serialization Serialization is the process of turning structured objects

into a byte stream for trans-mission over a network or for writing to persistent storage.

Deserialization is the reverse process of turning a byte stream back into a series of structured objects.

In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).

Page 11: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

The Writable Interface

public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;

}

public interface WritableComparable<T> extends Writable, Comparable<T> A Writable which is also Comparable. public int compareTo(WritableComparable w){}

Page 12: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons
Page 13: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Shuffle and Sort

Mapper

Reducer

other mappers

other reducers

circular buffer (in memory)

spills (on disk)

merged spills (on disk)

intermediate files (on disk)

Combiner

Combiner?

13

Page 14: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Partitioner

public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int

numPartitions)

}

Page 15: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons

Q&A