大规模数据处理 / 云计算 lecture 5 – hadoop runtime 彭波...
TRANSCRIPT
![Page 1: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/1.jpg)
大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime
彭波北京大学信息科学技术学院
7/23/2013http://net.pku.edu.cn/~course/cs402/
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Jimmy LinUniversity of Maryland SEWMGroup
![Page 2: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/2.jpg)
课程评分• Project取消• 4次作业
– wordcount (不记分 )– co-occurrence– index– pagerank
• 1 Week– grace time, one day– 10% for each day delay(60% at most)
![Page 3: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/3.jpg)
'wordcount'How does it work?
![Page 4: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/4.jpg)
Hadoop Cluster
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
4
![Page 5: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/5.jpg)
job提交过程
![Page 6: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/6.jpg)
Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2).
Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.
![Page 7: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/7.jpg)
Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor
(controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3).
Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).
![Page 8: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/8.jpg)
InputFormat Class Hierarchy
![Page 9: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/9.jpg)
combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
9
![Page 10: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/10.jpg)
Serialization Serialization is the process of turning structured objects
into a byte stream for trans-mission over a network or for writing to persistent storage.
Deserialization is the reverse process of turning a byte stream back into a series of structured objects.
In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).
![Page 11: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/11.jpg)
The Writable Interface
public interface Writable { void write(DataOutput out) throws IOException; void readFields(DataInput in) throws IOException;
}
public interface WritableComparable<T> extends Writable, Comparable<T> A Writable which is also Comparable. public int compareTo(WritableComparable w){}
![Page 12: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/12.jpg)
![Page 13: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/13.jpg)
Shuffle and Sort
Mapper
Reducer
other mappers
other reducers
circular buffer (in memory)
spills (on disk)
merged spills (on disk)
intermediate files (on disk)
Combiner
Combiner?
13
![Page 14: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/14.jpg)
Partitioner
public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int
numPartitions)
}
![Page 15: 大规模数据处理 / 云计算 Lecture 5 – Hadoop Runtime 彭波 北京大学信息科学技术学院 7/23/2013 course/cs402/ This work is licensed under a Creative Commons](https://reader031.vdocuments.mx/reader031/viewer/2022033010/56649eca5503460f94bd7a94/html5/thumbnails/15.jpg)
Q&A