team3: xiaokui shu, ron cohen cs5604 at virginia tech december 6, 2010
DESCRIPTION
Is a software framework User should program Like a super-library For distributed applications Build-in solutions Solutions depend on this framework Inspired by Google's MapReduce and Google File System (GFS) papersTRANSCRIPT
![Page 2: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/2.jpg)
Content
Introduction Hadoop MapReduce
Working With Hadoop Environment MapReduce Programming
Summary
![Page 3: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/3.jpg)
Introduction :: Hadoop
Is a software framework User should program Like a super-library
For distributed applications Build-in solutions Solutions depend on this framework
Inspired by Google's MapReduce and Google File System (GFS) papers
![Page 4: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/4.jpg)
Introduction :: Hadoop
Who use Hadoop A9.com – Amazon▪ Amazon's product search indices
Adobe▪ 30 nodes running HDFS, Hadoop and Hbase
Baidu▪ handle about 3000TB per week
Facebook▪ store copies of internal log and dimension data
sources Last.fm, LinkedIn, IBM, Yahoo!, Google…
![Page 5: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/5.jpg)
Introduction :: Hadoop
Hadoop Common HDFS MapReduce ZooKeeper
![Page 6: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/6.jpg)
Introduction :: Hadoop :: IR
Connections to the IR book Ch.4 Index construction▪ Distributed indexing (4.4)
Ch.20 Web crawling and indexes▪ Distributed crawler (20.2)▪ Distributed indexing (20.3)
![Page 7: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/7.jpg)
Introduction :: MapReduce Is a software framework For distributed computing
Mass amount of data Simple processing requirement Portability across variety platforms▪ Clusters▪ CMP/SMP▪ GPGPU
Introduced by Google
![Page 8: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/8.jpg)
Introduction :: MapReduce
Cited from MapReduce: Simplified Data Processing on Large Clusters
![Page 9: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/9.jpg)
Introduction :: MapReduce Map
Map(k1,v1) -> list(k2,v2) Reduce
Reduce(k2, list (v2)) -> list(v3)
Hadoop MapReduce (input) <k1, v1> -> map -> <k2, v2> ->
combine -> <k2, v2> -> reduce -> <k3, v3> (output)
![Page 10: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/10.jpg)
Introduction :: MapReduce Ex Source
$cat file01Hello World Bye World$cat file02Hello Hadoop Goodbye Hadoop$
![Page 11: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/11.jpg)
Introduction :: MapReduce Ex Map Output
For File01< Hello, 1>< World, 1>< Bye, 1>< World, 1>
For File02< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>
![Page 12: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/12.jpg)
Introduction :: MapReduce Ex Reduce Output
< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>
![Page 13: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/13.jpg)
Introduction :: MapReduce More input More mappers
Combiner Function after Map More reducers
Partition Function before ReduceFocus on Map & Reduce
![Page 14: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/14.jpg)
Working With Hadoop :: Env
Hadoop in Java (C++) Run in 3 modes
Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode
It is setup to Pseudo-Distributed Mode in our instance on IBM cloud
![Page 15: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/15.jpg)
Working With Hadoop
Process1. Start Hadoop service2. Prepare input3. Write your MapReduce program4. Compile your program5. Run your application with Hadoop
![Page 16: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/16.jpg)
Working With Hadoop :: Env Start Hadoop service
$ bin/hadoop namenode -format $ bin/start-all.sh
Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs
![Page 17: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/17.jpg)
Working With Hadoop :: Env Compile your program & create jar
$ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf wordcount.jar -C wordcount_classes/ .
Run your application with Hadoop $ bin/hadoop jar wordcount.jar
org.myorg.WordCount hinputdir houtputdir
![Page 18: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/18.jpg)
Working With Hadoop :: Progvoid map(String name, String document):
// name: document name// document: document contentsfor each word w in document:
EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts):
// word: a word// partialCounts: a list of aggregated partial countsint result = 0;for each pc in partialCounts:
result += ParseInt(pc);Emit(AsString(result));
Cited from Wikipedia
![Page 19: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/19.jpg)
Working With Hadoop :: Progpublic static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());output.collect(word, one);
}}
}
![Page 20: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/20.jpg)
Working With Hadoop :: Progpublic static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;while (values.hasNext()) {
sum += values.next().get();}output.collect(key, new IntWritable(sum));
}}
![Page 21: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/21.jpg)
Working With Hadoop :: Prog Configurations & Main class
Leave other work for the Hadoop MapReduce Framework
![Page 22: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/22.jpg)
Summary
Hadoop Introduction Connections to the IR book
MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application
![Page 23: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/23.jpg)
Refenerce Hadoop Project
http://hadoop.apache.org/ MapReduce in Hadoop
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html
MapReduce: Simplified Data Processing on Large Clusters
http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GUIDE&dl=&idx=J79&part=magazine&WantType=Magazines&title=Communications%20of%20the%20ACM
Hadoop Single-Node Setuphttp://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
Who use Hadoophttp://wiki.apache.org/hadoop/PoweredBy
![Page 24: Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010](https://reader030.vdocuments.mx/reader030/viewer/2022011722/5a4d1ad07f8b9ab05997121f/html5/thumbnails/24.jpg)
Thank You!