sdec2011 shashank-introducing hadoop
DESCRIPTION
TRANSCRIPT
![Page 1: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/1.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Introducing HadoopMastering Hadoop Map-reduce for Data Analysis
Shashank Tiwariblog: shanky.org | twitter: @[email protected]
![Page 2: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/2.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
What is Hadoop
![Page 3: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/3.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
HDFS Architecture
![Page 4: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/4.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Namenode/Datanode, JobTracker/TaskTracker
![Page 5: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/5.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
MapReduce
![Page 6: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/6.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
ZK Namespace
![Page 7: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/7.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Essential HBase Schema
![Page 8: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/8.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Multi-dimensional View
![Page 9: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/9.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
A Map/Hash View
• {
• "row_key_1" : { "name" : {
• "first_name" : "Jolly", "last_name" : "Goodfellow"
• } } },
• "location" : { "zip": "94301" },
![Page 10: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/10.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Architectural View (HBase)
![Page 11: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/11.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
The Persistence Mechanism
![Page 12: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/12.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
The underlying file format
![Page 13: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/13.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Installing & Setting up Hadoop
• Required software: Java 1.6.x, ssh + sshd
• Download
• Install
• Configure
• single-node
• pseudo-distributed
• cluster
![Page 14: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/14.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Download
• Source: http://hadoop.apache.org/
• Version:
• 0.20.203.x -- current stable
• 0.20.x -- previous stable
• Includes
• Hadoop Common -- common utilities, HDFS, MapReduce
![Page 15: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/15.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Install
• Extract: tar zxvf hadoop-0.20.203.0rc1.tar.gz
• Move & Create Symbolic Link
• ln -s hadoop-0.20.203.0 hadoop
• On Windows
• http://developer.yahoo.com/hadoop/tutorial/module3.html
![Page 16: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/16.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Configure -- single-node
• Edit: conf/hadoop-env.sh
• Set JAVA_HOME
• Default configuration is single-node
• Start bin/hadoop (for command options)
• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html
![Page 17: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/17.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Configure -- pseduo-distributed
• Edit: conf/core-site.xml (configure HDFS daemon)
• Edit: conf/hdfs-site.xml (configure HDFS replication factor)
• Edit: conf/mapred-site.xml (configure MapReduce JobTracker daemon)
• Enable ssh to localhost (without passphrase)
• Reference: http://hadoop.apache.org/common/docs/r0.20.203.0/single_node_setup.html
![Page 18: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/18.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Start Hadoop
• Format HDFS: bin/hadoop namenode -format
• Start all daemons: bin/start-all.sh
• Verify logs
• Browse the web interface:
• Namenode: http://localhost:50070/
• JobTracker: http://localhost:50030/
![Page 19: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/19.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Take Hadoop for a test-drive
• Run examples (hadoop-examples-0.20.203.0.jar)
• Grep using regular expressions
• Copy files to HDFS: bin/hadoop fs -put bin input
• Grep for files which have text beginning with ‘start’
• Verify output on HDFS: bin/hadoop fs -cat output/*
• Copy output to local filesystem & verify: bin/hadoop fs -get output output && cat output/*
![Page 20: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/20.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Configure -- cluster
• References:
• http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html (official documentation)
• http://developer.yahoo.com/hadoop/tutorial/module7.html (Managing a Hadoop Cluster. Source: YDN)
• http://wiki.datameer.com/display/DAS1/Hadoop+Cluster+Configuration+Tips
![Page 21: Sdec2011 shashank-introducing hadoop](https://reader034.vdocuments.mx/reader034/viewer/2022042613/5457cdafb1af9fcf338b51df/html5/thumbnails/21.jpg)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLCAll other & referenced work is copyrighted to their respective owners
Questions?
• blog: shanky.org | twitter: @tshanky