overview of hadoop for data mining federal big data group confidential mark silverman treeminer,...
TRANSCRIPT
![Page 1: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/1.jpg)
Overview of Hadoop forData MiningFederal Big Data Groupconfidential
Mark SilvermanTreeminer, Inc.
155 Gibbs Street Suite 514Rockville, Maryland 20850
(240) [email protected]
![Page 2: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/2.jpg)
TREEMINER, INC.CONFIDENTIAL
Agenda
• Introduction to Hadoop• Developing and testing a Map/Reduce
application• Auto-Clustering in Hadoop and
Interworking with Apache Storm
![Page 3: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/3.jpg)
TREEMINER, INC.CONFIDENTIAL
Introduction to Hadoop
• Hadoop consists of:• Clustered, distributed, highly available file
system (HDFS)• Execution framework (Map/Reduce)
![Page 4: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/4.jpg)
TREEMINER, INC.CONFIDENTIAL
Hadoop File System
• “Rack” aware• Local storage• Distributed copies (generally 3)
Rack
![Page 5: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/5.jpg)
TREEMINER, INC.CONFIDENTIAL
Sample Hadoop File System
![Page 6: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/6.jpg)
TREEMINER, INC.CONFIDENTIAL
Hadoop “Eco-System”
• HiveAllows SQL-like querying of data in HDFS
• PigBasic scripting language for Hadoop
• DatabasesHbase, Accumulo, Cassandra, Neo4j
![Page 7: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/7.jpg)
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
![Page 8: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/8.jpg)
TREEMINER, INC.CONFIDENTIAL
Map / ReduceParallel Execution Framework
![Page 9: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/9.jpg)
TREEMINER, INC.CONFIDENTIAL
WordCount Example
![Page 10: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/10.jpg)
TREEMINER, INC.CONFIDENTIAL
Getting Started
• Cloudera and Hortonworks have sandboxes that are easy to download and are fully contained implementations in a VM. Also download from Apache.
http://hortonworks.com/products/hortonworks-sandbox/http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.htmlhttp://hadoop.apache.org/releases.html
![Page 11: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/11.jpg)
TREEMINER, INC.CONFIDENTIAL
Developing In Map / Reduce
• Standalone Mode – Hadoop runs as single process, best for debugging
• Pseudo-Distributed – Separate processes on same server
• Fully Distributed – Full blown cluster
![Page 12: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/12.jpg)
TREEMINER, INC.CONFIDENTIAL
Eclipse Framework
• Write code in eclipse• PC or Linux• Options:
• Run Hadoop on Windows • Run Eclipse in Linux with Plugin• Run Eclipse in Windows, Remote debug and
profiling• Profiling: Yourkit
![Page 13: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/13.jpg)
TREEMINER, INC.CONFIDENTIAL
WordCount
• Create a project in eclipse• Load wordcount code (widely available
and in sandbox downloads)• Compile jar file• Execute on hadoop in standalone mode$ hadoop jar path/to/file.jar input output
![Page 14: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/14.jpg)
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
![Page 15: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/15.jpg)
TREEMINER, INC.CONFIDENTIAL
Monitoring Hadoop Jobs
![Page 16: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/16.jpg)
TREEMINER, INC.CONFIDENTIAL
Resources
http://www.cloudera.com
http://www.hortonworks.com
hadoop.apache.org
http://web.stanford.edu/class/cs246/homeworks/tutorial.pdf
Hadoop: A Definitive Guide by Tom White
![Page 17: Overview of Hadoop for Data Mining Federal Big Data Group confidential Mark Silverman Treeminer, Inc. 155 Gibbs Street Suite 514 Rockville, Maryland 20850](https://reader030.vdocuments.mx/reader030/viewer/2022032523/56649d825503460f94a67cd1/html5/thumbnails/17.jpg)
TREEMINER, INC.CONFIDENTIAL
Example: Document AutoClustering using Hadoop and Storm
https://www.youtube.com/watch?v=5X65WV0n4rU