“cloudera hadoop” - data mining...

“Cloudera Hadoop”โดย คุณกิตติรักษ์ ม่วงม่ิงสุข

กรรมการผู้จัดการบริษัทคลัสเตอร์ คิท (Cluster Kit) และ นายกสมาคมศึกษาและพัฒนาโอเพ่นซอร์ส

สัมมนา Big Data & Analytics โดย ดาต้า คิวบ์ (facebook.com/datacube.th)

Cloudera Hadoop

ก�ตต�ร�กษ� มวงม� งส�ขKittirak Moungmingsuk

kittirak@clusterkit.co.th

Arp 4, 2015 Data Cube Seminar @ KU HOME

รร�จ�กก�นก�อน

ก�ตต� ร�กษ� มวง ม� ง ส�ข ชช อเลน ก�ก

ป�จจ�บ�นทท�หน �ทท ห ล�ยอย�ง ในก�จก�รเล$ก ๆ ชช อ “คล�สเตอร�ค�ท”

และไ ด ร� บมอบหม� ยจ�กค นหล�ยคน ให เป-น น�ยกสมาคมศ�กษาและพ�ฒนาโอเพนซอร�ส หรชอ OSEDA

ว�ฒ�ก�รศ0กษ�

น�กธรรมช�2 นตรท สท�น� กเ รทย นจ�ง หว�ดอ�บลร�ชธ�นท ว�ดป3�

ว�เวก(ธรรมช�น�)

ชอบเลนอ�นเทอร�เ น$ ต ทองเ ทท ยว และ ทท�ก� จกร รมต �ง ๆ

ThaiGrid (Tera Cluster)800 Cores, Linux Cluster

133 Cores, Win Cluster

Sila Cluster @Ramkhamhaeng U. 286 Cores

BIOTEC (Eclipse Cluster) 704 Cores Virgin Radio Thailand

7 nodes, Web Cluster

Geo-Informatics and Space Technology Development Agency (GISTDA)

10 nodes, Web Cluster

HAII (HAII Cluster I, II) 480 Cores

Cluster Kit: Achievement

Top500.org (update Nov 2014)

Top500 Architecture Share (June 2014)

Top500 OS Share

Why Big Data?

Source: https://practicalanalytics.;les.wordpress.com/2012/10/newstyleo;t.jpg

Source:http://smartdatacollective.com/yellow;n/75616/why-big-data-and-business-intelligence-one-direction

Facebook Usage Statistics (June 2014)

829 million daily active users

654 million mobile daily active users

1.32 billion monthly active users

1.07 billion mobile monthly active users

Approximately 81.7% of our daily active users are outside the US and Canada

Source: http://newsroom.fb.com/company-info/

Google Usage Statistic

Data from http://expandedramblings.com/index.php/by-the-num

bers-a-gigantic-list-of-google-stats-and-facts/#.

VDavqq2mhNA

Amount of monthly Google searches

11.944 billion (3/20/14)

Number of monthly unique visitor

187 million (3/25/14)

จจ�นวนเครร�องเซ�ร�ฟเวอร�

Google

> ล านเครร อง

Facebook

180,900 Servers

https://www.facebook.com/ArcadianLearning

s/posts/549836811713533

Low CostHigh Performance

http://www.opencompute.org/

Software

Python C++, Java, Javascript, Go, Sawzal (a custom logging language)

Hadoop

PHP, C++, Java, Python, and Ruby.

Apache Web Server

Hadoop

Memcached, Flashcache

HipHop to transform PHP source code into C++ and gain performance bene;ts.

What is Hadoop?

MapReduce

How to build Hadoop cluster

How to execute MapReduce

Hive SQL

Hadoop – How was it Born?

To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data).Also the Web generated more and more information, which was becoming quite challenging to index the content.

What Is Apache Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Image Source: http://blogs.ejb.cc/archives/4290/hadoop-technical-manuals-a-the-hadoop-ecosystem/tumblr_lbbwggcer71qappj8

HDFS Architecture

Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

Data Replication

Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

Hadoop - Basic Architecture

Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html

Hadoop - Basic Architecture (contd.)

Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html

MapReduce

MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. – Wikipedia

map: (K1, V1) -> list(K2, V2)Reduce: (K2, list(V2)) -> list(K3, V3)

MapReduce

Output in a list of (Key, Value)

Output in a list of (Key, List of Values)Image source: http://www.rabidgremlin.com/data20/#(3)/

Map function

Reduce function

WordCount - MapReduce

WordCount by Pig

A = load 'input/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into 'output/wordcount-pig';

Source: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

https://pig.apache.org/

Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

hive> show tables;

hive> create table country (country_id int, country string) row format delimited fields terminated by ',' stored as textfile;

hive> desc country;

hive> load data local inpath '/tmp/country.csv' into table country;

hive> select count(country_id) from country where country like 'T%';

Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm.

Mahout supports four main data science use cases:

Collaborative ;ltering

Clustering

Classi;cation

Frequent itemset mining

List of algorithms (for distributed mode)

Canopy Clustering

Dirichlet Process Clustering

Fuzzy K-Means

Hierarchical Clustering

K-Means Clustering

Latent Dirichlet Allocation

Mean Shift Clustering

Minhash Clustering

Spectral Clustering

Distributed Item-based Collaborative Filtering

Collaborative Filtering Using a Parallel Matrix Factorization

Bayesian

Random Forests

Parallel FP Growth Algorithm

Source: http://hortonworks.com/hadoop/mahout/

Source: http://imgbuddy.com/hadoop-ecosystem-components.asp

HADOOP 1.0 vs 2.0

Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/

Hadoop 2.0 : YARN(Yet Another Resource Negotiator)

Source: http://hortonworks.com/get-started/yarn/

Cloudera Hadoop (CDH)

CDH is Cloudera's open source software distribution

http://www.cloudera.com/

Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html

Cloudera GUI (Hue)

Another Hadoop Platform

Hartonworks

http://hortonworks.com/

https://www.mapr.com/

References

https://www.facebook.com/Engineering

https://www.facebook.com/data

https://www.facebook.com/publication

http://research.google.com

http://googleblog.blogspot.com

References

Stratapps, “An Introduction to Hadoop”, http://stratapps.net/intro-hadoop.php

edureka!, “Introduction to Hadoop 2.0 and advantages of Hadoop 2.0 over 1.0”, http://www.edureka.co/blog/introduction-to-hadoop

-2-0-and-advantages-of-hadoop-2-0/ , May 2014,

โครงก�รคอมพ�วเตอร�มรอสองเพร�อน�องในชนบท

The End.

Download this slide at http://goo.gl/DoibT2

Tweet to me at @kittirak

“cloudera hadoop” - data mining...

Documents

installation: sas university edition · cloudera hadoop and...

combat cyber threats with cloudera impala & apache hadoop

cloudera certified administrator for apache hadoop (ccah)

cloudera data analyst training for apache hadoop - xebia...

oracle dba & developer days 2014...

cloudera distributed hadoop (cdh) installation and...

installation du framework hadoop (distribution cloudera...

dell cloudera apache hadoop soution reference architecture

dell | cloudera apache hadoop solution reference ... ·...

administración de apache hadoop a través de cloudera

reference architecture: cloudera distribution for hadoop...

cloudera impala: a modern sql engine for hadoop

cloudera distributed hadoop (cdh) installation and...

dell | cloudera apache hadoop solution reference...

cloudera data analyst training for apache hadoop

a beginners guide to cloudera hadoop

data warehouse-optimization-with-hadoop-informatica-cloudera

cloudera hadoop installation and...

cloudera developer training for apache hadoop: hands-on...

cloudera-intel-cisco hadoop benchmark toi (external) … ·...