“cloudera hadoop” - data mining...

Post on 13-Mar-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

“Cloudera Hadoop”โดย คุณกิตติรักษ์ ม่วงม่ิงสุข

กรรมการผู้จัดการบริษัทคลัสเตอร์ คิท (Cluster Kit) และ นายกสมาคมศึกษาและพัฒนาโอเพ่นซอร์ส

สัมมนา Big Data & Analytics โดย ดาต้า คิวบ์ (facebook.com/datacube.th)

Cloudera Hadoop

ก�ตต�ร�กษ� มวงม� งส�ขKittirak Moungmingsuk

kittirak@clusterkit.co.th

Arp 4, 2015 Data Cube Seminar @ KU HOME

2

รร�จ�กก�นก�อน

ก�ตต� ร�กษ� มวง ม� ง ส�ข ชช อเลน ก�ก

ป�จจ�บ�นทท�หน �ทท ห ล�ยอย�ง ในก�จก�รเล$ก ๆ ชช อ “คล�สเตอร�ค�ท”

และไ ด ร� บมอบหม� ยจ�กค นหล�ยคน ให เป-น น�ยกสมาคมศ�กษาและพ�ฒนาโอเพนซอร�ส หรชอ OSEDA

ว�ฒ�ก�รศ0กษ�

น�กธรรมช�2 นตรท สท�น� กเ รทย นจ�ง หว�ดอ�บลร�ชธ�นท ว�ดป3�

ว�เวก(ธรรมช�น�)

ชอบเลนอ�นเทอร�เ น$ ต ทองเ ทท ยว และ ทท�ก� จกร รมต �ง ๆ

3

ThaiGrid (Tera Cluster)800 Cores, Linux Cluster

133 Cores, Win Cluster

Sila Cluster @Ramkhamhaeng U. 286 Cores

BIOTEC (Eclipse Cluster) 704 Cores Virgin Radio Thailand

7 nodes, Web Cluster

Geo-Informatics and Space Technology Development Agency (GISTDA)

10 nodes, Web Cluster

HAII (HAII Cluster I, II) 480 Cores

Cluster Kit: Achievement

4

Top500.org (update Nov 2014)

5

Top500 Architecture Share (June 2014)

6

Top500 OS Share

7

Why Big Data?

8

Source: https://practicalanalytics.;les.wordpress.com/2012/10/newstyleo;t.jpg

9

Source:http://smartdatacollective.com/yellow;n/75616/why-big-data-and-business-intelligence-one-direction

10

11

Facebook Usage Statistics (June 2014)

829 million daily active users

654 million mobile daily active users

1.32 billion monthly active users

1.07 billion mobile monthly active users

Approximately 81.7% of our daily active users are outside the US and Canada

Source: http://newsroom.fb.com/company-info/

12

Google Usage Statistic

Data from http://expandedramblings.com/index.php/by-the-num

bers-a-gigantic-list-of-google-stats-and-facts/#.

VDavqq2mhNA

Amount of monthly Google searches

11.944 billion (3/20/14)

Number of monthly unique visitor

187 million (3/25/14)

13

จจ�นวนเครร�องเซ�ร�ฟเวอร�

Google

> ล านเครร อง

Facebook

180,900 Servers

https://www.facebook.com/ArcadianLearning

s/posts/549836811713533

14

Low CostHigh Performance

15

http://www.opencompute.org/

16

Software

Linux

Python C++, Java, Javascript, Go, Sawzal (a custom logging language)

Hadoop

Linux

PHP, C++, Java, Python, and Ruby.

Apache Web Server

MySQL

Hadoop

Memcached, Flashcache

HipHop to transform PHP source code into C++ and gain performance bene;ts.

17

What is Hadoop?

HDFS

MapReduce

How to build Hadoop cluster

How to execute MapReduce

Hive SQL

18

Hadoop – How was it Born?

To Process Huge Volume of data, as the amount of generated data continued to rapidly increase. (Big Data).Also the Web generated more and more information, which was becoming quite challenging to index the content.

19

What Is Apache Hadoop?

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Image Source: http://blogs.ejb.cc/archives/4290/hadoop-technical-manuals-a-the-hadoop-ecosystem/tumblr_lbbwggcer71qappj8

20

HDFS Architecture

Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

21

Data Replication

Source: http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

22

Hadoop - Basic Architecture

Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html

23

Hadoop - Basic Architecture (contd.)

Source: http://www.mplsvpn.info/2012/11/hadoop-architecture-types-of-hadoop.html

24

MapReduce

MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. – Wikipedia

map: (K1, V1) -> list(K2, V2)Reduce: (K2, list(V2)) -> list(K3, V3)

25

MapReduce

Output in a list of (Key, Value)

Output in a list of (Key, List of Values)Image source: http://www.rabidgremlin.com/data20/#(3)/

26

Map function

Reduce function

WordCount - MapReduce

27

WordCount by Pig

A = load 'input/*';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = group B by word;D = foreach C generate COUNT(B), group;store D into 'output/wordcount-pig';

Source: http://salsahpc.indiana.edu/ScienceCloud/pig_word_count_tutorial.htm

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs

https://pig.apache.org/

28

Hive

Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

hive> show tables;

hive> create table country (country_id int, country string) row format delimited fields terminated by ',' stored as textfile;

hive> desc country;

hive> load data local inpath '/tmp/country.csv' into table country;

hive> select count(country_id) from country where country like 'T%';

29

Apache™ Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm.

Mahout supports four main data science use cases:

Collaborative ;ltering

Clustering

Classi;cation

Frequent itemset mining

30

List of algorithms (for distributed mode)

Canopy Clustering

Dirichlet Process Clustering

Fuzzy K-Means

Hierarchical Clustering

K-Means Clustering

Latent Dirichlet Allocation

Mean Shift Clustering

Minhash Clustering

Spectral Clustering

Distributed Item-based Collaborative Filtering

Collaborative Filtering Using a Parallel Matrix Factorization

Bayesian

Random Forests

Parallel FP Growth Algorithm

Source: http://hortonworks.com/hadoop/mahout/

31

Source: http://imgbuddy.com/hadoop-ecosystem-components.asp

32

HADOOP 1.0 vs 2.0

Source: http://hortonworks.com/blog/apache-hadoop-2-is-ga/

33

Hadoop 2.0 : YARN(Yet Another Resource Negotiator)

Source: http://hortonworks.com/get-started/yarn/

34

Cloudera Hadoop (CDH)

CDH is Cloudera's open source software distribution

http://www.cloudera.com/

Source: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh.html

35

Cloudera GUI (Hue)

36

Another Hadoop Platform

Hartonworks

http://hortonworks.com/

MapR

https://www.mapr.com/

37

References

https://www.facebook.com/Engineering

https://www.facebook.com/data

https://www.facebook.com/publication

http://research.google.com

http://googleblog.blogspot.com

38

References

Stratapps, “An Introduction to Hadoop”, http://stratapps.net/intro-hadoop.php

edureka!, “Introduction to Hadoop 2.0 and advantages of Hadoop 2.0 over 1.0”, http://www.edureka.co/blog/introduction-to-hadoop

-2-0-and-advantages-of-hadoop-2-0/ , May 2014,

39

โครงก�รคอมพ�วเตอร�มรอสองเพร�อน�องในชนบท

The End.

Download this slide at http://goo.gl/DoibT2

Tweet to me at @kittirak

top related