introduction to the hadoop ecosystem

106
Introduction to the Hadoop Ecosystem

Upload: ud

Post on 01-Feb-2016

237 views

Category:

Documents


1 download

DESCRIPTION

Introduction to the Hadoop Ecosystem

TRANSCRIPT

Page 1: Introduction to the Hadoop Ecosystem

Introdu

ction to th

e H

adoop Ecosystem

Page 2: Introduction to the Hadoop Ecosystem

Ab

out m

e

Page 3: Introduction to the Hadoop Ecosystem

Ab

out u

s

Page 4: Introduction to the Hadoop Ecosystem

Agen

da

•••••

Page 5: Introduction to the Hadoop Ecosystem

Agen

da

•••••

Page 6: Introduction to the Hadoop Ecosystem

Let’s face it…

Page 7: Introduction to the Hadoop Ecosystem

Bu

t on th

e other h

and

Page 8: Introduction to the Hadoop Ecosystem

Th

ink

abou

t it…

Page 9: Introduction to the Hadoop Ecosystem

Th

ink

abou

t it…

Page 10: Introduction to the Hadoop Ecosystem

Th

ink

abou

t it…

Page 11: Introduction to the Hadoop Ecosystem

Th

e 3 V

’s of Big D

ata

Page 12: Introduction to the Hadoop Ecosystem

My favorite d

efinition

Page 13: Introduction to the Hadoop Ecosystem

Wh

y Had

oop?

Page 14: Introduction to the Hadoop Ecosystem

How

to scale data?

r�r�

w�

w�

w�r�

Page 15: Introduction to the Hadoop Ecosystem

Bu

t…

Page 16: Introduction to the Hadoop Ecosystem

Bu

t…

Page 17: Introduction to the Hadoop Ecosystem

Wh

at is Had

oop?

Page 18: Introduction to the Hadoop Ecosystem

Wh

at is Had

oop?

Page 19: Introduction to the Hadoop Ecosystem

Wh

at is Had

oop?

Page 20: Introduction to the Hadoop Ecosystem

Wh

at is Had

oop?

Page 21: Introduction to the Hadoop Ecosystem

Th

e Had

oop A

pp

Store

HDFSMapRed

HCatPig

HiveHBase

Ambari

AvroCassandra

Chukwa

Intel

Sync

Flume

HanaHyperT

Impala

Mahout

Nutch

OozieScoop

ScribeTez

VerticaWhirr

ZooKeeHorton

ClouderaMapR

EMC

IBMTalend

TeraDataPivotal

Informat

Microsoft.

PentahoJasper

KognitioTableau

SplunkPlatfora

RackKarm

aActuate

MicStrat

Page 22: Introduction to the Hadoop Ecosystem

lessm

ore

•H

DFS

•M

apReduce

•H

adoop Ecosystem•

Hadoop YA

RN

•Test &

Packaging•

Installation•

Monitoring

•B

usiness Support

+•

Integrated Environment

•V

isualization•

(Near-)R

ealtime

analysis•

Modeling

•ETL &

Connectors

+

Th

e Had

oop A

pp

Store

Page 23: Introduction to the Hadoop Ecosystem

Agen

da

•••••

Page 24: Introduction to the Hadoop Ecosystem

Data S

torage

Page 25: Introduction to the Hadoop Ecosystem

Data S

torage

Page 26: Introduction to the Hadoop Ecosystem

Had

oop D

istribu

ted F

ile System

•••

Page 27: Introduction to the Hadoop Ecosystem

Had

oop D

istribu

ted F

ile System

••

Page 28: Introduction to the Hadoop Ecosystem

HD

FS A

rchitectu

re

Page 29: Introduction to the Hadoop Ecosystem

Data P

rocessing

Page 30: Introduction to the Hadoop Ecosystem

Data P

rocessing

Page 31: Introduction to the Hadoop Ecosystem

Map

Red

uce

•••

Page 32: Introduction to the Hadoop Ecosystem

Typ

ical large-data p

roblem

•••••

Page 33: Introduction to the Hadoop Ecosystem

Map

Red

uce Flow

��

����

����

����

����

����

��

a�

b2

c9

a3

c2

b7

c8

a�

b2

c3

c6

a3

c2

b7

c8

a1

3b

�7

c2

89

a4

b9

c1

9

Page 34: Introduction to the Hadoop Ecosystem

Com

bin

ed H

adoop

Arch

itecture

Page 35: Introduction to the Hadoop Ecosystem

Word

Cou

nt M

app

er in Java

public class WordCountMapperextends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable>

{private final static IntWritable

one = new IntWritable(1);private Text word = new Text();

public void map(LongWritablekey, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException{

String line = value.toString();StringTokenizer

tokenizer= new StringTokenizer(line);

while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());output.collect(word, one);

}}

}

Page 36: Introduction to the Hadoop Ecosystem

Word

Cou

nt R

edu

cer in Java

public class WordCountReducerextends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException

{ int

sum = 0;while (values.hasNext()){

IntWritablevalue = (IntWritable) values.next();

sum += value.get();} output.collect(key, new IntWritable(sum));

}}

Page 37: Introduction to the Hadoop Ecosystem

Agen

da

•••••

Page 38: Introduction to the Hadoop Ecosystem

Scrip

ting for H

adoop

Page 39: Introduction to the Hadoop Ecosystem

Scrip

ting for H

adoop

Page 40: Introduction to the Hadoop Ecosystem

Ap

ache P

ig

••••

Page 41: Introduction to the Hadoop Ecosystem

Pig in

the H

adoop

ecosystem

Hadoop D

istributed File System

Distributed Program

ming Fram

ework

Metadata M

anagement

Scripting

Page 42: Introduction to the Hadoop Ecosystem

Pig L

atin

users = LOAD 'users.txt' USING PigStorage(',') AS (name, age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url);

filteredUsers= FILTER users BY age >= 18 and age <=50;

joinResult= JOIN filteredUsers

BY name, pages by user;grouped = GROUP joinResult

BY url;summed = FOREACH grouped GENERATE group,

COUNT(joinResult) as clicks;sorted = ORDER summed BY clicks desc;top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

Page 43: Introduction to the Hadoop Ecosystem

Pig E

xecution

Plan

Page 44: Introduction to the Hadoop Ecosystem

Try th

at with

Java…

Page 45: Introduction to the Hadoop Ecosystem

SQ

L for H

adoop

Page 46: Introduction to the Hadoop Ecosystem

SQ

L for H

adoop

Page 47: Introduction to the Hadoop Ecosystem

Ap

ache H

ive

••

Page 48: Introduction to the Hadoop Ecosystem

Hive in

the H

adoop

ecosystem

Hadoop D

istributed File System

Distributed Program

ming Fram

ework

Metadata M

anagement

Scripting

Query

Page 49: Introduction to the Hadoop Ecosystem

Hive A

rchitectu

re

Page 50: Introduction to the Hadoop Ecosystem

Hive E

xamp

le

CREATE TABLE users(name STRING, age INT);CREATE TABLE pages(user STRING, url

STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users';LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user)WHERE users.age

>= 18 AND users.age<= 50

GROUP BY pages.urlSORT BY clicks DESCLIMIT 10;

Page 51: Introduction to the Hadoop Ecosystem

Bu

t wait, th

ere’s still more!

Page 52: Introduction to the Hadoop Ecosystem

Data storage

Data processing

Metadata M

anagement

Scripting

SQ

L-likequeries

NoSQL Database

Machine Learning

Cluster Coordination

Import &

Export of relational data

Cluster installation& management

Workflow automatization

Import &

Export of data

flows

Page 53: Introduction to the Hadoop Ecosystem

Agen

da

•••••

Page 54: Introduction to the Hadoop Ecosystem

Classical en

terprise p

latform

Page 55: Introduction to the Hadoop Ecosystem

Big D

ata Platform

Page 56: Introduction to the Hadoop Ecosystem

Pattern

#1: R

efine d

ata

Page 57: Introduction to the Hadoop Ecosystem

Pattern

#2

: Exp

lore data

Page 58: Introduction to the Hadoop Ecosystem

Pattern

#3

: En

rich d

ata

Page 59: Introduction to the Hadoop Ecosystem

Brin

ging it all togeth

er…

Page 60: Introduction to the Hadoop Ecosystem

Digital A

dvertisin

g

••••

Page 61: Introduction to the Hadoop Ecosystem

Ad

Servin

g Arch

itecture

Page 62: Introduction to the Hadoop Ecosystem

Wh

at’s next?

Page 63: Introduction to the Hadoop Ecosystem

Had

oop 1.0

Page 64: Introduction to the Hadoop Ecosystem

Map

Red

uce is good

for…

••••

Page 65: Introduction to the Hadoop Ecosystem

Map

Red

uce is O

K for…

•––

Page 66: Introduction to the Hadoop Ecosystem

Map

Red

uce is n

ot good for…

•––

•••

Page 67: Introduction to the Hadoop Ecosystem

Map

Red

uce lim

itations

•–––

•–

•–

•–

Page 68: Introduction to the Hadoop Ecosystem

Redundant, reliable

storage

Had

oop 2

.0: N

ext-gen p

latform

Cluster reso

urce mgm

t. +

data processing

Redundant, reliable sto

rage

Data pro

cessing

Cluster reso

urce managem

ent

Data pro

cessing

Page 69: Introduction to the Hadoop Ecosystem

Takin

g Had

oop b

eyond

batch

Redundant, reliable sto

rage

MapR

educe

Cluster reso

urce managem

ent

Tez

HO

YA

Sto

rm, …

Giraph

Spark

Search, …

Page 70: Introduction to the Hadoop Ecosystem

A b

rief history of H

adoop

2.0

Page 71: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 P

rojects

•••

Page 72: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 P

rojects

•••

Page 73: Introduction to the Hadoop Ecosystem

YA

RN

: Arch

itecture

Reso

urceManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

Page 74: Introduction to the Hadoop Ecosystem

YA

RN

: Arch

itecture

•––

•–––

•–––

Page 75: Introduction to the Hadoop Ecosystem

YA

RN

: Arch

itecture

Reso

urceManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

No

deManager

Page 76: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 P

rojects

•••

Page 77: Introduction to the Hadoop Ecosystem

HD

FS Fed

eration

••••

Page 78: Introduction to the Hadoop Ecosystem

HD

FS Fed

eration: A

rchitectu

re

Nam

eNo

de 1N

amespace 1

logs

finance

Blo

ck Managem

ent 1

12

43

Nam

eNo

de 2N

amespace 2

insightsrepo

rts

Blo

ck Managem

ent 2

56

87

DataN

ode

1D

ataNo

de 2

DataN

ode

3D

ataNo

de 4

Page 79: Introduction to the Hadoop Ecosystem

HD

FS: Q

uoru

m b

ased storage

Active N

ameN

ode

Standby N

ameN

ode

DataN

ode

DataN

ode

DataN

ode

DataN

ode

DataN

ode

Journal

No

deJo

urnal N

ode

Journal

No

de

Blo

ckM

apE

ditsF

ileB

lock

Map

Edits

File

Page 80: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 P

rojects

•••

Page 81: Introduction to the Hadoop Ecosystem

Hive: C

urren

t Focus A

rea

•••

••••

•••

•••

Page 82: Introduction to the Hadoop Ecosystem

Stin

ger: Exten

din

g the sw

eet spot

•••

••••

•••

•••

•••

•••

Page 83: Introduction to the Hadoop Ecosystem

Stin

ger Initiative at a glan

ce

Page 84: Introduction to the Hadoop Ecosystem

Tez: T

he E

xecution

En

gine

••••••–

Page 85: Introduction to the Hadoop Ecosystem

Pig/H

ive MR

vs. Pig/H

ive Tez

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM aJOIN b

ON (a.id= b.id)

JOIN cON

(a.itemId= c.itemId)

GROUP BY a.state

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Single Job

Page 86: Introduction to the Hadoop Ecosystem

Tez

Service

•–

•–

••

–•

Page 87: Introduction to the Hadoop Ecosystem

Tez: L

ow laten

cySELECT a.state, COUNT(*),

AVERAGE(c.price) FROM a

JOIN bON (a.id

= b.id)JOIN c

ON(a.itemId

= c.itemId) GROUP BY a.state

Page 88: Introduction to the Hadoop Ecosystem

Stin

ger: Su

mm

ary

Page 89: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 90: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 91: Introduction to the Hadoop Ecosystem

Map

Red

uce 2

.0

•••••

Page 92: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 93: Introduction to the Hadoop Ecosystem

HO

YA

: HB

ase on Y

AR

N

••••••

Page 94: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 95: Introduction to the Hadoop Ecosystem

Tw

itter Storm

••••

••

Page 96: Introduction to the Hadoop Ecosystem

Storm

: Con

ceptu

al view

Page 97: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 98: Introduction to the Hadoop Ecosystem

Sp

ark

••––

••–

•–

Page 99: Introduction to the Hadoop Ecosystem

Data S

harin

g in S

park

Page 100: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 A

pp

lications

••••••••

Page 101: Introduction to the Hadoop Ecosystem

Ap

ache G

iraph

••••–

Page 102: Introduction to the Hadoop Ecosystem

Had

oop 2

.0 S

um

mary

Page 103: Introduction to the Hadoop Ecosystem

Gettin

g started…

Page 104: Introduction to the Hadoop Ecosystem

Horton

work

s San

db

ox

Page 105: Introduction to the Hadoop Ecosystem

Book

s abou

t Had

oop

Page 106: Introduction to the Hadoop Ecosystem

Th

e end

…or th

e beginn

ing?