introduction to the hadoop ecosystem
DESCRIPTION
Introduction to the Hadoop EcosystemTRANSCRIPT
Introdu
ction to th
e H
adoop Ecosystem
Ab
out m
e
Ab
out u
s
Agen
da
•••••
Agen
da
•••••
Let’s face it…
Bu
t on th
e other h
and
…
Th
ink
abou
t it…
Th
ink
abou
t it…
Th
ink
abou
t it…
Th
e 3 V
’s of Big D
ata
My favorite d
efinition
Wh
y Had
oop?
How
to scale data?
r�r�
w�
w�
w�r�
Bu
t…
Bu
t…
Wh
at is Had
oop?
Wh
at is Had
oop?
Wh
at is Had
oop?
Wh
at is Had
oop?
Th
e Had
oop A
pp
Store
HDFSMapRed
HCatPig
HiveHBase
Ambari
AvroCassandra
Chukwa
Intel
Sync
Flume
HanaHyperT
Impala
Mahout
Nutch
OozieScoop
ScribeTez
VerticaWhirr
ZooKeeHorton
ClouderaMapR
EMC
IBMTalend
TeraDataPivotal
Informat
Microsoft.
PentahoJasper
KognitioTableau
SplunkPlatfora
RackKarm
aActuate
MicStrat
lessm
ore
•H
DFS
•M
apReduce
•H
adoop Ecosystem•
Hadoop YA
RN
•Test &
Packaging•
Installation•
Monitoring
•B
usiness Support
+•
Integrated Environment
•V
isualization•
(Near-)R
ealtime
analysis•
Modeling
•ETL &
Connectors
+
Th
e Had
oop A
pp
Store
Agen
da
•••••
Data S
torage
Data S
torage
Had
oop D
istribu
ted F
ile System
•••
Had
oop D
istribu
ted F
ile System
••
HD
FS A
rchitectu
re
Data P
rocessing
Data P
rocessing
Map
Red
uce
•••
Typ
ical large-data p
roblem
•••••
Map
Red
uce Flow
��
����
����
����
����
����
��
a�
b2
c9
a3
c2
b7
c8
a�
b2
c3
c6
a3
c2
b7
c8
a1
3b
�7
c2
89
a4
b9
c1
9
Com
bin
ed H
adoop
Arch
itecture
Word
Cou
nt M
app
er in Java
public class WordCountMapperextends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable>
{private final static IntWritable
one = new IntWritable(1);private Text word = new Text();
public void map(LongWritablekey, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException{
String line = value.toString();StringTokenizer
tokenizer= new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());output.collect(word, one);
}}
}
Word
Cou
nt R
edu
cer in Java
public class WordCountReducerextends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator values, OutputCollectoroutput, Reporter reporter) throws IOException
{ int
sum = 0;while (values.hasNext()){
IntWritablevalue = (IntWritable) values.next();
sum += value.get();} output.collect(key, new IntWritable(sum));
}}
Agen
da
•••••
Scrip
ting for H
adoop
Scrip
ting for H
adoop
Ap
ache P
ig
••••
Pig in
the H
adoop
ecosystem
Hadoop D
istributed File System
Distributed Program
ming Fram
ework
Metadata M
anagement
Scripting
Pig L
atin
users = LOAD 'users.txt' USING PigStorage(',') AS (name, age);
pages = LOAD 'pages.txt' USING PigStorage(',') AS (user, url);
filteredUsers= FILTER users BY age >= 18 and age <=50;
joinResult= JOIN filteredUsers
BY name, pages by user;grouped = GROUP joinResult
BY url;summed = FOREACH grouped GENERATE group,
COUNT(joinResult) as clicks;sorted = ORDER summed BY clicks desc;top10 = LIMIT sorted 10;
STORE top10 INTO 'top10sites';
Pig E
xecution
Plan
Try th
at with
Java…
SQ
L for H
adoop
SQ
L for H
adoop
Ap
ache H
ive
••
Hive in
the H
adoop
ecosystem
Hadoop D
istributed File System
Distributed Program
ming Fram
ework
Metadata M
anagement
Scripting
Query
Hive A
rchitectu
re
Hive E
xamp
le
CREATE TABLE users(name STRING, age INT);CREATE TABLE pages(user STRING, url
STRING);
LOAD DATA INPATH '/user/sandbox/users.txt' INTO TABLE 'users';LOAD DATA INPATH '/user/sandbox/pages.txt' INTO TABLE 'pages';
SELECT pages.url, count(*) AS clicks FROM users JOIN pages ON (users.name = pages.user)WHERE users.age
>= 18 AND users.age<= 50
GROUP BY pages.urlSORT BY clicks DESCLIMIT 10;
Bu
t wait, th
ere’s still more!
Data storage
Data processing
Metadata M
anagement
Scripting
SQ
L-likequeries
NoSQL Database
Machine Learning
Cluster Coordination
Import &
Export of relational data
Cluster installation& management
Workflow automatization
Import &
Export of data
flows
Agen
da
•••••
Classical en
terprise p
latform
Big D
ata Platform
Pattern
#1: R
efine d
ata
Pattern
#2
: Exp
lore data
Pattern
#3
: En
rich d
ata
Brin
ging it all togeth
er…
Digital A
dvertisin
g
••••
Ad
Servin
g Arch
itecture
Wh
at’s next?
Had
oop 1.0
Map
Red
uce is good
for…
••••
Map
Red
uce is O
K for…
•––
Map
Red
uce is n
ot good for…
•––
•••
Map
Red
uce lim
itations
•–––
•–
•–
•–
Redundant, reliable
storage
Had
oop 2
.0: N
ext-gen p
latform
Cluster reso
urce mgm
t. +
data processing
Redundant, reliable sto
rage
Data pro
cessing
Cluster reso
urce managem
ent
Data pro
cessing
Takin
g Had
oop b
eyond
batch
Redundant, reliable sto
rage
MapR
educe
Cluster reso
urce managem
ent
Tez
HO
YA
Sto
rm, …
Giraph
Spark
Search, …
A b
rief history of H
adoop
2.0
•
–
•
–
•
–
Had
oop 2
.0 P
rojects
•••
Had
oop 2
.0 P
rojects
•••
YA
RN
: Arch
itecture
Reso
urceManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
YA
RN
: Arch
itecture
•––
•–––
•–––
YA
RN
: Arch
itecture
Reso
urceManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
No
deManager
Had
oop 2
.0 P
rojects
•••
HD
FS Fed
eration
••••
HD
FS Fed
eration: A
rchitectu
re
Nam
eNo
de 1N
amespace 1
logs
finance
Blo
ck Managem
ent 1
12
43
Nam
eNo
de 2N
amespace 2
insightsrepo
rts
Blo
ck Managem
ent 2
56
87
DataN
ode
1D
ataNo
de 2
DataN
ode
3D
ataNo
de 4
HD
FS: Q
uoru
m b
ased storage
Active N
ameN
ode
Standby N
ameN
ode
DataN
ode
DataN
ode
DataN
ode
DataN
ode
DataN
ode
Journal
No
deJo
urnal N
ode
Journal
No
de
Blo
ckM
apE
ditsF
ileB
lock
Map
Edits
File
Had
oop 2
.0 P
rojects
•••
Hive: C
urren
t Focus A
rea
•••
••••
•••
•••
Stin
ger: Exten
din
g the sw
eet spot
•••
••••
•••
•••
•••
•••
Stin
ger Initiative at a glan
ce
Tez: T
he E
xecution
En
gine
••••••–
•
Pig/H
ive MR
vs. Pig/H
ive Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM aJOIN b
ON (a.id= b.id)
JOIN cON
(a.itemId= c.itemId)
GROUP BY a.state
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
Tez
Service
•–
•–
••
–•
–
Tez: L
ow laten
cySELECT a.state, COUNT(*),
AVERAGE(c.price) FROM a
JOIN bON (a.id
= b.id)JOIN c
ON(a.itemId
= c.itemId) GROUP BY a.state
Stin
ger: Su
mm
ary
Had
oop 2
.0 A
pp
lications
••••••••
Had
oop 2
.0 A
pp
lications
••••••••
Map
Red
uce 2
.0
•••••
Had
oop 2
.0 A
pp
lications
••••••••
HO
YA
: HB
ase on Y
AR
N
••••••
Had
oop 2
.0 A
pp
lications
••••••••
Tw
itter Storm
••••
••
Storm
: Con
ceptu
al view
Had
oop 2
.0 A
pp
lications
••••••••
Sp
ark
••––
••–
•–
Data S
harin
g in S
park
Had
oop 2
.0 A
pp
lications
••••••••
Ap
ache G
iraph
••••–
Had
oop 2
.0 S
um
mary
Gettin
g started…
Horton
work
s San
db
ox
Book
s abou
t Had
oop
Th
e end
…or th
e beginn
ing?