hadoop in three use cases

74
2 December 2011 Hadoop in Three Use Cases Joey Echeverria | Solutions Architect [email protected] | @fwiffo

Upload: joey-echeverria

Post on 19-Jun-2015

291 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hadoop in three use cases

2 December 2011

Hadoop in Three Use CasesJoey Echeverria | Solutions [email protected] | @fwiffo

Page 2: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.2

About Joey

• Solutions Architect• 6 months• 3+ years• Local

Page 3: Hadoop in three use cases

Cloudera’s Distribution including Apache Hadoop

Copyright 2011 Cloudera Inc. All rights reserved3

Coordination

Data IntegrationFast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME*, APACHE SQOOP* APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE* APACHE OOZIE* APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

*currently under incubation in the Apache Software Foundation

Page 4: Hadoop in three use cases

Extract, Transform, and Load

Copyright 2011 Cloudera Inc. All rights reserved4

Page 5: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.5

ETL before Hadoop

Difficult to maintain, not scalable

Logs

Files

Relational Databases

Enterprise Data Warehouse

Custom ETL Scripts

Page 6: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.6

ETL before Hadoop

May be scalable, expensive

Logs

Files

Relational Databases

Enterprise Data Warehouse SQL:

raw table → warehouse tables

Page 7: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.7

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data Warehouse

Page 8: Hadoop in three use cases

Steps

Copyright 2011 Cloudera Inc. All rights reserved8

2. Process

1. In

3. Out

Page 9: Hadoop in three use cases

Flume

Copyright 2011 Cloudera Inc. All rights reserved9

Page 10: Hadoop in three use cases

Flume

Copyright 2011 Cloudera Inc. All rights reserved10

Page 11: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.11

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

Page 12: Hadoop in three use cases

HDFS

Copyright 2011 Cloudera Inc. All rights reserved12

Page 13: Hadoop in three use cases

HDFS

Copyright 2011 Cloudera Inc. All rights reserved13

Client

NameNode

DataNode 01

DataNode 05

DataNode 09

DataNode 02

DataNode 06

DataNode 10

DataNode 03

DataNode 07

DataNode 11

DataNode 04

DataNode 08

DataNode 12

open(“file.txt”)

02, 06, 10

data

data data

Page 14: Hadoop in three use cases

HDFS

• Distributed• Replication• Bulk I/O• Fault tolerant• Scalable• Append only• Not POSIX

Copyright 2011 Cloudera Inc. All rights reserved14

Page 15: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.15

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume HDFS

Page 16: Hadoop in three use cases

FUSE-DFS

Copyright 2011 Cloudera Inc. All rights reserved16

Page 17: Hadoop in three use cases

FUSE-DFS

• FUSE– User space– File systems

• FUSE-DFS– /hdfs– Mostly transparent

Copyright 2011 Cloudera Inc. All rights reserved17

Page 18: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.18

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

HDFS

Page 19: Hadoop in three use cases

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved19

Page 20: Hadoop in three use cases

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved20

• SQL to Hadoop• Parallel import• File formats

Page 21: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.21

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Page 22: Hadoop in three use cases

Pig

Copyright 2011 Cloudera Inc. All rights reserved22

Page 23: Hadoop in three use cases

Pig

• Scripting language• Generates MapReduce jobs• Perl for Hadoop• Great for ETL

Copyright 2011 Cloudera Inc. All rights reserved23

A = LOAD 'data' USING PigStorage() AS (f1:int, f2:int, f3:int);B = GROUP A BY f1;C = FOREACH B GENERATE COUNT ($0);DUMP C;

Page 24: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.24

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Pig

Page 25: Hadoop in three use cases

Sqoop with connectors

Copyright 2011 Cloudera Inc. All rights reserved25

Page 26: Hadoop in three use cases

Sqoop with connectors

• MySQL*• PostgreSQL*• Teradata*• Netezza*• Oracle*• Couchbase*• Microsoft SQL Server• VoltDB

Copyright 2011 Cloudera Inc. All rights reserved26

*Cloudera certified connector

Page 27: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.27

ETL with Hadoop

Managed, flexible, scalable

Logs

Files

Relational Databases

Enterprise Data WarehouseFlume

FUSE-DFS

Sqoop

HDFS

Sqoop

Pig

Page 28: Hadoop in three use cases

Recommendations

Copyright 2011 Cloudera Inc. All rights reserved28

Page 29: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.29

Recommendations with Hadoop

Logs

Relational Databases

Web Application

CUSTOMERS

Page 30: Hadoop in three use cases

Flume

Copyright 2011 Cloudera Inc. All rights reserved30

Page 31: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.31

Recommendations with Hadoop

Logs

Relational Databases

Flume

Web Application

CUSTOMERS

Page 32: Hadoop in three use cases

HDFS

Copyright 2011 Cloudera Inc. All rights reserved32

Page 33: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.33

Recommendations with Hadoop

Logs

Relational Databases

Flume HDFS

Web Application

CUSTOMERS

Page 34: Hadoop in three use cases

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved34

Page 35: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.35

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Web Application

CUSTOMERS

Page 36: Hadoop in three use cases

Pig

Copyright 2011 Cloudera Inc. All rights reserved36

Page 37: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.37

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Page 38: Hadoop in three use cases

Mahout

Copyright 2011 Cloudera Inc. All rights reserved38

Page 39: Hadoop in three use cases

Mahout

• Scalable machine learning algorithms– Collaborative Filtering– User and Item based recommenders– K-Means, Fuzzy K-Means clustering– Mean Shift clustering– Singular value decomposition– Complementary Naive Bayes classifier …

Copyright 2011 Cloudera Inc. All rights reserved39

Page 40: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.40

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout

Page 41: Hadoop in three use cases

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved41

Page 42: Hadoop in three use cases

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved42

toOne()

toOne()

toOne()

:1

:1

:1

:1

:1

:1

:1

:1

:1

count():[1,1,1,1]

:[1,1]

:[1,1]

:[1]

count()

:4

:2

:2

:1

shufflemap reduce

Page 43: Hadoop in three use cases

MapReduce

• Distributed• Code to data• Reliable• Scalable

Copyright 2011 Cloudera Inc. All rights reserved43

Page 44: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.44

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

Page 45: Hadoop in three use cases

Oozie

Copyright 2011 Cloudera Inc. All rights reserved45

Page 46: Hadoop in three use cases

Oozie

• Workflows• Coordinator

– Triggers

Copyright 2011 Cloudera Inc. All rights reserved46

Page 47: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.47

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

Oozie

Page 48: Hadoop in three use cases

HBase

Copyright 2011 Cloudera Inc. All rights reserved48

Page 49: Hadoop in three use cases

HBase

• Key/value store

• Data stored in HDFS

• Access model is get/put/del– Plus range scans and versions

• Random reads and writes for Hadoop

Copyright 2011 Cloudera Inc. All rights reserved49

Page 50: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.50

Recommendations with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

Pig

Web Application

CUSTOMERS

Mahout MapReduce Pig

OozieHBase

Page 51: Hadoop in three use cases

Business Intelligence

Copyright 2011 Cloudera Inc. All rights reserved51

Page 52: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.52

Business Intelligence with Hadoop

Logs

Relational Databases

BI / Analytics

ANALYSTS

Page 53: Hadoop in three use cases

Flume

Copyright 2011 Cloudera Inc. All rights reserved53

Page 54: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.54

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

BI / Analytics

ANALYSTS

Page 55: Hadoop in three use cases

HDFS

Copyright 2011 Cloudera Inc. All rights reserved55

Page 56: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.56

Business Intelligence with Hadoop

Logs

Relational Databases

Flume HDFS

BI / Analytics

ANALYSTS

Page 57: Hadoop in three use cases

Sqoop

Copyright 2011 Cloudera Inc. All rights reserved57

Page 58: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.58

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Page 59: Hadoop in three use cases

Hive

Copyright 2011 Cloudera Inc. All rights reserved59

Page 60: Hadoop in three use cases

Hive

• Data warehouse• Ad-hoc queries

– Not real-time (minutes)

• SQL• Tables• Joins

Copyright 2011 Cloudera Inc. All rights reserved60

Page 61: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.61

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive

Page 62: Hadoop in three use cases

MapReduce

Copyright 2011 Cloudera Inc. All rights reserved62

Page 63: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.63

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive MapReduce

Page 64: Hadoop in three use cases

Oozie

Copyright 2011 Cloudera Inc. All rights reserved64

Page 65: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.65

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

Page 66: Hadoop in three use cases

HBase

Copyright 2011 Cloudera Inc. All rights reserved66

Page 67: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.67

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS HBase

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

Page 68: Hadoop in three use cases

Hive

Copyright 2011 Cloudera Inc. All rights reserved68

Page 69: Hadoop in three use cases

Hive for Business Intelligence

• JDBC– JasperReports*– Pentaho*

• ODBC– MicroStrategy*^

Copyright 2011 Cloudera Inc. All rights reserved69

* Vender certified connector^ Cloudera certified connector

Page 70: Hadoop in three use cases

©2011 Cloudera, Inc. All Rights Reserved.70

Business Intelligence with Hadoop

Logs

Relational Databases

Flume

Sqoop

HDFS Hive HBase

BI / Analytics

ANALYSTS

Hive

Oozie

MapReduce

Page 71: Hadoop in three use cases

CDH

Copyright 2011 Cloudera Inc. All rights reserved71

Coordination

Data IntegrationFast Read/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME*, APACHE SQOOP* APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE* APACHE OOZIE* APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

*currently under incubation in the Apache Software Foundation

Page 72: Hadoop in three use cases

What’s next?

• Cloudera Training Videos• CDH Virtual Machines• Hadoop: The Definitive Guide, 2nd Edition• Cloudera University

– Developer Training in Columbia, MD• Dec 13-16, Feb 13-16

– Administrator Training in Herndon, VA• Jan 4-6

– Private Training

Copyright 2011 Cloudera Inc. All rights reserved72

Page 73: Hadoop in three use cases

We’re Hiring!

• http://www.cloudera.com/company/careers /• Customer Operations

– Customer Operations Engineer– Customer Operations Tools Developer

• Customer Solutions– Solutions Architect

• Engineering– Senior Data Integration Developer– Senior Distributed Systems Engineer– Senior UI Engineer– Software Quality Engineer– Technical Writer

• IT/Operations– Systems Administrator

Copyright 2011 Cloudera Inc. All rights reserved73

Page 74: Hadoop in three use cases

74