hadoop world 2011: integrating hadoop with enterprise rdbms using apache sqoop and other tools - guy...

58
November 2011 Apache Sqoop (Incubating) Integrating Hadoop with Enterprise RDBS Part I Arvind Prabhakar (arvind at apache dot org) Apache Sqoop Committer and Software Engineer at Cloudera

Upload: cloudera-inc

Post on 15-Jul-2015

4.917 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

November 2011

Apache Sqoop (Incubating)Integrating Hadoop with Enterprise RDBS – Part I

Arvind Prabhakar (arvind at apache dot org)

Apache Sqoop Committer and Software Engineer at Cloudera

Page 2: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

1

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

1

Page 3: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

2

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

2

Page 4: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

3

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

3

Page 5: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

4

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

4

Page 6: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

5

©2011 Quest Software, Inc. All rights reserved..

In This Session…

How Sqoop Works

Roadmap

5

Page 7: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

6

©2011 Quest Software, Inc. All rights reserved..

Data Import

6

Page 8: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

7

©2011 Quest Software, Inc. All rights reserved..

Data Import

7

Page 9: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

8

©2011 Quest Software, Inc. All rights reserved..

Data Import

8

Page 10: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

9

©2011 Quest Software, Inc. All rights reserved..

Data Import

9

Page 11: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

10

©2011 Quest Software, Inc. All rights reserved..

Data Import

10

Page 12: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

11

©2011 Quest Software, Inc. All rights reserved..

Sqoop Overview

11

Page 13: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

12

©2011 Quest Software, Inc. All rights reserved..

Pre-processing

12

Page 14: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

13

©2011 Quest Software, Inc. All rights reserved..

Code Generation

13

Page 15: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

14

©2011 Quest Software, Inc. All rights reserved..

Type Mapping

14

Page 16: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

15

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

15

Page 17: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

16

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

16

Page 18: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

17

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

17

Page 19: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

18

©2011 Quest Software, Inc. All rights reserved..

Post-Processing

18

Page 20: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

19

©2011 Quest Software, Inc. All rights reserved..

Sqoop Connectors

Oracle – Developed by Quest Software

Couchbase – Developed by Couchbase

Netezza – Developed by Cloudera

Teradata – Developed by Cloudera

SQL Server – Developed by Microsoft

Microsoft PDW – Developed by Microsoft

Volt DB – Developed by Volt DB

19

Page 21: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

20

©2011 Quest Software, Inc. All rights reserved..

Sqoop Roadmap

SQOOP-365: Proposal for Sqoop 2.0

• https://issues.apache.org/jira/browse/SQOOP-365

Highlights

• Sqoop as a Service

• Connections as First Class Objects

• Role based Security

20

Page 22: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

21

©2011 Quest Software, Inc. All rights reserved..

Sqoop 2 Architecture (proposed)

21

Page 24: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

23

©2011 Quest Software, Inc. All rights reserved..

Thank You!

Q & A will be after part II of this session.

23

Page 25: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

©2011 Quest Software, Inc. All rights reserved..

Guy Harrison, Quest Software

Integrating Hadoop with Enterprise

RDBMS Using Apache SQOOP and

Other Tools

Page 26: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

25

©2011 Quest Software, Inc. All rights reserved..

Introductions

Page 27: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

26

©2011 Quest Software, Inc. All rights reserved..

Page 28: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

27

Page 29: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

28

©2011 Quest Software, Inc. All rights reserved..

Agenda

• Scenarios for RDBMS-Hadoop interaction

• Case study: Quest extension to SQOOP

• Other RDBMS-Hadoop integrations

Page 30: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

29

©2011 Quest Software, Inc. All rights reserved..

Hadoop meets RDBMS – scenarios

Page 31: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

Scenario #1: Reference data in RDBMS

CUSTOMERS

WEBlOGS

PRODUCTS

HDFS

RDBMS

Page 32: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

Scenario #2: Hadoop for off-line analytics

CUSTOMERS

PRODUCTS

HDFS

RDBMS

SALES

HISTORY

Page 33: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

Scenario #3: MapReduce output to RDBMS

WEBLOGS

SUMMARY

RDBMS

DB QUERY

TOOL

WEBlOGS

HDFS

Page 34: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

Scenario #4: Hadoop as RDBMS “active archive”

SALES 2011

HDFS

RDBMS

QUERY

TOOL

SALES 2010

SALES 2009

SALES 2008

SALES 2009

SALES 2008

Page 35: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

34

©2011 Quest Software, Inc. All rights reserved..

Case Study: extending SQOOP for Oracle

Page 36: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

35

©2011 Quest Software, Inc. All rights reserved..

SQOOP extensibility

• SQOOP implements a generic approach to

RDBMS/Hadoop data transfer

• But database optimization is highly platform specific

• Each RDBMS has distinct optimizations strategies

• For Oracle, optimization requires:

• Bypassing Oracle caching layers

• Avoiding Oracle optimizer meddling

• Exploiting Oracle metadata to balance mapper load

Page 37: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

CACHE

ORACLE TABLE

Reading from Oracle – default SQOOP

Index block Index block

RANGE SCAN

MAPPER

ORACLE SESSSION

ID > 0 and ID < MAX/2

MAPPER

ORACLE SESSION

ID > MAX/2

Index block Index block

RANGE SCAN

Index block Index block

Page 38: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

Oracle

SALES

table

HDFS

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Oracle – parallelism gone bad (1)

Page 39: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

ORACLE

TABLEHDFS

HADOOP

MAPPER

HADOOP

MAPPER

HADOOP

MAPPER

HADOOP

MAPPER

Oracle – parallelism gone bad (2)

Page 40: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

ORACLE

TABLEHDFS

Ideal architecture

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

Page 41: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

40

©2011 Quest Software, Inc. All rights reserved..

Design goals

• Partition data based on physical storage

• By-pass Oracle buffering

• By-pass Oracle parallelism

• Do not require or use indexes

• Never read the same data block more than once

• Support Oracle datatypes

Page 42: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

41

©2011 Quest Software, Inc. All rights reserved..

Import Throughput

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

0 5 10 15 20 25 30 35

Ela

psed

Tim

e (

ms)

Number of mappers

SQOOP

SQOOP with Quest Connector

Page 43: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

42

©2011 Quest Software, Inc. All rights reserved..

80.84

89.72

98.95

99.08

98.71

0 20 40 60 80 100

Elasped time

CPU Time

Network round trips

IO requests

IO time

Pct reduction

16 mappers, 50M rows, 50 GB clustered data

Page 44: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

43

©2011 Quest Software, Inc. All rights reserved..

Export Throughput

500

1,000

1,500

2,000

2,500

3,000

0 5 10 15 20 25

Seco

nd

s

No of mappers

SQOOP

SQOOP with Quest Connect

Page 45: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

44

©2011 Quest Software, Inc. All rights reserved..

Export load

0

5000

10000

15000

20000

25000

30000

0 5 10 15 20 25 30

Data

base t

ime (

s)

No of mappers

SQOOP

SQOOP with Quest connect

Page 46: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

45

©2011 Quest Software, Inc. All rights reserved..

Working with the SQOOP framework

• SQOOP lets you concentrate on the RDBMS logic, not

the Hadoop plumbing:

• Extend ManagerFactory (what to handle)

• Extend ConnManager (DB connection and metadata)

• For imports:

• Extend DataDrivenDBInputFormat (gets the data)

• Data allocation (getSplits())

• Split serialization (“io.serializations” property)

• Data access logic (createDBRecordReader(), getSelectQuery())

• Implement progress (nextKeyValue(), getProgress())

• Similar procedure for extending exports

Page 47: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

46

©2011 Quest Software, Inc. All rights reserved..

Extensions to native SQOOP

• MERGE

functionality

• Update if

exists, insert

otherwise

• Hive

connector

• Source defined as

HQL query rather

than HDFS

directory

• Eclipse UI

Page 48: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

47

©2011 Quest Software, Inc. All rights reserved..

Availability

• Apache licensed source available from :

https://github.com/QuestSoftwareTCD/OracleSQOOPconnector

• Download from (Quest):

http://www.quest.com/hadoop/

• Download from (Cloudera):

http://ccp.cloudera.com/display/SUPPORT/Downloads

Page 49: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

48

©2011 Quest Software, Inc. All rights reserved..

Other SQOOP connectors

• Microsoft SQL Server:

• http://www.microsoft.com/download/en/details.aspx?id=27584

• Teradata:

• https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide

%2C+version+1.0-beta-u4

• Microstrategy:

• https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreem

ent

• Nettezza:

• https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement

• VoltDB:

• http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration

Page 50: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

49

©2011 Quest Software, Inc. All rights reserved..

Other Hadoop – RDBMS integrations

Page 51: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

50

©2011 Quest Software, Inc. All rights reserved..

Oracle Big Data Appliance

• 18 Sun X4270 M2 servers

• 48GB per node (864GB total)

• 2x6 Core CPU per node (216 total)

• 12x2TB HDD per node (216 spindles, 864 TB)

• 40Gb/s Infiniband between nodes

• 10Gb/s Ethernet to datacenter

• Apache Hadoop

• Oracle NoSQL

• Oracle loader for Hadoop

• Multi-stage C-optimized unidirectional loader

www.oracle.com/us/bigdata/index.html

Page 52: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

ORACLE

EXADATA

ORACLE

EXALOGIC

ORACLE

BIG DATA

APPLIANCE

ORACLE

NOSQL

ORACLE

LOADER

FOR

HADOOPAPACHE

HADOOPORACLE

RDBMS

ORACLE

WEBLOGIC

ORACLE

EXALYTICS

ORACLE

ESSBASE

ORACLE

TIMES TEN

Page 53: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

52

©2011 Quest Software, Inc. All rights reserved..

Microsoft

Page 54: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

53

©2011 Quest Software, Inc. All rights reserved..

Hadapt

• Formally HadoopDB – Hadoop/Postgres hybrid

• Postgres servers on data nodes allow for accelerated

(indexed) HIVE queries

• Extensions to the Hive optimizer

http://www.hadapt.com/

Page 55: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

54

©2011 Quest Software, Inc. All rights reserved..

Greenplum

• SQL based access to HDFS data via in-DB MapReduce

http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf

Page 56: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

55

©2011 Quest Software, Inc. All rights reserved..

Toad for Cloud Databases

• Federated SQL queries across

Hive, Hbase, NoSQL, RDBMS

Page 57: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

56

©2011 Quest Software, Inc. All rights reserved..

Conclusions

• RDBMS-Hadoop interoperability is key to Enterprise

Hadoop adoption

• SQOOP provides a good general purpose framework

for transferring data between any JDBC database and

Hadoop

• We’d like to see it become a standard

• Each RDBMS offers distinct tuning opportunities, so

optimized SQOOP extensions offer real value

• Hadoop-RDBMS integration projects are proliferating

rapidly

Page 58: Hadoop World 2011: Integrating Hadoop with Enterprise RDBMS Using Apache Sqoop and Other Tools - Guy Harrison, Quest Software &  Arvind Prabhakar, Cloudera

57

©2011 Quest Software, Inc. All rights reserved..