hadoop world 2011: integrating hadoop with enterprise rdbms using apache sqoop and other tools - guy...

Post on 15-Jul-2015

4.917 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

November 2011

Apache Sqoop (Incubating)Integrating Hadoop with Enterprise RDBS – Part I

Arvind Prabhakar (arvind at apache dot org)

Apache Sqoop Committer and Software Engineer at Cloudera

1

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

1

2

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

2

3

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

3

4

©2011 Quest Software, Inc. All rights reserved..

Hadoop Data Processing

4

5

©2011 Quest Software, Inc. All rights reserved..

In This Session…

How Sqoop Works

Roadmap

5

6

©2011 Quest Software, Inc. All rights reserved..

Data Import

6

7

©2011 Quest Software, Inc. All rights reserved..

Data Import

7

8

©2011 Quest Software, Inc. All rights reserved..

Data Import

8

9

©2011 Quest Software, Inc. All rights reserved..

Data Import

9

10

©2011 Quest Software, Inc. All rights reserved..

Data Import

10

11

©2011 Quest Software, Inc. All rights reserved..

Sqoop Overview

11

12

©2011 Quest Software, Inc. All rights reserved..

Pre-processing

12

13

©2011 Quest Software, Inc. All rights reserved..

Code Generation

13

14

©2011 Quest Software, Inc. All rights reserved..

Type Mapping

14

15

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

15

16

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

16

17

©2011 Quest Software, Inc. All rights reserved..

Data Transfer

17

18

©2011 Quest Software, Inc. All rights reserved..

Post-Processing

18

19

©2011 Quest Software, Inc. All rights reserved..

Sqoop Connectors

Oracle – Developed by Quest Software

Couchbase – Developed by Couchbase

Netezza – Developed by Cloudera

Teradata – Developed by Cloudera

SQL Server – Developed by Microsoft

Microsoft PDW – Developed by Microsoft

Volt DB – Developed by Volt DB

19

20

©2011 Quest Software, Inc. All rights reserved..

Sqoop Roadmap

SQOOP-365: Proposal for Sqoop 2.0

• https://issues.apache.org/jira/browse/SQOOP-365

Highlights

• Sqoop as a Service

• Connections as First Class Objects

• Role based Security

20

21

©2011 Quest Software, Inc. All rights reserved..

Sqoop 2 Architecture (proposed)

21

23

©2011 Quest Software, Inc. All rights reserved..

Thank You!

Q & A will be after part II of this session.

23

©2011 Quest Software, Inc. All rights reserved..

Guy Harrison, Quest Software

Integrating Hadoop with Enterprise

RDBMS Using Apache SQOOP and

Other Tools

25

©2011 Quest Software, Inc. All rights reserved..

Introductions

26

©2011 Quest Software, Inc. All rights reserved..

27

28

©2011 Quest Software, Inc. All rights reserved..

Agenda

• Scenarios for RDBMS-Hadoop interaction

• Case study: Quest extension to SQOOP

• Other RDBMS-Hadoop integrations

29

©2011 Quest Software, Inc. All rights reserved..

Hadoop meets RDBMS – scenarios

Scenario #1: Reference data in RDBMS

CUSTOMERS

WEBlOGS

PRODUCTS

HDFS

RDBMS

Scenario #2: Hadoop for off-line analytics

CUSTOMERS

PRODUCTS

HDFS

RDBMS

SALES

HISTORY

Scenario #3: MapReduce output to RDBMS

WEBLOGS

SUMMARY

RDBMS

DB QUERY

TOOL

WEBlOGS

HDFS

Scenario #4: Hadoop as RDBMS “active archive”

SALES 2011

HDFS

RDBMS

QUERY

TOOL

SALES 2010

SALES 2009

SALES 2008

SALES 2009

SALES 2008

34

©2011 Quest Software, Inc. All rights reserved..

Case Study: extending SQOOP for Oracle

35

©2011 Quest Software, Inc. All rights reserved..

SQOOP extensibility

• SQOOP implements a generic approach to

RDBMS/Hadoop data transfer

• But database optimization is highly platform specific

• Each RDBMS has distinct optimizations strategies

• For Oracle, optimization requires:

• Bypassing Oracle caching layers

• Avoiding Oracle optimizer meddling

• Exploiting Oracle metadata to balance mapper load

CACHE

ORACLE TABLE

Reading from Oracle – default SQOOP

Index block Index block

RANGE SCAN

MAPPER

ORACLE SESSSION

ID > 0 and ID < MAX/2

MAPPER

ORACLE SESSION

ID > MAX/2

Index block Index block

RANGE SCAN

Index block Index block

Oracle

SALES

table

HDFS

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Oracle – parallelism gone bad (1)

ORACLE

TABLEHDFS

HADOOP

MAPPER

HADOOP

MAPPER

HADOOP

MAPPER

HADOOP

MAPPER

Oracle – parallelism gone bad (2)

ORACLE

TABLEHDFS

Ideal architecture

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

HADOOP

MAPPERORACLE

SESSION

40

©2011 Quest Software, Inc. All rights reserved..

Design goals

• Partition data based on physical storage

• By-pass Oracle buffering

• By-pass Oracle parallelism

• Do not require or use indexes

• Never read the same data block more than once

• Support Oracle datatypes

41

©2011 Quest Software, Inc. All rights reserved..

Import Throughput

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

0 5 10 15 20 25 30 35

Ela

psed

Tim

e (

ms)

Number of mappers

SQOOP

SQOOP with Quest Connector

42

©2011 Quest Software, Inc. All rights reserved..

80.84

89.72

98.95

99.08

98.71

0 20 40 60 80 100

Elasped time

CPU Time

Network round trips

IO requests

IO time

Pct reduction

16 mappers, 50M rows, 50 GB clustered data

43

©2011 Quest Software, Inc. All rights reserved..

Export Throughput

500

1,000

1,500

2,000

2,500

3,000

0 5 10 15 20 25

Seco

nd

s

No of mappers

SQOOP

SQOOP with Quest Connect

44

©2011 Quest Software, Inc. All rights reserved..

Export load

0

5000

10000

15000

20000

25000

30000

0 5 10 15 20 25 30

Data

base t

ime (

s)

No of mappers

SQOOP

SQOOP with Quest connect

45

©2011 Quest Software, Inc. All rights reserved..

Working with the SQOOP framework

• SQOOP lets you concentrate on the RDBMS logic, not

the Hadoop plumbing:

• Extend ManagerFactory (what to handle)

• Extend ConnManager (DB connection and metadata)

• For imports:

• Extend DataDrivenDBInputFormat (gets the data)

• Data allocation (getSplits())

• Split serialization (“io.serializations” property)

• Data access logic (createDBRecordReader(), getSelectQuery())

• Implement progress (nextKeyValue(), getProgress())

• Similar procedure for extending exports

46

©2011 Quest Software, Inc. All rights reserved..

Extensions to native SQOOP

• MERGE

functionality

• Update if

exists, insert

otherwise

• Hive

connector

• Source defined as

HQL query rather

than HDFS

directory

• Eclipse UI

47

©2011 Quest Software, Inc. All rights reserved..

Availability

• Apache licensed source available from :

https://github.com/QuestSoftwareTCD/OracleSQOOPconnector

• Download from (Quest):

http://www.quest.com/hadoop/

• Download from (Cloudera):

http://ccp.cloudera.com/display/SUPPORT/Downloads

48

©2011 Quest Software, Inc. All rights reserved..

Other SQOOP connectors

• Microsoft SQL Server:

• http://www.microsoft.com/download/en/details.aspx?id=27584

• Teradata:

• https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide

%2C+version+1.0-beta-u4

• Microstrategy:

• https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreem

ent

• Nettezza:

• https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement

• VoltDB:

• http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration

49

©2011 Quest Software, Inc. All rights reserved..

Other Hadoop – RDBMS integrations

50

©2011 Quest Software, Inc. All rights reserved..

Oracle Big Data Appliance

• 18 Sun X4270 M2 servers

• 48GB per node (864GB total)

• 2x6 Core CPU per node (216 total)

• 12x2TB HDD per node (216 spindles, 864 TB)

• 40Gb/s Infiniband between nodes

• 10Gb/s Ethernet to datacenter

• Apache Hadoop

• Oracle NoSQL

• Oracle loader for Hadoop

• Multi-stage C-optimized unidirectional loader

www.oracle.com/us/bigdata/index.html

ORACLE

EXADATA

ORACLE

EXALOGIC

ORACLE

BIG DATA

APPLIANCE

ORACLE

NOSQL

ORACLE

LOADER

FOR

HADOOPAPACHE

HADOOPORACLE

RDBMS

ORACLE

WEBLOGIC

ORACLE

EXALYTICS

ORACLE

ESSBASE

ORACLE

TIMES TEN

52

©2011 Quest Software, Inc. All rights reserved..

Microsoft

53

©2011 Quest Software, Inc. All rights reserved..

Hadapt

• Formally HadoopDB – Hadoop/Postgres hybrid

• Postgres servers on data nodes allow for accelerated

(indexed) HIVE queries

• Extensions to the Hive optimizer

http://www.hadapt.com/

54

©2011 Quest Software, Inc. All rights reserved..

Greenplum

• SQL based access to HDFS data via in-DB MapReduce

http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf

55

©2011 Quest Software, Inc. All rights reserved..

Toad for Cloud Databases

• Federated SQL queries across

Hive, Hbase, NoSQL, RDBMS

56

©2011 Quest Software, Inc. All rights reserved..

Conclusions

• RDBMS-Hadoop interoperability is key to Enterprise

Hadoop adoption

• SQOOP provides a good general purpose framework

for transferring data between any JDBC database and

Hadoop

• We’d like to see it become a standard

• Each RDBMS offers distinct tuning opportunities, so

optimized SQOOP extensions offer real value

• Hadoop-RDBMS integration projects are proliferating

rapidly

57

©2011 Quest Software, Inc. All rights reserved..

top related