hadoop world 2011: integrating hadoop with enterprise rdbms using apache sqoop and other tools - guy...
Post on 15-Jul-2015
4.917 Views
Preview:
TRANSCRIPT
November 2011
Apache Sqoop (Incubating)Integrating Hadoop with Enterprise RDBS – Part I
Arvind Prabhakar (arvind at apache dot org)
Apache Sqoop Committer and Software Engineer at Cloudera
1
©2011 Quest Software, Inc. All rights reserved..
Hadoop Data Processing
1
2
©2011 Quest Software, Inc. All rights reserved..
Hadoop Data Processing
2
3
©2011 Quest Software, Inc. All rights reserved..
Hadoop Data Processing
3
4
©2011 Quest Software, Inc. All rights reserved..
Hadoop Data Processing
4
5
©2011 Quest Software, Inc. All rights reserved..
In This Session…
How Sqoop Works
Roadmap
5
6
©2011 Quest Software, Inc. All rights reserved..
Data Import
6
7
©2011 Quest Software, Inc. All rights reserved..
Data Import
7
8
©2011 Quest Software, Inc. All rights reserved..
Data Import
8
9
©2011 Quest Software, Inc. All rights reserved..
Data Import
9
10
©2011 Quest Software, Inc. All rights reserved..
Data Import
10
11
©2011 Quest Software, Inc. All rights reserved..
Sqoop Overview
11
12
©2011 Quest Software, Inc. All rights reserved..
Pre-processing
12
13
©2011 Quest Software, Inc. All rights reserved..
Code Generation
13
14
©2011 Quest Software, Inc. All rights reserved..
Type Mapping
14
15
©2011 Quest Software, Inc. All rights reserved..
Data Transfer
15
16
©2011 Quest Software, Inc. All rights reserved..
Data Transfer
16
17
©2011 Quest Software, Inc. All rights reserved..
Data Transfer
17
18
©2011 Quest Software, Inc. All rights reserved..
Post-Processing
18
19
©2011 Quest Software, Inc. All rights reserved..
Sqoop Connectors
Oracle – Developed by Quest Software
Couchbase – Developed by Couchbase
Netezza – Developed by Cloudera
Teradata – Developed by Cloudera
SQL Server – Developed by Microsoft
Microsoft PDW – Developed by Microsoft
Volt DB – Developed by Volt DB
19
20
©2011 Quest Software, Inc. All rights reserved..
Sqoop Roadmap
SQOOP-365: Proposal for Sqoop 2.0
• https://issues.apache.org/jira/browse/SQOOP-365
Highlights
• Sqoop as a Service
• Connections as First Class Objects
• Role based Security
20
21
©2011 Quest Software, Inc. All rights reserved..
Sqoop 2 Architecture (proposed)
21
22
©2011 Quest Software, Inc. All rights reserved..
For More Information
Website:
http://incubator.apache.org/sqoop/
Mailing Lists:
incubator-sqoop-user-subscribe@apache.org
incubator-sqoop-dev-subscribe@apache.org
•Issue Tracker:
http://issues.apache.org/jira/browse/SQOOP
22
23
©2011 Quest Software, Inc. All rights reserved..
Thank You!
Q & A will be after part II of this session.
23
©2011 Quest Software, Inc. All rights reserved..
Guy Harrison, Quest Software
Integrating Hadoop with Enterprise
RDBMS Using Apache SQOOP and
Other Tools
25
©2011 Quest Software, Inc. All rights reserved..
Introductions
26
©2011 Quest Software, Inc. All rights reserved..
27
28
©2011 Quest Software, Inc. All rights reserved..
Agenda
• Scenarios for RDBMS-Hadoop interaction
• Case study: Quest extension to SQOOP
• Other RDBMS-Hadoop integrations
29
©2011 Quest Software, Inc. All rights reserved..
Hadoop meets RDBMS – scenarios
Scenario #1: Reference data in RDBMS
CUSTOMERS
WEBlOGS
PRODUCTS
HDFS
RDBMS
Scenario #2: Hadoop for off-line analytics
CUSTOMERS
PRODUCTS
HDFS
RDBMS
SALES
HISTORY
Scenario #3: MapReduce output to RDBMS
WEBLOGS
SUMMARY
RDBMS
DB QUERY
TOOL
WEBlOGS
HDFS
Scenario #4: Hadoop as RDBMS “active archive”
SALES 2011
HDFS
RDBMS
QUERY
TOOL
SALES 2010
SALES 2009
SALES 2008
SALES 2009
SALES 2008
34
©2011 Quest Software, Inc. All rights reserved..
Case Study: extending SQOOP for Oracle
35
©2011 Quest Software, Inc. All rights reserved..
SQOOP extensibility
• SQOOP implements a generic approach to
RDBMS/Hadoop data transfer
• But database optimization is highly platform specific
• Each RDBMS has distinct optimizations strategies
• For Oracle, optimization requires:
• Bypassing Oracle caching layers
• Avoiding Oracle optimizer meddling
• Exploiting Oracle metadata to balance mapper load
CACHE
ORACLE TABLE
Reading from Oracle – default SQOOP
Index block Index block
RANGE SCAN
MAPPER
ORACLE SESSSION
ID > 0 and ID < MAX/2
MAPPER
ORACLE SESSION
ID > MAX/2
Index block Index block
RANGE SCAN
Index block Index block
Oracle
SALES
table
HDFS
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Hadoop Mapper
Oracle – parallelism gone bad (1)
ORACLE
TABLEHDFS
HADOOP
MAPPER
HADOOP
MAPPER
HADOOP
MAPPER
HADOOP
MAPPER
Oracle – parallelism gone bad (2)
ORACLE
TABLEHDFS
Ideal architecture
HADOOP
MAPPERORACLE
SESSION
HADOOP
MAPPERORACLE
SESSION
HADOOP
MAPPERORACLE
SESSION
HADOOP
MAPPERORACLE
SESSION
40
©2011 Quest Software, Inc. All rights reserved..
Design goals
• Partition data based on physical storage
• By-pass Oracle buffering
• By-pass Oracle parallelism
• Do not require or use indexes
• Never read the same data block more than once
• Support Oracle datatypes
41
©2011 Quest Software, Inc. All rights reserved..
Import Throughput
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
0 5 10 15 20 25 30 35
Ela
psed
Tim
e (
ms)
Number of mappers
SQOOP
SQOOP with Quest Connector
42
©2011 Quest Software, Inc. All rights reserved..
80.84
89.72
98.95
99.08
98.71
0 20 40 60 80 100
Elasped time
CPU Time
Network round trips
IO requests
IO time
Pct reduction
16 mappers, 50M rows, 50 GB clustered data
43
©2011 Quest Software, Inc. All rights reserved..
Export Throughput
500
1,000
1,500
2,000
2,500
3,000
0 5 10 15 20 25
Seco
nd
s
No of mappers
SQOOP
SQOOP with Quest Connect
44
©2011 Quest Software, Inc. All rights reserved..
Export load
0
5000
10000
15000
20000
25000
30000
0 5 10 15 20 25 30
Data
base t
ime (
s)
No of mappers
SQOOP
SQOOP with Quest connect
45
©2011 Quest Software, Inc. All rights reserved..
Working with the SQOOP framework
• SQOOP lets you concentrate on the RDBMS logic, not
the Hadoop plumbing:
• Extend ManagerFactory (what to handle)
• Extend ConnManager (DB connection and metadata)
• For imports:
• Extend DataDrivenDBInputFormat (gets the data)
• Data allocation (getSplits())
• Split serialization (“io.serializations” property)
• Data access logic (createDBRecordReader(), getSelectQuery())
• Implement progress (nextKeyValue(), getProgress())
• Similar procedure for extending exports
46
©2011 Quest Software, Inc. All rights reserved..
Extensions to native SQOOP
• MERGE
functionality
• Update if
exists, insert
otherwise
• Hive
connector
• Source defined as
HQL query rather
than HDFS
directory
• Eclipse UI
47
©2011 Quest Software, Inc. All rights reserved..
Availability
• Apache licensed source available from :
https://github.com/QuestSoftwareTCD/OracleSQOOPconnector
• Download from (Quest):
http://www.quest.com/hadoop/
• Download from (Cloudera):
http://ccp.cloudera.com/display/SUPPORT/Downloads
48
©2011 Quest Software, Inc. All rights reserved..
Other SQOOP connectors
• Microsoft SQL Server:
• http://www.microsoft.com/download/en/details.aspx?id=27584
• Teradata:
• https://ccp.cloudera.com/display/con/Cloudera+Connector+for+Teradata+User+Guide
%2C+version+1.0-beta-u4
• Microstrategy:
• https://ccp.cloudera.com/display/con/MicroStrategy+Free+Download+License+Agreem
ent
• Nettezza:
• https://ccp.cloudera.com/display/con/Netezza+Free+Download+License+Agreement
• VoltDB:
• http://voltdb.com/company/blog/sqoop-voltdb-export-and-hadoop-integration
49
©2011 Quest Software, Inc. All rights reserved..
Other Hadoop – RDBMS integrations
50
©2011 Quest Software, Inc. All rights reserved..
Oracle Big Data Appliance
• 18 Sun X4270 M2 servers
• 48GB per node (864GB total)
• 2x6 Core CPU per node (216 total)
• 12x2TB HDD per node (216 spindles, 864 TB)
• 40Gb/s Infiniband between nodes
• 10Gb/s Ethernet to datacenter
• Apache Hadoop
• Oracle NoSQL
• Oracle loader for Hadoop
• Multi-stage C-optimized unidirectional loader
www.oracle.com/us/bigdata/index.html
ORACLE
EXADATA
ORACLE
EXALOGIC
ORACLE
BIG DATA
APPLIANCE
ORACLE
NOSQL
ORACLE
LOADER
FOR
HADOOPAPACHE
HADOOPORACLE
RDBMS
ORACLE
WEBLOGIC
ORACLE
EXALYTICS
ORACLE
ESSBASE
ORACLE
TIMES TEN
52
©2011 Quest Software, Inc. All rights reserved..
Microsoft
53
©2011 Quest Software, Inc. All rights reserved..
Hadapt
• Formally HadoopDB – Hadoop/Postgres hybrid
• Postgres servers on data nodes allow for accelerated
(indexed) HIVE queries
• Extensions to the Hive optimizer
http://www.hadapt.com/
54
©2011 Quest Software, Inc. All rights reserved..
Greenplum
• SQL based access to HDFS data via in-DB MapReduce
http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
55
©2011 Quest Software, Inc. All rights reserved..
Toad for Cloud Databases
• Federated SQL queries across
Hive, Hbase, NoSQL, RDBMS
56
©2011 Quest Software, Inc. All rights reserved..
Conclusions
• RDBMS-Hadoop interoperability is key to Enterprise
Hadoop adoption
• SQOOP provides a good general purpose framework
for transferring data between any JDBC database and
Hadoop
• We’d like to see it become a standard
• Each RDBMS offers distinct tuning opportunities, so
optimized SQOOP extensions offer real value
• Hadoop-RDBMS integration projects are proliferating
rapidly
57
©2011 Quest Software, Inc. All rights reserved..
top related