sqoop2 refactoring for generic data transfer - hadoop strata sqoop meetup
DESCRIPTION
Sqoop2 is Sqoop as a service. Its focus is on ease of use, ease of extensibility, and security. Recently, Sqoop2 was refactored to handle generic data transfer needs.TRANSCRIPT
Sqoop 2Refactoring for generic data transfer
Abraham Elmahrek
Cloudera Ingest!
Introduction to Sqoop 2
Provide a rest API and Java API for easy integration. Existing clients include a Hue UI and a command line client.
Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.
Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
Ease of use Extensible Security
Life of a Request
• Client– Talks to server over REST + JSON– Does nothing but sends requests
• Server– Extracts metadata from data source– Delegates to execution engine– Does all the heavy lifting really
• MapReduce– Parallelizes execution of the job
Workflow
Job Types
IMPORT into Hadoop and EXPORT out of Hadoop
Responsibilities
Connector responsibilities Sqoop framework responsibilities
Transfer data from Connector A to Hadoop
Connector Definitions
• Connectors define:– How to connect to a data source– How to extract data from a data source– How to load data to a data source
public Importer getImporter(); // Supply extract method
public Importer getExporter(); // Supply load method
public class getConnectionConfigurationClass();
public class getJobConfigurationClass(MJob.Type type); // MJob.Type is IMPORT or EXPORT
Intermediate Data Format
• Describe a single record as it moves through Sqoop• currently available
– CSV
col1,col2,col3,...col1,col2,col3,......
• Hadoop as a first class citizen disables transfers between the components in the Hadoop ecosystem– HBase to HDFS not supported– HDFS to Accumulo not supported
• Hadoop ecosystem not well defined– Accumulo was not considered part of Hadoop ecosystem– What’s next? Kafka?
What’s Wrong w/ Current Implementation?
Refactoring
• Connectors already defined extractors and loaders– Refactor the connector SDK
• Pull out HDFS integration to a connector• Improve Schema integration
Transfer data from Connector A to Connector B
Connector SDK
• Connectors assume all roles• Add Direction for FROM and TO• Initializers and destroyers for both directions
Connector responsibilities
HDFS Connector
• Move Hadoop role to connector• Schemaless• Data formats
– Text (CSV)– Sequence– etc.
Schema Improvements
• Schema per connector• Intermediate data format (IDF) has a Schema• Introduce matcher• Schema represents data as it moves through the system
Matcher
• Matcher ensures data goes to right place• Combinations
– FROM and TO schema– FROM schema– TO schema– No schema = Error
Matcher
Ensure that FROM schema matches TO schema by index location of Schema
Provide a connector SDK and focus on pluggability. Existing connectors include Generic JDBC connector and HDFS connector.
Emphasize separation of responsibilities. Eventually have ACLs or RBAC.
Location Name User defined
Thank you