graphic-guide-to-offloading-the-data-warehouse.pdf

25

Upload: christopher-mitchell

Post on 28-Sep-2015

19 views

Category:

Documents


3 download

TRANSCRIPT

  • TABLE OF CONTENTS:Intro: Why Offload Is the Best Way to Get Started with Hadoop

    A Framework for EDW Offload

    From SQL to Hadoop ETL in 5 Steps

    Case Study: Leading Healthcare Organization Offloads EDW to Hadoop

    Conclusion

  • Wouldnt it be great to have a single, consistent version of the truth for all your corporate data? Nearly two decades ago, that was the vision of the Enterprise Data Warehouse (EDW) enabled by ETL tools that Extract data from multiple sources, Transform it with operations such as sorting, aggregating, and joining and then Load it into a central repository. But early success resulted in greater demands for information;

    users became increasingly dependent on data for better business decisions. Unable to keep up with an insatiable hunger for information, data integration tools compelled organizations to push transformations down to the data warehouse, in many cases resorting back to hand coding, and ELT emerged.

    Today, 70% of all data warehouses are performance and capacity

    INTRO: WHY OFFLOAD IS THE BEST WAY TO GET STARTED WITH HADOOP

    3

  • constrained, according to Gartner. Many organizations are spending millions of dollars a year in database capacity just to process ELT workloads. Faced with growing costs and unmet requirements, many are looking for an alternative and are increasingly considering offloading data and transformations to Hadoop for the cost savings alone. Multiple sources report that managing data in Hadoop can range from $500 to $2,000 per terabyte of data, compared to $20,000 to $100,000 per terabyte for high-end data warehouses.

    Hadoop can become this massively scalable and cost-effective staging area, or Enterprise Data Hub, for all corporate

    data. By offloading ELT workloads into Hadoop you can:

    Keep data for as long as you want at a significantly lower cost

    Free up premium database capacity Defer additional data warehouse expenditures

    Significantly reduce batch windows so your users get access to fresher data

    Provide business users direct access to data stored in Hadoop for data exploration and discovery

    4

  • However, Hadoop is not a complete ETL solution. While Hadoop offers powerful utilities and virtually unlimited horizontal scalability, it does not provide the complete set of functionality users need for enterprise ETL. In most cases, these gaps must be filled through complex manual coding and advanced programming skills in Java, Hive, Pig and other Hadoop technologies that are expensive and difficult to find, slowing Hadoop adoption and frustrating organizations eager to deliver results.

    Thats why Syncsort has developed DMX-h, Hadoop ETL software that combines the benefits of enterprise-caliber, high-performance ETL with

    Hadoop, enabling you to optimize your data warehouse while gaining all the benefits of a complete ETL solution. Hand-in-hand with DMX-h, Syncsort has identified a set of best practices to accelerate your data offload efforts into Hadoop. This three-phased approach begins with identifying the data and transformations to target first, then offloading the data and workloads into Hadoop with a graphical tool and no coding, and finally ensuring your Hadoop ETL environment can meet business requirements with enterprise-class performance optimization and security. Lets explore this approach further.

    A THREE-PHASE APPROACH:THE SYNCSORT OFFLOAD FRAMEWORK

    Analyze, understand & document SQL jobs

    Identify SQL ELT workloads suitable for offload

    Use a single point-&-click interface to extract & load virtually any data into HDFS; replicate existing ELT workloads in Hadoop; develop in Windows & deploy in Hadoop

    No manual coding required

    No code generation

    Deploy as part of your Hadoop cluster

    Deliver faster throughput per node

    Fully support Kerberos & LDAP

    Easily monitor via a Web console

    Close integration with Cloudera Manager & Hadoop Job Tracker

    5

  • PHASE I: Identify the data and transformations that will bring you the highest savings with minimum effort and risk. In most cases 20% of data transformations consume up to 80% of resources. Analyzing, understanding and documenting SQL workloads is an essential first step. Syncsorts unique utility, SILQ, helps you do this. SILQ takes a SQL script as an input and then provides a detailed flow chart of the entire data flow. Using an intuitive web-based interface, users can easily drill down to get detailed information about each step within the data flow including tables and data transformations. SILQ even offers hints and best practices to develop equivalent transformations using Syncsort DMX-h, a unique solution for Hadoop ETL that eliminates the need for custom code, delivers smarter connectivity to all your data, and improves Hadoops processing efficiency. One of the biggest barriers to offloading from the data warehouse into Hadoop has been a legacy of thousands of scripts built and extended over time. Understanding and documenting massive amounts of SQL code and then mastering the advanced programming skills to offload these transformation has left many organizations reluctant to move. SILQ removes this roadblock, eliminating the complexity and risk.

    PHASE II: Offload expensive ETL workloads and the associated data to Hadoop quickly and securely with a single tool using current skills within your organization. You need to be able to easily replicate existing workloads without intensive manual coding

    projects, even bringing mainframe data into Hadoop, which offers no native support for mainframes. The next section of this guide will focus exclusively on this phase.

    PHASE III: Optimize & Secure the new environment. Once the transformations are complete, you then need to make sure you have the tools and processes in place to manage, secure and operationalize your Enterprise Data Hub for ongoing success. The organization expects the same level of functionality and services provided before, only faster and less costly now that the transformations are in Hadoop. You need to leverage business-class tools to optimize performance of your Enterprise Data Hub. Syncsort DMX-h is fully integrated with Hadoop, running on every node of your cluster. This means faster throughput per node without code generation. Syncsort integrates with tools in the Hadoop ecosystem such as Cloudera Manager, Ambari, and Hadoop JobTracker allowing you to easily deploy and manage enterprise deployments from just a few nodes to several hundred nodes. A zero footprint, web-based monitoring console allows users to monitor and manage data flows through a web browser and even mobile devices such as smart phones and tablets. You can also secure your Hadoop cluster using common security standards such as Kerberos and LDAP. And to simplify management and reusability in order to meet service level agreements, built-in metadata capabilities are available as part of Hadoop ETL. This guide focuses on Phase II.

    6

  • OVERCOMING SQL CHALLENGES WITH SILQSQL still remains one of the primary approaches for data integration. Thus, data warehouse offload projects often start by analyzing and understanding ELT SQL scripts. In most cases though, SQL can easily grow to hundreds or even thousands of lines developed by several people over the years, making it almost impossible to maintain and understand.

    SILQ is the only SQL Offload utility specifically designed to overcome these challenges, helping your data warehouse offload initiative go smooth. The following figure shows a snapshot of the SQL used for the purposes of this example along with a fragment of the fully documented flow chart generated by SILQ.

    7

  • Phase II, the task of re-writing heavy ELT workloads to run in Hadoop, is typically associated with the need for highly skilled programmers in Java, Hive and other Hadoop technologies. This requirement is even higher when the data and processing involves complex data structures like mainframe files and complex data warehouse SQL processes that need to be converted and run on Hadoop. But this doesnt have to be the case.

    Syncsort DMX-h provides a simpler, all graphical approach to shift ELT workloads and the associated data into Hadoop before loading the processed data back into the EDW for agile data discovery and visual analytics. This is done using native connectors to

    virtually any data source including most relational databases, appliances, social data and even mainframes and a graphical user interface to develop complex processing, workflows, scheduling and monitoring for Hadoop.

    Graphical Offload Using DMX-hThe following flow demonstrates the simplicity of the approach. This end-to-end flow is implemented in a single job comprised of multiple steps that can include sub-jobs and tasks. Data is offloaded from the EDW (in this case Teradata) and the mainframe, and loaded to the Hadoop Distributed File System (HDFS) using native connectors. The loaded data is then transformed on Hadoop using DMX-h, followed by a load back to Teradata/EDW from HDFS.

    A CLOSER LOOK AT OFFLOADING ETL WORKLOADS INTO HADOOP

    8

  • The DMX-h Graphical User Interface (GUI) has 3 basic components:

    1. The Job Editor is used to build a DMX-h job or workflow of sub-jobs or tasks. The job defines the execution dependencies (the order in which tasks will run) and the data flow for a set of tasks. The tasks may be DMX-h tasks or custom tasks which allow you to integrate external scripts or programs into your DMX-h application.

    2. The Task Editor is used to build the tasks that comprise a DMX-h job. Tasks are the simplest unit of work. Each task reads data from one or more sources, processes that data and outputs it to one or more targets.

    3. The Server dialog is used to schedule and monitor Hadoop jobs.

    Below is the equivalent job flow in the DMX-h Job Editor that represents the flow depicted in Figure 1. As you can see, the job contains 3 sub-jobs:

    1. Load_DataWarehouse_Mainframe_To_HDFS

    2. MapReduce_Join_Aggregate_EDW_and_Mainframe

    3. L o a d _ A g g r e g a t e _ D a t a _ t o _DataWarehouse

    Lets take a detailed look at the steps involved in each one of these sub-jobs.

    9

  • The first sub-job, Load_DataWarehouse_Mainframe_To_HDFS, consists of 2 tasks: Extract_Active_EDW_data and ConvertLocalMainframeFileHDFS. The EDW to Hadoop extraction task and its functionality is described in Steps 1.1, 1.2 and 1.3. Although excluded from this guide, the second mainframe to Hadoop task follows a similar graphical approach

    using mainframe COBOL copybooks to read and translate complex mainframe data structures from EBCDIC to ASCII delimited on Hadoop. Syncsort DMX-h also includes a library of Use Case Accelerators for common data flows. This makes it easier to learn how to create your own jobs.

    EXTRACTING SOURCE DATA FROM MAINFRAME & THE EDW

    STEP 1

    10

  • Using native connectors, you can easily create a DMX-h task to extract data from Teradata or any other major database in

    parallel and load it into HDFS without writing any code.

    1.1: EXTRACTING DATA FROM THE ENTERPRISE DATA WAREHOUSE

    11

  • Similarly, you can specify the columns you need from the source data warehouse table.

    1.2: SOURCE DATABASE TABLE DIALOG

    12

  • Once the columns are specified, another dialog window allows you to specify the mapping between the extracted source EDW columns on the left and the delimited HDFS target file on the right.

    1.3: REFORMAT TARGET LAYOUT DIALOG

    13

  • 14

    The second sub-job in our example is the MapReduce sub-job, MapReduce_Join_Aggregate_EDW_and_Mainframe. DMX-h provides an extremely simple way of defining the mapper and reducer ETL jobs graphically without having to write a single line of code. Everything to the left of the Map Reduce link is the Map logic and everything to the right is Reduce logic.

    DMX-h provides the ability to have multiple Map and Reduce tasks strung together without having to write any

    intermediate data between them. Furthermore, none of the Map and Reduce logic requires any coding or code generation. All of the tasks are executed natively by the pre-compiled DMX-h ETL engine on every Hadoop node as part of the mappers and reducers. In this case, the Map task is filtering and sorting the two files loaded from the EDW and Mainframe, followed by the Reduce task which is joining and aggregating the data and writing the result back to HDFS.

    JOINING AND SORTING THE SOURCE DATASETS USING A MAPREDUCE ETL JOB

    STEP 2

  • The No Coding Advantage A Fortune 100 organization tried to create a Change Data Capture (CDC) job using HiveQL. It took 9 weeks to complete and resulted in over 47 scripts and numerous user-defined functions written in Java to overcome the limitations of HiveQL. The code wasnt scalable and would be costly to maintain with teams of skilled developers. Other code-heavy tools introduced more complexity, cost and performance issues. Syncsort DMX-h:

    Cut development time by 2/3 Required only 4 DMX-h graphical jobs Eliminated the need for Java user-defined functions Delivered a 24x performance improvement

    The DMX-h GUI makes it easy for you to graphically link metadata across MapReduce jobs to perform metadata lineage and impact analysis across

    mappers and reducers. The blue arrows below track a certain columns lineage across multiple steps in the MapReduce job.

    2.1: LEVERAGING METADATA FOR LINEAGE & IMPACT ANALYSIS FOR MAPREDUCE JOBS

    15

  • The third and final sub-job in our example, Load_Aggregate_Data_to_DataWarehouse, takes the output of the Reduce tasks from HDFS and loads it back into the EDW using native Teradata connectors (TTU or TPT).

    LOADING THE FINAL DATASET INTO THE DATA WAREHOUSE

    From Data Blending to Data Discovery With Syncsort DMX-h it is easy to setup high-performance source and target access for all major databases, including highly optimized, native connectivity for Teradata, Vertica, EMC Greenplum, Oracle and more. Alternatively, you can also land the data back into HDFS or even create a Tableau data extract file for visual data discovery and analysis.

    STEP 3

    16

  • You can graphically map the delimited HDFS data columns on the left to the EDW table columns on the right. Syncsort DMX-h supports graphical controls for inserting, truncating/

    inserting, updating tables as well as setting commit intervals for the EDW. You can also create new tables using the Create new button.

    3.1: TARGET DATABASE TABLE DIALOG

    EXECUTING, SCHEDULING AND MONITORING GRAPHICALLYUsing the GUI, you can execute and schedule jobs graphically on a Hadoop cluster, on standalone UNIX, LINUX and Windows servers as well as on your workstation/laptop. You can also

    monitor the jobs and view logs through the Server dialog. E-mail capabilities are also available based on the status of the job including e-mailing copies of job logs. Steps 4 and 5 demonstrate how to do this.

    17

  • Syncsort DMX-h allows you to easily develop and test your Hadoop ETL jobs graphically in Windows and then deploy in Hadoop. The Run Job dialog allows

    users to specify the run-time parameters and execute jobs immediately or on a given schedule.

    EXECUTING AND SCHEDULING YOUR HADOOP ETL JOBS

    STEP 4

    18

  • Comprehensive logging capabilities as well as integration with Cloudera Manager and the Hadoop JobTracker make it easy to monitor and track your DMX-h jobs. This is the last, but very important step, as visibility into ETL workloads is critical. These tools provide

    the same level of enterprise-grade functionality the organization has come to expect before Hadoop, and make it easy to identify and quickly correct for errors and enhance productivity and optimize performance of your Hadoop environment.

    MONITORING YOUR HADOOP ETL JOBS

    You can monitor the Server dialog for a comprehensive, real-time list of all the DMX-h jobs including those running,

    completed (success, exceptions, terminated) and scheduled.

    5.1: DMX-H MONITORING CAPABILITIES

    STEP 5

    19

  • Using the Job Log, you can track the successful completion of DMX-h MapReduce jobs. The same log and dialog can also include non-Hadoop job logs.

    5.2: DMX-H JOB LOGS

    20

  • Since the DMX-h engine is executed natively as part of every mapper and reducer, you can monitor DMX-h statistics through the stderr logs of the map and reduce tasks in the Hadoop

    JobTracker. This is an example from one of the reduce task logs that invoked DMX-h for performing a Join between 5,023 and 4,858 records.

    STEP 5.3: HADOOP JOB TRACKER

    See For Yourself with a Free Test Drive & Pre-Built Templates!

    Download a free trial at syncsort.com/try and follow these steps with your own SQL scripts.

    DMX-h Use Case Accelerators are a set of pre-built graphical template jobs that help you get started with loading data to Hadoop and processing data on Hadoop. You can find them at: http://www.syncsort.com/TestDrive/Resources

    21

  • LEADING HEALTHCARE ORGANIZATION OFFLOADS EDW TO HADOOP

    Case Study:

    A leading healthcare organization continuously experiences exponential growth in data volumes and has invested millions of dollars in creating its data environment to support a team of skilled professionals who use this real-world data to drive safety, health outcomes, as well as late-phase and comparative effectiveness research.

    Faced with a cost-cutting initiative, the organization needed to reduce its hardware and software spend and decided to explore moving its ETL/ELT workloads from its EDW to Hadoop.

    The healthcare organization found that Hadoop offers a cost effective and

    scalable data processing environment; on average the cost to store and process data in Hadoop would be 1% of the cost to process and store the same data in its EDW. But while Hadoop is a key enabling technology for large-scale, cost-effective data processing for ETL workloads, the native Hadoop tools for building and migrating applications (Hive, Pig, Java) require custom coding and lack enterprise features and enterprise support. To fully address its cost-cutting imperative, the organization needed tools that would allow it to leverage existing staff skilled in ETL without requiring significant additional staff with new skills (MapReduce) which are scarce and expensive.

    22

    TCO: ELT on Enterprise Data Warehouse

    TCO: ELT on Hadoop

    $1.4M PROJECTED TCO SAVINGSOVER 3 YEARS

    $1.8M

    $390K

  • The healthcare organization turned to Syncsort and found that its existing ETL developers could be productive in Hadoop by leveraging Syncsort DMX-h. The easy-to-use GUI allows existing staff to create data flows with a point-and-click approach and avoid the complexities of MapReduce and manual coding.

    By offloading its EDW to Hadoop with Syncsort, the healthcare organization realized the following benefits:

    Projected TCO savings over 3 years are $1.4M

    Eliminated immediate cost of $300k EDW expense

    Activated Hadoop initiative with a modern, secure and scalable enterprise grade solution

    Enabled Big Data for next-generation analytics

    Fast-tracked their EDW offload to Hadoop with no need for specialized skills, manual coding or tuning

    Achieved comparable high-end performance at a tremendously lower cost

    For years, many organizations have struggled with cost and processing limitations of using their EDW for data integration. Once considered best practices, staging areas have become the dirty secret of every data warehouse environment one that consumes the lions share of time,

    money and effort. Thats why many Hadoop implementations start with ETL initiatives. But Hadoop also presents its own challenges, and without the right tools offloading ELT workloads can be a lengthy and expensive process even with relatively inexpensive hardware.

    CONCLUSION

    23

  • Syncsort DMX-h addresses these challenges with an approach that doesnt require you to write, tune or maintain any code, but instead allows you to leverage your existing ETL skills. Even if you are not familiar with ETL, this graphical approach allows you to offload ELT workloads fast in 5 easy steps. While we used a specific example in this guide for illustrative purposes, the steps can be applied to any ELT offload project as follows:

    STEP 1. Extract Data from original sources, commonly including a mix of relational, multi-structured, social media, and mainframe sources

    STEP 2: Re-design data transformations previously developed in SQL to run on Hadoop using a point-and-click, graphical user interface

    STEP 3: Load the Final Dataset into the desired target repository, usually the Data Warehouse or Hadoop itself

    STEP 4: Execute and Schedule Your Hadoop ETL Jobs on a Hadoop cluster. Simply specify the run-time parameters and execute jobs immediately or on a given schedule.

    STEP 5: Manage and Monitor Your Hadoop ETL Jobs in real-time with comprehensive logging capabilities and integration with Cloudera Manager and the Hadoop JobTracker.

    The 5-step process shows how you can use Syncsort DMX-h for a simpler, simpler, all graphical approach to shift ELT workloads and the associated data into Hadoop before loading the processed data back into the EDW for agile data discovery and visual analytics. The DMX-h GUI and engine support important Hadoop authentication protocols including Kerberos, and integrate natively with these due to a simple architecture and integration with Hadoop. DMX-h Use Case Accelerators a library of reusable templates for some of the most common data flows dramatically improve productivity when deploying Hadoop jobs. Finally, Syncsorts Hadoop ETL engine runs natively on each Hadoop node, providing much more scalability and efficiency per node versus custom Java, Hive and Pig code whether written manually by developers or generated automatically by other tools.

    With Syncsort DMX-h you gain a graphical approach to offload ELT workloads into Hadoop with no coding, and a practical way to optimize and free-up one of the most valuable investments in your IT infrastructure, the data warehouse. Moreover, offloading ELT workloads into Hadoop will put you on the fast-track to a modern data architecture that delivers a single, consistent version of the truth for all of your corporate data.

    24

  • ABOUT USSyncsort provides fast, secure, enterprise-grade software spanning Big Data solutions in Hadoop to Big Iron on mainframes. We help customers around the world to collect, process and distribute more data in less time, with fewer resources and lower costs. 87 of the Fortune 100 companies are Syncsort customers, and Syncsorts products are used in more than 85 countries to offload expensive and inefficient legacy data workloads, speed data warehouse and mainframe processing, and optimize cloud data integration. Experience Syncsort at www.syncsort.com

    2014 Syncsort Incorporated. All rights reserved. Company and product names used herein may be the trademarks of their respective companies. DMXh-EB-001-0614US

    Learn More!

    GUIDE: 5 Steps to Offloading Your Data

    Warehouse with Hadoop >

    SOLUTIONS: Explore More Hadoop Solutions >

    RESOURCES: View More Hadoop Guides, eBooks,

    Webcasts, Videos, & More >