a practical guide to big data readiness

Upload: dhivyaashanmugam

Post on 24-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 A Practical Guide to Big Data Readiness

    1/25

  • 7/25/2019 A Practical Guide to Big Data Readiness

    2/25

    Introduction: Are You Ready for Big Data? 3

    The Big Data Continuum 5

    Stage 1: Awakening 6

    Stage 2: Advancing 9

    Stage 3: Plateauing 12

    Stage 4: Dynamic 15

    Stage 5: Evolved 19

    Conclusion 23

    Learn More 25

    TABLE OF CONTENTS

  • 7/25/2019 A Practical Guide to Big Data Readiness

    3/25

    chain or Tower Records. Big Data will be no different. Organizations unable to effectively keep pace amidst the

    three Vs of Big Data Volume, Variety, and Velocity are at risk of becoming twenty-first century road kill.

    How did we get here? The fact is that organizations have struggled to make sense of data for decades. And,

    since the dawn of computing, there have been periods of innovation that have disrupted the entire market.

    From mainframes to PCs, from Internet to social and mobile technologies, each fundamental shift in the

    computing landscape has created unique challenges to organizations existing data management architectures

    and processes. One-off point solutions using custom coding in the early 90s gave way to ETL platforms and

    the enterprise data warehouse, all promising information nirvana: a single version of the truth.

    More recently, as datasets explode with unprecedented speed and variety, and the needs of the business

    become ever more complex, data management is more challenging than ever before. Traditional architectures

    are breaking once again, and organizations are racing to adapt and rebuild them to handle Big Data. Big Data is

    driving the next technological shift, and data integration is at the epicenter of the transformation.

    The Big Data problem is a big business problem.

    Analyzing Big Data to extract meaningful value is

    no longer a luxury; its a necessity as companies

    strive to remain relevant and competitive in the

    marketplace.

    Technological shifts create both opportunities

    and challenges. For instance, while the Internet

    revolution gave rise to Amazon and iTunes, it also

    meant the end of Borders the defunct bookstore

    1The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012

    3

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    4/25

    SURVIVE AND THRIVE WITH BIG DATASo how can organizations evaluate their readiness in the context of this new environment and, most importantly,

    prepare for the challenges ahead? How can you be sure youre making the right investments to embrace, and

    capitalize on, the opportunities of Big Data?

    Thats where the Big Data Continuum can help. The Big Data Continuum is a framework that can help you:

    Assess your companys data management maturity level.

    Identify potential pitfalls as you evaluate and implement new technologies and processes.

    Learn how to successfully address common problems that arise at each stage.

    Fast track your journey to embrace Big Data and capitalize on the forth V Value.

    With decades of data management expertise and a long history of innovation,

    Syncsort has worked with thousands of companies to help them solve their big data

    issues, long before they knew the name Big Data. Based on our extensive experience

    helping customers of all sizes and at all levels of data integration maturity, weve

    designed a framework to help organizations evolve in their quest to leverage data for

    competitive advantage. We call this framework The Big Data Continuum.

    4

    Organizations across different industries and sectors fall into

    a wide range of maturity levels in terms of the processes andtechnologies they use to manage their data, and their ability

    to extract value from it. Therefore, the first steps in preparing

    for Big Data involve a rigorous assessment of your existing

    data management architecture and processes, and a strategic

    roadmap that includes the challenges and opportunities

    ahead. In essence, Where are you today, and where do you

    need to be in the next 12 months?

    The Big Data Continuum is a framework that can help you

    answer these questions and propel your organization to the

    next level.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    5/25

    5

    THE FIVE STAGES OF DATA INTEGRATION MATURITY:

    Awakening. Data integration tasks are mostly performed using custom coded approaches, often using SQL to

    transform and integrate data inside a database.

    Advancing. Organizations realize the value of data and start standardizing data integration processes on a

    common platform such as Informatica, DataStage, and others, leading to greater efficiencies and economies o

    scale.

    Plateauing. Initial successes with an enterprise data warehouse spark the need for more insights. However

    increasing data volumes and changing business requirements push the limits of traditional data integration and

    data warehousing architectures. Stopgap measures trigger a transition from ETL (Extract, Transform, Load) to

    ELT (Extract, Load, Transform), shifting heavy data transformation workloads into the enterprise data warehouse

    The IT backlog grows despite standards and best practices. Initial success is replaced by unsustainable costs

    and user frustration.

    Dynamic. Organizations start to look for alternative solutions to meet these challenges in less time, with less

    effort, and at lower cost. They experiment with Big Data frameworks like Hadoop to address architectura

    limitations of traditional platforms and look for ways to leverage the accumulated expertise within their

    organizations.

    Evolved. Companies at this stage are scaling Hadoop across the entire enterprise, using it as an integra

    component of their production data management infrastructure. Big Data platforms become a new standard

    within these organizations, augmenting traditional architectures at significantly lower costs.

    The rest of this paper examines the Big Data Continuum in more detail and provides specic

    readiness strategies to help your organization address the challenges and opportunities

    of each stage.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    6/25

    For organizations in the Awakening stage, hand coding

    often using Structured Query Language (SQL)

    inside the database is the most common method to

    transform and integrate data sets. According to data

    warehousing expert Rick Sherman, much of the data

    integration projects in corporate enterprises are still

    being done through manual coding methods that are

    inefficient and often not documented.2

    The problems associated with hand coding and using

    SQL for data integration tasks are well understood

    and include:

    Low Productivity: Developing, maintaining, and extending custom software code is a productivity drain

    and quickly becomes unsustainable. It is particularly challenging to tune, maintain and extend existing

    code when the original developers are no longer in the same roles or have left the company. Custom

    code also makes it difficult to perform impact analysis or data lineage to understand dependencies and

    data flows.

    6

    CustomSQLCodeUsedforETLProcessing

    2Rick Sherman. Misconceptions Holding Back Use of Data Integration Tools. BI Trends + Strategies, August 2012.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    7/25

    READINESS STRATEGIES

    Migrate SQL scripts to a high-performance ETL tool.ETL tools have

    become the de facto solution to SQL scripting, maintenance and

    performance issues. When choosing an ETL tool, beware of complex

    engines and code-generators that push SQL down to the database.

    Analyze and document complex code and SQL scriptsused in data

    integration processes and create graphical flow charts to depict SQL logic.

    Identify the top 20%. Typically, 20% of SQL scripts consume up to 80%

    of the time and cost, due to hardware, tuning and maintenance. Usual

    suspects include SQL with merge/upsert, joins, materialized views, cursors

    and union operations.

    Migrate SQL scripts using the 80/20 rule.When planning and evaluating

    the benefits of SQL migration, it is important to realize that a complete

    migration of all SQL code is not necessary to achieve significant benefits.

    Instead, focus on the top 20% to deliver quick results and significantsavings.

    7

    Poor Performance:SQL was not designed for ETL processing. Instead, it is a special-purpose programming

    language designed for querying and managing data stored in relational databases. Using SQL for ETL

    tasks is inefficient, creating performance bottlenecks and jeopardizing service level agreements (SLAs) fo

    ETL processing windows.

    High Cost: Pushing intensive data transformations down to the database steals expensive database

    cycles from the tasks for which it was intended, resulting in added infrastructure costs and jeopardizing

    performance SLAs for processing database queries.

    All of these issues can make it difficult for organizations to extract information and deliver business value from

    data, especially as data-driven information and decision making become a vicious cycle, creating the demand fo

    even more data-driven information. Often, custom coding will solve problems at the outset, but as the need fo

    more and faster information grows, these approaches simply cant keep pace with the demands of the business.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    8/25

    HOW SYNCSORT CAN HELP

    Syncsorts SQL migration solution is specifically designed to help

    organizations at the Awakening stage eliminate the SQL ETL coding and

    maintenance nightmare by migrating existing SQL ETL scripts to a few

    graphical DMX jobs. Syncsort DMX is high-performance ETL software tha

    accelerates overall performance and eliminates the need for database

    staging areas, seamlessly reducing the total cost and complexity of data

    integration.

    Intelligent, self-documenting ow charts are automatically

    generated so you can clearly understand complex SQL scripts used

    in data integration processes.

    A few graphical jobs vs. thousands of lines of SQL code.Replace

    thousands of lines of SQL code with a few graphical jobs, allowing

    even novice users to quickly develop and maintain data integration

    jobs.

    Improved IT productivity and sustained optimal performance

    Seamlessly scale as data volumes grow, without the need fo

    manual coding or tuning.

    8

    Migrate SQL Scripts to a high-

    performance ETL tool. Look for the

    following characteristics to identify

    the high impact code for migration.

    High elapsed processing times.

    Very complex scripts, including

    multiple merges, joins, cursors andunions.

    High impact on resource utilization,

    including CPU, memory, and storage.

    Unstable or error-prone.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    9/25

    As organizations progress to the advancing stage they will experience:

    More Data. The number and type of data sources users need to leverage increases, often including

    dissimilar data in different formats (e.g. text, mainframe, web logs, and CRM).

    More End Users. The range of end users that must be satisfied increases, including executives, managers

    and field and operations staff, for example.

    More Queries. As the number and roles of end users grow, so do the number, variety, and complexity of

    queries that must be performed on the data.

    Companies at this stage come to realize that continuing to use point solutions and hand-coded approaches wil

    hold them back. As a result, they will begin to evaluate, adopt and standardize on ETL tools and data integration

    platforms. In addition to investments in IT infrastructure, organizations start to develop and enforce best practices

    and accumulate technical expertise that can prove critical to progress along the Big Data Continuum.

    When surveyed, more organizations identified their data integration readiness at these first two stages of the Big

    Data Continuum than at any of the others.

    9

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    10/25

    READINESS STRATEGIES

    Beware of code-generators and push-down optimizations.Some organizations

    have adopted tools that generate SQL or offer so-called push-down

    optimizations as a means to achieve faster performance at scale. Unfortunately,

    most of these tools including Talend and Informatica require significantskills and ongoing manual tuning to achieve and sustain acceptable performance,

    creating similar challenges to hand coding and maintaining SQL-based data

    integration logic.

    Improve sta productivity. Select an ETL tool with Windows-based paradigms

    that dont require a long learning curve or specialized skills. Data integration

    tools should allow users to focus on business rules and workflows, rather than

    complex tuning parameters to achieve and maintain high performance. Look

    for ease of use as well as ease of re-use, with impact analysis and data lineage

    capabilities to make it easy to revise and extend existing applications as business

    requirements change.

    Choose a tool that maximizes run-time performance and eciency. A tool

    that delivers superior run-time processing performance and efficiency will

    maximize resource utilization, minimize costs, and provide superior throughput.

    Look for a solution that performs all transformation processing outside of the

    database, minimizing performance bottlenecks and inefficient utilization of

    expensive database resources. Doing so can keep costs under control and allow

    you to build a solid foundation for the future, avoiding potential issues often

    encountered in the subsequent stages.

    Leverage all your data. Having the right data source and target connectivity is

    critical for leveraging all your data, to help make the best business decisions and

    discover new business opportunities.

    Establish a Big Data Center of Excellence (COE). A center of excellence is key

    to develop and retain Big Data expertise within the organization. The COE should

    also set and enforce standards for the data management architecture,define the strategic roadmap, establish best practices and provide

    training and support to the organization.

    10

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    11/25

    HOW SYNCSORT CAN HELP

    Syncsorts DMX high-performance ETL solution provides companies

    at the Advancing stage with a two-fold approach: it makes addressing

    their immediate productivity issues fast and easy, while providing a solid

    foundation for future data growth.

    Template-driven design. DMX offers a clear, intuitive graphical use

    interface that makes it easy for both business and technical users to

    develop and deploy ETL processes.

    11

    Faster transformations for unparalleled ETL

    performance. The solution packages a library of

    hundreds of smart algorithms to handle the most

    demanding data integration transformations and

    functions, delivering up to 10X faster elapsed

    processing times than Informatica, Talend, and

    other conventional tools.

    Smart ETL Optimizer. You dont have to worry

    about ongoing, time-consuming tuning efforts to

    maintain optimum performance. Our unique ETL

    Optimizer ensures you will always get maximum

    performance, so you can design for the businesswithout wasting time tuning.

    Comprehensive connectivity to leverage all your data. The high

    performance ETL solution provides out-of-the-box connectivity to

    relational sources, flat files, mainframes, Hadoop, and everything in

    between.

    Flexibility and reusability with no strings attached. A file-based

    repository delivers all the benefits of a complete metadata laye

    without dependencies on third-party systems such as relationa

    databases.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    12/25

    Over time, increasing demands for information oftentimes prove to be too much for traditional architectures to

    handle. As data volumes grow and business users demand fresher data, popular data integration tools such

    as Informatica and DataStage force organizations to push data transformations down to the enterprise data

    warehouse, effectively causing a transition from ETL to ELT. Unfortunately, SQL is almost never the best approach

    for data integration tasks. Relational database management systems (RDBMS) were specifically designed to solve

    problems that involve a big question with a small answer (i.e. user queries). However, when dealing with data

    transformations, the T in ETL, the answer is generally as big, if not bigger, than the question.

    Moreover, organizations can face unacceptable bottlenecks

    and delays, not only for data transformations but also for

    analytical queries, as both processes compete for EDW

    resources. IT staff and budget can quickly be consumed by

    expensive and tedious stopgap measures: manual tuning

    efforts, hardware upgrades, and additional data warehousecapacity. Early excitement fades and gives way to use

    frustration, incremental costs and a crippling IT backlog.

    The resulting business ramifications of these bottlenecks can

    be severe, including lost revenue opportunities, impaired

    decision making, customer attrition, and so on.

    12

    The RDBMS is optimized to

    solve query loads. That is, big

    questions with a small answer.

    However, ETL involves big

    questions with sometimes evenbigger answers. By ofoading

    heavy data transformations

    from the EDW, you can free up

    database capacity and budget

    while accelerating overall data

    performance.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    13/25

    READINESS STRATEGIES

    Ooad transformations from the data warehouse.Inefficient and

    underperforming ETL tools have forced many IT developers to push

    transformations down to the database, adding complexity and requiring massive

    investments in additional database capacity. This approach will actually moveyou backward along the Big Data Continuum, increasing database costs and

    the effort to maintain and tune scripts. Look for approaches that shift intensive

    transformations out of the database.

    Leverage acceleration technologies to extend your existing data

    integration infrastructure. Most organizations have spent considerable time

    and money building their existing data integration infrastructure, so rip &

    replace approaches arent practical. Rather than buying extra hardware and

    database capacity, you can identify where the bottlenecks occur and bring in

    specialized data integration technology to accelerate these processes. For

    example, technology now exists that can efficiently handle sorts, merges, and

    aggregations, and that integrates seamlessly with your existing architecture.

    Accelerating technologies increase an organizations Big Data readiness by

    removing performance bottlenecks while allowing them to leverage their existing

    architecture. These plug-and-play technologies typically result in significant

    savings that can be used to fund initiatives to move into the Dynamic stage.

    Start with the top 20% of data transformations. Usually 20% of the

    transformations incur 80% of the processing problems. Offloading and

    accelerating these transformations will provide the best bang for the buck.

    Consider using Hadoop to ooad all ETL processes from the data warehouse.

    Hadoop is emerging as the de facto operating system for Big Data. Thanks to its

    massively scalable and fault-tolerant architecture, Hadoop can be much more

    effective from a performance and cost perspective than the data warehouse in

    processing ETL workloads. In addition, shifting ETL workloads to Hadoop

    can free up valuable database capacity to accelerate user queries.

    13

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    14/25

    HOW SYNCSORT CAN HELP

    Syncsorts ETL optimization solution helps organizations maximize the

    return on their data integration investments, allowing them to keep their

    existing infrastructure while shifting the heavy transformation processes to

    Syncsort DMX.

    Accelerate your existing data integration environment, including

    Informatica and DataStage by 10x or more. Syncsort packages a

    library of hundreds of smart algorithms, as well as an ETL Optimizer

    to handle the most demanding data integration transformations and

    deliver up to 10x faster elapsed times

    Simply plug DMX into your existing environment. DMX provides

    advanced metadata interchange capabilities to bi-directionally

    exchange metadata with other applications. This makes it easy

    to plug the solution into existing data integration environments to

    seamlessly accelerate performance, eliminate constant tuning, and

    facilitate regulatory compliance.

    Free up your database and your budget.Syncsorts ETL optimization

    solution shifts all data transformations from the enterprise data

    warehouse into the DMX high-performance ETL engine, freeing up

    database resources for faster user queries.

    Get Hadoop-ready. Syncsort offers high-performance data

    integration software with everything you need to deploy enterprise

    grade ETL capabilities on Hadoop. DMX-h offers a unique approach

    to Hadoop ETL that lowers the barriers for adoption, helping your

    organization unleash the full potential of Hadoop. Thanks to a

    library of Use Case Accelerators, its easy for organizations to get

    started with Hadoop by implementing common ETL tasks such as

    joins, change data capture (CDC), web log aggregations, mainframe

    data access and more.

    14

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    15/25

    Hadoop is helping organizations in all industries gain greater insights, processing more data in less time and at a

    lower cost. According to organizations surveyed, the top benefits from their use of Hadoop are finding previously

    undiscovered insights and reducing the overall costs of data .

    Two of the most common approaches include data warehouse optimization and mainframe offload. By shifting

    transformations the T in ETL out of the data warehouse and into Hadoop, organizations can quickly

    15

    realize significant value, includin

    shortened ETL batch windows, faste

    database user queries, and significan

    operational savings in the form of spare

    database capacity. Similarly, enterprise

    that rely on mainframe processing t

    support mission-critical application

    can capitalize on valuable insights ansavings by offloading data and batc

    processing from the mainframe int

    Hadoop.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    16/25

    It is important to recognize, however, that Hadoop is not a complete ETL solution. Hadoop is an operating system

    that provides the underlying services to create Big Data applications. While it offers powerful utilities and massive

    horizontal scalability, it does not provide the full set of capabilities that users need to deliver enterprise ETL

    applications and functionality. If not addressed correctly, the gaps between the operating-level services that

    Hadoop offers and the functionality that enterprise-grade ETL requires can slow Hadoop adoption and frustrate

    organizations eager to deliver results, jeopardizing subsequent investments.

    16

    Hadoop is an open-source software framework that excels at processing and

    analyzing large amounts of data at scale. Hadoop makes it practical to scale

    out processing tasks across large numbers of nodes by handling the complicated

    aspects of creating, managing, and executing a set of parallel processes over a

    cluster of low-cost computers.

    ETL the process of collecting, processing, and distributing data has emerged as

    one of the most common use cases for Hadoop.3 In fact, industry analyst Gartner

    predicts that most organizations will adapt their data integration strategy using

    Hadoop as a form of preprocessor for Big Data integration in the data warehouse. 4

    Use of Hadoop can become a game changer for organizations, dramatically

    improving the cost structure for gaining new insights, for analyzing larger data sets

    and new data types, and for quickly and exibly bringing new services to market.

    Local

    Disk

    MAP

    REDUCE HDFS

    Input

    Formatter

    Ouput

    Formatter

    SORT

    Optional

    Partitioner

    Optional

    Combiner

    Local

    Disk

    SORT

    REDUCE HDFSOuput

    Formatter

    Local

    Disk

    SORT

    Local

    Disk

    MAPInput

    Formatter SORT

    Optional

    Partitioner

    Optional

    Combiner

    Local

    Disk

    MAPInput

    Formatter SORT

    Optional

    Partitioner

    Optional

    Combiner Typical MapReduce Data Flow

    3http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/

    4Mark A. Beyer and Ted Friedman. Big Data Adoption in the Logical Data Warehouse. Gartner Research, February 2013

    http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/
  • 7/25/2019 A Practical Guide to Big Data Readiness

    17/25

    READINESS STRATEGIES

    During experimentation and early stages of Hadoop, the main objective is to prove the

    value that Hadoop can bring to organizations by augmenting or extending existing data

    integration and data warehouse architectures. Therefore, data connectivity and quick

    development of common ETL use cases are critical for organizations at the Dynamicstage. Connectivity to the right data sources can maximize the value of the framework

    and avoid having Hadoop become yet another silo within the enterprise. In addition,

    quickly ramping productivity with Hadoop allows IT to deliver quantifiable successes that

    pave the way for more widespread adoption. Success at this stage enables companies

    to move to the Evolved stage, where Hadoop becomes an integral component of the

    production data management architecture.

    Select a tool with a wide variety of connectors to source and target systems.

    Simplify importing data from various sources into Hadoop, as well as exporting

    data from Hadoop to other systems.

    Leverage mainframe data.Mainframe data can be the critical reference point for

    new data sources, such as web logs and sensor data. Therefore, make sure the

    tool provides connectivity and data translation capabilities for the mainframe.

    Ensure the tool oers a comprehensive library of pre-built, out-of-the-box data

    transformations. The most common data flows include joins, aggregations,

    and change data capture. Reusable templates can accelerate development of

    prototype applications and proof of value.

    Avoid tools that generate code. These tools will burden your organization with

    heavy tuning and maintenance.

    Test and break your system. As you build your proof-of-concept, stress testing

    your system will help you assess the reliability of your implementation and will

    teach your staff critical skills to maintain and support it down the road.

    Identify and prioritize use cases. Identify one (or a small number of) proof-of-

    concept use cases for Hadoop. Candidate use cases often involve recurring ETL

    processes that place a heavy burden on the existing data warehouse.

    17

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    18/25

    HOW SYNCSORT CAN HELP

    Syncsorts DMX-h high-performance data integration software provides a

    smarter approach to Hadoop ETL including an intuitive graphical interface

    for easily creating and maintaining jobs, a wide range of productivity

    features, metadata facilities for development re-use and data lineage, high-

    performance connectivity capabilities, and an ability to run natively within

    the MapReduce framework, avoiding code generation.

    Smarter connectivity to all your data. With DMX-h, you only need

    one tool to connect all sources and targets to Hadoop, including

    relational databases, appliances, files, XML, and even cloud. No

    coding or scripting is needed. DMX-h can also be used to pre-

    process data cleanse, sort, partition, and compress prior to

    loading it into Hadoop, resulting in enhanced performance and

    significant storage savings.

    Smarter mainframe data ingestion and translation. DMX-h offers

    unique capabilities to read, translate, and distribute mainframe

    data with Hadoop. It supports mainframe record formats such

    as fixed, variable, variable with block descriptor, and VSAM, and

    also translates data from EBCDIC to ASCII, and imports COBOL

    copybooks without coding.

    Smarter testing, debugging and troubleshooting

    DMX-h allows you to develop, test, and troubleshoot locally

    in Windows before deploying into Hadoop. In addition

    DMX-h provides comprehensive logging capabilities, as

    well as integration with Hadoops JobTracker for easier

    log consumption.

    Smarter productivity to fast-track your way to

    successful Hadoop ETL. DMX-h helps you get started

    and become fully productive with Hadoop quickly

    by providing a library of Use Case Accelerators that

    implement common ETL tasks such as joins, change

    data capture (CDC), web log aggregations, mainframe

    data access, and more.

    18

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    19/25

    While most organizations at this stage are not looking to replace their existing data warehousing infrastructure

    with Hadoop, ETL is a different story. Hadoop is poised to completely change the way organizations collect

    process, and distribute their data. ETL is shifting to Hadoop ETL and Big Data is becoming the new standard

    architecture, providing greater value to the organization at a cost structure that is radically lower than traditiona

    architectures. And thats why the ability to cost-effectively utilize Big Data is quickly becoming a requirement for

    companies to survive.

    For example, an organization can store

    aggregated web log data in their relational

    database, while keeping the complete

    datasets at the most granular level in Hadoop.

    This allows them to run new queries against

    the full historical data at any time to find new

    insights, which can be a true game-changeras organizations aggressively look for new

    insights and offerings to differentiate from

    the competition.

    19

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    20/25

    20

    As organizations begin to standardize on Hadoop as the new Big Data platform, they must keep hardware

    and resource costs under control. Although Hadoop leverages commodity hardware, the total cost for system

    resources can still be significant. When dealing with large numbers of nodes, hardware costs add up. Programming

    resources e.g. HiveQL, Pig, Java, MapReduce can also prove expensive. Using Hadoop for ETL processing

    requires specialized and expensive developers that can be hard to find and hire. For example, the Wall Stree

    Journal recently cited that a Hadoop programmer can now earn as much as $300,000 per year.

    Today, the reality is that very few organizations have yet to reach the Evolved stage. Less than 2% of organizations

    surveyed are using Hadoop as an integral component of their data management platform. But many organizations

    are working towards this goal, and almost 11% expect to be at this stage within the next twelve months. Those who

    get there faster will have a definite competitive edge.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    21/25

    READINESS STRATEGIES

    Organizations at this stage need to focus on approaches that will allow them to efficiently scale

    adoption of Big Data technologies across the entire enterprise. As companies move from proof-

    of-value solutions to full-scale adoption, it is critical to understand that what worked in the earlier

    stages may not always work in the Evolved stage.

    Select an approach with built-in optimizations that enhance Hadoops vertical

    scalability to reduce hardware requirements. Run performance benchmarks and study

    which tools deliver the best combination of price/performance for your most common

    use cases.

    Ensure code does not become a coding nightmare. While learning and developing Pig,

    HiveQL, and Java code might be fun at the beginning, highly repetitive tasks such as

    joins, change data capture (CDC), and aggregations can quickly become a nightmare to

    troubleshoot and maintain. Using tools with a template-driven approach can make you

    more productive by focusing on more value-added activities. Choose a Hadoop ETL tool with a user-friendly graphical interface. Easily build ETL

    jobs without the need to develop, debug, and maintain complex Java, Pig, HiveQL, and

    other specialized code for MapReduce. Using common ETL paradigms will allow you

    to leverage existing ETL skills within your organization, minimizing barriers for wider

    Hadoop adoption.

    Consider an ETL tool with native Hadoop integration. Beware of ETL tools that claim

    integration with Hadoop but simply generate code such as HiveQL, Pig, or Java. These

    approaches can create additional performance overhead and maintenance hurdles down

    the road.

    Leverage a metadata repository. This will facilitate reusability, data lineage, and impact

    analysis capabilities.

    Rationalize your data warehouse. Identify the top 20% of ETL workflows causing

    problems within your existing enterprise data warehouse. Start by shifting these

    processes into Hadoop. Operational savings and additional database capacity can then

    be used to fund more strategic initiatives.

    Secure your Hadoop data. Any viable approach to Hadoop ETL must provide ironclad

    security that meets your organizations and industrys data security requirements.

    Seamless support for Kerberos and LDAP is key.

    Augment your Center of Excellence (COE) with Hadoop best practices and guidelines.

    Enhance your organizations COE to provide expertise in Hadoop and related tools, and

    to define and standardize guidelines to identify and align the appropriate IT resources

    with the appropriate use cases throughout your organization.

    21

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    22/25

    HOW SYNCSORT CAN HELP

    Syncsort DMX-h turns Hadoop into a more robust and feature-rich ETL

    solution, enabling users to maximize the benefits of MapReduce without

    compromising on the capabilities and ease of use offered by conventiona

    data integration tools.

    Faster performance per node. DMX-h is not a code generator

    Instead, Hadoop automatically invokes the highly efficient DMX-h

    runtime engine, which executes on all nodes as an integral part

    of Hadoop. DMX-h can help organizations in the Evolved stage

    by delivering consistently higher performance per node as data

    volumes grow.

    Hadoop ETL without coding. DMX-h enables people with a much

    broader range of skills not just MapReduce programmers to

    create ETL tasks that execute within the MapReduce frameworkreplacing complex Java, Pig, or HiveQL code with a powerful, easy-

    to-use graphical development environment.

    Enterprise-grade security for Hadoop ETL. DMX-h helps you

    keep all your data secure with market-leading support for common

    protocols such as LDAP and Kerberos.

    Smarter Hadoop deployments. DMX-h offers tight integration

    with all major Hadoop distributions, including Apache, Cloudera

    Hortonworks, MapR, and PivotalHD. Seamless integration with

    Cloudera Manager allows you to easily deploy and upgrade DMX-h

    in your entire Hadoop cluster with the click of a button.

    Optimized sort for MapReduce processes and HiveQL. Thanks

    to Syncsorts recently committed contribution to the open source

    community MAPREDUCE-2454 you can simply plug DMX-h

    into your existing Hadoop clusters to seamlessly optimize existing

    Hive and MapReduce jobs for even greater performance and more

    efficient use of your Hadoop cluster.

    Smarter Economics. Keep costs down as you scale Hadoop across

    the entire organization. DMX-hs unique capabilities help you

    maximize savings, delivering best-in-class ETL technology at a price

    point that is more consistent with the cost structure of open source

    solutions. Achieve significant operational savings faster by shifting

    existing ETL workloads from high-end platforms to Hadoop.

    22

    Syncsort developed andcontributed key features

    to the Apache open source

    community to make the

    sort function pluggable

    with Hadoop. MAPREDUCE

    2454 allows you to run

    the fastest and most

    efcient sort technology

    natively within Hadoopto optimize existing

    MapReduce operations

    without any code changes

    or tuning.

    https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttps://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454
  • 7/25/2019 A Practical Guide to Big Data Readiness

    23/25

    ARE YOU READY TO EMBRACE THE CHALLENGESAND OPPORTUNITIES OF BIG DATA?The Big Data Continuum a framework developed with decades of data management expertise can help

    you assess your readiness and prepare for the challenges ahead:

    Assess your companys data management maturity level.

    Identify potential pitfalls as you evaluate and implement new technologies and processes.

    Learn how to successfully address common problems that arise at each stage.

    Fast track your journey to embrace Big Data and capitalize on the forth V Value.

    The key stages of the Big Data Continuum are:

    Awakening. Primarily using hand-coding techniques to process data.

    Advancing. Standardizing on traditional data integration platforms.

    Plateauing. Straining the limits of traditional data integration architectures.

    Dynamic. Experimenting with Hadoop.

    Evolved. Standardizing on Hadoop as the operating system for Big Data across the entire enterprise.

    23

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    24/25

    24

    ARE YOU READY FOR BIG DATA?Organizations that are further along the Big Data Continuum have a much better chance to succeed and enjoy first-

    mover advantage, while laggards will find themselves at risk of declining revenues, market share, and relevance

    Regardless where you are on the Big Data Continuum, Syncsort offers smarter solutions to help leverage all you

    data assets and build a solid foundation for Big Data. With thousands of deployments across all major platformsSyncsorts solutions from SQL migration, to high performance ETL, to Hadoop can help you thrive in the world

    of Big Data.

    http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR
  • 7/25/2019 A Practical Guide to Big Data Readiness

    25/25

    Discover Syncsorts Big Data Solutions

    Take a Free Test Drive of Our Hadoop ETLSolution

    Check Out Our Infographic: The Big Pictureon Big Data & Hadoop

    Read a Report: The European Big Picture onBig Data & Hadoop

    Syncsort provides data-intensive organizations across the Big Data continuum with a smarter way to collect and

    process the ever-expanding data avalanche. With thousands of deployments across all major platforms, includingmainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and

    Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less

    resources and lower TCO. For more information visit www.syncsort.com.

    Like This Guide? Share

    http://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://bit.ly/13iY20chttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readiness