a practical guide to big data readiness

7/25/2019 A Practical Guide to Big Data Readiness

1/25


2/25

Introduction: Are You Ready for Big Data? 3

The Big Data Continuum 5

Stage 1: Awakening 6

Stage 2: Advancing 9

Stage 3: Plateauing 12

Stage 4: Dynamic 15

Stage 5: Evolved 19

Conclusion 23

Learn More 25

TABLE OF CONTENTS


3/25

chain or Tower Records. Big Data will be no different. Organizations unable to effectively keep pace amidst the

three Vs of Big Data Volume, Variety, and Velocity are at risk of becoming twenty-first century road kill.

How did we get here? The fact is that organizations have struggled to make sense of data for decades. And,

since the dawn of computing, there have been periods of innovation that have disrupted the entire market.

From mainframes to PCs, from Internet to social and mobile technologies, each fundamental shift in the

computing landscape has created unique challenges to organizations existing data management architectures

and processes. One-off point solutions using custom coding in the early 90s gave way to ETL platforms and

the enterprise data warehouse, all promising information nirvana: a single version of the truth.

More recently, as datasets explode with unprecedented speed and variety, and the needs of the business

become ever more complex, data management is more challenging than ever before. Traditional architectures

are breaking once again, and organizations are racing to adapt and rebuild them to handle Big Data. Big Data is

driving the next technological shift, and data integration is at the epicenter of the transformation.

The Big Data problem is a big business problem.

Analyzing Big Data to extract meaningful value is

no longer a luxury; its a necessity as companies

strive to remain relevant and competitive in the

marketplace.

Technological shifts create both opportunities

and challenges. For instance, while the Internet

revolution gave rise to Amazon and iTunes, it also

meant the end of Borders the defunct bookstore

1The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012

3
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR


4/25

SURVIVE AND THRIVE WITH BIG DATASo how can organizations evaluate their readiness in the context of this new environment and, most importantly,

prepare for the challenges ahead? How can you be sure youre making the right investments to embrace, and

capitalize on, the opportunities of Big Data?

Thats where the Big Data Continuum can help. The Big Data Continuum is a framework that can help you:

Assess your companys data management maturity level.

Identify potential pitfalls as you evaluate and implement new technologies and processes.

Learn how to successfully address common problems that arise at each stage.

Fast track your journey to embrace Big Data and capitalize on the forth V Value.

With decades of data management expertise and a long history of innovation,

Syncsort has worked with thousands of companies to help them solve their big data

issues, long before they knew the name Big Data. Based on our extensive experience

helping customers of all sizes and at all levels of data integration maturity, weve

designed a framework to help organizations evolve in their quest to leverage data for

competitive advantage. We call this framework The Big Data Continuum.

4

Organizations across different industries and sectors fall into

a wide range of maturity levels in terms of the processes andtechnologies they use to manage their data, and their ability

to extract value from it. Therefore, the first steps in preparing

for Big Data involve a rigorous assessment of your existing

data management architecture and processes, and a strategic

roadmap that includes the challenges and opportunities

ahead. In essence, Where are you today, and where do you

need to be in the next 12 months?

The Big Data Continuum is a framework that can help you

answer these questions and propel your organization to the

next level.


5/25

5

THE FIVE STAGES OF DATA INTEGRATION MATURITY:

Awakening. Data integration tasks are mostly performed using custom coded approaches, often using SQL to

transform and integrate data inside a database.

Advancing. Organizations realize the value of data and start standardizing data integration processes on a

common platform such as Informatica, DataStage, and others, leading to greater efficiencies and economies o

scale.

Plateauing. Initial successes with an enterprise data warehouse spark the need for more insights. However

increasing data volumes and changing business requirements push the limits of traditional data integration and

data warehousing architectures. Stopgap measures trigger a transition from ETL (Extract, Transform, Load) to

ELT (Extract, Load, Transform), shifting heavy data transformation workloads into the enterprise data warehouse

The IT backlog grows despite standards and best practices. Initial success is replaced by unsustainable costs

and user frustration.

Dynamic. Organizations start to look for alternative solutions to meet these challenges in less time, with less

effort, and at lower cost. They experiment with Big Data frameworks like Hadoop to address architectura

limitations of traditional platforms and look for ways to leverage the accumulated expertise within their

organizations.

Evolved. Companies at this stage are scaling Hadoop across the entire enterprise, using it as an integra

component of their production data management infrastructure. Big Data platforms become a new standard

within these organizations, augmenting traditional architectures at significantly lower costs.

The rest of this paper examines the Big Data Continuum in more detail and provides specic

readiness strategies to help your organization address the challenges and opportunities

of each stage.


6/25

For organizations in the Awakening stage, hand coding

often using Structured Query Language (SQL)

inside the database is the most common method to

transform and integrate data sets. According to data

warehousing expert Rick Sherman, much of the data

integration projects in corporate enterprises are still

being done through manual coding methods that are

inefficient and often not documented.2

The problems associated with hand coding and using

SQL for data integration tasks are well understood

and include:

Low Productivity: Developing, maintaining, and extending custom software code is a productivity drain

and quickly becomes unsustainable. It is particularly challenging to tune, maintain and extend existing

code when the original developers are no longer in the same roles or have left the company. Custom

code also makes it difficult to perform impact analysis or data lineage to understand dependencies and

data flows.

6

CustomSQLCodeUsedforETLProcessing

2Rick Sherman. Misconceptions Holding Back Use of Data Integration Tools. BI Trends + Strategies, August 2012.


7/25

READINESS STRATEGIES

Migrate SQL scripts to a high-performance ETL tool.ETL tools have

become the de facto solution to SQL scripting, maintenance and

performance issues. When choosing an ETL tool, beware of complex

engines and code-generators that push SQL down to the database.

Analyze and document complex code and SQL scriptsused in data

integration processes and create graphical flow charts to depict SQL logic.

Identify the top 20%. Typically, 20% of SQL scripts consume up to 80%

of the time and cost, due to hardware, tuning and maintenance. Usual

suspects include SQL with merge/upsert, joins, materialized views, cursors

and union operations.

Migrate SQL scripts using the 80/20 rule.When planning and evaluating

the benefits of SQL migration, it is important to realize that a complete

migration of all SQL code is not necessary to achieve significant benefits.

Instead, focus on the top 20% to deliver quick results and significantsavings.

7

Poor Performance:SQL was not designed for ETL processing. Instead, it is a special-purpose programming

language designed for querying and managing data stored in relational databases. Using SQL for ETL

tasks is inefficient, creating performance bottlenecks and jeopardizing service level agreements (SLAs) fo

ETL processing windows.

High Cost: Pushing intensive data transformations down to the database steals expensive database

cycles from the tasks for which it was intended, resulting in added infrastructure costs and jeopardizing

performance SLAs for processing database queries.

All of these issues can make it difficult for organizations to extract information and deliver business value from

data, especially as data-driven information and decision making become a vicious cycle, creating the demand fo

even more data-driven information. Often, custom coding will solve problems at the outset, but as the need fo

more and faster information grows, these approaches simply cant keep pace with the demands of the business.


8/25

HOW SYNCSORT CAN HELP

Syncsorts SQL migration solution is specifically designed to help

organizations at the Awakening stage eliminate the SQL ETL coding and

maintenance nightmare by migrating existing SQL ETL scripts to a few

graphical DMX jobs. Syncsort DMX is high-performance ETL software tha

accelerates overall performance and eliminates the need for database

staging areas, seamlessly reducing the total cost and complexity of data

integration.

Intelligent, self-documenting ow charts are automatically

generated so you can clearly understand complex SQL scripts used

in data integration processes.

A few graphical jobs vs. thousands of lines of SQL code.Replace

thousands of lines of SQL code with a few graphical jobs, allowing

even novice users to quickly develop and maintain data integration

jobs.

Improved IT productivity and sustained optimal performance

Seamlessly scale as data volumes grow, without the need fo

manual coding or tuning.

8

Migrate SQL Scripts to a high-

performance ETL tool. Look for the

following characteristics to identify

the high impact code for migration.

High elapsed processing times.

Very complex scripts, including

multiple merges, joins, cursors andunions.

High impact on resource utilization,

including CPU, memory, and storage.

Unstable or error-prone.


9/25

As organizations progress to the advancing stage they will experience:

More Data. The number and type of data sources users need to leverage increases, often including

dissimilar data in different formats (e.g. text, mainframe, web logs, and CRM).

More End Users. The range of end users that must be satisfied increases, including executives, managers

and field and operations staff, for example.

More Queries. As the number and roles of end users grow, so do the number, variety, and complexity of

queries that must be performed on the data.

Companies at this stage come to realize that continuing to use point solutions and hand-coded approaches wil

hold them back. As a result, they will begin to evaluate, adopt and standardize on ETL tools and data integration

platforms. In addition to investments in IT infrastructure, organizations start to develop and enforce best practices

and accumulate technical expertise that can prove critical to progress along the Big Data Continuum.

When surveyed, more organizations identified their data integration readiness at these first two stages of the Big

Data Continuum than at any of the others.

9


10/25


Beware of code-generators and push-down optimizations.Some organizations

have adopted tools that generate SQL or offer so-called push-down

optimizations as a means to achieve faster performance at scale. Unfortunately,

most of these tools including Talend and Informatica require significantskills and ongoing manual tuning to achieve and sustain acceptable performance,

creating similar challenges to hand coding and maintaining SQL-based data

integration logic.

Improve sta productivity. Select an ETL tool with Windows-based paradigms

that dont require a long learning curve or specialized skills. Data integration

tools should allow users to focus on business rules and workflows, rather than

complex tuning parameters to achieve and maintain high performance. Look

for ease of use as well as ease of re-use, with impact analysis and data lineage

capabilities to make it easy to revise and extend existing applications as business

requirements change.

Choose a tool that maximizes run-time performance and eciency. A tool

that delivers superior run-time processing performance and efficiency will

maximize resource utilization, minimize costs, and provide superior throughput.

Look for a solution that performs all transformation processing outside of the

database, minimizing performance bottlenecks and inefficient utilization of

expensive database resources. Doing so can keep costs under control and allow

you to build a solid foundation for the future, avoiding potential issues often

encountered in the subsequent stages.

Leverage all your data. Having the right data source and target connectivity is

critical for leveraging all your data, to help make the best business decisions and

discover new business opportunities.

Establish a Big Data Center of Excellence (COE). A center of excellence is key

to develop and retain Big Data expertise within the organization. The COE should

also set and enforce standards for the data management architecture,define the strategic roadmap, establish best practices and provide

training and support to the organization.

10


11/25


Syncsorts DMX high-performance ETL solution provides companies

at the Advancing stage with a two-fold approach: it makes addressing

their immediate productivity issues fast and easy, while providing a solid

foundation for future data growth.

Template-driven design. DMX offers a clear, intuitive graphical use

interface that makes it easy for both business and technical users to

develop and deploy ETL processes.

11

Faster transformations for unparalleled ETL

performance. The solution packages a library of

hundreds of smart algorithms to handle the most

demanding data integration transformations and

functions, delivering up to 10X faster elapsed

processing times than Informatica, Talend, and

other conventional tools.

Smart ETL Optimizer. You dont have to worry

about ongoing, time-consuming tuning efforts to

maintain optimum performance. Our unique ETL

Optimizer ensures you will always get maximum

performance, so you can design for the businesswithout wasting time tuning.

Comprehensive connectivity to leverage all your data. The high

performance ETL solution provides out-of-the-box connectivity to

relational sources, flat files, mainframes, Hadoop, and everything in

between.

Flexibility and reusability with no strings attached. A file-based

repository delivers all the benefits of a complete metadata laye

without dependencies on third-party systems such as relationa

databases.


12/25

Over time, increasing demands for information oftentimes prove to be too much for traditional architectures to

handle. As data volumes grow and business users demand fresher data, popular data integration tools such

as Informatica and DataStage force organizations to push data transformations down to the enterprise data

warehouse, effectively causing a transition from ETL to ELT. Unfortunately, SQL is almost never the best approach

for data integration tasks. Relational database management systems (RDBMS) were specifically designed to solve

problems that involve a big question with a small answer (i.e. user queries). However, when dealing with data

transformations, the T in ETL, the answer is generally as big, if not bigger, than the question.

Moreover, organizations can face unacceptable bottlenecks

and delays, not only for data transformations but also for

analytical queries, as both processes compete for EDW

resources. IT staff and budget can quickly be consumed by

expensive and tedious stopgap measures: manual tuning

efforts, hardware upgrades, and additional data warehousecapacity. Early excitement fades and gives way to use

frustration, incremental costs and a crippling IT backlog.

The resulting business ramifications of these bottlenecks can

be severe, including lost revenue opportunities, impaired

decision making, customer attrition, and so on.

12

The RDBMS is optimized to

solve query loads. That is, big

questions with a small answer.

However, ETL involves big

questions with sometimes evenbigger answers. By ofoading

heavy data transformations

from the EDW, you can free up

database capacity and budget

while accelerating overall data

performance.


13/25


Ooad transformations from the data warehouse.Inefficient and

underperforming ETL tools have forced many IT developers to push

transformations down to the database, adding complexity and requiring massive

investments in additional database capacity. This approach will actually moveyou backward along the Big Data Continuum, increasing database costs and

the effort to maintain and tune scripts. Look for approaches that shift intensive

transformations out of the database.

Leverage acceleration technologies to extend your existing data

integration infrastructure. Most organizations have spent considerable time

and money building their existing data integration infrastructure, so rip &

replace approaches arent practical. Rather than buying extra hardware and

database capacity, you can identify where the bottlenecks occur and bring in

specialized data integration technology to accelerate these processes. For

example, technology now exists that can efficiently handle sorts, merges, and

aggregations, and that integrates seamlessly with your existing architecture.

Accelerating technologies increase an organizations Big Data readiness by

removing performance bottlenecks while allowing them to leverage their existing

architecture. These plug-and-play technologies typically result in significant

savings that can be used to fund initiatives to move into the Dynamic stage.

Start with the top 20% of data transformations. Usually 20% of the

transformations incur 80% of the processing problems. Offloading and

accelerating these transformations will provide the best bang for the buck.

Consider using Hadoop to ooad all ETL processes from the data warehouse.

Hadoop is emerging as the de facto operating system for Big Data. Thanks to its

massively scalable and fault-tolerant architecture, Hadoop can be much more

effective from a performance and cost perspective than the data warehouse in

processing ETL workloads. In addition, shifting ETL workloads to Hadoop

can free up valuable database capacity to accelerate user queries.

13


14/25


Syncsorts ETL optimization solution helps organizations maximize the

return on their data integration investments, allowing them to keep their

existing infrastructure while shifting the heavy transformation processes to

Syncsort DMX.

Accelerate your existing data integration environment, including

Informatica and DataStage by 10x or more. Syncsort packages a

library of hundreds of smart algorithms, as well as an ETL Optimizer

to handle the most demanding data integration transformations and

deliver up to 10x faster elapsed times

Simply plug DMX into your existing environment. DMX provides

advanced metadata interchange capabilities to bi-directionally

exchange metadata with other applications. This makes it easy

to plug the solution into existing data integration environments to

seamlessly accelerate performance, eliminate constant tuning, and

facilitate regulatory compliance.

Free up your database and your budget.Syncsorts ETL optimization

solution shifts all data transformations from the enterprise data

warehouse into the DMX high-performance ETL engine, freeing up

database resources for faster user queries.

Get Hadoop-ready. Syncsort offers high-performance data

integration software with everything you need to deploy enterprise

grade ETL capabilities on Hadoop. DMX-h offers a unique approach

to Hadoop ETL that lowers the barriers for adoption, helping your

organization unleash the full potential of Hadoop. Thanks to a

library of Use Case Accelerators, its easy for organizations to get

started with Hadoop by implementing common ETL tasks such as

joins, change data capture (CDC), web log aggregations, mainframe

data access and more.

14


15/25

Hadoop is helping organizations in all industries gain greater insights, processing more data in less time and at a

lower cost. According to organizations surveyed, the top benefits from their use of Hadoop are finding previously

undiscovered insights and reducing the overall costs of data .

Two of the most common approaches include data warehouse optimization and mainframe offload. By shifting

transformations the T in ETL out of the data warehouse and into Hadoop, organizations can quickly

15

realize significant value, includin

shortened ETL batch windows, faste

database user queries, and significan

operational savings in the form of spare

database capacity. Similarly, enterprise

that rely on mainframe processing t

support mission-critical application

can capitalize on valuable insights ansavings by offloading data and batc

processing from the mainframe int

Hadoop.


16/25

It is important to recognize, however, that Hadoop is not a complete ETL solution. Hadoop is an operating system

that provides the underlying services to create Big Data applications. While it offers powerful utilities and massive

horizontal scalability, it does not provide the full set of capabilities that users need to deliver enterprise ETL

applications and functionality. If not addressed correctly, the gaps between the operating-level services that

Hadoop offers and the functionality that enterprise-grade ETL requires can slow Hadoop adoption and frustrate

organizations eager to deliver results, jeopardizing subsequent investments.

16

Hadoop is an open-source software framework that excels at processing and

analyzing large amounts of data at scale. Hadoop makes it practical to scale

out processing tasks across large numbers of nodes by handling the complicated

aspects of creating, managing, and executing a set of parallel processes over a

cluster of low-cost computers.

ETL the process of collecting, processing, and distributing data has emerged as

one of the most common use cases for Hadoop.3 In fact, industry analyst Gartner

predicts that most organizations will adapt their data integration strategy using

Hadoop as a form of preprocessor for Big Data integration in the data warehouse. 4

Use of Hadoop can become a game changer for organizations, dramatically

improving the cost structure for gaining new insights, for analyzing larger data sets

and new data types, and for quickly and exibly bringing new services to market.

Local

Disk

MAP

REDUCE HDFS

Input

Formatter

Ouput

Formatter

SORT

Optional

Partitioner

Optional

Combiner

Local

Disk

SORT

REDUCE HDFSOuput

Formatter

Local

Disk

SORT

Local

Disk

MAPInput

Formatter SORT

Optional

Partitioner

Optional

Combiner

Local

Disk

MAPInput

Formatter SORT

Optional

Partitioner

Optional

Combiner Typical MapReduce Data Flow

3http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/

4Mark A. Beyer and Ted Friedman. Big Data Adoption in the Logical Data Warehouse. Gartner Research, February 2013
http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/


17/25


During experimentation and early stages of Hadoop, the main objective is to prove the

value that Hadoop can bring to organizations by augmenting or extending existing data

integration and data warehouse architectures. Therefore, data connectivity and quick

development of common ETL use cases are critical for organizations at the Dynamicstage. Connectivity to the right data sources can maximize the value of the framework

and avoid having Hadoop become yet another silo within the enterprise. In addition,

quickly ramping productivity with Hadoop allows IT to deliver quantifiable successes that

pave the way for more widespread adoption. Success at this stage enables companies

to move to the Evolved stage, where Hadoop becomes an integral component of the

production data management architecture.

Select a tool with a wide variety of connectors to source and target systems.

Simplify importing data from various sources into Hadoop, as well as exporting

data from Hadoop to other systems.

Leverage mainframe data.Mainframe data can be the critical reference point for

new data sources, such as web logs and sensor data. Therefore, make sure the

tool provides connectivity and data translation capabilities for the mainframe.

Ensure the tool oers a comprehensive library of pre-built, out-of-the-box data

transformations. The most common data flows include joins, aggregations,

and change data capture. Reusable templates can accelerate development of

prototype applications and proof of value.

Avoid tools that generate code. These tools will burden your organization with

heavy tuning and maintenance.

Test and break your system. As you build your proof-of-concept, stress testing

your system will help you assess the reliability of your implementation and will

teach your staff critical skills to maintain and support it down the road.

Identify and prioritize use cases. Identify one (or a small number of) proof-of-

concept use cases for Hadoop. Candidate use cases often involve recurring ETL

processes that place a heavy burden on the existing data warehouse.

17


18/25


Syncsorts DMX-h high-performance data integration software provides a

smarter approach to Hadoop ETL including an intuitive graphical interface

for easily creating and maintaining jobs, a wide range of productivity

features, metadata facilities for development re-use and data lineage, high-

performance connectivity capabilities, and an ability to run natively within

the MapReduce framework, avoiding code generation.

Smarter connectivity to all your data. With DMX-h, you only need

one tool to connect all sources and targets to Hadoop, including

relational databases, appliances, files, XML, and even cloud. No

coding or scripting is needed. DMX-h can also be used to pre-

process data cleanse, sort, partition, and compress prior to

loading it into Hadoop, resulting in enhanced performance and

significant storage savings.

Smarter mainframe data ingestion and translation. DMX-h offers

unique capabilities to read, translate, and distribute mainframe

data with Hadoop. It supports mainframe record formats such

as fixed, variable, variable with block descriptor, and VSAM, and

also translates data from EBCDIC to ASCII, and imports COBOL

copybooks without coding.

Smarter testing, debugging and troubleshooting

DMX-h allows you to develop, test, and troubleshoot locally

in Windows before deploying into Hadoop. In addition

DMX-h provides comprehensive logging capabilities, as

well as integration with Hadoops JobTracker for easier

log consumption.

Smarter productivity to fast-track your way to

successful Hadoop ETL. DMX-h helps you get started

and become fully productive with Hadoop quickly

by providing a library of Use Case Accelerators that

implement common ETL tasks such as joins, change

data capture (CDC), web log aggregations, mainframe

data access, and more.

18


19/25

While most organizations at this stage are not looking to replace their existing data warehousing infrastructure

with Hadoop, ETL is a different story. Hadoop is poised to completely change the way organizations collect

process, and distribute their data. ETL is shifting to Hadoop ETL and Big Data is becoming the new standard

architecture, providing greater value to the organization at a cost structure that is radically lower than traditiona

architectures. And thats why the ability to cost-effectively utilize Big Data is quickly becoming a requirement for

companies to survive.

For example, an organization can store

aggregated web log data in their relational

database, while keeping the complete

datasets at the most granular level in Hadoop.

This allows them to run new queries against

the full historical data at any time to find new

insights, which can be a true game-changeras organizations aggressively look for new

insights and offerings to differentiate from

the competition.

19


20/25

20

As organizations begin to standardize on Hadoop as the new Big Data platform, they must keep hardware

and resource costs under control. Although Hadoop leverages commodity hardware, the total cost for system

resources can still be significant. When dealing with large numbers of nodes, hardware costs add up. Programming

resources e.g. HiveQL, Pig, Java, MapReduce can also prove expensive. Using Hadoop for ETL processing

requires specialized and expensive developers that can be hard to find and hire. For example, the Wall Stree

Journal recently cited that a Hadoop programmer can now earn as much as $300,000 per year.

Today, the reality is that very few organizations have yet to reach the Evolved stage. Less than 2% of organizations

surveyed are using Hadoop as an integral component of their data management platform. But many organizations

are working towards this goal, and almost 11% expect to be at this stage within the next twelve months. Those who

get there faster will have a definite competitive edge.


21/25


Organizations at this stage need to focus on approaches that will allow them to efficiently scale

adoption of Big Data technologies across the entire enterprise. As companies move from proof-

of-value solutions to full-scale adoption, it is critical to understand that what worked in the earlier

stages may not always work in the Evolved stage.

Select an approach with built-in optimizations that enhance Hadoops vertical

scalability to reduce hardware requirements. Run performance benchmarks and study

which tools deliver the best combination of price/performance for your most common

use cases.

Ensure code does not become a coding nightmare. While learning and developing Pig,

HiveQL, and Java code might be fun at the beginning, highly repetitive tasks such as

joins, change data capture (CDC), and aggregations can quickly become a nightmare to

troubleshoot and maintain. Using tools with a template-driven approach can make you

more productive by focusing on more value-added activities. Choose a Hadoop ETL tool with a user-friendly graphical interface. Easily build ETL

jobs without the need to develop, debug, and maintain complex Java, Pig, HiveQL, and

other specialized code for MapReduce. Using common ETL paradigms will allow you

to leverage existing ETL skills within your organization, minimizing barriers for wider

Hadoop adoption.

Consider an ETL tool with native Hadoop integration. Beware of ETL tools that claim

integration with Hadoop but simply generate code such as HiveQL, Pig, or Java. These

approaches can create additional performance overhead and maintenance hurdles down

the road.

Leverage a metadata repository. This will facilitate reusability, data lineage, and impact

analysis capabilities.

Rationalize your data warehouse. Identify the top 20% of ETL workflows causing

problems within your existing enterprise data warehouse. Start by shifting these

processes into Hadoop. Operational savings and additional database capacity can then

be used to fund more strategic initiatives.

Secure your Hadoop data. Any viable approach to Hadoop ETL must provide ironclad

security that meets your organizations and industrys data security requirements.

Seamless support for Kerberos and LDAP is key.

Augment your Center of Excellence (COE) with Hadoop best practices and guidelines.

Enhance your organizations COE to provide expertise in Hadoop and related tools, and

to define and standardize guidelines to identify and align the appropriate IT resources

with the appropriate use cases throughout your organization.

21


22/25


Syncsort DMX-h turns Hadoop into a more robust and feature-rich ETL

solution, enabling users to maximize the benefits of MapReduce without

compromising on the capabilities and ease of use offered by conventiona

data integration tools.

Faster performance per node. DMX-h is not a code generator

Instead, Hadoop automatically invokes the highly efficient DMX-h

runtime engine, which executes on all nodes as an integral part

of Hadoop. DMX-h can help organizations in the Evolved stage

by delivering consistently higher performance per node as data

volumes grow.

Hadoop ETL without coding. DMX-h enables people with a much

broader range of skills not just MapReduce programmers to

create ETL tasks that execute within the MapReduce frameworkreplacing complex Java, Pig, or HiveQL code with a powerful, easy-

to-use graphical development environment.

Enterprise-grade security for Hadoop ETL. DMX-h helps you

keep all your data secure with market-leading support for common

protocols such as LDAP and Kerberos.

Smarter Hadoop deployments. DMX-h offers tight integration

with all major Hadoop distributions, including Apache, Cloudera

Hortonworks, MapR, and PivotalHD. Seamless integration with

Cloudera Manager allows you to easily deploy and upgrade DMX-h

in your entire Hadoop cluster with the click of a button.

Optimized sort for MapReduce processes and HiveQL. Thanks

to Syncsorts recently committed contribution to the open source

community MAPREDUCE-2454 you can simply plug DMX-h

into your existing Hadoop clusters to seamlessly optimize existing

Hive and MapReduce jobs for even greater performance and more

efficient use of your Hadoop cluster.

Smarter Economics. Keep costs down as you scale Hadoop across

the entire organization. DMX-hs unique capabilities help you

maximize savings, delivering best-in-class ETL technology at a price

point that is more consistent with the cost structure of open source

solutions. Achieve significant operational savings faster by shifting

existing ETL workloads from high-end platforms to Hadoop.

22

Syncsort developed andcontributed key features

to the Apache open source

community to make the

sort function pluggable

with Hadoop. MAPREDUCE

2454 allows you to run

the fastest and most

efcient sort technology

natively within Hadoopto optimize existing

MapReduce operations

without any code changes

or tuning.
https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttps://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454


23/25

ARE YOU READY TO EMBRACE THE CHALLENGESAND OPPORTUNITIES OF BIG DATA?The Big Data Continuum a framework developed with decades of data management expertise can help

you assess your readiness and prepare for the challenges ahead:

Assess your companys data management maturity level.

Identify potential pitfalls as you evaluate and implement new technologies and processes.

Learn how to successfully address common problems that arise at each stage.

Fast track your journey to embrace Big Data and capitalize on the forth V Value.

The key stages of the Big Data Continuum are:

Awakening. Primarily using hand-coding techniques to process data.

Advancing. Standardizing on traditional data integration platforms.

Plateauing. Straining the limits of traditional data integration architectures.

Dynamic. Experimenting with Hadoop.

Evolved. Standardizing on Hadoop as the operating system for Big Data across the entire enterprise.

23


24/25

24

ARE YOU READY FOR BIG DATA?Organizations that are further along the Big Data Continuum have a much better chance to succeed and enjoy first-

mover advantage, while laggards will find themselves at risk of declining revenues, market share, and relevance

Regardless where you are on the Big Data Continuum, Syncsort offers smarter solutions to help leverage all you

data assets and build a solid foundation for Big Data. With thousands of deployments across all major platformsSyncsorts solutions from SQL migration, to high performance ETL, to Hadoop can help you thrive in the world

of Big Data.


25/25

Discover Syncsorts Big Data Solutions

Take a Free Test Drive of Our Hadoop ETLSolution

Check Out Our Infographic: The Big Pictureon Big Data & Hadoop

Read a Report: The European Big Picture onBig Data & Hadoop

Syncsort provides data-intensive organizations across the Big Data continuum with a smarter way to collect and

process the ever-expanding data avalanche. With thousands of deployments across all major platforms, includingmainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and

Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less

resources and lower TCO. For more information visit www.syncsort.com.

Like This Guide? Share
http://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://bit.ly/13iY20chttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readiness

a practical guide to big data readiness

Documents