a practical guide to big data readiness
TRANSCRIPT
-
7/25/2019 A Practical Guide to Big Data Readiness
1/25
-
7/25/2019 A Practical Guide to Big Data Readiness
2/25
Introduction: Are You Ready for Big Data? 3
The Big Data Continuum 5
Stage 1: Awakening 6
Stage 2: Advancing 9
Stage 3: Plateauing 12
Stage 4: Dynamic 15
Stage 5: Evolved 19
Conclusion 23
Learn More 25
TABLE OF CONTENTS
-
7/25/2019 A Practical Guide to Big Data Readiness
3/25
chain or Tower Records. Big Data will be no different. Organizations unable to effectively keep pace amidst the
three Vs of Big Data Volume, Variety, and Velocity are at risk of becoming twenty-first century road kill.
How did we get here? The fact is that organizations have struggled to make sense of data for decades. And,
since the dawn of computing, there have been periods of innovation that have disrupted the entire market.
From mainframes to PCs, from Internet to social and mobile technologies, each fundamental shift in the
computing landscape has created unique challenges to organizations existing data management architectures
and processes. One-off point solutions using custom coding in the early 90s gave way to ETL platforms and
the enterprise data warehouse, all promising information nirvana: a single version of the truth.
More recently, as datasets explode with unprecedented speed and variety, and the needs of the business
become ever more complex, data management is more challenging than ever before. Traditional architectures
are breaking once again, and organizations are racing to adapt and rebuild them to handle Big Data. Big Data is
driving the next technological shift, and data integration is at the epicenter of the transformation.
The Big Data problem is a big business problem.
Analyzing Big Data to extract meaningful value is
no longer a luxury; its a necessity as companies
strive to remain relevant and competitive in the
marketplace.
Technological shifts create both opportunities
and challenges. For instance, while the Internet
revolution gave rise to Amazon and iTunes, it also
meant the end of Borders the defunct bookstore
1The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC, December 2012
3
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
4/25
SURVIVE AND THRIVE WITH BIG DATASo how can organizations evaluate their readiness in the context of this new environment and, most importantly,
prepare for the challenges ahead? How can you be sure youre making the right investments to embrace, and
capitalize on, the opportunities of Big Data?
Thats where the Big Data Continuum can help. The Big Data Continuum is a framework that can help you:
Assess your companys data management maturity level.
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
With decades of data management expertise and a long history of innovation,
Syncsort has worked with thousands of companies to help them solve their big data
issues, long before they knew the name Big Data. Based on our extensive experience
helping customers of all sizes and at all levels of data integration maturity, weve
designed a framework to help organizations evolve in their quest to leverage data for
competitive advantage. We call this framework The Big Data Continuum.
4
Organizations across different industries and sectors fall into
a wide range of maturity levels in terms of the processes andtechnologies they use to manage their data, and their ability
to extract value from it. Therefore, the first steps in preparing
for Big Data involve a rigorous assessment of your existing
data management architecture and processes, and a strategic
roadmap that includes the challenges and opportunities
ahead. In essence, Where are you today, and where do you
need to be in the next 12 months?
The Big Data Continuum is a framework that can help you
answer these questions and propel your organization to the
next level.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
5/25
5
THE FIVE STAGES OF DATA INTEGRATION MATURITY:
Awakening. Data integration tasks are mostly performed using custom coded approaches, often using SQL to
transform and integrate data inside a database.
Advancing. Organizations realize the value of data and start standardizing data integration processes on a
common platform such as Informatica, DataStage, and others, leading to greater efficiencies and economies o
scale.
Plateauing. Initial successes with an enterprise data warehouse spark the need for more insights. However
increasing data volumes and changing business requirements push the limits of traditional data integration and
data warehousing architectures. Stopgap measures trigger a transition from ETL (Extract, Transform, Load) to
ELT (Extract, Load, Transform), shifting heavy data transformation workloads into the enterprise data warehouse
The IT backlog grows despite standards and best practices. Initial success is replaced by unsustainable costs
and user frustration.
Dynamic. Organizations start to look for alternative solutions to meet these challenges in less time, with less
effort, and at lower cost. They experiment with Big Data frameworks like Hadoop to address architectura
limitations of traditional platforms and look for ways to leverage the accumulated expertise within their
organizations.
Evolved. Companies at this stage are scaling Hadoop across the entire enterprise, using it as an integra
component of their production data management infrastructure. Big Data platforms become a new standard
within these organizations, augmenting traditional architectures at significantly lower costs.
The rest of this paper examines the Big Data Continuum in more detail and provides specic
readiness strategies to help your organization address the challenges and opportunities
of each stage.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
6/25
For organizations in the Awakening stage, hand coding
often using Structured Query Language (SQL)
inside the database is the most common method to
transform and integrate data sets. According to data
warehousing expert Rick Sherman, much of the data
integration projects in corporate enterprises are still
being done through manual coding methods that are
inefficient and often not documented.2
The problems associated with hand coding and using
SQL for data integration tasks are well understood
and include:
Low Productivity: Developing, maintaining, and extending custom software code is a productivity drain
and quickly becomes unsustainable. It is particularly challenging to tune, maintain and extend existing
code when the original developers are no longer in the same roles or have left the company. Custom
code also makes it difficult to perform impact analysis or data lineage to understand dependencies and
data flows.
6
CustomSQLCodeUsedforETLProcessing
2Rick Sherman. Misconceptions Holding Back Use of Data Integration Tools. BI Trends + Strategies, August 2012.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
7/25
READINESS STRATEGIES
Migrate SQL scripts to a high-performance ETL tool.ETL tools have
become the de facto solution to SQL scripting, maintenance and
performance issues. When choosing an ETL tool, beware of complex
engines and code-generators that push SQL down to the database.
Analyze and document complex code and SQL scriptsused in data
integration processes and create graphical flow charts to depict SQL logic.
Identify the top 20%. Typically, 20% of SQL scripts consume up to 80%
of the time and cost, due to hardware, tuning and maintenance. Usual
suspects include SQL with merge/upsert, joins, materialized views, cursors
and union operations.
Migrate SQL scripts using the 80/20 rule.When planning and evaluating
the benefits of SQL migration, it is important to realize that a complete
migration of all SQL code is not necessary to achieve significant benefits.
Instead, focus on the top 20% to deliver quick results and significantsavings.
7
Poor Performance:SQL was not designed for ETL processing. Instead, it is a special-purpose programming
language designed for querying and managing data stored in relational databases. Using SQL for ETL
tasks is inefficient, creating performance bottlenecks and jeopardizing service level agreements (SLAs) fo
ETL processing windows.
High Cost: Pushing intensive data transformations down to the database steals expensive database
cycles from the tasks for which it was intended, resulting in added infrastructure costs and jeopardizing
performance SLAs for processing database queries.
All of these issues can make it difficult for organizations to extract information and deliver business value from
data, especially as data-driven information and decision making become a vicious cycle, creating the demand fo
even more data-driven information. Often, custom coding will solve problems at the outset, but as the need fo
more and faster information grows, these approaches simply cant keep pace with the demands of the business.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
8/25
HOW SYNCSORT CAN HELP
Syncsorts SQL migration solution is specifically designed to help
organizations at the Awakening stage eliminate the SQL ETL coding and
maintenance nightmare by migrating existing SQL ETL scripts to a few
graphical DMX jobs. Syncsort DMX is high-performance ETL software tha
accelerates overall performance and eliminates the need for database
staging areas, seamlessly reducing the total cost and complexity of data
integration.
Intelligent, self-documenting ow charts are automatically
generated so you can clearly understand complex SQL scripts used
in data integration processes.
A few graphical jobs vs. thousands of lines of SQL code.Replace
thousands of lines of SQL code with a few graphical jobs, allowing
even novice users to quickly develop and maintain data integration
jobs.
Improved IT productivity and sustained optimal performance
Seamlessly scale as data volumes grow, without the need fo
manual coding or tuning.
8
Migrate SQL Scripts to a high-
performance ETL tool. Look for the
following characteristics to identify
the high impact code for migration.
High elapsed processing times.
Very complex scripts, including
multiple merges, joins, cursors andunions.
High impact on resource utilization,
including CPU, memory, and storage.
Unstable or error-prone.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
9/25
As organizations progress to the advancing stage they will experience:
More Data. The number and type of data sources users need to leverage increases, often including
dissimilar data in different formats (e.g. text, mainframe, web logs, and CRM).
More End Users. The range of end users that must be satisfied increases, including executives, managers
and field and operations staff, for example.
More Queries. As the number and roles of end users grow, so do the number, variety, and complexity of
queries that must be performed on the data.
Companies at this stage come to realize that continuing to use point solutions and hand-coded approaches wil
hold them back. As a result, they will begin to evaluate, adopt and standardize on ETL tools and data integration
platforms. In addition to investments in IT infrastructure, organizations start to develop and enforce best practices
and accumulate technical expertise that can prove critical to progress along the Big Data Continuum.
When surveyed, more organizations identified their data integration readiness at these first two stages of the Big
Data Continuum than at any of the others.
9
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
10/25
READINESS STRATEGIES
Beware of code-generators and push-down optimizations.Some organizations
have adopted tools that generate SQL or offer so-called push-down
optimizations as a means to achieve faster performance at scale. Unfortunately,
most of these tools including Talend and Informatica require significantskills and ongoing manual tuning to achieve and sustain acceptable performance,
creating similar challenges to hand coding and maintaining SQL-based data
integration logic.
Improve sta productivity. Select an ETL tool with Windows-based paradigms
that dont require a long learning curve or specialized skills. Data integration
tools should allow users to focus on business rules and workflows, rather than
complex tuning parameters to achieve and maintain high performance. Look
for ease of use as well as ease of re-use, with impact analysis and data lineage
capabilities to make it easy to revise and extend existing applications as business
requirements change.
Choose a tool that maximizes run-time performance and eciency. A tool
that delivers superior run-time processing performance and efficiency will
maximize resource utilization, minimize costs, and provide superior throughput.
Look for a solution that performs all transformation processing outside of the
database, minimizing performance bottlenecks and inefficient utilization of
expensive database resources. Doing so can keep costs under control and allow
you to build a solid foundation for the future, avoiding potential issues often
encountered in the subsequent stages.
Leverage all your data. Having the right data source and target connectivity is
critical for leveraging all your data, to help make the best business decisions and
discover new business opportunities.
Establish a Big Data Center of Excellence (COE). A center of excellence is key
to develop and retain Big Data expertise within the organization. The COE should
also set and enforce standards for the data management architecture,define the strategic roadmap, establish best practices and provide
training and support to the organization.
10
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
11/25
HOW SYNCSORT CAN HELP
Syncsorts DMX high-performance ETL solution provides companies
at the Advancing stage with a two-fold approach: it makes addressing
their immediate productivity issues fast and easy, while providing a solid
foundation for future data growth.
Template-driven design. DMX offers a clear, intuitive graphical use
interface that makes it easy for both business and technical users to
develop and deploy ETL processes.
11
Faster transformations for unparalleled ETL
performance. The solution packages a library of
hundreds of smart algorithms to handle the most
demanding data integration transformations and
functions, delivering up to 10X faster elapsed
processing times than Informatica, Talend, and
other conventional tools.
Smart ETL Optimizer. You dont have to worry
about ongoing, time-consuming tuning efforts to
maintain optimum performance. Our unique ETL
Optimizer ensures you will always get maximum
performance, so you can design for the businesswithout wasting time tuning.
Comprehensive connectivity to leverage all your data. The high
performance ETL solution provides out-of-the-box connectivity to
relational sources, flat files, mainframes, Hadoop, and everything in
between.
Flexibility and reusability with no strings attached. A file-based
repository delivers all the benefits of a complete metadata laye
without dependencies on third-party systems such as relationa
databases.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
12/25
Over time, increasing demands for information oftentimes prove to be too much for traditional architectures to
handle. As data volumes grow and business users demand fresher data, popular data integration tools such
as Informatica and DataStage force organizations to push data transformations down to the enterprise data
warehouse, effectively causing a transition from ETL to ELT. Unfortunately, SQL is almost never the best approach
for data integration tasks. Relational database management systems (RDBMS) were specifically designed to solve
problems that involve a big question with a small answer (i.e. user queries). However, when dealing with data
transformations, the T in ETL, the answer is generally as big, if not bigger, than the question.
Moreover, organizations can face unacceptable bottlenecks
and delays, not only for data transformations but also for
analytical queries, as both processes compete for EDW
resources. IT staff and budget can quickly be consumed by
expensive and tedious stopgap measures: manual tuning
efforts, hardware upgrades, and additional data warehousecapacity. Early excitement fades and gives way to use
frustration, incremental costs and a crippling IT backlog.
The resulting business ramifications of these bottlenecks can
be severe, including lost revenue opportunities, impaired
decision making, customer attrition, and so on.
12
The RDBMS is optimized to
solve query loads. That is, big
questions with a small answer.
However, ETL involves big
questions with sometimes evenbigger answers. By ofoading
heavy data transformations
from the EDW, you can free up
database capacity and budget
while accelerating overall data
performance.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
13/25
READINESS STRATEGIES
Ooad transformations from the data warehouse.Inefficient and
underperforming ETL tools have forced many IT developers to push
transformations down to the database, adding complexity and requiring massive
investments in additional database capacity. This approach will actually moveyou backward along the Big Data Continuum, increasing database costs and
the effort to maintain and tune scripts. Look for approaches that shift intensive
transformations out of the database.
Leverage acceleration technologies to extend your existing data
integration infrastructure. Most organizations have spent considerable time
and money building their existing data integration infrastructure, so rip &
replace approaches arent practical. Rather than buying extra hardware and
database capacity, you can identify where the bottlenecks occur and bring in
specialized data integration technology to accelerate these processes. For
example, technology now exists that can efficiently handle sorts, merges, and
aggregations, and that integrates seamlessly with your existing architecture.
Accelerating technologies increase an organizations Big Data readiness by
removing performance bottlenecks while allowing them to leverage their existing
architecture. These plug-and-play technologies typically result in significant
savings that can be used to fund initiatives to move into the Dynamic stage.
Start with the top 20% of data transformations. Usually 20% of the
transformations incur 80% of the processing problems. Offloading and
accelerating these transformations will provide the best bang for the buck.
Consider using Hadoop to ooad all ETL processes from the data warehouse.
Hadoop is emerging as the de facto operating system for Big Data. Thanks to its
massively scalable and fault-tolerant architecture, Hadoop can be much more
effective from a performance and cost perspective than the data warehouse in
processing ETL workloads. In addition, shifting ETL workloads to Hadoop
can free up valuable database capacity to accelerate user queries.
13
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
14/25
HOW SYNCSORT CAN HELP
Syncsorts ETL optimization solution helps organizations maximize the
return on their data integration investments, allowing them to keep their
existing infrastructure while shifting the heavy transformation processes to
Syncsort DMX.
Accelerate your existing data integration environment, including
Informatica and DataStage by 10x or more. Syncsort packages a
library of hundreds of smart algorithms, as well as an ETL Optimizer
to handle the most demanding data integration transformations and
deliver up to 10x faster elapsed times
Simply plug DMX into your existing environment. DMX provides
advanced metadata interchange capabilities to bi-directionally
exchange metadata with other applications. This makes it easy
to plug the solution into existing data integration environments to
seamlessly accelerate performance, eliminate constant tuning, and
facilitate regulatory compliance.
Free up your database and your budget.Syncsorts ETL optimization
solution shifts all data transformations from the enterprise data
warehouse into the DMX high-performance ETL engine, freeing up
database resources for faster user queries.
Get Hadoop-ready. Syncsort offers high-performance data
integration software with everything you need to deploy enterprise
grade ETL capabilities on Hadoop. DMX-h offers a unique approach
to Hadoop ETL that lowers the barriers for adoption, helping your
organization unleash the full potential of Hadoop. Thanks to a
library of Use Case Accelerators, its easy for organizations to get
started with Hadoop by implementing common ETL tasks such as
joins, change data capture (CDC), web log aggregations, mainframe
data access and more.
14
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
15/25
Hadoop is helping organizations in all industries gain greater insights, processing more data in less time and at a
lower cost. According to organizations surveyed, the top benefits from their use of Hadoop are finding previously
undiscovered insights and reducing the overall costs of data .
Two of the most common approaches include data warehouse optimization and mainframe offload. By shifting
transformations the T in ETL out of the data warehouse and into Hadoop, organizations can quickly
15
realize significant value, includin
shortened ETL batch windows, faste
database user queries, and significan
operational savings in the form of spare
database capacity. Similarly, enterprise
that rely on mainframe processing t
support mission-critical application
can capitalize on valuable insights ansavings by offloading data and batc
processing from the mainframe int
Hadoop.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
16/25
It is important to recognize, however, that Hadoop is not a complete ETL solution. Hadoop is an operating system
that provides the underlying services to create Big Data applications. While it offers powerful utilities and massive
horizontal scalability, it does not provide the full set of capabilities that users need to deliver enterprise ETL
applications and functionality. If not addressed correctly, the gaps between the operating-level services that
Hadoop offers and the functionality that enterprise-grade ETL requires can slow Hadoop adoption and frustrate
organizations eager to deliver results, jeopardizing subsequent investments.
16
Hadoop is an open-source software framework that excels at processing and
analyzing large amounts of data at scale. Hadoop makes it practical to scale
out processing tasks across large numbers of nodes by handling the complicated
aspects of creating, managing, and executing a set of parallel processes over a
cluster of low-cost computers.
ETL the process of collecting, processing, and distributing data has emerged as
one of the most common use cases for Hadoop.3 In fact, industry analyst Gartner
predicts that most organizations will adapt their data integration strategy using
Hadoop as a form of preprocessor for Big Data integration in the data warehouse. 4
Use of Hadoop can become a game changer for organizations, dramatically
improving the cost structure for gaining new insights, for analyzing larger data sets
and new data types, and for quickly and exibly bringing new services to market.
Local
Disk
MAP
REDUCE HDFS
Input
Formatter
Ouput
Formatter
SORT
Optional
Partitioner
Optional
Combiner
Local
Disk
SORT
REDUCE HDFSOuput
Formatter
Local
Disk
SORT
Local
Disk
MAPInput
Formatter SORT
Optional
Partitioner
Optional
Combiner
Local
Disk
MAPInput
Formatter SORT
Optional
Partitioner
Optional
Combiner Typical MapReduce Data Flow
3http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/
4Mark A. Beyer and Ted Friedman. Big Data Adoption in the Logical Data Warehouse. Gartner Research, February 2013
http://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://blog.cloudera.com/blog/2013/02/big-datas-new-use-cases-transformation-active-archive-and-exploration/ -
7/25/2019 A Practical Guide to Big Data Readiness
17/25
READINESS STRATEGIES
During experimentation and early stages of Hadoop, the main objective is to prove the
value that Hadoop can bring to organizations by augmenting or extending existing data
integration and data warehouse architectures. Therefore, data connectivity and quick
development of common ETL use cases are critical for organizations at the Dynamicstage. Connectivity to the right data sources can maximize the value of the framework
and avoid having Hadoop become yet another silo within the enterprise. In addition,
quickly ramping productivity with Hadoop allows IT to deliver quantifiable successes that
pave the way for more widespread adoption. Success at this stage enables companies
to move to the Evolved stage, where Hadoop becomes an integral component of the
production data management architecture.
Select a tool with a wide variety of connectors to source and target systems.
Simplify importing data from various sources into Hadoop, as well as exporting
data from Hadoop to other systems.
Leverage mainframe data.Mainframe data can be the critical reference point for
new data sources, such as web logs and sensor data. Therefore, make sure the
tool provides connectivity and data translation capabilities for the mainframe.
Ensure the tool oers a comprehensive library of pre-built, out-of-the-box data
transformations. The most common data flows include joins, aggregations,
and change data capture. Reusable templates can accelerate development of
prototype applications and proof of value.
Avoid tools that generate code. These tools will burden your organization with
heavy tuning and maintenance.
Test and break your system. As you build your proof-of-concept, stress testing
your system will help you assess the reliability of your implementation and will
teach your staff critical skills to maintain and support it down the road.
Identify and prioritize use cases. Identify one (or a small number of) proof-of-
concept use cases for Hadoop. Candidate use cases often involve recurring ETL
processes that place a heavy burden on the existing data warehouse.
17
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
18/25
HOW SYNCSORT CAN HELP
Syncsorts DMX-h high-performance data integration software provides a
smarter approach to Hadoop ETL including an intuitive graphical interface
for easily creating and maintaining jobs, a wide range of productivity
features, metadata facilities for development re-use and data lineage, high-
performance connectivity capabilities, and an ability to run natively within
the MapReduce framework, avoiding code generation.
Smarter connectivity to all your data. With DMX-h, you only need
one tool to connect all sources and targets to Hadoop, including
relational databases, appliances, files, XML, and even cloud. No
coding or scripting is needed. DMX-h can also be used to pre-
process data cleanse, sort, partition, and compress prior to
loading it into Hadoop, resulting in enhanced performance and
significant storage savings.
Smarter mainframe data ingestion and translation. DMX-h offers
unique capabilities to read, translate, and distribute mainframe
data with Hadoop. It supports mainframe record formats such
as fixed, variable, variable with block descriptor, and VSAM, and
also translates data from EBCDIC to ASCII, and imports COBOL
copybooks without coding.
Smarter testing, debugging and troubleshooting
DMX-h allows you to develop, test, and troubleshoot locally
in Windows before deploying into Hadoop. In addition
DMX-h provides comprehensive logging capabilities, as
well as integration with Hadoops JobTracker for easier
log consumption.
Smarter productivity to fast-track your way to
successful Hadoop ETL. DMX-h helps you get started
and become fully productive with Hadoop quickly
by providing a library of Use Case Accelerators that
implement common ETL tasks such as joins, change
data capture (CDC), web log aggregations, mainframe
data access, and more.
18
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
19/25
While most organizations at this stage are not looking to replace their existing data warehousing infrastructure
with Hadoop, ETL is a different story. Hadoop is poised to completely change the way organizations collect
process, and distribute their data. ETL is shifting to Hadoop ETL and Big Data is becoming the new standard
architecture, providing greater value to the organization at a cost structure that is radically lower than traditiona
architectures. And thats why the ability to cost-effectively utilize Big Data is quickly becoming a requirement for
companies to survive.
For example, an organization can store
aggregated web log data in their relational
database, while keeping the complete
datasets at the most granular level in Hadoop.
This allows them to run new queries against
the full historical data at any time to find new
insights, which can be a true game-changeras organizations aggressively look for new
insights and offerings to differentiate from
the competition.
19
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
20/25
20
As organizations begin to standardize on Hadoop as the new Big Data platform, they must keep hardware
and resource costs under control. Although Hadoop leverages commodity hardware, the total cost for system
resources can still be significant. When dealing with large numbers of nodes, hardware costs add up. Programming
resources e.g. HiveQL, Pig, Java, MapReduce can also prove expensive. Using Hadoop for ETL processing
requires specialized and expensive developers that can be hard to find and hire. For example, the Wall Stree
Journal recently cited that a Hadoop programmer can now earn as much as $300,000 per year.
Today, the reality is that very few organizations have yet to reach the Evolved stage. Less than 2% of organizations
surveyed are using Hadoop as an integral component of their data management platform. But many organizations
are working towards this goal, and almost 11% expect to be at this stage within the next twelve months. Those who
get there faster will have a definite competitive edge.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
21/25
READINESS STRATEGIES
Organizations at this stage need to focus on approaches that will allow them to efficiently scale
adoption of Big Data technologies across the entire enterprise. As companies move from proof-
of-value solutions to full-scale adoption, it is critical to understand that what worked in the earlier
stages may not always work in the Evolved stage.
Select an approach with built-in optimizations that enhance Hadoops vertical
scalability to reduce hardware requirements. Run performance benchmarks and study
which tools deliver the best combination of price/performance for your most common
use cases.
Ensure code does not become a coding nightmare. While learning and developing Pig,
HiveQL, and Java code might be fun at the beginning, highly repetitive tasks such as
joins, change data capture (CDC), and aggregations can quickly become a nightmare to
troubleshoot and maintain. Using tools with a template-driven approach can make you
more productive by focusing on more value-added activities. Choose a Hadoop ETL tool with a user-friendly graphical interface. Easily build ETL
jobs without the need to develop, debug, and maintain complex Java, Pig, HiveQL, and
other specialized code for MapReduce. Using common ETL paradigms will allow you
to leverage existing ETL skills within your organization, minimizing barriers for wider
Hadoop adoption.
Consider an ETL tool with native Hadoop integration. Beware of ETL tools that claim
integration with Hadoop but simply generate code such as HiveQL, Pig, or Java. These
approaches can create additional performance overhead and maintenance hurdles down
the road.
Leverage a metadata repository. This will facilitate reusability, data lineage, and impact
analysis capabilities.
Rationalize your data warehouse. Identify the top 20% of ETL workflows causing
problems within your existing enterprise data warehouse. Start by shifting these
processes into Hadoop. Operational savings and additional database capacity can then
be used to fund more strategic initiatives.
Secure your Hadoop data. Any viable approach to Hadoop ETL must provide ironclad
security that meets your organizations and industrys data security requirements.
Seamless support for Kerberos and LDAP is key.
Augment your Center of Excellence (COE) with Hadoop best practices and guidelines.
Enhance your organizations COE to provide expertise in Hadoop and related tools, and
to define and standardize guidelines to identify and align the appropriate IT resources
with the appropriate use cases throughout your organization.
21
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
22/25
HOW SYNCSORT CAN HELP
Syncsort DMX-h turns Hadoop into a more robust and feature-rich ETL
solution, enabling users to maximize the benefits of MapReduce without
compromising on the capabilities and ease of use offered by conventiona
data integration tools.
Faster performance per node. DMX-h is not a code generator
Instead, Hadoop automatically invokes the highly efficient DMX-h
runtime engine, which executes on all nodes as an integral part
of Hadoop. DMX-h can help organizations in the Evolved stage
by delivering consistently higher performance per node as data
volumes grow.
Hadoop ETL without coding. DMX-h enables people with a much
broader range of skills not just MapReduce programmers to
create ETL tasks that execute within the MapReduce frameworkreplacing complex Java, Pig, or HiveQL code with a powerful, easy-
to-use graphical development environment.
Enterprise-grade security for Hadoop ETL. DMX-h helps you
keep all your data secure with market-leading support for common
protocols such as LDAP and Kerberos.
Smarter Hadoop deployments. DMX-h offers tight integration
with all major Hadoop distributions, including Apache, Cloudera
Hortonworks, MapR, and PivotalHD. Seamless integration with
Cloudera Manager allows you to easily deploy and upgrade DMX-h
in your entire Hadoop cluster with the click of a button.
Optimized sort for MapReduce processes and HiveQL. Thanks
to Syncsorts recently committed contribution to the open source
community MAPREDUCE-2454 you can simply plug DMX-h
into your existing Hadoop clusters to seamlessly optimize existing
Hive and MapReduce jobs for even greater performance and more
efficient use of your Hadoop cluster.
Smarter Economics. Keep costs down as you scale Hadoop across
the entire organization. DMX-hs unique capabilities help you
maximize savings, delivering best-in-class ETL technology at a price
point that is more consistent with the cost structure of open source
solutions. Achieve significant operational savings faster by shifting
existing ETL workloads from high-end platforms to Hadoop.
22
Syncsort developed andcontributed key features
to the Apache open source
community to make the
sort function pluggable
with Hadoop. MAPREDUCE
2454 allows you to run
the fastest and most
efcient sort technology
natively within Hadoopto optimize existing
MapReduce operations
without any code changes
or tuning.
https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttps://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454https://issues.apache.org/jira/browse/MAPREDUCE-2454 -
7/25/2019 A Practical Guide to Big Data Readiness
23/25
ARE YOU READY TO EMBRACE THE CHALLENGESAND OPPORTUNITIES OF BIG DATA?The Big Data Continuum a framework developed with decades of data management expertise can help
you assess your readiness and prepare for the challenges ahead:
Assess your companys data management maturity level.
Identify potential pitfalls as you evaluate and implement new technologies and processes.
Learn how to successfully address common problems that arise at each stage.
Fast track your journey to embrace Big Data and capitalize on the forth V Value.
The key stages of the Big Data Continuum are:
Awakening. Primarily using hand-coding techniques to process data.
Advancing. Standardizing on traditional data integration platforms.
Plateauing. Straining the limits of traditional data integration architectures.
Dynamic. Experimenting with Hadoop.
Evolved. Standardizing on Hadoop as the operating system for Big Data across the entire enterprise.
23
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
24/25
24
ARE YOU READY FOR BIG DATA?Organizations that are further along the Big Data Continuum have a much better chance to succeed and enjoy first-
mover advantage, while laggards will find themselves at risk of declining revenues, market share, and relevance
Regardless where you are on the Big Data Continuum, Syncsort offers smarter solutions to help leverage all you
data assets and build a solid foundation for Big Data. With thousands of deployments across all major platformsSyncsorts solutions from SQL migration, to high performance ETL, to Hadoop can help you thrive in the world
of Big Data.
http://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMR -
7/25/2019 A Practical Guide to Big Data Readiness
25/25
Discover Syncsorts Big Data Solutions
Take a Free Test Drive of Our Hadoop ETLSolution
Check Out Our Infographic: The Big Pictureon Big Data & Hadoop
Read a Report: The European Big Picture onBig Data & Hadoop
Syncsort provides data-intensive organizations across the Big Data continuum with a smarter way to collect and
process the ever-expanding data avalanche. With thousands of deployments across all major platforms, includingmainframe, Syncsort helps customers around the world to overcome the architectural limits of todays ETL and
Hadoop environments, empowering their organizations to drive better business outcomes in less time, with less
resources and lower TCO. For more information visit www.syncsort.com.
Like This Guide? Share
http://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://linkd.in/1dJI7hnhttp://on.fb.me/15Gk0qXhttp://bit.ly/1dTZCMRhttp://www.syncsort.com/en/Data-Integration/Home?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://bit.ly/13iY20chttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-European-Big-Picture-on-Big-Data-and-Hadoop-in?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Resource-Center/The-Big-Picture-on-Big-Data-and-Hadoop-in-2013?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Registration/Registration?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readinesshttp://www.syncsort.com/en/Data-Integration/Solutions?utm_source=Syncsort-Asset&utm_medium=eBook&utm_campaign=eBook-Link-Guide-to-Big-Data-Readiness