tdwi e book_modernizing dw_with_hadoop

tdwi.org

TDWI E-Book

Sponsored by:

MARCH 2014

Modernizing Your Data Warehouse Architecture with Hadoop

1 Q&A: Best Practices for Offloading Tasks to Hadoop

3 How (and Why) Hadoop Is Changing the Data Warehousing Paradigm

6 Hadoop and Data Management: A Close Relationship

9 About Syncsort

http://tdwi.org

http://tdwi.org

http://www.syncsort.com

1 TDWI e-book MoDernIzIng Your DaTa Warehouse archITecTure WITh haDoop

Expert Q&A Why Hadoop? Hadoop and Data Management About Syncsort

TDWI: How would you describe the current state of big data integration today?

Jorge A. Lopez: For decades, organizations have struggled with critical performance and scalability shortcomings of conventional data integration. These shortcomings forced them to push heavy data integration workloads down to the data warehouse. As a result, core data integration experienced a shift from extract, transform, and load (ETL) to extract, load, and transform (ELT).

Although this worked in the short term, it also created a whole new set of problems for the IT organization, from longer batch windows to shorter data retention and rapidly growing database costs. Today, big data is aggravating these problems. More than ever, your organization’s survival depends on your ability to transform data into actionable insights. This need to analyze more data from a more diverse set of sources in less time, while keeping costs under control, is creating a lot of tension in existing data integration architectures.

What is driving organizations to incorporate Hadoop into their data management environments?

It’s precisely this tension between the evolving needs of the business and the growing costs of IT infrastructure that’s driving many Hadoop implementations. Hadoop is offering a better alternative to data integration, with an approach that is economically feasible, while providing the required levels of performance and massive scalability.

Hadoop has the potential of becoming the ideal staging area where you can store and archive all of your data (both structured

The growing costs of IT infrastructure and larger data volumes are driving enterprises to offload data management chores to Hadoop. We explore how enterprises can make the most of this shift with Jorge A. Lopez, director of product marketing at Syncsort. With over 15 years of experience in business intelligence and data integration, Lopez is responsible for product marketing and strategy at Syncsort.

Q&A: Best PrActices for offloAding tAsks to HAdooP



and unstructured), but you can also pre-process it—execute all the batch workloads—and then feed it to other pieces of your architecture. By effectively offloading data and ELT workloads from the data warehouse into Hadoop, organizations can significantly reduce batch windows, keep readily available data as long as they need, and free up significant data warehouse capacity. This means no trade-offs and no tension—just the data you need to drive your business.

What challenges do organizations face as they offload their data warehouse into Hadoop?

It’s key to understand that Hadoop is not a complete ETL solution. In my opinion, Hadoop is much closer to an operating system. It provides services for developers and vendors to create big data applications. However, although it offers powerful utilities and massive horizontal scalability, it does not provide the set of functionality users need to deliver enterprise ETL capabilities. That’s why offloading data and ETL workloads to Hadoop can be intimidating.

Some of the key challenges involve identifying the right tools to close the functional gaps between enterprise ETL and Hadoop. Where do you begin? How do you know which workloads to move? Do you have all the tools necessary to access and move your data and processing? How can you optimize processing once it’s inside Hadoop?

What are some best practices to overcome these challenges?

First of all, let me be clear: the data warehouse is not going away. The goal of offloading is to free up database capacity to reduce costs, improve database user query response time, and use that premium database capacity more wisely. To that end, most organizations follow a three-step approach.

The first step consists of identifying infrequently used (cold) data and heavy ELT workloads in your data warehouse. We have heard from partners and customers alike that, in many cases, ELT processes performed on this “cold” data can waste significant premium storage and CPU resources in your data warehouse, yet it adds zero value. Similarly, heavy transformations including changed data capture (CDC), slowly changing dimensions, raking functions, volatile tables, multiple merge, joins, cursors, and unions can drive up to 80 percent of resources.

The second step is all about moving the data and replicating the existing ELT workloads into Hadoop. The best way to do this is by leveraging existing skills within your organization. Tools with point-

and-click interfaces can help accelerate development and ongoing maintenance of the new environment.

Finally, once you’ve offloaded data and workloads from the data warehouse, you will need enterprise-grade tools to manage, secure, and operationalize the new environment. Here, it is important to look out for solutions that support common security standards such as Kerberos and LDAP, as well as monitoring and management tools.

Where does the cloud fit within this picture?

Hadoop presents a great opportunity to collect, process, and analyze extreme data volumes at a much lower cost. However, procuring, deploying, and maintaining a Hadoop environment can be a daunting task, and that’s exactly where the cloud comes into the picture.

Cloud services, such as Amazon EMR, Google Cloud, and others, allow organizations to instantly provision a Hadoop framework, effectively lowering the barriers for wider adoption and leveling the playing field. Not only that, it means any organization, regardless of its size, can have a Hadoop cluster with virtually unlimited scalability. That’s why the convergence of cloud and Hadoop is so much more disruptive.

Does legacy data have a place in Hadoop?

Absolutely. Historically, the most data-intensive businesses have relied on mainframes to manage their “big data.” These organizations—retail, financial, healthcare, telecommunications—know they cannot neglect mainframe data. However, they also need to be aware of skills, integration, and cost gaps between mainframe and Hadoop in order to provide fast, reliable, and secure access to mainframe data.

What products or services does Syncsort offer for offloading data and processing to Hadoop?

Syncsort provides targeted solutions to address the challenges of offloading data and workloads from legacy systems, such as data warehouses and mainframes, to Hadoop. Our fully integrated approach gives you the tools to automatically identify data and processing suitable for offload, easily migrate them to Hadoop with the help of a graphical user interface, and, once there, optimize and secure your Hadoop environment.

This is true for both the enterprise data warehouse and the mainframe. Our mainframe heritage means you can also analyze mainframe workloads and easily access, translate, and move mainframe data to Hadoop. Our solutions can be deployed on premises with Syncsort DMX-h or in the cloud with Ironcluster for Amazon EMR.



Hadoop will not replace relational databases or traditional data warehouse platforms, but its superior price/performance ratio can help organizations lower costs while maintaining their existing applications and reporting infrastructure. How should your enterprise get started?

The emergence of new data sources and the need to analyze virtually everything, including unstructured data and live event streams, has led many organizations to a startling conclusion: a single enterprise data warehousing platform can no longer handle the growing breadth and depth of analytical workloads. Being purpose-built for big data analytics, Hadoop is now becoming a strategic addition to the data warehousing environment, where it is able to fulfill several roles.

Why Hadoop (and Why Now) Organizations across all industries are confronting the same challenge: data is arriving faster than existing data warehousing platforms are able to absorb and analyze it. The migration to online channels, for example, is driving unprecedented volumes of transaction and clickstream data, which are, in turn, driving up the cost of data warehouses, ETL processing, and analytics.

Compounding the challenge is that much of this new data is unstructured. Many businesses, for example, now want to analyze

more complex high-value data types (such as clickstream and social media data, as well as un-modeled, multi-structured data) to gain new insights. The problem is that these new data types do not fit the existing massively parallel processing model that was designed for structured data in most data warehouses.

The cost to scale traditional data warehousing technologies is high and eventually becomes prohibitive. Even if the cost could be justified, the performance would be insufficient to accommodate today’s growing volume, velocity, and variety of data. Something more scalable and cost-effective is needed, and Hadoop satisfies both of these needs.

Hadoop is a complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and otherwise processing disparate data sources (structured, semi-structured, and unstructured) in a cluster of commodity computers. This architecture gives Hadoop clusters incremental and virtually unlimited scalability—from a few to a few thousand servers, each offering local storage and computation.

Hadoop’s ability to store and analyze large data sets in parallel on a large cluster of computers yields exceptional performance, while the use of commodity hardware results in a remarkably low cost. In fact, Hadoop clusters often cost 50 to 100 times less on a per-terabyte basis than today’s typical data warehouse.

With such an impressive price/performance ratio, it should come as no surprise that Hadoop is changing the data warehousing paradigm.

How (And wHy) HAdooP is cHAnging tHe dAtA wAreHousing PArAdigm By Jack Norris



Hadoop’s Role in the New Data Warehousing ParadigmHadoop’s role in data warehousing is evolving rapidly. Initially, Hadoop was used as a transitory platform for extract, transform, and load (ETL) processing. In this role, Hadoop is used to offload processing and transformations performed in the data warehouse. This replaces an ELT (extract, load, and transform) process that required loading data into the data warehouse as a means to perform complex and large-scale transformations. With Hadoop, data is extracted and loaded into the Hadoop cluster where it can then be transformed, potentially in near real time, with the results loaded into the data warehouse for further analysis.

In all fairness, ELT processes began as a way of taking advantage of the parallel query processing available in the data warehouse platform. Offloading transformation processing to Hadoop frees up considerable capacity in the data warehouse, thereby postponing or avoiding an expensive expansion or upgrade to accommodate the relentless data deluge.

Hadoop has a role to play in the “front end” of performing transformation processing as well as in the “back end” of offloading data from a data warehouse. With virtually unlimited scalability at a per-terabyte cost that is more than 50 times less than traditional data warehouses, Hadoop is quite well suited for data archiving. Because Hadoop can perform analytics on the archived data, it is necessary to move only the specific result sets to the data warehouse (and not the full, large set of raw data) for further analysis.

Appfluent, a data usage analytics provider, calls this the “Active Archive”—an oxymoron that accurately reflects the value-added potential of using Hadoop in today’s data warehousing environment. They have found that for many companies, about 85 percent of their tables go unused, and that in the active tables, up to 50 percent of the columns go unused. The combination of eliminating “dead data” at the ETL stage and relocating “dormant data” to a low-cost Hadoop Active Archive can be considerable, resulting in truly extraordinary savings.

Although able to provide superior price/performance ratio in both the front and back ends of a data warehouse, Hadoop’s best role may well be as an end in and of itself. This is particularly true given how much Hadoop has evolved since its early days of batch-oriented analysis of Web content for search engines.

Consider the inclusion of HBase, for example, in the Hadoop ecosystem. HBase is a non-relational, NoSQL database that sits atop the Hadoop Distributed File System (HDFS). HBase applications have several advantages in certain distributions, including the ability to achieve high performance and consistently low latency for database operations.

Of course, Hadoop’s original MapReduce framework—purpose-built for large-scale parallel processing—is also eminently suitable for data analytics in a data warehouse. In fact, MapReduce is fully capable of everything from complex analyses of structured data to exploratory analyses of un-modeled, multi-structured data.

An exploratory analysis, for example, could derive structure from unstructured data, enabling the data to be loaded into HBase, Hive, or the existing data warehouse for further analysis. Such “pre-processing” is so effective and cost-effective that a growing number of ETL processes are being rewritten as MapReduce jobs. These efforts are often assisted by Hive’s ability to convert ETL-generated SQL transformations into MapReduce jobs.

Although these MapReduce conversions work well, performance can be improved by rewriting the intermediate shuffle phase that occurs after the Map and before the Reduce functions. Optimizing the shuffle benefits the sorting, aggregation, hashing, pattern-matching, and other processes that are integral to ETL/ELT.

Because it is quite common with MapReduce to have the output of one job become the input for another, Hadoop effectively makes ETL integral to, and seamless with, data analytics and archival processing. It is this beginning-to-end role in data warehousing that has given impetus to what is Hadoop’s ultimate role as an enterprise data management hub in a multi-platform data analytics environment. Indeed, it is almost as if Hadoop is destined to fulfill this role based on its versatility, scalability, compatibility, and affordability.

Although Hadoop appears perfectly suited for use as an enterprise data management hub, there is (as always) a caveat: some Hadoop distributions and/or configurations lack “enterprise-class” capabilities. As a hub, the Hadoop cluster must offer mission-critical high availability and robust data protection. The former can be achieved by eliminating any single points of failure, the latter by supporting both snapshots for point-in-time data recovery and remote mirroring for disaster recovery.


EnterpriseData Hub

SensorData

Clickstreams

Location

SocialMedia

SCMBilling Sales

Public

WebLogs

ProductionData

CRM

A data hub combines different data sources, minimizes data movement, and uses one platform for analytics.


ConclusionThe data deluge—with its three equally challenging dimensions of variety, volume, and velocity—has made it impossible for any single platform to meet all of an organization’s data warehousing needs. Hadoop will not replace relational databases or traditional data warehouse platforms, but its superior price/performance ratio will give organizations an option to lower costs while maintaining their existing applications and reporting infrastructure.

So get started with Hadoop at the front end with ETL, at the back end with an Active Archive, or get started in-between by supplementing existing technologies with Hadoop’s parallel processing prowess for both structured and unstructured data—depending on your greatest need. For those still reluctant to make the investment at this time, consider getting started in in the cloud, where Hadoop is now available as an “on-demand” service.

However your organization gets started, be prepared to become a believer in the new multi-platform data warehousing paradigm, in general, and in Hadoop as a potential and powerful enterprise data management hub.

Jack Norris is the chief marketing officer of MapR Technologies and leads the company’s worldwide marketing efforts. Jack has over 20 years of enterprise software marketing and product management experience in defining and delivering analytics, storage, and information delivery products. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain and Company. Jack earned an MBA from UCLA Anderson and a BA in economics with honors and distinction from Stanford University.

An enterprise data Hub



Hadoop doesn’t require you to rewire your existing data management best practices, only revise them.

Contrary to what you might have heard, Hadoop doesn’t rewrite the data management (DM) rulebook.

Not in whole, and not necessarily in (large) part. In fact, Hadoop and traditional data management are highly complementary.

From a DM perspective, Hadoop enables fundamentally new practices and excels in contexts in which the data warehouse (DW), that linchpin of traditional DM, founders. For this reason, Hadoop and the DW are not mutually antagonistic. Far from it: many of the new (or non-traditional) use cases that Hadoop enables in turn enable new—and non-traditional—use cases in DM.

This doesn’t so much require rewriting as it does revising existing data management best practices.

In most cases, “revising” the DM rulebook means incorporating changes as new additions to the standard text. This is what it means to “extend” traditional DM with Hadoop: it’s a question of identifying new and existing use cases and best practices that leverage Hadoop’s strengths in a complementary capacity. “Hadoop can be a powerful platform that enables scalability and handles diverse data types for certain components of your DW architecture,” notes Philip Russom, TDWI Research director for data management.

For example, Hadoop can function in any of several roles as a long-term repository in which to store, persist, and manage data. Some of these roles are by now well known (the Hadoop “landing zone”), while others are relatively new. From a DM perspective, all of these roles tend to be highly complementary: e.g., it simply is not cost-effective to use an RDBMS or an MPP DBMS platform to implement a landing zone- or data lake-like scheme. However, just because Hadoop is complementary doesn’t mean that it doesn’t also constitute a significant enhancement or extension of—a departure from—existing technology and practices. Hadoop enhances and extends traditional data management practices and enables entirely new practices or use cases. It is both continuous with traditional DM and at the same time much bigger: it enables an altogether new kind of “big data management”—a category that subsumes traditional DM—centering on the Hadoop environment itself.

Russom describes a use case in which Hadoop simultaneously functions as both a low-cost replacement for—and, in the form of a massive online data archive, as an extension of—the otherwise-indispensable ODS. “To free up capacity on a data warehouse, many organizations manage detailed source data on an ODS consisting of a standalone hardware server running a DBMS instance. There’s a need for ODS platforms that cost-effectively handle massive data volumes and more diverse data, which Hadoop can do,” he points out.

“The source data stored long term in ODSs can approach petabyte scale. Examples include call detail records in telco, sessionized clickstreams in e-commerce, and customer data in financial

HAdooP And dAtA mAnAgement: A close relAtionsHiP



services. To cope with large volumes, some data is archived offline, which puts it beyond the reach of analytics. Hadoop can keep data online for constant access.”

DM Best Practices, Hadoop StyleThe data warehouse was designed in part to address a data access challenge: the DW facilitates query access to business information by centralizing data, by imposing strict constraints on data types, and by organizing data in a strict, rigidly defined schema. In effect, the DW model brings the data to the analytic logic. Hadoop and big data effectively upend this model, Russom notes: in the Hadoop model, after all, both data and analytic logic are (or can be made to be) in the same place.

“For decades, most analytic tools required that data be transformed to a special model and moved to a special database or file prior to analysis,” he explains. “Given the volumes of today’s big data, this is no longer feasible. However, Hadoop was designed for processing data in place. Think of how MapReduce and Hive access and process Hadoop data without moving or remodeling it first.”

That said, many DM best practices can and should be extended to Hadoop. For example, if an information system is identified as a valuable source for analytics, it, too, should be managed and curated. As with a data warehouse, this means preloading its data into Hadoop.

“In the long run, this is faster than retrieving the data prior to each run of an analytic process, especially if the data is voluminous,” writes Russom. Here, too, DM best practices apply: a DW isn’t just preloaded with data, after all; data must be updated or refreshed over time. This means synchronization, which effectively means changed data capture (CDC) capabilities: “Preloading data into HDFS means you must devise processes that keep Hadoop data up to date. Look for changed data capture functionality in data integration tools that interface with HDFS.”

In this model—as in most such integration or extension schemes—Hadoop isn’t a replacement for the data warehouse. For example, one loudly trumpeted use case casts Hadoop as an analytic or data integration sandbox: i.e., as a place in which (respectively) to consolidate and test extremely large data sets or in which to test joins, aggregations, and transformational logic. Even though the scope of both activities is open-ended or experimental, the ultimate aim is to produce stable or persistent structures: for example, tested and refined analytic insights or data integration artifacts.

“It’s ironic that data analysts, data scientists, and similar users scan gigantic volumes of data to understand a business problem—e.g., what’s the root cause of the latest form of churn?—or opportunity,” he indicates. “[T]hey typically boil it all down to a relatively small data set [that is] expressed in a model that represents their epiphany. Too often, analysts share the epiphany with a few peers and managers, then move on to the next analytic assignment. Instead, analysts should always take the outcome of analytics to the BI and DW team in case the team sees the need to operationalize in reports what was initially discovered via analysis.”

One of Hadoop’s biggest strengths is that it’s able to accommodate types and volumes of data that traditional RDBMSs or even MPP DBMSs cannot. This isn’t a categorical “cannot,” however: it’s possible to scale an MPP warehouse into the double-digit petabyte range, albeit at a cost that’s orders of magnitude more expensive than that of a comparative Hadoop deployment.

To paraphrase the subtitle of Stanley Kubrick’s Dr. Strangelove, DM practitioners must learn to stop worrying and love Hadoop. This means looking for opportunities to offload (to Hadoop) data or processes for which the DW itself is fundamentally unsuited.

“This includes data types that few DWs were designed for, such as detailed source data and any file-based data—[e.g.] logs, XML, text documents, [and] unstructured data,” Russom writes. “It includes most ETL and data integration processes, especially those that must run at massive scale—e.g., aggregating tens of terabytes, sorting millions of call detail records. Hadoop is designed for these data types and operations, and Hadoop capacity is far less expensive than DW capacity.”

Russom sees offloading as a win-win for both the DW and Hadoop: “[O]ffloading allows the DW to do what it does best: provide squeaky clean, well-modeled data with a well-documented audit trail for standard reports, dashboards, performance management, and OLAP.”

Using Hadoop as a landing zone, sandbox, or staging area for storing and managing new data—or for accommodating experimental analytic and data integration workloads—has other benefits, too.

For one thing, doing so makes it much easier to quickly expose new data sources: there’s simply less risk of something going awry. Data integration or analytic kinks can be worked out in Hadoop; squeaky-clean data or data structures can thereafter be moved into the data warehouse.



Russom cites the relatively new idea of using Hadoop as a “data lake”—i.e., as an inexhaustible pool for most of the raw data that a business ultimately uses to feed its DW analytic apps. This would be a nonstarter in a data warehouse environment; there’s a reason data is conformed and transformed before it’s loaded into the warehouse, after all. By the same token, the “data lake” concept wouldn’t be cost-effective or (as a function of scaling issues) practicable with an ODS.

Yet certain kinds of analysis, from BI discovery to advanced analytics, can make use of this raw data. What’s more, information from other non-traditional sources—such as mainframes, which generate detailed transaction information, or machines and sensors, which generate log and event data—can conceivably be pooled in a Hadoop data lake, too, Russom argues.

“DW professionals are often hesitant when it comes to integrating data from a new source into the warehouse because it takes time to model new data structures and design ETL jobs. In addition, disaggregating poor-quality or untrustworthy data from the DW’s calculated values, time series, and dimensional structures is so difficult as to be impossible,” he points out. “With a data-lake approach to HDFS, modeling and ETL are not required, and disaggregation can be as simple as altering virtual views or analytic algorithms so they ignore files containing questionable data.”

DM Mainstays Such as ETL Are More Important than EverThis doesn’t mean that data modeling and ETL requirements will go away. Far from it: if or when analytic artifacts are shifted from Hadoop to the data warehouse, modeling and ETL will be huge factors. The practical effect of using Hadoop to land and stage new data; inexpensively parallelize ETL workloads; identify, test, and refine analytic insights; or perfect data integration workloads is that it becomes possible to more quickly instantiate all of these artifacts in the data warehouse.

In the traditional DW model, there can be a huge time lag between when a feature or change is requested and when it’s actually delivered. This is in part because the DW model presupposes a comparatively static world—so much so that making changes is a nontrivial task. Incorporating Hadoop into DM as a versatile test and development, analytic discovery, or data processing platform can help mitigate this issue. “The issue ... is whether a single-platform data warehouse can be designed and optimized such that all workloads run optimally, even when concurrent. More and more

DW teams are concluding that a single-platform DW is no longer desirable,” Russom says.

“Instead, they maintain a core DW platform for traditional workloads—reports, performance management, and OLAP—but offload other workloads to other platforms,” he concludes. “The DW is not going away; it’s just being complemented by additional data platforms tuned to workloads that can and should be offloaded from the core warehouse.”


© 2014 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected].

Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.


www.syncsort.com

Syncsort provides fast, secure, enterprise-grade software spanning Big Data solutions in Hadoop to Big Iron on mainframes. We help customers around the world to collect, process, and distribute more data in less time, with fewer resources and lower costs. Eighty-seven of the Fortune 100 companies are Syncsort customers, and Syncsort’s products are used in more than 85 countries to offload expensive and inefficient legacy data workloads, speed data warehouse and mainframe processing, and optimize cloud data integration. Experience Syncsort at www.syncsort.com.

To learn more about Syncsort solutions for Hadoop—and try them for yourself: www.syncsort.com/hadoop www.syncsort.com/try

tdwi.org

TDWI, a division of 1105 Media, Inc., is the premier provider of in-depth, high-quality education and research in the business intelligence and data warehousing industry. TDWI is dedicated to educating business and information technology professionals about the best practices, strategies, techniques, and tools required to successfully design, build, maintain, and enhance business intelligence and data warehousing solutions. TDWI also fosters the advancement of business intelligence and data warehousing research and contributes to knowledge transfer and the professional development of its members. TDWI offers a worldwide membership program, five major educational conferences, topical educational seminars, role-based training, on-site courses, certification, solution provider partnerships, an awards program for best practices, live Webinars, resourceful publications, an in-depth research program, and a comprehensive website, tdwi.org.

http://tdwi.org



http://www.syncsort.com/hadoop

http://www.syncsort.com/try

http://tdwi.org

tdwi e book_modernizing dw_with_hadoop

Data & Analytics

offloading data

cold data

available data

data warehouse architecture

core data integration

data management chores

data management environments

big data applications