the contextual data lake - mcgovern web …the data lake concept has attracted significant recent...

20
THE CONTEXTUAL DATA LAKE MAXIMIZING DATA LAKE VALUE VIA HYBRID ENVIRONMENTS THAT PROVIDE COMPLETENESS, CONTEXT, AND ACCELERATED ANALYTICS CAPABILITY

Upload: others

Post on 22-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

THE CONTEXTUAL DATA LAKE

MAXIMIZING DATA LAKE VALUE VIA HYBRID ENVIRONMENTS THAT PROVIDE COMPLETENESS,

CONTEXT, AND ACCELERATED ANALYTICS CAPABILITY

Page 2: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 2 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

SUMMARY

Businesses are increasingly turning to data lakes as a means of addressing the challenges associated with managing

big data. These organizations face a certain amount of confusion and ambiguity when setting out to implement a

data lake solution because there is no single definitive “data lake” model, but rather a variety of options around

how this component of the enterprise data fabric can be architected and implemented. Much of the current

discussion about data lakes centers on Hadoop, which is – without question – a core big data technology. However,

an exclusive focus on Hadoop is misdirected, as would be an exclusive focus on traditional data warehousing

technology. The false dichotomy between these two approaches tends to obscure the fact that a hybrid

environment offers many advantages over an exclusive focus and is currently the most likely option for most

organizations. SAP offers solutions for real-time operations, data warehousing, and managing big data to support

a wide range of options for implementing and managing a hybrid data lake environment. Properly managed, a

hybrid environment enables the implementation of a true contextual data lake, an evolutionary step up from the

non-contextual data lake (the data swamp) to the real-time, virtual data lake environments to come.

Page 3: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 3 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

INTRODUCTION

Rethinking Data Architecture

In a world that is increasingly information-driven, data management is no longer merely a reflection of your

organization’s administrative competency, but rather a unique strategic differentiator that can mean the

difference between success and failure in the marketplace. The companies that realize success and growth in this

new era will be those that can adapt to this challenge with a core strategy that uses big data to transform their

businesses. Such transformation requires recognizing that big data is more than just a few new concepts and

technologies; it is a whole new paradigm.1

With this new paradigm come new challenges. Primary among these is the fact that traditional data architectures

are inadequate to deal with the new demands placed upon them by a massive influx of new information.

Large social media companies like Facebook deal with hundreds of terabytes of information each day. A fleet of

commercial jets can create similar amounts of data with from a single day of operations. Traffic cameras,

environmental sensors, and cell phones create and use untold billions of pieces of data that range from simple

numerical data to voice, text, and video information.

All of that data is moving at speeds that make velocity a critical concern for data management. Whether it is click

imprints on an online ad, data exchange between machines, online gaming information, just-in-time inventory

updates, or the constant tracking of activity in the world’s stock exchanges, the speed at which data is created and

moves is dizzying. Increasingly, the world of big data is also the world of real-time data. Businesses no longer have

the luxury of choosing whether they want to go big or they want to go fast. They must do both.

Moreover, the variety of data types poses a unique challenge. Gone are the days when the enterprise data

architecture could assume that all needed data would be structured or that it would fit easily into a conventional

enterprise data model. Today’s environments encompass more than just text, numbers, and the occasional image,

and more than just transactions from a traditional OLTP. They involve complex audio, video, and 3D image files;

telemetry, log files, and other machine data; and a whole host of social media and other user- or customer-

generated data.

All of these changes add up to the need for a new way of looking at data architecture.

1 Beyer, Mark (June 27, 2011.) “Gartner Says Solving 'Big Data' Challenge Involves More Than Just Managing

Volumes of Data.” Gartner. http://www.gartner.com/newsroom/id/1731916

Page 4: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 4 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

The Enterprise Data Warehouse vs. The Data Lake

Introduced as a means of managing growing volumes of data and consolidating enterprise data from multiple

sources, the enterprise data warehouse (EDW) is, in many ways, the original “big data” solution. The EDW supports

reporting and analysis on (usually) highly structured data, playing a critical role for many organizations in enabling

both internal and external reporting – ensuring, among other things, SLA (service-level agreement) and regulatory

compliance – as well as more strategic analysis of business processes, market performance, etc. The information

managed by these environments complies with a defined schema, which optimizes searches, reporting, and

analysis. Large subsets of detailed information can be easily searched and analyzed, enabling both data mining and

more advanced predictive analysis.

What the EDW cannot do, however, is effectively manage much of the new data variety that makes up a growing

share of the typical big data environment. While most EDW architectures are scalable, their reliance on a pre-

defined schema limits their flexibility, and therefore applicability, as repositories for the full enterprise dataset in

the era of big data.

The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not

require incoming data to conform to a pre-defined schema. Rather, it stores data in its original format. Most often

relying on Hadoop (or, less frequently, a NoSQL database2) as the enterprise repository, and implemented on

commodity hardware – or via the cloud – the data lake addresses the challenges stemming from massive growth of

data volumes as well as those arising from widespread disparity and incompatibility among data types.

For example, many hospitals have discovered that data lakes are an ideal way to manage the millions of patient

records they maintain – records that can range in format from x-rays to physicians’ notes to lab results. With a

data lake, the hospital stores all of that disparate data in its original format, calling upon specific types of record

when needed, converting the data into uniform structures only when the situation calls for it. Because each record

remains in its original format, the data lake supports a variety of contextual possibilities that a standard database

structure cannot offer.

Of course, the data lake model is not without associated risks and limitations. Notwithstanding those limitations,

some organizations are experimenting with simply leaving data in the data lake while they wait to see how they

might eventually use it. But without an organized system for managing all that information, it is easy to lose track

of it. Additionally, relying on late binding techniques rather than well-defined metadata protocols can make data

less accessible to those who need direct query or analytical access to the data.

The potential contextual capabilities of the data lake are extremely valuable, but – on its own – the drawbacks add

a level of risk that most businesses would find unacceptable.

2 Brantne , Matthias (2015). "Filling in the Gaps in NoSQL Document Stores and Data Lakes." PWC.

Page 5: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 5 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

When viewed as alternatives, both the traditional EDW model and the data lake have their strong points and their

weaknesses. In trying to choose between them, organizations often have to make difficult choices. As Figure 1,

below, demonstrates, organizational requirements often put the two models at odds with one another. A business

that needs to deal with very large volumes of data and keep server and storage costs down while meeting

numerous compliance and reporting mandates will often be confused and frustrated by this dichotomy.

Figure 1: Viewed as alternatives, the data lake and EDW models are incomplete

when it comes to addressing the kinds of challenges businesses currently face

The answer may lie with a hybrid of the two models. By combining the best features of an enterprise data

warehouse with the greater storage flexibility and contextual value provided by a data lake, your company can

more efficiently manage growing data volumes and complexity. Executed properly, a hybrid environment can

provide the ease of analysis you need today, while maintaining a repository of all data in its original format. Such

an approach can address analytical needs while ensuring that context is preserved for both immediate and future

needs.

Page 6: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 6 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Figure 2: A hybrid solution can fill the gaps

The following sections provide an overview of some of the more typical deployment options of a data lake within a

larger data management framework.

Page 7: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 7 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

DATA LAKE AS A TECHNOLOGY SOLUTION

As data volumes grow, and the variety and complexity of data types increase, traditional data management tools

are pushed to their limits. In particular, the EDW – with its extensive data preparation and cleansing requirements

– is unable to keep up. A data lake can offer significant assistance to an environment relying solely on an EDW and

facing such pressures.

Typically, a data lake is proposed as primarily a technology solution, albeit one that addresses certain

organizational concerns. In this scenario, the data lake is portrayed as the ideal way to advance your organization’s

big data strategy. Advocates of this approach argue that the data lake is a project that you can simply “turn over”

to your IT department, along with a deadline and an appropriate commitment from your budget. Once it is

developed, you will have a repository for all the disparate pieces of data that your organization collects. Some

would even argue that a data lake provides the ideal way to break the pattern of tension that routinely exists

between IT and the business. IT has traditionally worked to drive information into centralized EDW architecture

solutions (including datamarts), even as the business tends towards less centralized solutions such as shadow IT 3

efforts and “spreadmarts,” 4 Microsoft Excel-based quasi-data mart solutions that tend to proliferate throughout

organizations, providing incomplete and often conflicting sets of analyses.

Increasing demand for access to data throughout the organization has only increased that tension in recent years.

The need for business analysts, managers, and other decision-makers to have hands-on access to data and analysis

is growing. A data lake enables more localized solutions and broader access to the data – while maintaining a

consistent, IT-managed repository – all incorporated without incurring significant costs. As a technology solution,

the data lake is intended to serve as an effective supplement to your EDW architecture, decreasing reliance on

centralized solutions and offering new flexibility. Once built, the idea is that IT can leave the analysis to those who

have need of it. However, the business analysts who need access to the data typically lack the broad set of skills

required to perform analysis on such a wide variety of data types. Business users naturally tend towards solutions

that they find intuitive and easy to use, which is why spreadmarts remain a problem.

Meeting Big Data Challenges

Probably the most important benefit of creating a central repository of an organization’s data is the context that it

provides. Context can transform raw data into usable knowledge to better inform both day-to-day decisions and

long-term strategies. The data lake provides the whole picture; nothing is left out because it doesn’t fit the

schema.

Consider the sensor and other machine information that Internet of Things (IoT) environments must manage. Daily

or hourly (or even considerably more frequent) readings on location, temperature, or any of thousands of other

3 Guest, Vawns and Bolger, Patrick. "Managing shadow IT." ComputerWeekly.com. http://goo.gl/gyBBvH

4 Eckerson, Wayne (July 2002). "Taming Spreadsheet Jockeys". TDWI Case Studies and Solutions. TDWI.

Page 8: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 8 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

variables can factor in to major business decisions. But that is a massive amount of data to try to wedge into a

conventional EDW schema, and those data points are only relevant for a certain class of queries and analysis.

Often, the ETL requirements alone preclude putting this kind of information into a conventional data warehouse

As an example of how a data lake can provide new opportunities for more powerful analysis of data, consider the

data lake system that GE recently implemented. More than two dozen airlines now stream a wide variety of jet

engine performance data directly into a data lake. The data is analyzed by service crews at the airlines so that they

can more easily detect problems with performance. Even the smallest anomalies can be detected with this level of

analysis, by comparing variables such as engine temperature, the engine type, and its overall service records. 5

The Data Annex

None of this suggests, however, that data lakes can simply serve as a replacement for an enterprise data

warehouse. They cannot.6 When deployed as a technology solution, the data lake becomes an annex to the EDW, a

supplemental environment that addresses some requirements the EDW cannot address on its own.

Figure 3: A data annex / supplemental data lake is typically

introduced as a technology-focused initiative

5 "Angling in the Data Lake. (August 10, 2014.)" GE Reports. http://goo.gl/YWZg9T

6 Elliott, Timo (April, 2014). "No, Hadoop Isn’t Going To Replace Your Data Warehouse." Business Analytics.

http://goo.gl/GtXb1V

Technology Solutions

Big Data Solutions

Data Annex

Page 9: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 9 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

When implemented via Hadoop or a NoSQL database, data lakes function as repositories where disparate types of

information are stored in their native format. The lack of structure is, in one sense, necessary. At present, the

demands of Big Data necessitate such a repository for the storage of the many different varieties of data that are

captured. Analyzing this kind of data, as GE is doing with the jet engine data mentioned above, requires

manipulating metadata. What analysis can be achieved with this approach is severely limited in scope, although –

as noted – it can provide a tremendous business impact.

In addition to a lack of structure, such data has no lineage. It can thus be extremely difficult to determine how and

where the data was generated, as well as other factors that would make it easier to classify and categorize. The

basic assumptions that standard enterprise data management practices have established around data do not

apply. In the appropriate context, and with the metadata tweaked just right, the data lake produces value. In

another context (even potentially a closely related one) the same data may be of substantially lower value, or no

value at all. As explained by Gartner:

Data lakes therefore carry substantial risks. The most important is the inability to determine data quality

or the lineage of findings by other analysts or users that have found value, previously, in using the same

data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without

descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And

without metadata, every subsequent use of data means analysts start from scratch. 7

In a standard data warehouse environment, on the other hand, data tends to be broadly applicable across the

widest variety of analytical use cases. Banks, accounting firms, manufacturers, and other data-intensive

organizations can more readily access the information they need and enjoy much greater flexibility in the queries

and types of analysis they apply to it. Where a data lake may be implemented as a technology solution, a data

warehouse is almost always implemented as a business solution. This above all is why – for all the success that GE

has achieved with their jet engine data lake – no one is suggesting that such an environment replaces the data

warehouse(s) such a large company needs to manage day-to-day operations, support broad reporting and analysis

of the overall operation, and maintain regulatory and SLA compliance.

7 "Gartner Says Beware of the Data Lake Fallacy" (July 28, 2014). Gartner. http://goo.gl/vOKEs3

Page 10: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 10 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

THE DATA LAKE AS A COMPONENT OF THE DATA ARCHITECTURE

Not long ago, there was a flurry of media discussion about whether big data environments will bring about the end

of the data warehouse. While that question is not being asked as frequently today, the tendency to view data lakes

and EDWs as competing or alternative models persists, leading to ongoing misunderstandings, and also obscuring

what is really happening within many organizations.

The reality for many businesses is that there is no choice to be made. Health care, financial services, utilities, and

many other industries are highly regulated. In those settings, the structure and use of the data warehouse

becomes a matter of law. Moreover, even in less regulated industries, when companies commit to complex and

detailed service level agreements, compliance often involves a structured approach to managing data that only an

EDW can provide. Operating outside of the law or violating core business agreements with vital customers or

partners is not an option.

Of necessity, such companies will retain an EDW as part of their overall data architecture. If the data lake is most

often deployed as a technology solution, the data warehouse is most often retained as a business solution.

Whatever value the data lake may bring, it generally cannot deliver core functionality that the business requires.

Figure 4: A business action machine solves specific business problems and is

integrated with the overall enterprise data fabric

However, not every “data warehouse” qualifies as a true enterprise data warehouse. In addition to data marts,

reporting servers, and related systems, many organizations implement data warehousing solutions that are in fact

quite limited in their scope and abilities, and often devoted to a single main purpose – such as managing security

data or monitoring product performance. These “business action machines” often work in parallel with the true

EDW, interacting with it as required depending on the specific function being carried out.

Business Solutions

EDW Solutions

Business Action

Machine

Page 11: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 11 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

A business action machine may be an important component of the overall data architecture, but by definition it is

not an alternative to an EDW. In the GE example provided above, the data lake provides exactly this kind of

limited-scope function. The applicability of a data lake architecture to such a system will depend in large part on

what task is being undertaken. As outlined above, many functions can only be reliably performed via a data

warehouse.

Of course, the limitations cited here primarily stem from the use of Hadoop or a NoSQL database. However, a

business can implement what is, effectively, a business-action data lake using any of a number of technologies,

including more traditional database technology. In such an environment, there will still be a clear distinction

between the data repository and the active, governed environment.

Page 12: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 12 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

THE DATA LAKE AS AN ENTERPRISE CONTEXT MACHINE

Successfully incorporating a data lake into a hybrid architecture requires balancing expectations of the advantages

it can bring with a thorough understanding of its limitations. To understand why, consider some of the more

obvious limitations.

A data lake is not a good option for high-performance analysis of structured data. Because the data in a data lake

remains in its native format, it lacks the structure that would facilitate versatile querying and high-performance

analysis. The big data answer to this problem is the Hadoop principle known as “schema on read,” where the data

structure is applied at the time the query is issued. But this approach is not without risks. Some of the data

included in big data environments can be highly volatile, creating a disconnect between assumed structure of the

data and what is actually there. This creates a situation where it is easy to overlook errors in the data, and

performance inevitably takes a hit. The best way to mitigate against these risks involves some level of vetting of

incoming data. This requires reintroducing at least some parts of the Extract Transform Load (ETL) process for EDW

that a big data solution is supposed to eliminate. 8

A stand-alone data lake is not the best option for real-time analysis for several reasons. The most obvious is that

the data lake model is, by definition, a repository. A business looking to do real-time analysis will need to add real-

time functionality onto the data lake, via Spark streaming or some other interface, to support such a use case.

Additionally, there are all the potential issues around data quality discussed above. The risk that the schema

applied at read time is not going to find all relevant data – that some data is going to slip between the cracks – is a

very real one. If important data is not finding its way out of the data lake, high speed analysis only serves to

accelerate the errors.

Access and Context

As noted above, the primary advantage of a hybrid environment is that it brings context to enterprise analysis and

reporting. In the age of big data, businesses are swimming in a sea of context. When deployed properly, contextual

data can add tremendous business value. Businesses can explore machine- and user-generated data to segment

customers by behavior as well as demographics, and to make surprising connections between seemingly unrelated

factors. A retailer looking at point-of-sale transactions can compare receipts with external data such as searches

trending on social media to better understand why customers are deciding to buy (or not buy.) A shipping

company can dig deeper than just seasonal variations and begin forecasting work volume and likely delays by

cross-referencing orders placed with changes in weather. Manufacturing, logistics, hospitality, health care, retail,

entertainment – all industries can benefit from digging deeper into data that sheds unexpected light on their

operations.

While context from a data lake can provide tremendous insights, it is of little use if there is no system in place to

deliver reliable answers to those who need it – and deliver it in a way that helps to facilitate the decision-making

8 "Why Hadoop Projects Fail — and How to Make Yours a Success." Venturebeat. http://goo.gl/oGI7yH

Page 13: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 13 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

process. To accommodate that need, the best option is an architecture that leverages the advantages of both the

enterprise data warehouse and the data lake. The warehouse can still provide the rapid analysis of structured data,

while the data lake can support the warehouse by providing the context that can better inform decision-making at

every level of the enterprise.

Figure 5: The contextual data lake combines the scalability and flexibility of

big data solutions with the reliability and business focus of a traditional EDW

A hybrid environment can serve as a true enterprise context machine, bridging the gap between the power and

scalability of big data technologies and the reliability and business focus of an enterprise data warehouse. In the

next section, we examine some of the technologies that support an enterprise context machine.

Page 14: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 14 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

SAP SOLUTIONS SUPPORT THE CONTEXTUAL DATA LAKE

SAP provides a full palette of technologies to support the contextual data lake. Proven EDW technology provides

the structure for a true enterprise-grade solution, backed by unmatched knowledge of and integration with end-

to-end business processes. The real-time platform directly integrates the EDW with the business, and can also

serve as a bridge between the business, the data warehouse, and the data lake. High-end traditional data

management technologies round out the ecosystem and provide flexibility in structuring a hybrid environment.

Big Data Platform

SAP’s Big Data Platform is SAP HANA, an in-memory, column-oriented, RDBMS (relational database management

system) designed to handle both high transaction rates and complex query processing on the same platform.

HANA radically simplifies data management architectures by combining in-memory processing of transactional

data with the EDW to enable business processes to run 1,000 to 100,000 times faster than in environments relying

on traditional architectures. With an embedded web server and version control repository that can be used for

application development, HANA provides a full real-time computing platform that enables businesses to realize

maximum value from their big data assets.

eBay’s HANA implementation story provides an example of how the platform supports a true contextual data lake

environment. With more than 90 million users worldwide and a system that processes millions of transactions per

day, eBay’s online auction service has accumulated some 50 petabytes of data, and is still growing. In order to

provide actionable intelligence to their users, eBay requires a system that can support analyzing tens of thousands

of variables within that massive collection of data in order to identify shopping patterns and purchasing trends as

they emerge. Because the value of the assets traded on eBay can be highly variable, understanding and leveraging

these trends as they occur is vitally important to the success of the sellers, who seek to maximize the value of each

sale they make.

The 50 petabytes of data is curated in a massive conventional data warehouse. eBay has a team of more than 300

analysts studying the data from the North America marketplace on a full-time basis; these individuals are

responsible for understanding online shopping patterns within specific product categories. With HANA, eBay has

implemented an early pattern detection system that uses predictive analytics to enable these analysts to discover

trends as they emerge in real time. As noted above, a contextual data lake need not be built on a “big data”

platform per se. In this instance, eBay is running real-time predictive analysis on data in a HANA-based data mart,

and dipping into the data lake (that is, the conventional data warehouse) as needed for additional context.

Holidays, sporting events, movie releases, and many other news items and emerging patterns on social media can

drive rapid and substantial changes to the value of specific items.

Analysts can now observe patterns as they emerge and dig deeper to get the full story on what is happening in the

market. For example, noting that shoe sales are spiking is helpful, but observing that the real upward trend is

among athletic shoes helps queue the right sellers to take advantage of the trend. Moreover, linking the spike to a

particular brand and make of shoe that has suddenly become hot because of a sporting event or a tweet from a

Page 15: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 15 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

star athlete enables the specific sellers who have those to offer to leverage the rapidly changing market. Real-time

transactional data combined with context from the data lake open up a whole new view of what is happening

within eBay.

EDW

SAP Business Warehouse (BW) is an Enterprise Data Warehouse solution, enabling business to integrate,

transform, and consolidate relevant business information both from SAP applications and external data sources.

SAP BW provides businesses with a high-performance infrastructure that enables them to evaluate and interpret

data, fully integrated with an overall SAP ecosystem, and leveraging the unique and comprehensive understanding

of business process. It enables reporting, analysis, and interpretation of business data that is crucial to preserve

and enhance the competitive edge of companies by optimizing processes and enabling them to react quickly to

meet market opportunity. Decision makers can make well-founded decisions and identify target-oriented activities

on the basis of the analyzed data. BW was traditionally implemented on standard RDBMS technology, which it still

supports, and is now frequently deployed on SAP HANA to take advantage of the full integration of operations and

analytics that HANA provides, including real-time capability.

Alliander, a Dutch energy distribution company serving 3.5 million customers, provides a good example of how

businesses can leverage HANA’s real-time potential with SAP Business Warehouse. The environment supports both

real-time analysis and historical / contextual analysis depending on the use case. A critical process for Alliander is

load forecasting. Providing too much power causes waste and has a negative environmental impact; providing too

little causes customer dissatisfaction and puts vital services at risk. Striking a balance between the two requires

intensive analysis. By using BW on HANA, Alliander was able to cut their load forecasting process from 10 weeks to

three days, providing significant savings and greater assurance of forecast accuracy. The company has also

implemented advanced analytics for asset management, using comparative analysis of nearby assets to predict

failure of infrastructure before it occurs.

Data Management

SAP also provides a suite of traditional database management systems which can provide critical infrastructure to

hybrid data lake environments. These include SAP ASE, which is full-featured, enterprise OLTP database; SAP IQ, a

columnar RDBMS optimized for big data analytics and data warehousing; and SAP SQL Anywhere, an embedded

SQL database management system that supports custom mobile database applications.

HANA EDW Deployment Options

For structured data, SAP HANA Smart Data Access enables businesses to merge data in heterogeneous EDW

landscapes and to access remote data without having to replicate the data to the SAP HANA database first. In

addition to Apache Hadoop, HANA Smart Data Access supports a variety of data sources, including Teradata

Page 16: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 16 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

database, SAP Sybase ASE, SAP Sybase IQ, and the Intel Distribution for SAP HANA. SAP HANA handles the data like

local tables on the database. Automatic data type conversion makes it possible to map data types from databases

connected via SAP HANA Smart Data Access to SAP HANA data types.

Figure 6: Hybrid Data Lake environment using structured data and conventional RDBMS / EDW technology

Another option for structured data is SAP HANA Dynamic Tiering, which provides the ability to keep data either in

memory or on disk in a columnar format via SAP IQ, allowing users to assign hot (active) data to in-memory, while

handling warm or cooler data on disk. From the user point of view, HANA tables on disk and in-memory are not

distinguishable and can be queried and modified using standard SQL statements, like any other SAP HANA tables.

Dynamic Tiering provides a valuable intermediate solution for organizations managing very large sets of structured

data. While in some instances it may not be practicable to put the full dataset into HANA, there is very little

advantage, other than storage costs, to moving such data into Hadoop. Using SAP IQ to offload less active data

enables your business to get the full benefit of a dedicated columnar RDBMS while keeping all administrative and

other processes HANA-centric. Dynamic Tiering provides much faster access to the data in SAP IQ than would be

possible for the same data in Hadoop, and without the administrative overhead of having to perform SQL queries

into Hadoop.

Page 17: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 17 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Figure 7: Hybrid Data Lake environment using multi-structured data and Hadoop

HANA Deployments with Hadoop

A hybrid environment enables your business to combine the in-memory processing power of SAP HANA with

Hadoop’s ability to store and process very large datasets without regard to structure. Such a solution can process

massive amounts of data – up to 100 petabytes or more – at a relatively low cost by distributing data processing

via Hadoop to scale across commodity hardware. Combining HANA and Hadoop provides an environment that

leverages the advantages of both technologies, creating a highly dynamic and scalable data ecosystem. Such a

solution can shrink data management costs to a fraction of conventional database total cost of ownership. Perhaps

more importantly, such an environment opens up capability that the business simply did not have before.

The McLaren Group, known for designing and building winning Formula One cars and deploying winning Grand

Prix racing teams, demonstrates such capability with their hybrid HANA and Hadoop environment. The McLaren

race car has over 1,000 sensors, which track more than 30,000 bytes per second of data. In real time, HANA

enables the driver to evaluate his performance against competitors, and to avoid collisions, while pit mechanics

track component performance and predict failure in order to minimize the number of required pit stops. After the

race, the full set of data stored in the Hadoop data lake enables the team to analyze overall vehicle and driver

performance for improved future performance; that same contextual data enables the manufacturer to perform

sophisticated analysis that drives design changes to subsequent versions of the vehicle.

HANA can connect with Hadoop via Smart Data Access or using SAP Data Services, which provide the options of

pushing down to Hadoop via HiveSQL or Pig scripts. Via an ETL process, Hadoop data can then be bulk loaded into

an EDW running on HANA.

Page 18: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 18 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

To bring real-time performance and contextual analysis even closer together, SAP has introduced HANA Vora, a

solution that combines Hadoop with HANA using Spark as a distribution engine, connecting the HANA in-memory

technology with the Hadoop file system. The new solution provides an in-memory, massively distributed data

processing engine within Hadoop to provide simple business-oriented scale-out processing of data. As shown in

Figure 8., below, such an approach will prove particularly effective in the growing number of environments in

which there is no single data lake, but rather a widely distributed set of big data collections. Early testing shows

that this solution’s performance greatly improves performance. 9

Figure 8: Vora Integrates HANA and Hadoop

Such a solution is an evolutionary step towards the vision of a fully in-memory solution. Advances that move

contextual analysis and high-performance, high-integrity data management systems closer together serve an

important function. But it is possible that the data lake model, while likely to be with us for a while to come, is not

the end-game.

9 Leukert, Bernd. "Run Simple: Reimagine the Promise at the Heart of Your Business" (May 6, 2015). SAP Sapphire

(Keynote address.) http://goo.gl/r33eu8

Page 19: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 19 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

Ultimately, SAP HANA has evolved to solve business problems. That is less true for Hadoop, which has evolved

primarily to address technical issues. HANA can integrate every component of the enterprise data ecosystem,

uniting real-time data access and contextual analysis of data in a way that no other solution can. While hybrid data

lake environments are a sound choice for the present and near future, a solution that brings all enterprise data

together in a single real-time environment may prove to be the answer in the long run.

Figure 9: Pure HANA real-time big data environment

Page 20: THE CONTEXTUAL DATA LAKE - McGovern Web …The data lake concept has attracted significant recent attention as an alternative to the EDW. A data lake does not require incoming data

Page 20 The Contextual Data Lake Copyright © 2015 SAP, Inc.

The Contextual Data Lake

CONCLUSION

The dichotomy of the data lake and the enterprise data warehouse reflects an older, deeper rift between the need

to solve business problems and the attraction that new technologies often represent. The importance of Hadoop

to the big data landscape would be difficult to overstate, but it is a mistake to confuse Hadoop with the entire

landscape. An exclusive focus on Hadoop, or on the data lake model, or on any technology or implementation

model, can become a distraction in the face of business challenges.

Ultimately, your organization needs an architecture driven by business need rather than by what technologies you

have (or what new technologies are available.) Existing EDW technologies and practices are vital because of the

ongoing problems they address and the integrity they provide for the data ecosystem. Big data technologies and

practices are critically important because of the new opportunities they provide for business insight in an

environment that demands increasingly expanded and accelerated results.

A hybrid approach as outlined in this paper can bridge the gap between the conflicting requirements your

businesses increasingly faces – e.g., doing more with ever greater volumes of data in ever-diminishing time frames

– and the disparate technologies that provide such capabilities. A structured data warehouse informed by a data

lake can bring transactions and records together with demographic, historical, and other contextual data to

provide new and often completely unexpected insights, allowing you to avoid risks and leverage opportunities that

would have been invisible before. And your business can realize these benefits while maintaining proper data

governance, and while meeting all business and external (e.g. regulatory) requirements. Moreover, when

implemented with the right technologies, a hybrid contextual data lake can help make your business future-ready,

better prepared for the merging of the real-time and big data paradigms which is on its way.