enterprise data management - data lake - a perspective

19
Enterprise Data Management: A Perspective From the days of Data Silo, EDW to the present day of Hadoop & Data Lake This document discusses the evolution of the enterprise data management over the years, the challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a means to tackle such challenges. It also talks about some reference architectures and recommended toolset in today’s context. March, 2016 Authors: Selva Kumar VR Saurav Mukherjee

Upload: saurav-mukherjee

Post on 24-Jan-2017

264 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Page 1: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management: A Perspective From the days of Data Silo, EDW to the present day of Hadoop & Data Lake

This document discusses the evolution of the enterprise data management over the years, the

challenges of the current CTOs and chief enterprise architects, and the concept of the Data Lake as a

means to tackle such challenges. It also talks about some reference architectures and recommended

toolset in today’s context.

March, 2016

Authors:

Selva Kumar VR

Saurav Mukherjee

Page 2: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 1 of 18

Contents 1. The Evolution of Data Management – what led to ‘Data Lake’? .......................................................... 3

1.1. Data Silo ........................................................................................................................................ 3

1.2. Enterprise Data Warehouse (EDW) .............................................................................................. 3

1.3. Big Data ......................................................................................................................................... 4

1.4. Hadoop .......................................................................................................................................... 5

2. The Challenges of present CTOs ........................................................................................................... 6

3. Data Lake ............................................................................................................................................... 7

3.1. Key Components of Data Lake ...................................................................................................... 7

3.1.1. Storage .................................................................................................................................. 7

3.1.2. Ingestion ................................................................................................................................ 7

3.1.3. Inventory & Cataloguing ....................................................................................................... 7

3.1.4. Exploration ............................................................................................................................ 7

3.1.5. Entitlement ........................................................................................................................... 7

3.1.6. API & User Interface .............................................................................................................. 7

4. Data Lake – Implementing the Architecture ......................................................................................... 8

4.1. Storage .......................................................................................................................................... 8

4.2. Ingestion........................................................................................................................................ 9

4.2.1. The Challenges ...................................................................................................................... 9

4.2.2. Recommendation .................................................................................................................. 9

4.3. Inventory, Catalogue & Explore .................................................................................................. 12

4.3.1. Discovery ............................................................................................................................. 12

4.3.2. Catalog & Visualization ....................................................................................................... 12

4.4. Entitlement & Auditing ............................................................................................................... 12

4.5. API & User Interface Access ........................................................................................................ 13

5. Conclusion ........................................................................................................................................... 14

6. Bibliography ........................................................................................................................................ 15

7. Few Other Useful References ............................................................................................................. 18

Page 3: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 2 of 18

Figures Figure 1: Data Management in Silos 3 Figure 2: Typical EDW Implementation 4 Figure 3: Typical Data Lake Implementation 8 Figure 4: Apache Nifi Data Flow View 10 Figure 5: Apache Nifi Data Provenance View 10 Figure 6: Nifi - The Power of Provenance 10 Figure 7: Apache Nifi Stats View 11

Tables Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today 6 Table 2: Data Ingestion Challenges - beyond just the tools 9

Page 4: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 3 of 18

1. The Evolution of Data Management – what led to ‘Data Lake’? The concept of data management evolved in last 30 years based on the idea of providing better and timely analytics to the business teams. IT team always struggled with the business demand of providing everything in the next minute to serve new business ideas.

1.1. Data Silo Initially, data management systems for analytics were created in silos. This approach helped extract some insights from the organization’s data asset. However, the silos were very much restricted towards individual LOBs (line of business) and hence, were never considered comprehensive. Usually, LOBs used to send data to other LOBs as required and requested. In most cases, they were just reports (static & analytical) getting pulled from application database.

Figure 1: Data Management in Silos

1.2. Enterprise Data Warehouse (EDW) To break away from the data silos so that LOBs get the freedom to create their own data marts, the idea of Enterprise Data Warehouse (EDW) was adopted widely by industry. This concept has been researched for long. A joint research by HP-Labs and Microsoft research team provides a good overview of this concept and approach (Chaudhuri, et al., 1997). All data marts source their data from one central version of data, thereby maintaining data integrity and consistency at the enterprise level.

Though EDW solved the problem of providing an enterprise-level view of data to all business teams to a certain extent, answering questions or providing necessary data to business teams within the next minute of new business idea still remained a cherished but elusive dream for IT & business teams. Also, this ‘one version-fits-all’ idea did not go well with every group in organization. And the culture of business analysts downloading data to Microsoft Excel spread sheets or Microsoft Access from EDW and merging them with source data continued to be widely followed.

EDW architecture offered numerous technical challenges. Few such challenges are listed below.

Cost

Licensing cost (Database licenses, ETL tools etc.)

Storage cost

Ridiculously long lead time before database schemas could be created as per standards, which in turn followed by long ETL development cycles

LOB-1 LOB-n-1

LOB-2

LOB-1

LOB-3

LOB-n

Page 5: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 4 of 18

Every post-production fix involved long and repetitive development cycle

Complicated designs

Need for highly skilled labor force

Figure 2: Typical EDW Implementation

1.3. Big Data In the meanwhile, technology evangelists like Google, Netflix, Amazon, Facebook, Twitter, advanced oil

drill equipment manufacturing companies, space companies etc. injected new types of problems in to

the data space, e.g. data type and volume. It was no more the case of structured data mindset. It

involved unstructured data like videos, social text streams, sensor data, data streams from IoT devices

etc. These data types can neither be accommodated into traditional database nor their scale are easily

manageable like structured data. In addition to the data volume, variety and velocity of data flow had to

be tackled together to derive business advantage and doing that faster than the competition. These new

generation companies also created applications which are ground up distributed in nature. New

distributed file systems, new distributed processing applications etc. were required to handle the

volume and the velocity. Thought papers from companies like Google (Chang, et al., 2006) (Dean, et al.,

2004) (Ghemawat, et al., 2003), Amazon (DeCandia, et al., 2007) etc. offer detailed discussion on this

topic. The dimensions of volume, variety and velocity gave birth to what came to be known as ‘Big

Data’1.

1 Over time, couple more V’s – veracity & volatility got attributed to Big Data.

Reporting Layer

EDW Layer ETL Layer Data Source Layer

Data Source – LOB-1

Data Source – LOB-2

Data Source – LOB-3

Data Source – LOB-n

ETL Tools EDW Holds schema on Write

i.e., predefined databases schema

Data Mart LOB-1

Data Mart LOB-2

Data Mart LOB-3

Data Mart LOB-n

Re

po

rtin

g La

yer

Page 6: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 5 of 18

1.4. Hadoop Doug Cutting, Chief Architect at Cloudera, adopted the distributed systems idea and created Hadoop,

being inspired and modeled by Google’s high volume data processing systems. Hadoop is open source

and relies on the concept of bulk commodity hardware. It solves the cost issue (licensing cost, storage

cost) and data variety issue.

Over time, new ecosystem got created around HDFS (Hadoop Distributed File System). It generated new

efficiencies for data architecture through optimization of data processing workloads such as data

transformation and integration. It simultaneously lowered the cost of storage. Ideas like flexible

‘schema-on-read’ access to all enterprise data allowed circumventing long database schema design and

long ETL development cycles started taking shape.

Though Hadoop potentially solves data storage problem, it requires high latency for data retrieval (batch

processing). The latency issue led to new ways of data storage & retrieval in form of NoSQL databases

e.g., Apache HBase, Apache Cassandra - inspired by Amazon (DeCandia, et al., 2007) etc. for and better

processing engines like Spark (Zaharia, et al., 2012) (Zaharia, et al., 2010) (Zaharia, et al., 2012), Flink

(Apache Software Foundation, 2015) etc. However, NoSQL databases have their own challenges like

complicated table designs, joins not working well like in traditional RDBMS etc.

This landed the industry at the juncture of good infrastructure framework, low cost open source tools

(e.g., storage tools like HDFS, NoSQL databases like MongoDB, HBase, Cassandra, MemCache etc., data

processing tools like Spark, Map Reduce, Pig, Hive, Flink, Nifi etc., message broking tools like Kafka

(Kreps, et al.), RabbitQ etc.) and, of course, the existing high cost enterprise toolsets & easy access

storage (i.e. RDBMS like Oracle, DB2, SQL Server etc., Massively Parallel Processing (MPP) tools like

Teradata, Impala etc., processing tools like AbInitio, Informatica, DataStage etc.).

Along the way, the revolution called open source added significant value to technology community. It

facilitated creation of lot of start-ups, encouraged new ideas and of course, added a lot of chaos. Each

of these tools (whether low cost or high cost) are focused on solving specific use case. Every other

month, new open source products started getting released. However, for an enterprise CTO or an

architect, it gets really challenging to identify sustainable open source solutions which would also solve

multiple use cases instead of solving specific ones. Here came the open source bundling companies e.g.,

Cloudera, Hortonworks, MapR etc. They took the ownership of identifying software that are good and

sustainable, and managing tools which go through very frequent releases for improvised versions. This

solved the basic adaptation problem of open source ecosystem into an enterprise to a good extent.

There have been differences in the selection of tools of the open source bundling companies’ and of

course, it is purely left to enterprise’s use cases to decide which one to go for.

Once the new ecosystem (majorly based open source solutions) got stabilized, next challenge was to

adopt a suitable methodology for application development and maintenance. Adoption of open source

ecosystem also mandated replacement of all/some of the well accepted traditional enterprise software

and tools. Such replacement entails its own share of risks.

Also, there are no widely practiced and adopted standards in the industry for open source based

enterprise data management solutions. Most of the advanced business analysts still rely on power tools

like SQL, Metadata Management repositories etc. to infer business insights. Hadoop lacks the flexibility

Page 7: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 6 of 18

of data extraction using SQL at similar speed. On top of it, there have been challenges of dealing with

regulations, preventing data falling into wrong hands, auditing etc.

2. The Challenges of present CTOs The previous section discussed about the evolution of data management, the multidimensional

challenges that it posed and the challenges in identifying proper adoption framework or architecture

which may be widely used, standardized and easily adopted by enterprises. The CTOs or architects

would be better served by having reference architecture or framework to minimize the risks involved.

Few exceptional use cases which may not fit well in this framework or architecture can be handled

separately.

Before delving deep in to the adoption framework or architecture, here is a quick summary of the

critical challenges from enterprise data management perspective as an evolution from EDW era.

# Description

1 Provide low cost storage and processing. Accommodate any data type.

2 Provide consolidated view of enterprise data to empower business teams to pull all required information next minute new business idea pops up.

3 Provide consolidated view of enterprise data and flexibility of ad hoc reporting on any data element in enterprise to the business analyst.

4 Provide metadata cataloguing and search facility for metadata.

5 Store data in original raw form to guarantee data fidelity.

6 Provide entitlement management features that take care of regulation, authorization, authentication, encryption, data masking, auditing etc.

7 Leverage existing licensed tools for use cases / problems which open source systems cannot solve.

8 Maintain existing good features like faster data extraction using SQL for analysis and add new features that have significant reduction in latency in creating advanced analytical applications like machine learning.

9 Provide data access to external & internal teams based on entitlement.

10 Provide enterprise data elements in raw form to a new category of analysts, called data scientists.

11 Select technologies to minimize tool replacement costs and keep up with technology trends for keep enterprise competitive.

12 Integrate data profiling and data quality results into metadata management framework. Table 1: Key Challenges for CTOs/Chief Enterprise Architects of today

Page 8: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 7 of 18

3. Data Lake ‘Data Lake’ came across as the next key concept in data management area and was primarily

conceptualized to tackle the challenges mentioned in the section above - The Challenges of present

CTOs. It is more of architectural concept and may be defined as - “Repository of enterprise-wide, large

quantities and variety of data elements, both structured and unstructured, in raw form.”

This definition is purely based on the insights from multiple data management implementations in

Hadoop environment, identifying challenges and coming up with architecture to solve these challenges.

However, just repository alone will not suffice in meeting the challenges mentioned in Table 1. It would

require supporting components to deliver the benefits.

3.1. Key Components of Data Lake The Data Lake architecture involves some mandatory components (mentioned below) to make it a

successful implementation.

3.1.1. Storage

Low cost

Store raw data from different input sources

Support any data type

High durability

3.1.2. Ingestion

Facilitate both batch & streaming ingestion frameworks

Offer low latency

3.1.3. Inventory & Cataloguing

Discover metadata and generate tags

Discover lineage information

Manage tags

3.1.4. Exploration

Browse / Search Inventory

Inspect Data Quality

Tag Data Quality attributes

Auditing

3.1.5. Entitlement

Identify & Access Management

o Authentication, Authorization, Encryption, Quotas, Data Masking

3.1.6. API & User Interface

Expose search API

Expose Data Lake to customers using API & SQL interface based on entitlements and access

rights

Page 9: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 8 of 18

4. Data Lake – Implementing the Architecture

Components mentioned in Figure 3 above are minimal requirements for implementing a Data Lake.

Hadoop (HDFS) can accommodate application storage as well. These applications can also leverage Data

Lake’s built-in framework components like Catalogue, Data Quality, and Search & Entitlements.

4.1. Storage Primary requirement of storage is to low cost, able to accommodate high volume and long durability.

Storage should be able to accommodate any data type. Current technological trends suggest that HDFS,

MapR-FS and Amazon S3 suit the need. Even though, they have different underlying implementation,

they still adhere to Hadoop standards.

Along with storing data in distributed file systems, it would be a good idea to identify suitable storage

options for different data types as below.

Unstructured data

o Store native file format (logs, dump files, videos etc.)

o Compress with Streaming Codec (LZO, Snappy)

Semi-Structured data – JSON, XML files

o Good to store in schema aware formats e.g., Avro. Avro allows versioning &

extensibility like adding new fields.

Structured data

o Flat records (CSV or some other field separated)

o Avro or Columnar Storage (Parquet)

Streaming Ingestion

Access Layer Data Lake Layer Ingestion Layer Data Source Layer

Data Source – LOB-1

Data Source – LOB-2

Data Source – LOB-3

Data Source – LOB-n

Batch Ingestion

Data Lake

Automated Inventory, Catalogue & Tagging Framework

Data Quality Tagging Framework

Inventory & Catalogue Search & Explore Framework

Entitlement Framework

RD

BM

S /

No

SQL

Acc

ess

Se

arch

( S

olr

/ E

last

ic

Sear

ch)

Acc

ess

AP

I

Acc

ess

Figure 3: Typical Data Lake Implementation

Page 10: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 9 of 18

Storage life cycle policy can also be defined. There are many open source tools like Apache Falcon

(Apache Software Foundation, 2016) that operates based on pre-defined policies. Data directory

structure can be defined to segregate data based on life cycle policy - e.g., latest data, 7 years old data

as required by regulations, data older than 7 years etc.

4.2. Ingestion Ingestion is the first piece of the puzzle that needs to be put in place after setting up the storage. This

involves setting up ingestion framework that will handle both batch and streaming data. Looking into

the current trends of data processing tools, next generation of technologies might look at batch

processing as legacy systems. Better processing tools (e.g. Spark (near real-time), Flink (real-time) etc.)

are promoting batch as streams. Complexity of good stream processing depends on use cases. O’Reilly

offers in-depth discussion on the topic of streaming - going beyond batch (Akidau, 2015) (Akidau, 2016).

4.2.1. The Challenges

However, having advanced processing tools would not be enough to ensure proper ingestion. The

following list summarizes few such challenges (as listed below) that need to be circumvented.

# Description

1 Making use of advanced processing tools would require high skilled resources in good numbers.

2 Traditional data processing engineers widely use GUI-based ETL tools (like AbInitio, Informatica etc.) that use data flow programming techniques. Coding applications using open source processing tools (like Spark, Flink etc.) still take consider time for development and testing for those engineers.

3 Due to nature of open source eco system, there will always be new processing tool that will outrun the benefit of current toolset and will add business advantage than enterprise’s competition. This would require easy and quick adoptability, which may be a big challenge.

4 Data processing tools are good at processing data. However, ingestion framework also needs to go beyond that and solve challenges like:

Low latency & guaranteed data delivery

Handling back pressure

Data provenance (tracking data all the way from data source)

Customizability

Quick implementation and better UI for operations team.

Supporting wide variety of protocols used for sending/receiving data (e.g., SSL, SSH, HTTPS, other encrypted contents etc.)

Load data into wide number of destinations (HDFS, Spark, MapR-FS, S3, RDBMS, NoSQL etc.).

5 From enterprise perspective, it is often desired to have same tools to be used across the enterprise for any applications that would require data push/pull. However, zeroing on ‘the one toolset’ is always challenging.

Table 2: Data Ingestion Challenges - beyond just the tools

4.2.2. Recommendation

Based on tool evaluation research, two tools may be recommended to handle the ingestion problems

and quick adoptability challenges:

Page 11: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 10 of 18

4.2.2.1. Apache Nifi

Apache Nifi (Apache Software Foundation, 2015) is one of the best open source data flow programming

tool. Nifi kind of fits the bill for most of the data push/pull use cases. Just to get the uninitiated excited

about it, here are few Nifi snapshots:

Figure 4: Apache Nifi Data Flow View

Figure 5: Apache Nifi Data Provenance View

Figure 6: Nifi - The Power of Provenance

Page 12: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 11 of 18

Figure 7: Apache Nifi Stats View

Nifi can be used as a full-fledged ETL tool and does support most of the ETL features. However, Nifi still

claims itself as simple event processing and data provenance tool. If open source support is extended to

it, Nifi may well be transformed into a full-fledged ETL tool.

4.2.2.2. Cascading

To deal with the quick adaptability part, it would be a good idea to have wrapper technologies. They

would allow the coding to be done once and the processing engine underneath to be changed based on

the latest trends or best fit. Our research recommends Cascading (Driven, Inc., 2015)to be a good

candidate here. At present, Cascading supports multiple processing engines underneath (Spark, Map

Reduce, Flink etc.).

Cascading also supports development in Java and Scala. Cascading allows developing the business logic

separately from the integration logic. Complete applications may be developed and unit tests may be

written without touching a single Hadoop API. It provides the degrees of freedom to easily move

through the application development life-cycle and to deal separately with integrating existing systems.

Cascading provides a rich API that allows thinking in terms of data and business problems with

capabilities such as sort, average, filter, merge, etc. The computation engine and process planner

convert the business logic into efficient parallel jobs, delivering the optimal plan at run-time to the

computation fabric of choice.

In simple terms, cascading may be considered as the plumbing components that are used for building

pipelines. It provides sinks, traps, connections etc. It is just the matter of plugging them together to

build business logic without bothering about whether the code will run on MapReduce or Spark or Flink.

This is famously known as pattern language.

Developers can develop all the way till unit testing without touching Hadoop or any processing engine.

From technology category perspective, it is the middleware for designing workflow.

Page 13: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 12 of 18

4.3. Inventory, Catalogue & Explore Data lake-based storage ideas & ingestion ideas mentioned above can solve few challenges like low cost

storage, low cost processing, real-time sync with data source and data in raw form for maintaining

fidelity as mentioned in Table 1. Enterprise data pushed in raw form into Data Lake provides the

flexibility to business analysts and data scientists to pull any enterprise data element as required

without waiting for long ETL development and data modeling exercises to complete. Streaming ingestion

facilitates Data Lake to be in sync with data source in as near real time as possible.

However, enterprise data in their own raw format might be huge and it will be like finding needle in hay

stack for a data scientist or any user. This mandates self-data-service framework to be built for data

discovery (Inventory), data preparation (Catalogue) and data visualization (Explore).

4.3.1. Discovery

First step in data discovery is to provide a metadata framework (a sub-component in self-data-service

framework) to capture business metadata, technical metadata and operational metadata. This process

needs to be automated to handle the sheer volume of file load into Data Lake. Even though in theory

Data Lake talks about data availability to everyone, there are constraints in the form of entitlement

which needs to be put in place for Data Governance purposes.

Metadata framework should also have features to create important data lineage information as part of

Ingestion frameworks. This will enable lineage all the way from data source to Data Lake.

4.3.2. Catalog & Visualization

Once metadata (business, technical & operation) has been captured for the raw data provided by data

sources, it may be used as catalog and UI may then be used to explore these metadata. Along with

metadata, data profiling ability & data quality metric for all data pushed into Data Lake are really

valuable and desirable in this context.

Most of the available frameworks are tag based. They identify and mark metadata, profiling metrics &

quality metrics. These frameworks are inbuilt with CRUD API, Query API or Analytics API for handling

metadata management.

This area is fairly new to industry. Only few vendors are out there who provide data self-service

framework. Below is a list of such vendors and their products.

Cloudera (Cloudera, 2016)– Cloudera Navigator (not open source, license based).

Waterline Data (Waterline Data, Inc., 2016) – Independent organization and integrates with any

Hadoop distribution.

Hortonworks (Hortonworks Inc., 2016) – Apache Atlas, still in incubation. However, a limited

featured version has been added to HDP 2.3 release. Hortonworks actively partnered with

Waterline data as well.

4.4. Entitlement & Auditing Entitlement is one of the primary pieces of data governance. Generally, data governance has few

mandatory components like Data Profiling, Data Quality, Entitlement and Auditing. Main goal of the

governance is to facilitate easy & secured data accessibility along with the reliability of the data

Page 14: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 13 of 18

(profiling & data quality measures). Previous section discussed profiling & data quality. This section will

focus on entitlements and auditing.

Entitlement & auditing covers wide range of activities, like o Authentication

o Authorization

o Encryption

o Auditing

o Data Masking

o Data Field Level Authorization

Almost all Hadoop distribution vendors use Kerberos as authentication protocol. MapR uses propriety

authentication tool, which follows similar approach as Kerberos.

For authorization, Data Masking & Data field level authorization, different Hadoop distribution vendors

use different toolsets. Cloudera uses Sentry & Cloudera Navigator. Hortonworks uses Apache Ranger &

Apache Knox. MapR uses proprietary ACE (Authorization Control Expression) that provides better

flexibility than ACL (Authorization control list). ACE is supported by all Vendors.

All vendors offer encryption for at-rest and in-transit data. Approaches taken for Key Management for

keys used for encryption/decryption are quiet proprietary.

There are multiple open source projects in Hadoop security area. A list of few such projects is given below.

Apache Knox (Apache Software Foundation, 2016): A REST API gateway that provides a single access point for all REST interactions with Hadoop clusters.

Apache Sentry (Apache Incubator): A modular system for providing role-based authorization for both data and metadata stored in HDFS. Sentry project is primarily led by Cloudera, one of the best-known Hadoop distributors.

Apache Ranger (Hortonworks, Inc., 2016): A centralized environment for administering and managing security policies across the Hadoop ecosystem. This project is led by Hortonworks, another well-known Hadoop distributor, and includes technology that it gained when it acquired XA Secure in mid-2014 (Hortonworks, Inc., 2014).

Apache Falcon (Apache Software Foundation, 2016): A data governance engine that allows administrators to define and schedule data management and governance polices across the Hadoop environment. The section 4.1 also discusses this.

Project Rhino (Williams, 2013): Creates an encryption, key management capabilities and a common authorization framework across Hadoop projects and subprojects (TechTarget). This project is led by Intel.

Most of these security tools are inbuilt and distributed by different Hadoop bundling vendors.

4.5. API & User Interface Access To provide easy and secure access, it is recommended to allow control access to Data Lake either using

API or Interactive SQL. This will in turn enforce inbuilt entitlement as discussed in sections above.

Page 15: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 14 of 18

Wide range of tools is available for API management and SQL (Spark SQL, Flink SQL, Impala etc.). Even

with all these tools available, data access might not be as fast as RDBMS tools. This is a case in point to

leverage existing enterprise tools.

As mentioned earlier, most of the framework setup for Data Lake can be re-used for other use cases. If

data cleansing & standardization has to be done, it can be run in Hadoop environment using data

processing tools like Map Reduce, Cascading, Spark, Flink etc. and the HDFS environment can be

segmented to hold cleansed, standardized and aggregated level information. Standardized version of

data may also be pushed to existing EDW. This approach moves complete ETL from EDW to Hadoop

environment, thus minimizing processing and licensing cost. Also, highly granularity data will reduce

RDBMS storage cost.

5. Conclusion Data Lake provides an architectural approach with embedded Governance model. It helps Data

Management teams to implement variety of solutions using cost effective storage, efficient processing

engines and self-data-service features. Teams implementing Data Lake need to give lot of focus while

defining metadata for all types data objects ingested into Data Lake. Metadata plays key role in Data

Lake to expose self-data-service flexibility to analysts/data scientists/users and it is a key component for

defining entitlements.

Page 16: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 15 of 18

6. Bibliography Akidau Tyler The world beyond batch: Streaming 101 - O'Reilly Media [Online] // The world beyond

batch: Streaming 101 - O'Reilly Media. - O'Reilly, Aug 05, 2015. - Mar 08, 2016. -

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.

Akidau Tyler The world beyond batch: Streaming 102 - O'Reilly Media [Online] // The world beyond

batch: Streaming 102 - O'Reilly Media. - O'Reill, Jan 20, 2016. - Mar 08, 2016. -

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102.

Amazon Dynamo: Amazon’s Highly Available Key-value Store [Online] // All Things Distrubuted. -

Amazon, 2007. - Mar 04, 2016. - http://www.allthingsdistributed.com/files/amazon-dynamo-

sosp2007.pdf.

Apache Incubator Apache Sentry (incubating) [Online] // Apache Sentry (incubating). - Apache Software

Foundation. - Mar 09, 2016. - https://sentry.incubator.apache.org/.

Apache Software Foundation Apache Flink: Scalable Batch and Stream Data Processing [Online] //

Apache Flink: Scalable Batch and Stream Data Processing. - Apache Software Foundation, 2015. - Mar

07, 2016. - http://flink.apache.org/.

Apache Software Foundation Apache Nifi [Online] // Apache Nifi. - Apache Software Foundation, 2015. -

Mar 08, 2016. - https://nifi.apache.org/.

Apache Software Foundation Falcon - Falcon - Feed Management & Data processing platform

[Online] // Falcon - Falcon - Feed Management & Data processing platform. - Apache Software

Foundation, Feb 15, 2016. - Mar 08, 2016. - https://falcon.apache.org/.

Apache Software Foundation Knox Gateway - REST API Gateway for Hadoop Ecosystem [Online] // Knox

Gateway - REST API Gateway for Hadoop Ecosystem. - Apache Software Foundation, Mar 01, 2016. - Mar

09, 2016. - https://knox.apache.org/.

Chang Fay [et al.] Bigtable: A Distributed Storage System for Structured Data [Online] // Bigtable: A

Distributed Storage System for Structured Data. - Google, 2006. - Mar 15, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.

Chaudhuri Surajit and Dayal Umeshwar An Overview of Data Warehousing and OLAP Technology

[Online] // Microsoft Research - Turning Ideas into Reality. - Microsoft, Mar 1997. - Mar 03, 2016. -

http://research.microsoft.com/pubs/76058/sigrecord.pdf.

Cloudera Cloudera [Online] // Cloudera. - Cloudera, 2016. - Mar 08, 2016. - https://cloudera.com/.

Dean Jeffrey and Ghemawat Sanjay MapReduce: Simplified Data Processing on Large Clusters

[Online] // MapReduce: Simplified Data Processing on Large Clusters. - Google, 2004. - Mar 15, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.

DeCandia Giuseppe [et al.] Dynamo: Amazon’s Highly Available Key-value Store [Online] // Dynamo:

Amazon’s Highly Available Key-value Store. - Amazon, 2007. - Mar 15, 2016. -

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf.

Page 17: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 16 of 18

Driven, Inc. Cascading | Application Platform for Enterprise Big Data [Online] // Cascading | Application

Platform for Enterprise Big Data. - Driven, Sep 2015. - Mar 08, 2016. - http://www.cascading.org/.

Ghemawat Sanjay, Gobioff Howard and Leung Shun-Tak The Google File System [Online] // The Google

File System. - Google, 2003. - Mar 15, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.

Google, Inc The Google File System [Online] // Research at Google. - Google, 2003. - Mar 4, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf.

Google, Inc. Bigtable: A Distributed Storage System for Structured Data [Online] // Research at Google. -

Google, 2006. - Mar 04, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf.

Google, Inc. MapReduce: Simplified Data Processing on Large Clusters [Online] // Research at Google. -

Google, 2004. - Mar 04, 2016. -

http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf.

Hortonworks Inc. Hortonworks: Open and Connected Data Platforms [Online] // Hortonworks: Open

and Connected Data Platforms. - Hortonworks, 2016. - Mar 08, 2016. - http://hortonworks.com/.

Hortonworks, Inc. Apache Ranger [Online] // Apache Ranger. - Hortonworks, 2016. - Mar 09, 2016. -

http://hortonworks.com/hadoop/ranger/.

Hortonworks, Inc. Hortonworks Acquires XA Secure - Hortonworks [Online] // Hortonworks Acquires XA

Secure - Hortonworks. - May 15, 2014. - Mar 09, 2016. - http://hortonworks.com/press-

releases/hortonworks-acquires-xa-secure/.

Kreps Jay, Narkhede Neha and Rao Jun Kafka: a Distributed Messaging System for Log Processing

[Online] // Microsoft Research - Turning Ideas into Reality. - LinkedIn Corp.. - Mar 7, 2016. -

http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-

final12.pdf.

TechTarget Managing Hadoop projects: What you need to know to succeed [Online] // Managing

Hadoop projects: What you need to know to succeed. - TechTarget. - Mar 09, 2016. -

http://searchdatamanagement.techtarget.com/essentialguide/Managing-Hadoop-projects-What-you-

need-to-know-to-succeed.

University of California, Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-

Memory Cluster Computing [Online] // Computer Science Division | EECS at UC Berkley. - University of

California, Berkley, 2012. - Mar 07, 2016. -

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.

University of California, Berkeley Spark: Cluster Computing withWorking Sets [Online] // Computer

Science Division | EECS at UC Berkley. - University of California, Berkley, 2012. - Mar 07, 2016. -

http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.

Page 18: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 17 of 18

Waterline Data, Inc. Waterline Data | Find, understand, and govern data in Hadoop [Online] //

Waterline Data | Find, understand, and govern data in Hadoop. - Waterline, 2016. - Mar 09, 2016. -

http://www.waterlinedata.com/.

Williams Alex Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better Security

To Big Data [Online] // Intel Launches Hadoop Distribution And Project Rhino, An Effort To Bring Better

Security To Big Data. - TechCrunch Network, Feb 26, 2013. - Mar 09, 2016. -

http://techcrunch.com/2013/02/26/intel-launches-hadoop-distribution-and-project-rhino-an-effort-to-

bring-better-security-to-big-data/.

Zaharia Matei [et al.] Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing

on Large Clusters [Online]. - University of California, Berkeley, 2012. - Mar 07, 2016. -

https://people.csail.mit.edu/matei/papers/2012/hotcloud_spark_streaming.pdf.

Zaharia Matei [et al.] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing [Online] // Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster

Computing. - University of California, Berkeley, 2012. - Mar 15, 2016. -

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.

Zaharia Matei [et al.] Spark: Cluster Computing with Working Sets [Online] // Spark: Cluster Computing

with Working Sets. - University of California, Berkeley, 2010. - Mar 15, 2016. -

http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf.

Page 19: Enterprise Data Management - Data Lake - A Perspective

Enterprise Data Management – A Perspective

Page 18 of 18

7. Few Other Useful References

Data Lake References

Horton works & Teradata Paper - Data Lake

Amazon’s experience on Data Lake - Data Lake Implementation Guidelines

Knowledgent Reference - Data Lake Design Waterline Data - Self Data Service

Flink: A new breed in processing tool

Flink Streaming & Batching in One Engine

Data Security

Cloudera Security – Paper on Hadoop Security

Cloudera reference on Hadoop Encryption - Encryption in Cloudera

Hortonworks - Data Governance