redefine big data: emc data lake in action · pdf filegemfire - real-time data service hdfs...

24
Redefine Big Data: EMC Data Lake in Action Andrea Prosperi Systems Engineer

Upload: phamthu

Post on 08-Mar-2018

225 views

Category:

Documents


1 download

TRANSCRIPT

1 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Redefine Big Data: EMC Data Lake in Action Andrea Prosperi – Systems Engineer

2 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Agenda

• Data Analytics Today

• Big data

• Hadoop & HDFS

• Different types of analytics

• Data lakes

• EMC Solutions for Data Lakes

3 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

The world before big data Data warehousing. Research and the definition of dimensions and facts started in the 1960’s. Things really got going in the 1980s.

4 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

So what changed? Big data rocked up to the party.

5 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Traditional solutions struggled

• Too much data

• No Real Time analysis

• No Data Exploration

• More expensive hardware to go faster and deeper

• Overnight batch not good enough

• Not just structured data in a star schema

6 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Thankfully we had Google Cue Doug Cutting’s son and his elephant, Hadoop…

• Computation Tier uses a framework called MapReduce

• Storage is provided via a distributed filesystem called HDFS

• Hadoop runs on commodity hardware

7 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

All analytics aren’t equal Descriptive, Predictive and Prescriptive. There is also Diagnostic.

Degree of Complexity

Com

petitive A

dvanta

ge

What exactly is the problem?

How many, how often, where?

What happened?

What will happen next if?

What if these trends continue?

What could happen?

What actions are needed?

How can we achieve the best outcome including the effects of variability?

How can we achieve the best outcome?

Descriptive

Predictive

Prescriptive

Source: Based on "Competing on Analytics," Davenport and Harris

8 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Descriptive Analytics

Predictive Analytics

Prescriptive Analytics

9 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Data lakes Today, think of it in terms of co-existence with Enterprise DWH. Both environments are valid.

Analyze & Report

Client/Portal Device

Data Security, Backup

Semi-structured & Unstructured Data

Structured Data

Data Transformation

Client/Portal Devices

Analyze & Report

Enterprise DWH

ETL/ELT

CRM ERP

OLTP DB

Hadoop Based Data

Lake

10 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

What is a Data Lake?

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. *James Dixon, coiner of “Data Lake” term

11 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pragmatic approach to Data Lake

Identify Domain

Be Pragmatic/Start Small

Build Lake infrastructure

Fill Lake

Build Fishing Poles, exploration, extract value, then expand

12 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Data Lake Interaction 3 Main Levels of interaction: • Real Time: for fast

analysis and correlation

• Interactive: for transactional processing

• Batch: for large dataset analysis

13 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

EMC Solutions for Data Lake Infrastructure

BATCH

INTERACTIVE

REAL-TIME

ISILON

VNX

Commodity

ECS

DSSD VIPR Controller

VIPR Services

EMC Big Data Storage

Lake I

nfr

astr

uctu

re

14 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Build Lake Infrastructure

• Be Fast – Reuse your current infrastructure to build

an HDFS repository

• Reduce risk – Reduce CAPEX investment required to perform

analytics

– Maintain data protection, compliance at array level

• Reduce cost and complexity of dedicated clusters

– Reduce need for new vendor nodes and storage capacity

Use General Purpose Arrays/Commodity Disks As Data Lake Store

Commodity

3rd Party

VNX

ViPR Data Services

15 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Build Lake Infrastructure

• ViPR Object & ViPR HDFS access on the same data – S3, Swift, Atmos API via the

Object head

– File protocols in development

• Use your preferred Hadoop distribution

Object, File And HDFS Operations On The Same Data

VIRTUAL ARRAY

Commodity

Object HDFS Object & HDFS

16 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Build Lake Infrastructure Use Specialized Arrays As Data Lake Store

ECS Appliance • Hyper-scale:

– ECS supports unlimited applications and users on a single, scale- out architecture

– start at 360 TB and scale to multiple petabytes or even exabytes

• Pre-Engineered and Pre-Built • Commodity Hardware • Structured and Unstructured Content

3rd p

latfo

rm

applic

atio

ns

17 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Build Lake Infrastructure

• Accelerate the benefits of Hadoop for the enterprise – Proven Hadoop solution, faster implementation

– Greater interoperability with enterprise applications and Hadoop analytics through multi-protocol parallel access from any client

• Enterprise data protection – Fast snapshots, backup, and recovery

– Simple, reliable data replication for disaster recovery

• Ultimate flexibility – Scale compute and storage resources separately

– Supports physical and virtualized server environments

Use Specialized Arrays As Data Lake Store

18 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

BATCH BATCH

INTERACTIVE INTERACTIVE HAWQ Greenplum DB

Unlimited Pivotal HD

REAL-TIME REAL-TIME GemFire XD

EMC/Pivotal Solutions for Data Lake Software

Lake S

oft

ware

19 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Pivotal HD Architecture - Apache

HDFS

HBas

e

Pig, Hive,

Mahout

Map

Reduce

Sqoop Flume

Resource

Management

& Workflow

Yarn

Zookeeper

Apache

20 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

HAWQ - Full ANSI SQL Engine on Hadoop

HDFS

HBas

e

Pig, Hive,

Mahout

Map

Reduce

Sqoop Flume

Resource

Management

& Workflow

Yarn

Zookeeper

Apache Pivotal

Comman

d Center Configure,

Deploy,

Monitor,

Manage

Data Loader

Spring

Unified Storage

Service

Xtension

Framework

Catalog

Services

Query

Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced

Database Services

Hadoop Virtualization

Extension

MADlib Algorithms

21 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

GemFire - Real-Time Data Service

HDFS

HBas

e

Pig, Hive,

Mahout

Map

Reduce

Sqoop Flume

Resource

Management

& Workflow

Yarn

Zookeeper

Apache Pivotal

Comman

d Center Configure,

Deploy,

Monitor,

Manage

Data Loader

Spring

Unified Storage

Service

Xtension

Framework

Catalog

Services

Query

Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced

Database Services

Hadoop Virtualization

Extension

Distrubuted

In-memory

Store

Query

Transactions

Ingestion

Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time

Database Services

MADlib Algorithms

22 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

A Reference Architecture Standardized, on-demand services are layered around shared data repositories & processing capabilities to form the data lake.

Data Sources • Existing structured data. • Unstructured or semi-

structured data sources • Machine generated data such

as logs and sensor data. • External data sources.

Ingest and data capture • Scheduled, Batch data ingest

to capture bulk data sources. • Micro-batch ingest capturing

small quantities of data.

• Low-latency and real-time ingest of data.

• Real-time routing of data to complex event processing and persistent storage.

Shared storage and re-use • Isilon and ViPR provide shared

access to new and existing data sources through HDFS.

• Minimize data copies. • Smart De-dupe for Hadoop. • Kerberos Authentication.

Data Analytics • In-memory performance

(GemFire) • MPP Processing (Pivotal HD) • High performance SQL access

to HDFS data (HAWQ).

Applications and integration • CloudFoundry on vSphere. • Build interactive, data-driven

applications using modern frameworks and approaches.

23 © Copyright 2014 EMC Corporation. All rights reserved. © Copyright 2014 EMC Corporation. All rights reserved.

Data Science Data Engineering

+

What about services?