denodo datafest 2016: logical data lakes: performance considerations

Post on 22-Jan-2017

18 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A

#DenodoDataFest

RAPID, AGILE DATA STRATEGIESFor Accelerating Analytics, Cloud, and Big Data Initiatives.

Logical Data Lakes: Performance Considerations

Dr. Alberto Pan

Denodo, CTO

Agenda1. Logical Architectures for Analytics: The Role of Data Virtualization

2. Example Scenario: The Numbers

3. Performance: The “Move Processing to the Data” Paradigm

4. Performance: Cost-based Optimization in Logical Architectures

Logical Architectures for Analytics

The Role of Data Virtualization

5

The Logical Data Warehouse

6

Logical Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data Mining

Data Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

7

The Logical Data Warehouse

8

The Logical Data Warehouse: Data Virtualization

9

Logical Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data Mining

Data Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

10

Logical Data Lake: Data Virtualization

Real-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data MiningData Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

NoSQL

EDWIn-Memory

(SAP Hana, …)Analytical

Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Data Virtualization

Real-Time Data Access (On-Demand / Streaming)

Data Caching

Data

Serv

ices

Data Search & Discovery

Governance

Security

Optimization

Data

Abstr

action

Data

Tra

nsfo

rmation

Data

Federa

tionBatch

YARN / Workload Management

HDFS

HiveSparkDrill

Impala

Storm HBase SolrHunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

11

The Role of Data Virtualization

Combines data from several systems and publish it to the desired format with a few clicks

Denodo Data Virtualization is the only option verifying:

12

Example: Distributed Report

Demo Scenario of LDW

Sales Data (TPC-DS)(280M)

Customer (TPC-DS)(2M)

MDM

12

SQL on Hadoop

Total Sales by Customer

13

The Role of Data Virtualization

Combines data from several systems and publish it to the desired format with a few clicks

Abstracts applications from changes in the underlying infrastructure

Expose different logical views over the same data

Single entry point to apply Security and Governance

Denodo Data Virtualization is the only option verifying:

14

DW + MDM Dimensions

Common LDW Patterns

TimeDimension

Fact table(sales) Product

Dimension

Retailer Dimension

EDW MDM

14

DW Historical Offloading (Cold Data Storage)

Common LDW Patterns

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW Hadoop

Current Sales Historical Sales

15

DW + Cloud Dimensional Data

Common LDW Patterns

Time Dimension Fact table(sales) Product Dimension

Customer Dimension

EDW CRM

SFDC Customer

16

Data Warehouse Federation

Common LDW Patterns

Time Dimensi

on

Fact table(sales)

CustomerDimension

Region

EDW

City

EDW

Fact table(Fidelity)

Customer Dimension

17

Example Scenario: The Numbers

19

Denodo has done extensive testing using queries from the standard benchmarking test

TPC-DS* and the following scenario

Compares the performance of a federated approach in Denodo with an MPP system where

all the data has been replicated via ETL

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Denodo 6.0 ArchitecturePerformance Comparison – Logical Data Warehouse vs. Physical Data Warehouse

20

Denodo 6.0 Architecture

Query DescriptionReturned

RowsTime Netezza

Time Denodo (Federated Oracle,

Netezza & SQL Server)

Optimization Technique (automatically selected)

Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5,51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17,05 K 3.5 sec. 5.2 sec On the fly data movement

Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse

Performance

The “Move Processing to the Data” Paradigm

22

Move Processing to the Data

Process the data where it resides

Process the data locally where it resides

DV System combines partial results

Minimizes network traffic

Leverages specialized data sources

23

Move Processing to the Data: Example 1

Obtain Total Sales By Product (Naive Strategy)

Naive Strategy: 350M rows moved through the network

24

Move Processing to the Data: Example 1

Obtain Total Sales By Product (Move Processing to the Data)

Denodo Strategy: 30k rows moved through the network

25

Move Processing to the Data: Example 1 (Alternative)

Obtain Total Sales By Product (Dimension Table Replicated)

Denodo Strategy: Dimension Table Replicated

Denodo Strategy: 20k rows moved through the network

26

Move Processing to the Data: Example 2

Execution Strategy: Full aggregation pushdown not possible Two possible techniques: - On-the-fly data movement- Partial aggregation pushdown

Maximum Sales Discount By Product in The last year

Sales Discount: list_price (Product) – sale_price (Sales)

27

Move Processing to the Data: Example 2

Maximum Sales Discount By Product in the last year: On-the-fly Data Movement

Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K

rows inserted in the DW

Execute full query on the DW: 10k rows through the network

28

Move Processing to the Data: Example 2

Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown

Products DB: 10K rows through the network

Data Warehouse: #rows through the network = 10K * average

#sale_prices_per_product

Performance

Cost-based Optimization in Logical Architectures

30

How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization

Data statistics to estimate size of intermediate result sets

Data Source Indexes (and other physical structures)

Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases

Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)

Data Transfer rate

Must take into account:

Q&AFind more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/

Thank you!

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A

#DenodoDataFest

top related