data ninja webinar series: data virtualization as the enterprise data fabric
TRANSCRIPT
Data Virtualization as the
Enterprise Data Fabric
webinars
Data Ninja Webinar SeriesSessions covering data virtualization solutions for driving business value
2
Data Ninja Webinars
Five webinars over the next few months…
Speakers
Senior Engineer
Pablo Alvarez
Agenda1.The Data Fabric
2.Evolution of the Data Fabric: A Historical Perspective
3.Benefits
4.Performance and Scalability
5.Going Beyond
6.Q&A
5
In computing, a Fabric is a system of interconnected nodes
that looks like a "weave" when viewed collectively from a
distance.
In this context, a Data Fabric is a system that allows global
access to all your data assets, and leverages storage and
processing power from multiple heterogeneous nodes.
6
Data Virtualization as the Data Fabric
Offers a common access point for consumers
Allows specialized data stores to be used for what
they are best at
With other approaches, like Data Lakes, that are
based on replication to a single large target system,
this ability is lost.
Data virtualization’s architecture is based on the usage of underlying sources whenever possible.This can be seen as a network of different specialized processing and storage nodes that form the Data Fabric under the umbrella of a common virtual data model:
7
Successful Customer Use Cases
AGILE BUSINESS INTELLIGENCE
Replaced traditional BI with the Logical Data Warehouse that integrates multiple sources around a central EDW
360 VIEW APPLICATIONS
‘Unified Desktop’ that provides integrated customer information
CLOUD INTEGRATION
Virtual layer to abstract access to SaaS applications and enable integration with data center
DATA SERVICES
Services Layer (REST, OData) on top of Denodo’s data model with access to any data
Evolution of the Data Fabric: A Historical Perspective
9
The Old Days: EDW Reporting
Simple WYSIWYG reporting tools
One-to-One reporting on top a tailor-
made Data Warehouse and Data
Marts
Problems:
Poor reusability
Reports built on top of Data Mart
data model
Excessive replication
OperationalData
Staging EDW
SQL
Data Mart
10
The Dawn: Reporting with Semantic Layers
OperationalData
Staging EDW
SQL
More advanced reporting tools with
a built-in semantic layer for easier
use and better reusability
One-to-One reporting on top a
tailor-made Data Warehouse
Problems:
Limited to a single source
Limited to a single reporting tool
11
Reporting with Federation
OperationalData
Staging EDW
SQL
Reporting tools add a built-in
federation engine that allows for
multi-source reporting
Problems:
Bad Performance
Limited cross-source security
Limited to a single reporting tool
Other RDBMS
12
Early Data Virtualization
OperationalData
Staging EDW
SQL
Data Virtualization as an
independent semantic abstraction
layer
Reusable semantic model can be
used by multiple reporting tools
Engine specialized in federation
(optimizer, caching, etc)
Integrated security
Other RDBMS
IntegratedSecurity
Other Sources
Cache
13
Mature Data Virtualization
OperationalData
EDW
SQL
IntegratedSecurity
Other Sources
Cache
In-memoryFabric
BigData
SaaS
RESTOData
Catalog &Data Exploration
Monitoring Auditing
Benefits
Benefits
15
Data Virtualization as the Enterprise Data Fabric
Abstracts access to disparate data sources
• Homogeneous data access regardless of back-end technology
• No need to deal with new languages and APIs: access to SFDC, Excel,
Redshift, Oracle, Hadoop, other SaaS APIs, etc.
15
Acts as a single semantic repository
• Definition of a consistent business data model across all consumers and
reporting tools
• Combination of data regardless of locations and nature
• Avoids unnecessary replication
Benefits
16
Data Virtualization as the Enterprise Data Fabric
16
Centralized security layer
• Role-based authorization to all tables in the virtual layer
• Integration with AD/LDAP and Kerberos
• Security is moved outside the reporting layer to avoid security bypasses
• Centralized access point simplifies operations and auditing
Real-time fabric execution model
• Advanced optimizer designed specifically for virtualization
• Execution push-down to leverage source computing capabilities
• Data comes straight from the sources
• Cache layer to improve performance when needed
Performance & Scalability
18
A mature virtualization engine like Denodo offers
results comparable with single source executions.
Let’s see how this is possible…
19
PerformanceDenodo’s unique query optimizer
Denodo’s optimizer borrows many techniques from traditional RDBMs
Cost-base query plans based on statistics and indexes
Multiple JOIN methods
Query rewriting to generate more optimal SQL
However, given the distributed execution of a query in a processing
fabric, Denodo has designed unique techniques to maximize
performance in this environment
Dynamic rewriting focused on maximizing execution at source and reduction of
network traffic
Cost estimates also factor-in:
Processing power of the sources (e.g. number of nodes in a Hadoop cluster)
Network and transfer rates
20
PerformanceDV Overhead: Direct vs Denodo with single source
TPCDS Benchmark Tests using JDBC with IBM Netezza as data source with 10 Gbps LAN networkResults in seconds
When queries only hit an individual source, the data virtualization layer pushes the processing completely to the source with minimal overhead
As a note, since data needs to flow through the DV layer, the network between sources and DV should be broad to avoid network bottlenecks
21
Performance
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario that compares the performance of a federated
approach in Denodo with an MPP system where all the data has been replicated via ETL
Benchmarks: Federating large data sets
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
vs.Sales Facts290 M rows
Items Dim.400 K rows
Customer Dim.2 M rows
22
Performance
Query DescriptionReturned
RowsNetezza Time
Denodo Time (Federated Oracle,
Netezza & SQL Server)
Denodo Optimization Technique (automatically
selected)
Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5.51 M 52.3 sec. 59.0 sec. Full aggregation push-down
Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current
list price17.05 K 3.5 sec. 5.2 sec. On the fly data movement
Benchmarks: Federating large data sets
Execution times are comparable with single source executions based only on automatic
optimizer decisions
23
Performance
SELECT c.id, SUM(s.amount) as total
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.id
Reporting Tools are not optimized for federation across sources
System Execution Time Data Transferred
Optimization Technique
(automatically selected)
Denodo 9 sec. 4 M Aggregation push-down
Tableau 125 sec. 292 M None: full scan
Join
Group By
290 M 2 M
Sales Customer
Group By
Join
2 M
2 M
Sales Customer
24
Scalability
SQL Cluster:Denodo1:9999Denodo2:9999Denodo3:9999Denodo4:9999
Web Cont. Cluster:Denodo1:9090Denodo2:9090Denodo3:9090Denodo4:9090
Virtual ServerSQL Cluster: 192.168.0.10:9999Web Container Cluster: 192.168.0.10:9090
Load Balancer Shared Cache Server
Denodo can be deployed in a
cluster for HA and horizontal
scaling
“Shared-nothing” execution
engine ensures linear
scalability
Based on the use of an
external load balancer
Supports auto-scaling for cloud
deployments (like AWS)
Going Beyond
Going Beyond
26
What’s cooking in the virtualization space
26
Holistic Operations Console
• Common operations web console to orchestrate monitoring,
notifications, diagnosis, auditing, migration, license management, etc.
Web-based Self Service
• Advanced catalog enables a centralized “data marketplace”
• Keyword base search
• Collaboration (tags, comments, request for access, etc.)
Next-gen “Fabric” Execution Engine
• Tight integration with in-memory and data grids to move processing
from the virtual layer to specialized execution engines
Q&A
Next Steps
Get Started!Download Denodo Express: www.denodoexpress.comAccess Denodo Platform on AWS: www.denodo.com/en/denodo-platform/denodo-platform-for-aws
Denodo Platform 6.0 WhitepaperDownload & Read: http://www.denodo.com/en/document/whitepaper/denodo-platform-60-whitepaper
Data Virtualization for Data ServicesVisit: http://www.denodo.com/en/solutions/horizontal-solutions/data-services
Data Ninja Webinar SeriesSessions covering data virtualization solutions for driving business value
Next Session:
Realizing the Promise of Data LakesThursday, December 15th , 2016
Thanks!
www.denodo.com [email protected]
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.