thu-310pm-impetus-sachinandajay

1

Ajay ShriwastavaSachin Ghai

Impetus Technologies Inc.

Logical Data Warehouse:Building a virtual data services layer

Hadoop Summit – San Jose – 11 June 2015

2

AGENDA

Emergence of Logical Data Warehouse

Virtualization Offload

3

EVOLUTION OF THE DATA WAREHOUSE

4


If this slide looks inverted to you, it actually is.Data warehouse as we knew so far has inverted the concepts

with emergence of BIG DATA.

5


Pre-determi

ned input

schema

Extensive data

governance

ANSI SQL complianc

e

IT teams ownersh

ip

Concurrent

users

Pre-canned

BI

Low cost

storage/

archive

Non SQL

access

Exploratory

analysis

Machine

Learning/Graph

Data discove

ry

Self Service

BI/analytics

Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)

6

CO-EXISTENCE OF EDW AND BDW – A REALITY

Organizations still on

initial phase of Big Data

journey.

Existing ETL jobs feeding

EDW.

BI and downstream applications using ANSI

SQL for querying

data in EDW.

Business use cases for big

data are emerging.

7

EMERGENCE OF LOGICAL DATA WAREHOUSE

Logical Data Warehouse

In response to emerging forces like Big Data, the data warehousing practices evolution led to emergence of “Logical Data Warehouse”.

Key components include:

Repository management

Data virtualization

Distributed process

Auditing statistics and performance evaluation services

SLA management

Taxonomy/Ontology resolution

Metadata management

First proposed in May 2009 and published in August 2011 research by Gartner.

8

NEW PARADIGMS

DATA LAKEDISTRIBUTED

PARALLELPROCESS

VIRTUALIZATION

OFFLOAD

Repositories continue to be

no longer Enterprise Data Warehouse or Data marts – emergence of HDFS as the “data lake” along with

NoSQL data stores.

Distributed process now have become synonymous

with MapReduce

on files or DB. With Spark,

more distributed

operators are becoming common

place.

Virtualization gaining favor as an access mechanism

where transient

consolidation is required

for a use case.

Offload to newer

repositories and process

engines requiring

more accurate science and process now.

9

VIRTUALIZATION – AND ITS MANY DELINEATIONS

• simplified, unified, and integrated viewData

Virtualization

• is a subset of data virtualization• enhanced with query optimization strategies for

specific sourceData Federation

• involves actual data movement and ‘write’ to a repository rather than just ‘read’ for a transient use case

Data Blending

10

BRIDGING THE CHANNEL WITH VIRTUALIZATION

VIRTUALIZATION

Relational

Oracle, DB2…

NoSQL

Cassandra,

MongoDB…

File Systems

HDFS, GPFS…

MPP

Teradata, Netezza

…

Hadoop based

Warehouse

Hive, Tajo…

Users

Enterprise Web, Mobile

Applications

Enterprise, ESB…

BI

Reporting, Visualization…

Data Science

Machine Learning, Graph…

Data Manageme

nt

MDM, Discovery…

TAR

GET

SYSTEM

SSO

UR

CE

SYSTEM

S

11

WEALTH MANAGEMENT – USE CASE

Big data can transform the client and account centric wealth management to personalized goal based wealth management.

Unfortunately the information is spread across many different line of business using separate data sources and platforms

12

GOAL BASED WEALTH MANAGEMENT

• From account or client centric view to household and relationship view.

• A comprehensive approach to understand long term financial goals of client.

• Facilitate financial security during life changing events – marriage, college, job changes, retirement, inter generational wealth transfer.

13

BUILDING RELATIONSHIPS

Household ViewBusiness Network

ViewHierarchy View.

Collect the data and process on a common platform.- Different LOB’s- IVR Logs/Web Logs

Discover relationships.

Unified view over existing data in client and account systems.

Integrate with social data.

Implement Governance.

Implement corporate hierarchy over client data.

Jerry Mayfield

Paul Robinson

Andrew Madura

Linda Mays

Jack KlineRoot Nod

e

Friends

Golf Buddy

College Alumni

Neighbors

Father

Son Daughter-1

Daughter-2

Self Spouse

14

PERSONALIZED SERVICES CROSS/UP SELL

Household ViewBusiness Network

ViewHierarchy View.

• 401 K / IRA• 529• College loans• Student Credit Cards.• Gift Cards• Estate Planning• Alimonies

• Company loans• Asset standing• Business prospects

• Corporate Accounts.• Corporate discounts.• Group based services.• Prospects identification.• Goal based services.

Assets Distribution

Jan-

15

Jan-

16

Jan-

17

Jan-

18

Jan-

190

20

40

60

Goals Actuals

Liabilities0%

40%80%

15

TRADITIONAL DATA INTEGRATION

• Data is embedded in silos • Time consuming and resource intensive ETL processes• Create data duplication• Governance is inhibitor and not enabler• Inability to handle the V’s of big data

16

Govern

ance

– Secu

rity, A

udit, Lin

eage

Monito

ring a

nd C

luste

r Managem

ent

HDFS/Hive/PIG

EDW – Teradata /Netezza etc

Producer A

Stream Analytix

SQL Offload Solution

Kyvos Engine

ED

W

Mig

rati

on

In memory data Layer (spark)

Centralized

Schema

Big Data Governance

Ankush Jumbune

Analy

tics

/

Vir

tualiz

ati

on

Str

eam

In

gest

ion

Batc

h

Ing

est

ion

Batch Data Ingestion

Sqoop / Talend etc.

Kundera

Data Virtualization Layer

Distributed Messaging Layer (Kafka)

Producer B

Data

Acc

ess

Spark Streaming

REST API

Hadoop Cluster / YARN

QueryAPI

(sql)

Custom layer for universal

connectivity

Search API

ES, Solr, NLP

Propriety connectors

/ ODBC Drivers

BI Tools – Micro Strategy, Tableau, Kyvos…

Impetus Offerings (Details : Appendix 1)

Recommended Platform

Platform Requirements

JDBC

ML Lib

OLTP/RDBMS

R Algorithms.

Data Quality.Mahout

Storm

LOGICAL DATA WAREHOUSE REFERENCE ARCHITECTURE

17

UNIFIED VIEW - ADVANTAGES

• Fast real time data integration without creating expensive copies of data.

• Significant saving of time and resources required for ETL

• Facility to create a final composite schema.

• Information management capability.

• Meet stringent service level agreements.

18

OFFLOAD

Offload cold data and exploratory analysis workloads to commodity hardware driven Hadoop cluster – Save cost, resources

19

OFFLOAD CONCEPT

Run Hive queries on

Hadoop

SQL Script

sEDW

Static SQL

Procedures/ PL-SQL

Proprietary script

(RoadMap)

BDWHive Queries

Hive Queries

SQL Script Parser

JAVA code with Hive Queries

Tables

An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will essentially involve tables and code migration.

20

KEY CHALLENGES IN OFFLOAD

Varied input sourcesValidating complete

schema and data offload

ANSI SQL incompatibility

User Defined Functions

unavailability in target system

Lack of unified view and UI

Missing Data Quality checks

21

HOW WE BUILT THE OFFLOAD SOLUTION

Step 1: Identification

Step 2: Schema and Data Migration

Step 3: Logic Migration

Step 4: Data Quality

Enhancement

Step 5:Transformed

code Execution

Reduced Time!Reduced Risk!Automation!

22

SAMPLE QUERY – AUTO TRANSFORMED

Teradata Query:INSERT INTO month_wise_ship_agg select d_month_seq ,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type) ,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days" ,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60) then 1 else 0 end ) as "31-60 days"from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim where cs_ship_date_sk = d_date_sk and EXTRACT (YEAR FROM d_date) IN (2000,2001) and DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE and cs_warehouse_sk = w_warehouse_sk and cs_ship_mode_sk = sm_ship_mode_sk and cs_call_center_sk = cc_call_center_sk and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 )group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ;

Hive Query:INSERT INTO TABLE month_wise_ship_agg SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01,sm_type, UDF_TRIM('TRAILING ','_' , sm_type) , cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <= 60) THEN 1 ELSE 0 END ) AS 31_60_daysFROM catalog_sales, warehouse, ship_mode, call_center C, date_dim WHERE cs_ship_date_sk = d_date_sk AND EXTRACT ('YEAR', d_date) IN (2000,2001) AND DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE() AND cs_warehouse_sk = w_warehouse_sk AND cs_ship_mode_sk = sm_ship_mode_sk AND cs_call_center_sk = cc_call_center_sk AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address )GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;

23

OFFLOAD - KEY ADVANTAGES

Optimize MPP and Relational

database resources for

workloads

Re-use millions of lines of code and $

Avoid the learning curve

and re-code, re-test cycles

Seamless integratio

n of downstrea

m / upstream apps and reports

24

SUMMARY

Establish your data strategy

Identify key component: Hadoop, MPP, Spark, NoSQL etc.

Segregate your workloads

Offload to low cost Hadoop where required

Leverage Virtualization for key use case

Establish Data Quality, SLA, Semantics and MetaData as key supporting pillars

To summarize, following steps are recommended for creating a logical data warehouse

25

SIGN-OFF QUOTE

In the end, there can be only 2 types of data warehouses: Logical data warehouse

and illogical data warehouse…

“

”

26

Thank you.Questions??

[email protected]

[email protected]

27

APPENDIX 1

Product URL

StreamAnalytix http://streamanalytix.com/

Kyvos http://www.kyvosinsights.com/

Kundera http://bigdata.impetus.com/open_source_kundera

SQL Offload Solution http://www.impetus.com/sites/impetus.com/i

mpetus/brochures/ETL_Offloading_Datasheet.pdfAnkush http://bigdata.impetus.com/ankush

Jumbune http://www.jumbune.org/

http://streamanalytix.com/

http://www.kyvosinsights.com/

http://bigdata.impetus.com/open_source_kundera

http://bigdata.impetus.com/open_source_kundera

http://www.impetus.com/sites/impetus.com/impetus/brochures/ETL_Offloading_Datasheet.pdf



http://bigdata.impetus.com/ankush

http://www.jumbune.org/

thu-310pm-impetus-sachinandajay

Documents

querying data

emergence of big data

subset of data virtualization

nosql data stores

actual data movement

separate data sources

graph data management

data marts emergence