thu-310pm-impetus-sachinandajay
TRANSCRIPT
1
Ajay ShriwastavaSachin Ghai
Impetus Technologies Inc.
Logical Data Warehouse:Building a virtual data services layer
Hadoop Summit – San Jose – 11 June 2015
4
EVOLUTION OF THE DATA WAREHOUSE
If this slide looks inverted to you, it actually is.Data warehouse as we knew so far has inverted the concepts
with emergence of BIG DATA.
5
EVOLUTION OF THE DATA WAREHOUSE
Pre-determi
ned input
schema
Extensive data
governance
ANSI SQL complianc
e
IT teams ownersh
ip
Concurrent
users
Pre-canned
BI
Low cost
storage/
archive
Non SQL
access
Exploratory
analysis
Machine
Learning/Graph
Data discove
ry
Self Service
BI/analytics
Enterprise Data Warehouse (EDW) Big Data Warehouse (BDW)
6
CO-EXISTENCE OF EDW AND BDW – A REALITY
Organizations still on
initial phase of Big Data
journey.
Existing ETL jobs feeding
EDW.
BI and downstream applications using ANSI
SQL for querying
data in EDW.
Business use cases for big
data are emerging.
7
EMERGENCE OF LOGICAL DATA WAREHOUSE
Logical Data Warehouse
In response to emerging forces like Big Data, the data warehousing practices evolution led to emergence of “Logical Data Warehouse”.
Key components include:
Repository management
Data virtualization
Distributed process
Auditing statistics and performance evaluation services
SLA management
Taxonomy/Ontology resolution
Metadata management
First proposed in May 2009 and published in August 2011 research by Gartner.
8
NEW PARADIGMS
DATA LAKEDISTRIBUTED
PARALLELPROCESS
VIRTUALIZATION
OFFLOAD
Repositories continue to be
no longer Enterprise Data Warehouse or Data marts – emergence of HDFS as the “data lake” along with
NoSQL data stores.
Distributed process now have become synonymous
with MapReduce
on files or DB. With Spark,
more distributed
operators are becoming common
place.
Virtualization gaining favor as an access mechanism
where transient
consolidation is required
for a use case.
Offload to newer
repositories and process
engines requiring
more accurate science and process now.
9
VIRTUALIZATION – AND ITS MANY DELINEATIONS
• simplified, unified, and integrated viewData
Virtualization
• is a subset of data virtualization• enhanced with query optimization strategies for
specific sourceData Federation
• involves actual data movement and ‘write’ to a repository rather than just ‘read’ for a transient use case
Data Blending
10
BRIDGING THE CHANNEL WITH VIRTUALIZATION
VIRTUALIZATION
Relational
Oracle, DB2…
NoSQL
Cassandra,
MongoDB…
File Systems
HDFS, GPFS…
MPP
Teradata, Netezza
…
Hadoop based
Warehouse
Hive, Tajo…
Users
Enterprise Web, Mobile
Applications
Enterprise, ESB…
BI
Reporting, Visualization…
Data Science
Machine Learning, Graph…
Data Manageme
nt
MDM, Discovery…
TAR
GET
SYSTEM
SSO
UR
CE
SYSTEM
S
11
WEALTH MANAGEMENT – USE CASE
Big data can transform the client and account centric wealth management to personalized goal based wealth management.
Unfortunately the information is spread across many different line of business using separate data sources and platforms
12
GOAL BASED WEALTH MANAGEMENT
• From account or client centric view to household and relationship view.
• A comprehensive approach to understand long term financial goals of client.
• Facilitate financial security during life changing events – marriage, college, job changes, retirement, inter generational wealth transfer.
13
BUILDING RELATIONSHIPS
Household ViewBusiness Network
ViewHierarchy View.
Collect the data and process on a common platform.- Different LOB’s- IVR Logs/Web Logs
Discover relationships.
Unified view over existing data in client and account systems.
Integrate with social data.
Implement Governance.
Implement corporate hierarchy over client data.
Jerry Mayfield
Paul Robinson
Andrew Madura
Linda Mays
Jack KlineRoot Nod
e
Friends
Golf Buddy
College Alumni
Neighbors
Father
Son Daughter-1
Daughter-2
Self Spouse
14
PERSONALIZED SERVICES CROSS/UP SELL
Household ViewBusiness Network
ViewHierarchy View.
• 401 K / IRA• 529• College loans• Student Credit Cards.• Gift Cards• Estate Planning• Alimonies
• Company loans• Asset standing• Business prospects
• Corporate Accounts.• Corporate discounts.• Group based services.• Prospects identification.• Goal based services.
Assets Distribution
Jan-
15
Jan-
16
Jan-
17
Jan-
18
Jan-
190
20
40
60
Goals Actuals
Liabilities0%
40%80%
15
TRADITIONAL DATA INTEGRATION
• Data is embedded in silos • Time consuming and resource intensive ETL processes• Create data duplication• Governance is inhibitor and not enabler• Inability to handle the V’s of big data
16
Govern
ance
– Secu
rity, A
udit, Lin
eage
Monito
ring a
nd C
luste
r Managem
ent
HDFS/Hive/PIG
EDW – Teradata /Netezza etc
Producer A
Stream Analytix
SQL Offload Solution
Kyvos Engine
ED
W
Mig
rati
on
In memory data Layer (spark)
Centralized
Schema
Big Data Governance
Ankush Jumbune
Analy
tics
/
Vir
tualiz
ati
on
Str
eam
In
gest
ion
Batc
h
Ing
est
ion
Batch Data Ingestion
Sqoop / Talend etc.
Kundera
Data Virtualization Layer
Distributed Messaging Layer (Kafka)
Producer B
Data
Acc
ess
Spark Streaming
REST API
Hadoop Cluster / YARN
QueryAPI
(sql)
Custom layer for universal
connectivity
Search API
ES, Solr, NLP
Propriety connectors
/ ODBC Drivers
BI Tools – Micro Strategy, Tableau, Kyvos…
Impetus Offerings (Details : Appendix 1)
Recommended Platform
Platform Requirements
JDBC
ML Lib
OLTP/RDBMS
R Algorithms.
Data Quality.Mahout
Storm
LOGICAL DATA WAREHOUSE REFERENCE ARCHITECTURE
17
UNIFIED VIEW - ADVANTAGES
• Fast real time data integration without creating expensive copies of data.
• Significant saving of time and resources required for ETL
• Facility to create a final composite schema.
• Information management capability.
• Meet stringent service level agreements.
18
OFFLOAD
Offload cold data and exploratory analysis workloads to commodity hardware driven Hadoop cluster – Save cost, resources
19
OFFLOAD CONCEPT
Run Hive queries on
Hadoop
SQL Script
sEDW
Static SQL
Procedures/ PL-SQL
Proprietary script
(RoadMap)
BDWHive Queries
Hive Queries
SQL Script Parser
JAVA code with Hive Queries
Tables
An Enterprise Data Warehouse (EDW) to Big Data Warehouse (BDW) offload will essentially involve tables and code migration.
20
KEY CHALLENGES IN OFFLOAD
Varied input sourcesValidating complete
schema and data offload
ANSI SQL incompatibility
User Defined Functions
unavailability in target system
Lack of unified view and UI
Missing Data Quality checks
21
HOW WE BUILT THE OFFLOAD SOLUTION
Step 1: Identification
Step 2: Schema and Data Migration
Step 3: Logic Migration
Step 4: Data Quality
Enhancement
Step 5:Transformed
code Execution
Reduced Time!Reduced Risk!Automation!
22
SAMPLE QUERY – AUTO TRANSFORMED
Teradata Query:INSERT INTO month_wise_ship_agg select d_month_seq ,substr(w_warehouse_name,1,20) , TRIM(TRAILING '_' FROM sm_type) ,cc_name ,sum(case when (cs_ship_date_sk - cs_sold_date_sk <= 30 ) then 1 else 0 end) as "30 days" ,sum(case when (cs_ship_date_sk - cs_sold_date_sk > 30) and (cs_ship_date_sk - cs_sold_date_sk <= 60) then 1 else 0 end ) as "31-60 days"from catalog_sales ,warehouse ,ship_mode ,call_center C,date_dim where cs_ship_date_sk = d_date_sk and EXTRACT (YEAR FROM d_date) IN (2000,2001) and DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE and cs_warehouse_sk = w_warehouse_sk and cs_ship_mode_sk = sm_ship_mode_sk and cs_call_center_sk = cc_call_center_sk and (C.cust_id , C.address) LIKE ANY ( SELECT C1.cust_id, C1.address FROM Customer C1 )group by substr(w_warehouse_name,1,20) ,sm_type ,cc_name ,d_month_seq ;
Hive Query:INSERT INTO TABLE month_wise_ship_agg SELECT d_month_seq,SUBSTR( w_warehouse_name , 1 , 20 ) AS auto_c01,sm_type, UDF_TRIM('TRAILING ','_' , sm_type) , cc_name,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk <= 30) THEN 1 ELSE 0 END ) AS 30_days,SUM( CASE WHEN ( cs_ship_date_sk - cs_sold_date_sk > 30) AND ( cs_ship_date_sk - cs_sold_date_sk <= 60) THEN 1 ELSE 0 END ) AS 31_60_daysFROM catalog_sales, warehouse, ship_mode, call_center C, date_dim WHERE cs_ship_date_sk = d_date_sk AND EXTRACT ('YEAR', d_date) IN (2000,2001) AND DAYNUMBER_OF_MONTH(d_date) > 1 and TD_QUARTER_BEGIN(d_date) <= CURRENT_DATE() AND cs_warehouse_sk = w_warehouse_sk AND cs_ship_mode_sk = sm_ship_mode_sk AND cs_call_center_sk = cc_call_center_sk AND EXISTS ( SELECT * Customer C1 where C.cust_id LIKE C1.cust_id AND C.age LIKE C1.address )GROUP BY SUBSTR( w_warehouse_name , 1 , 20 ) ,sm_type,cc_name,d_month_seq;
23
OFFLOAD - KEY ADVANTAGES
Optimize MPP and Relational
database resources for
workloads
Re-use millions of lines of code and $
Avoid the learning curve
and re-code, re-test cycles
Seamless integratio
n of downstrea
m / upstream apps and reports
24
SUMMARY
Establish your data strategy
Identify key component: Hadoop, MPP, Spark, NoSQL etc.
Segregate your workloads
Offload to low cost Hadoop where required
Leverage Virtualization for key use case
Establish Data Quality, SLA, Semantics and MetaData as key supporting pillars
To summarize, following steps are recommended for creating a logical data warehouse
25
SIGN-OFF QUOTE
In the end, there can be only 2 types of data warehouses: Logical data warehouse
and illogical data warehouse…
“
”
27
APPENDIX 1
Product URL
StreamAnalytix http://streamanalytix.com/
Kyvos http://www.kyvosinsights.com/
Kundera http://bigdata.impetus.com/open_source_kundera
SQL Offload Solution http://www.impetus.com/sites/impetus.com/i
mpetus/brochures/ETL_Offloading_Datasheet.pdfAnkush http://bigdata.impetus.com/ankush
Jumbune http://www.jumbune.org/