data warehouse introduction - data & social methodologymike2.openmethodology.org › w ›...
TRANSCRIPT
© 2005 BearingPoint, Inc.
DW Introduction
Data Warehouse Introduction
Kenneth Domantay - Senior Manager
Data and Knowledge Management
© 2005 BearingPoint, Inc.
Table of Contents
DW Introduction
DW Architectures
DW Implementation Considerations
© 2005 BearingPoint, Inc.
“The Data Warehouse”
- Origin to Architecture -
© 2005 BearingPoint, Inc.
Definition:
Data Warehousing is a Data Warehousing is a processprocess not a productnot a productIt is a approach to “properly” assemble, validate
consolidate and manage data from various sources. Allows business questions to be
answered which were not previously possible.
It is evolves through ‘Iterations’not a one time process
© 2005 BearingPoint, Inc.
Characteristic of a Data Warehouse
"A Data Warehouse :
• Is Subject vs. Application oriented
• Contains Integrated Data
• Is Nonvolatile - limited if any Updates / Deletes
• Contains Detail and Summary Data
• Contains Current / Historical Data
• Is Time variant - Range of Time periods
© 2005 BearingPoint, Inc.
Contrasting Environments
DataWarehousingOLTP
Transactional
Simple
Point-in-Time
Known
Static
Structured
Business Need
Query
Timeframe
Business Question
Environment
Usage
Analytical
Complex
Historical
Unknown
Dynamic
Unstructured
© 2005 BearingPoint, Inc.
Cost Justification examples of DW
Business Drivers Technology Drivers
Improved Management information• spreads & margins, asset utilization, • asset quality, overhead control
Improved customer quality• profit and risk
Improved marketing effectiveness• focussed, efficient, • proactive, planned
Etc….
Enables offload volumes from mainframes• reduced costs, improve
responsiveness
Reduced maintenance effort and costs• reduced development costs
Etc….
© 2005 BearingPoint, Inc.
Data Warehouse Architectures
- Enterprise- Data Mart- ODS- Active DW
Enterprise - DATA WAREHOUSE ARCHITECTURE
Internal Data
External Data
MOM
Staging Area
or
ODS
Data Warehouse
SOURCE DATA LAYERDATA ACQUISITION LAYER
DATA MANAGEMENT LAYER
USER ACCESS LAYER
Source System Analyst Data Acquisition DeveloperBusiness Analyst, Data Modeler, DBA, OLAP Developer
Reports, OLAP, Data Mining, Knowledge Discovery etc.
Business Users
Data Entry ASCII, Excel etc.
DatamartDatamart
© 2005 BearingPoint, Inc.
Data Marts
DATA WAREHOUSE ARCHITECTURE
Internal Data
External Data
MOM
Staging Area
or
ODS
Data Warehouse
SOURCE DATA LAYERDATA ACQUISITION LAYER
DATA MANAGEMENT LAYER
USER DATA ACCESS LAYER
Source System Analyst Data Acquisition DeveloperBusiness Analyst, Data Modeler, DBA, OLAP Developer
Reports, OLAP, Data Mining, Knowledge Discovery etc.
Business Users
Data Entry ASCII, Excel etc.
DatamartDatamart
© 2005 BearingPoint, Inc.
Data Marts - Characteristics
Departmental use
Normally Point Solution based (I.e. Profitability, etc.)
Normally not Enterprise level (I.e. few Subject areas)
Focus on one to few problem(s)
Easiest to build
Often confused with Multi Dimensional Database (MDDB)
© 2005 BearingPoint, Inc.
Departmental Data Marts
Advantages• Intuitive Data Navigation Functionality• Faster Query Performance
Issues• No Single Version of the Truth • Lacks Enterprise Model • Limited Cross-Application Analysis • Little or No Data Transformation• Limited Data Set
© 2005 BearingPoint, Inc.
Independent versus Dependent Data Mart
An independent data mart is an isolated copy of existing data from operational and/or external systems, specially organized to serve a specific purpose.It typically services a department or specific group of users.
No DW involved
A dependent data mart is an integral subset of a data warehouse, organized by subject area, to enhance access and performance. It sources data from the Enterprise Data Warehouse. Therefore, the data remains consistent with the data warehouse to which it is connected;
DW involved
Hub & Spoke
© 2005 BearingPoint, Inc.
A success story becomes a failure
3+ years later = Re-architect / Redo
Clients may spend millions of dollars to recreate current/past problem
Easy
Harder
4+Subject areas = Problems
* Data Mart Consolidation Sales Opportunity
© 2005 BearingPoint, Inc.
Why Data Marts Fail - General
You need to know the questions & answers. If new or changed requirements = hard to change or extendLittle to no Data Transformation• Dirty Data is the biggest challenge • How do you know how dirty your data is?• Issues left for later and for someone else (9 to 12 mths later)
Usually rely on tools to hide the problems, until it’s too lateArchitecture does not support long term corporate goals
(non-enterprise solution)Higher Total Cost of ownership to maintain, change and fix
© 2005 BearingPoint, Inc.
Operational Data Store (ODS)
DATA WAREHOUSE ARCHITECTURE
Internal Data
External Data
MOM
Staging Area
or
ODS
Data Warehouse
SOURCE DATA LAYERDATA ACQUISITION LAYER
DATA MANAGEMENT LAYER
USER DATA ACCESS LAYER
Source System Analyst Data Acquisition DeveloperBusiness Analyst, Data Modeler, DBA, OLAP Developer
Reports, OLAP, Data Mining, Knowledge Discovery etc.
Business Users
Data Entry ASCII, Excel etc.
DatamartDatamart
© 2005 BearingPoint, Inc.
ODS Vs. DW - High level
Data WarehouseOperational data store
- subject oriented (normally)- Integrated- May contain Operational data- volatile (update is normal)- limited history data- Only detailed data
- subject oriented - Integrated- Should not contain Operational data- nonvolatile(update is not normal)- historical data (1+ years)- detailed and summary data
At first glance it may appear that the data warehouse andthe ODS are the same thing. They are decidedly not thesame thing
© 2005 BearingPoint, Inc.
Active or Near-Real time
Data Warehouse
(Next Generation)
© 2005 BearingPoint, Inc.
• Active
What is an Active Data Warehouse?
Data Warehousing
– Also for in-the-field / Tactical” decision makers– Day-to-day decision making– Tactical focus with strategic implications– Real to Near real time ETL and access (Fact/fiction = cutting corners)
• Traditional Data Warehousing
– More for Strategic” decision makers– Long-term decision making– Strategic focus
Business needs both strategic andTactical decision support capabilities.
© 2005 BearingPoint, Inc.
Active Vs. Traditional DW
Strategic decisions focus
Highly parameterized reporting, often using pre-built summary tables or data marts
Limited feedback loop or event based usage
Results sometimes hard to measure
Daily, Weekly, Mthly Data currency is acceptable, summaries often appropriate
Power users, knowledge workers, internal users
Also drives tactical decisions
Complex data mining to discover new hypotheses vs. confirming prior ones
High feedback loop and event based activity
Results measured with operations
Within Minutes; only comprehensive detailed data is acceptable
Operational staffs, call centers, external users
Traditional DW (Static) Active DW (Dynamic)
© 2005 BearingPoint, Inc.
Active Data Warehouse Process/Flow
FinanceFinance
EISEIS
Web SiteWeb Site
Active Active Data WarehouseData Warehouse
Automated Data AnalysisPre-defined queriesAdd-hoc Analysisdata Mining
Actions triggered automatically from
event definitions and pre-defined rules
TacticalDecisions
Continuous data feeds (Event triggers, etc)
LoadLoadExtraction &Extraction &TransformationTransformation
Action Triggered Automated
OperationalOperationalSystemsSystems
© 2005 BearingPoint, Inc.
DW Implementation Considerations
Implementation Methodology – MIKE
1 - Discovery – Req. Identification / Gathering2 – Data Modeling3 – Data Base Design4 – ETL5 – Information Reporting / Access / OLAP 6 – Data Mining7 – Data Quality8 – Metadata9 – Infrastructure
© 2005 BearingPoint, Inc.
MIKE Methodology Overview
© 2005 BearingPoint, Inc.
Requirements Gathering / Identification
1 – Discovery
© 2005 BearingPoint, Inc.
Business Discovery – High level
Business Discovery is a process through which organizations determine and validate :
• Key business objectives
• Key issues which influence the achievement of the business
objectives
• Applications
Business need for the application
Potential for Payback
• Implementation List of Priorities – Which CSF’s, KPI’s Reports to
start with and end with
© 2005 BearingPoint, Inc.
Information Discovery – Mid level
Information Discovery is a process which organizations clarify scope and gain direction through:
• Review of Business Requirements Results / Priorities• Identify information requirement’s for Pain Points• Access and Validate Data Requirements, issues, gaps• Design high level business model• Discuss Project Constraints and issues• Determine which pain points to pursue for initial project
Functional Specifications can be driven from the results
© 2005 BearingPoint, Inc.
2 - Data Modeling
© 2005 BearingPoint, Inc.
What is a Logical Data Model
• A picture of the organization's data and the relationship (boxes, attributes & lines)
• A process to break down the complexity of an organization's data into manageable portions
• A tool used by “Modelers” to collect, discuss and validate data and relationships with “Business Users”
• One of the “first steps” in the creation of anything related to a Data Warehouse
• The blueprint for the construction of a database (Physical Data Model = PDM)
Customer
Loan Facility
Collateral
Loan Drawdown
Credit Check
Application
Transaction
Accounts
Loan Delinquent
History
© 2005 BearingPoint, Inc.
Why create a LDM
Portable method to identify, validate and Integrate requirements
Helps resolve problems when integrating data from multiple systems :
Lack of Common Corporate Definitions/Standards (what does it mean ?)Data redundancy between different systems - One fact in One place(Causes Inaccurate and inconsistent reporting)Identify if data is suppose to exist - Relationship validation process (Nulls)Referential Integrity (Optional vs. Mandatory & 0:1:M)Inadequate or nonexistent meta dataSingle View of Customer and related information Full vs. partial view of requirements (Current / Future Planning)………
Bonus = Decrease development and maintenance time and cost (cheap to add/change - Upstream vs. Downstream in SDLC)
© 2005 BearingPoint, Inc.
Logical Data Modeling - LDM
Application CIF Extension CIF
Facility Account
Application Facility Account
Customer Account Relationship
Application Transaction
Application Monthly Summary
Application Daily Summary
Sales $
Revenue
Volume
Time
Geography
Product
Currency
DIMENSIONS DIMENSIONS
FACTS
3 NF- General (Common) VS Star Schema - Customized (Complex)
© 2005 BearingPoint, Inc.
3 - Data Base Design
© 2005 BearingPoint, Inc.
Transpose LDM to PDM
•Translate PDM design created from the logical data model
•Identifying and designing Database Tables, columns and indexes
•Identifying and designing Views (Pre canned SQL statements)
Pursue Capacity and “Performance” assessment (later)
Optimize PDM design for specific / actual reasons (later)
© 2005 BearingPoint, Inc.
4 - Extraction Transformation Load
(ETL)
© 2005 BearingPoint, Inc.
DW ETL Flow
Extract TransformOperational Systems Informational Processing
Transforming Legacy Data to Data Warehouse Data:
ExtractIntegrateSummarizeFilterConvert
Set default values RestructureReformatEstablish time varianceCreate consistency
Data Warehouse• Subject-oriented• Historical
Load
60-80% of time and effort spent during ETL
© 2005 BearingPoint, Inc.
Transformation challenges
Data Integrity• No time basis for data• Inconsistent extract criteria• Missing / inconsistent data• No common source of data• Uncontrolled use of external data
Reduced Productivity• Extended time to do analysis• Customized extract programs• Tedious activities for IT staff
© 2005 BearingPoint, Inc.
5a - Information Reporting
© 2005 BearingPoint, Inc.
Decision Making Evolution Process
WHAT happened? WHY did it happen? WHAT WILL happen?
STAGE 1 STAGE 2 STAGE 3
“Show Nokia Hand Phone models 20% or more below plan”
“Show Nokia Hand Phone models 20% or more below planhaving zero inventory”
Ad Hoc Queries Provide the Value - Stage 1Evolving Queries Are More Complex - Stage 2Data mining, Forecasting is Evolutionary - Stage 3
Pre-defined ReportsAd Hoc QueriesAnalytical Modeling
© 2005 BearingPoint, Inc.
Categorizations - Information Reporting-
EISStrategic
DSS - OLAP
Application/Operational Reporting
Tactical
Operational
Organization
© 2005 BearingPoint, Inc.
EIS Perspective
• Performance against strategic objectives is measured and reported :
– By how executive works
– Involves pre-defined requirements and formats
– “Minimal” hands on computer activity is required
– Graphical performance indicators
– Performance trend information
– Measurement data and Bar charts
Involves pre-definedrequirements and formats
– “Minimal” hands on computer activity is required
• Performance against strategic objectives is measured and reported :
– By how executive works
–
– Graphical performance indicators
– Performance trend information
– Measurement data and Bar charts
© 2005 BearingPoint, Inc.
DSS - Perspective
• Performance against strategic objectives
- Minimal pre-defined format
- Report writing capabilities
- Analytic Features- Add-Hoc/Query
functionality- Cross-Dimensional
calculations- Filtering
– Hands On Computer Activity Required !
• Performance against strategic objectives
- Minimal pre-defined format
- Report writing capabilities
- Analytic Features- Add-Hoc/Query
functionality- Cross-Dimensional
calculations- Filtering
– Hands On Computer Activity Required !
© 2005 BearingPoint, Inc.
5b - Information Access
© 2005 BearingPoint, Inc.
Information Access Tools - Categories
Information Access Tools can be classified into the following types:
1 - Reporting & Query tools
Multidimensional
SpreadsheetsData Visualization
Visual BasicPowerbuilder
Development
EIS
2 - Decision Support Systems (DSS) with :
Multidimensional - MOLAP
Relational OLAP - ROLAP
Database OLAP - DOLAP
Hybrid OLAP - HOLAP
3 - Executive Information Systems (EIS)
Business Intelligence tools help organizations provide end users with improved access to data, enhancing their decision making ability.”
© 2005 BearingPoint, Inc.
Information Access - Performance Ladder
EIS(OLAP)MDDB
DSS (OLAP)MDDB
DSS (OLAP) Relational
Report (SQL) Access Tool
ReportingData Access Activity “within Tool”
--Queries / Reporting
Static - Higher Performance
Dynamic - Low to MediumPerformance
Tool Efficiency
© 2005 BearingPoint, Inc.
5c – OLAP
(On-Line Analytical Processing)
© 2005 BearingPoint, Inc.
OLAP - General Characteristics
Allow users to view data in multiple dimensions - multi dimensional analysis Data is logically organized as multi dimensional arrays (cubes)Architected to quickly manipulate and display data in different combinations using an OLAP engine
UM
LR
R/3
R/2
BW
APO
Smith
Mill
er
C&
Y
KD
S
01.’98
02.’98
03.’98
Customer
Product Month
020406080
100
Jan Mar May Jul Sep Nov
Series1
020406080
100
Jan
MarMay Ju
lSep Nov
Series1
0
50
100
150
200
Jan Mar May Jul Sep Nov
Series1
© 2005 BearingPoint, Inc.
How does OLAP fit ?
Data Warehouse / Data Mart
OLAP
e-intelligence
BalancedScore Card
© 2005 BearingPoint, Inc.
6 - Data Mining
© 2005 BearingPoint, Inc.
Data Mining
Data Mining = Concept and/or TechnologyProvides insight & understanding• identify patterns, relationships, rules
Predictive analysis• Build forecasting models from generated rules
020406080
100
J F M A M J J A S O N D
When Age > 35 & married &...thenbuys gold card - 40%needs schools plan - 32%needs family a/c - 63%
.... ....... ....
© 2005 BearingPoint, Inc.
Data Mining - Considerations
Data mining• Get answers for difficult to “visualize” questions• can be manual (using OLAP tools) or purpose built tools
(data mining tools)• Solutions e.g. market segmentation, decision tree
(understanding decision process)• various techniques available (statistical, neural network,
genetic algorithms etc.)Issues• availability of clean data is the key• normally considered “after” the Data Warehouse is
mature
© 2005 BearingPoint, Inc.
7 - Data Quality
© 2005 BearingPoint, Inc.
Problems with Legacy Data
Data fragmented across multiple systems and platformsExtensive data redundancy between different application systemsLack of corporate data standardsInadequate or nonexistent meta dataUser perception of data quality not based on factsUser perception that a warehouse will fix data problemsMissing data from operational systemsData Integrity - Inconsistent,Incorrect, Incompatible Etc..
© 2005 BearingPoint, Inc.
Data Integrity Problems - Examples
Same HK ID number, different name spellings• David Jones; David Johns; David G. Jones etc.
Use of old (non standard) address codes• HK, H.K., Hong Kong, SARHK, etc .
Multiple ways to denote company name• BP, BP Ltd, Bearing Point
Different account numbers generated by different applications for the same customer Invalid product codes collected at point of sale• Manual entry leads to mistakes• “In case of a problem use 999999999”
Required fields left blank• No enforcement of data collection rules
© 2005 BearingPoint, Inc.
8 - Metadata
© 2005 BearingPoint, Inc.
Metadata
Metadata•data about data (134584 = Data, Customer ID definition = Metadata)
•defines data structures, definition of measures, transformation rules etc.
•used to understand how and what data is stored
© 2005 BearingPoint, Inc.
Metadata – Many Islands
IBM Compatible
ETL Development Activity(Automated ETL tool)
M
IBM Compatible
Data Mining
IBM Compatible
DSS
IBM Compatible
IBM Compatible
EIS
IBM Compatible
Data Modeltool
Data Stores(DB2, Oracle, etc..)
MM
M
M
M
MDDB
© 2005 BearingPoint, Inc.
Meta Data - Views
Source Target
Data Store (DW/Data Mart/MDDB)
Data Store Columns
DSSQueries
Stewards
Locations
DSSReports
DSSAccess Tools
DSSViews
Keywords
Synonyms
ConstructionEnvironment
Tables Mapping Groups
Columns Transformations
Business ViewTechnical View
Business ItemsBusiness Items
Business Items
BusinessSubject AreasBusiness
Subject AreasBusinessSubject Areas
modelingphysical designmappingDDL, DMLtransformationtechnically enabled
ETL Automated Tools / Metadata Repository
© 2005 BearingPoint, Inc.
9 - INFRASTRUCTURE
(On going)
© 2005 BearingPoint, Inc.
System Management & Operations
Organisational issues• Roles, ops procedures, security, changes etc
Systems Management• Resources management/utilisation• Configuration & Change management• Distributed data warehouse• Operations management (archive/purge, backup etc)
Policies• SLA’s & Charge-back• Data Flowback policy• Time-lag on updates (OLTP vs DSS)• Responsibilities and roles
© 2005 BearingPoint, Inc.
The END !
PCCW - KSD