introduction to data vault - dama oregon 2012
DESCRIPTION
DAMA, Oregon Chapter, 2012 presentation - an introduction to Data Vault modeling. I will be covering parts of the methodology, comparison and contrast of issues in general for the EDW space. Followed by a brief technical introduction of the Data Vault modeling method.After the presentation i I will be providing a demonstration of the ETL loading layers, LIVE!You can find more on-line training at: http://LearnDataVault.com/trainingTRANSCRIPT
1
IntroductionData Vault Model &
Methodology© Dan Linstedt, 2011-2012 all rights reserved
Prepared for: DAMA Oregon, July 2012
2
Who’s Using It?
3
The Experts Say…“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon
“The Data Vault is foundationally strong and exceptionally scalable architecture.”
Stephen Brobst
“The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney
4
More Notables…
“This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.”
Howard Dresner
“[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit
from..”Scott Ambler
5
Agenda• Introduce Yourselves…• What is a Data Vault? Where does it come from?• Pros & Cons of Data Modeling for EDW• Current EDW Issues & Pains• Consequences of Implementing the Pains…• How do we “Fix” This?• Keys to Success• When “NOT” to use a Data Vault• Ontologies and Data Vault• A Working Example• Query Performance (PIT & Bridge)• Conclusion (break)• Live Demo
6
Introduce Yourselves• Your Expectations?• Your Questions?• Your Background?• Areas of Interest?• What are the top 3 pains your
EDW/BI solution is experiencing?
• About Me…o http://www.LinkedIn.com/dlinstedt
• Learn More Data Vault on-line at:o http://LearnDataVault.com/training
7
Where did it come from? What is it?
Defining the Data Vault Space
8
Data Warehousing Time Line
The Data Vault Model & Methodology took 10 years of R&D to become consistent, flexible, and scalable.
9
What IS a Data Vault? (Business
Definition)
• Data Vault Modelo Detail orientedo Historical traceabilityo Uniquely linked set of
normalized tableso Supports one or more
functional areas of business
• Data Vault Methodology– CMMI, Project Plan– Risk, Governance, Versioning– Peer Reviews, Release Cycles– Repeatable, Consistent,
Optimized– Complete with Best Practices
for BI/DW
• Data Vault Architecture– 3 Tier Architecture (for
including Batch & Unstructured Data)
– 2 Tier Architecture (for Real-Time only)
10
The Data Vault Model
Customer
Sat
Sat
Sat
F(x)
Customer
Product
Sat
Sat
Sat
F(x)
Product
Order
Sat
Sat
Sat
F(x)
Order
Elements:•Hub•Link•Satellite
Link
F(x)
Sat
Records a history of the interaction
Hub = List of Unique Business KeysLink = List of Relationships, AssociationsSatellites = Descriptive Data
11
Data Vault Methodology
1
2
3
4
5
Process unpredictable and poorly controlled
Risk assessments / analysis, managed processes, basic alignment efforts
Defined Business Processes, Defined Goals, Defined Objectives
Metrics, Estimates vs Actuals, Function Point Analysis, Identification of broken processes
Optimized business processes, repeatable, scalable, fault-tolerant. Automatable (generatable)
Follows: SEI/CMMI Level 5, PMP, Six Sigma, TQM, and Agile elements
12
Data Vault Architecture
Sales
Finance
Contracts
StagingEDW
(Data Vault)
StarSchemas
ErrorMarts
ReportCollections
Enterprise BI SolutionSOA
(real-time)
(batch)
(batch)
ComplexBusiness
Rules
The business rules are moved closer to the business, improving IT reaction time, reducing cost and minimizing
impacts to the enterprise data warehouse (EDW)
• Repeatable• Consistent• Fault-tolerant• Supports phased release
• Scalable• Auditable
FUNDAMENTAL GOALS
UnstructuredData
(Hadoop NoSQL)
13
Star Schemas, 3NF, Data Vault:
Pros & ConsDefining the Data Vault Space
Why NOT use Star Schemas as an EDW?Why NOT use 3NF as an EDW?
Why NOT use Data Vault as a Data Delivery Model?
14
Star Schema Pros/Cons as an EDW
PROS• Good for multi-dimensional
analysis• Subject oriented answers• Excellent for aggregation points• Rapid development /
deployment• Great for some historical
storage
CONS• Not cross-business functional• Use of junk / helper tables• Trouble with VLDW• Unable to provide integrated
enterprise information• Can’t handle ODS or
exploration warehouse requirements
• Trouble with data explosion in near-real-time environments
• Trouble with updates to type 2 dimension primary keys
• Trouble with late arriving data in dimensions to support real-time arriving transactions
• Not granular enough information to support real-time data integration
15
3nf Pros/Cons as an EDWPROS• Many to many linkages• Handle lots of information• Tightly integrated information• Highly structured• Conducive to near-real time
loads• Relatively easy to extend
CONS• Time driven PK issues• Parent-child complexities• Cascading change impacts• Difficult to load• Not conducive to BI tools• Not conducive to drill-down• Difficult to architect for an
enterprise• Not conducive to spiral/scope
controlled implementation• Physical design usually
doesn’t follow business processes
16
Data Vault Pros/Cons as an EDW
PROS• Supports near-real time and
batch feeds• Supports functional business
linking• Extensible / flexible• Provides rapid build / delivery
of star schema’s• Supports VLDB / VLDW• Designed for EDW• Supports data mining and AI• Provides granular detail• Incrementally built
CONS• Not conducive to OLAP
processing• Requires business
analysis to be firm• Introduces many join
operations
17
Analogy: The Porsche, the SUV and the Big Rig
• Which would you use to win a race?• Which would you use to move a house?• Would you adapt the truck and enter a race with Porches and expect to
win?
18
Current EDW Issues and Pains
Business Rule Processing, Lack of Agility, and Future proofing your new solution
19
Current EDW Project IssuesThis is NOT what you want happening to your project!
THE GAP!!
20
2 Tier EDW Architecture
• Quality routines• Cross-system dependencies• Source data filtering• In-process data manipulation
• High risk of incorrect data aggregation• Larger system = increased impact• Often re-engineered at the SOURCE• History can be destroyed (completely re-computed)
Sales
Finance
Contracts
Staging(EDW)
StarSchemas
Enterprise BI Solution
(batch)
Conformed DimensionsJunk Tables
Helper TablesFactless Facts
ComplexBusiness
Rules+Dependencies
Complex Business Rules #2
Staging + History
21
#1 Cause of BI Initiative Failure
Re-EngineeringFor
Every Change!Let’s take a look at one example…
22
Re-Engineering
Customer
CustomerTransactions
Sales
Finance
Current Sources
Source
Join
BusinessRules
Data Flow (Mapping)
CustomerPurchases
** NEW SYSTEM**
IMPACT!!
23
Federated Star Schema Inhibiting
Agility
Time
Effort& Cost
High
Low
Start MaintenanceCycle Begins
Changing and Adjusting conformed dimensions causes an exponential rise in the cost curve over time
RESULT: Business builds their own Data Marts!
Data Mart 1
Data Mart 2
Data Mart 3
The main driver for this is the maintenance costs, and re-engineering of the existing system which occurs for each new “federated/conformed” effort. This increases delivery time, difficulty, and maintenance costs.
24
What are the ROOT Causes?
The root causes of RE-ENGINEERING are:
1. Putting business rules up-stream of the EDW
2. Conforming Dimensions
25
Consequences of Implementing the
Pains…Business rules up-stream of your EDW and Conforming
Dimensions to store ALL history
26
Deformed Dimensions• Deformity: The URGE to continue “slamming data” into an existing
conformed dimension until it simply cannot sustain any further changes, the result: a deformed dimension and a HUGE re-engineering cost / nightmare.
Re-Engineering the Load Processes EACH
TIME!
…………………………………… ………………… ………………… ………………… ………………… ………………… …………………
V1Comple
xLoad
90 days, $125k
Business Change
………………………………………………………………………………………………………………………………………………………………………………………………………………………………
V2
Complex
Load
120 days, $200k
Business Change
………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ………………… ……………………………………
V3
Complex
Load
180 days, $275k
Business Change
Business Wants a Change!Business said: Just add that to the existing Dimension, it will be easy right?
27
Dimension-itis• DimensionItis: Incurable Disease, the symptoms are the creation
of new dimensions because the cost and time to conform existing dimensions with new attributes rises beyond the business ability to pay…
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………... …………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………... …………………...
…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...…………………...
…………………...…………………...…………………...…………………...…………………...…………………...…………………...
Business Says: Avoid the re-engineering costs, just “copy” the dimensions and create a new one for OUR department…
What happens when we (IT) give in to this? …
28
Result: Silo Data Junkyards!
• Business Says: Take the dimension you have, copy it, and change it… This should be cheap, and easy right?
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
First Star
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneFact_ABCFact_DEFFact_PDQFact_MYFACT
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
We built our own because IT costs too much…
SALES
We built our own because IT took too long…
FINANCE
We built our own because we needed customized dimension data…
MARKETING
Business ChangeTo Modify Existing Star = 180 days, $275k
29
Accountability In Question?Corporate Fraud Accountability Title XI consists of seven sections. Section 1101 recommends a name for this title as “Corporate Fraud Accountability Act of 2002”. It identifies corporate fraud and records tampering as criminal offenses and joins those offenses to specific penalties. It also revises sentencing guidelines and strengthens their penalties. This enables the SEC to temporarily freeze large or unusual payments.
Source 1
Source 2
Source 3
Business RulesChangeData!
Staging
HR Mart
Sales Mart
Finance Mart
Are changes to data ON THE WAY IN to the EDW equivalent to records tampering?
30
How do we “fix” this?Answer: Move the business rules downstream, AND no-
longer be forced to conform dimensions.
31
It’s Not Just a Data Model…
SUCCESS!
Model Methodology
32
Move the Business Rules
Downstream• No “Conforming” of Dimensions on the way in to the EDW• Hold on… We do distinguish between HARD and SOFT
business rules…
33
Hard & Soft Business RulesHard Business Rules Soft Business Rules
• Data Domain Alignment (Data Type Matching)
• Normalization (where necessary)
• System Column Computation
• Any requirement the business user states, that, when applied, CHANGES the data or CHANGES the meaning of the data (the grain or interpretation)
• Simple example that will knock the socks off your feet!
34
Progressive Agility and Responsiveness of
IT
Time
Effort& Cost
High
Low
Start MaintenanceCycle Begins
Foundational Base Built
New Functional Areas AddedInitial DV Build Out
Re-Engineering does NOT occur with a Data Vault Model. This keeps costs down, and maintenance easy. It also reduces complexity of the existing architecture.
35
NO Re-Engineering
Customer
CustomerTransactions
Sales
Finance
Current Sources
StageCopy
StageCopy
HubCustome
r
HubAcct
HubProduc
t
Link Transacti
on
Data Vault
CustomerPurchases
** NEW SYSTEM**
StageCopy
IMPACT!!
NO IMPACT!!!NO RE-ENGINEERING!
36
Keys to SuccessBringing the Data Vault to Your Project
37
Key: Flexibility
Adding new components to the EDW has NEAR ZERO impact to:• Existing Loading Processes• Existing Data Model• Existing Reporting & BI Functions• Existing Source Systems• Existing Star Schemas and Data Marts
No Re-
Engineeri
ng!
38
Case In Point:
Result of flexibility of the Data Vault Model allowed them to merge 3 companies in 90 days – that is ALL systems, ALL DATA!
39
Key: Scalability in Architecture
Scaling is easy, its based on the following principles• Hub and spoke design• MPP Shared-Nothing Architecture• Scale Free Networks
40
Case In Point:
Result of scalability was to produce a Data Vault model that scaled to 3 Petabytes in size, and is still growing today!
41
Key: Scalability in Team Size
You should be able to SCALE your TEAM as well!With the Data Vault methodology, you can:
Scale your team when desired, at different points in the project!
42
Case In Point:(Dutch Tax Authority)
Result of scalability was to increase ETL developers for each new source system, and reassign them when the system was completely loaded to the Data Vault
43
Key: Productivity
Increasing Productivity requires a reduction in complexity.The Data Vault Model simplifies all of the following:• ETL Loading Routines• Real-Time Ingestion of Data• Data Modeling for the EDW• Enhancing and Adapting for Change to the Model• Ease of Monitoring, managing and optimizing
processes
44
Case in Point:Result of Productivity was: 2 people in 2 weeks merged 3 systems, built a full Data Vault EDW, 5 star schemas and 3 reports.
These individuals generated:• 90% of the ETL code for moving the data
set• 100% of the Staging Data Model• 75% of the finished EDW data Model• 75% of the star schema data model
45
The Competing Bid?
The competition bid this with 15 people and 3 months to completion, at a cost of $250k! (they bid a Very complex system)
Our total cost? $30k and 2 weeks!
46
Results?
Changing the direction of the river takes less effort than stopping the flow
of water
47
Now you know WHY to use a Data Vault….
But are there times when you shouldn’t use a Data Vault?
48
When NOT to use the Data Vault
A review of some reasons why not to use a Data Vault Model
49
When NOT to Use the Data Vault
• You have:o a small set of point solution requirementso a very short time-frame for deliveryo To use the data one-time, then throw it awayo a single source system, single source applicationo A single business analyst in the entire company
• You do NOT have:o audit requirements forcing you to keep historyo multiple data center consolidation effortso near-real-time to worry abouto massive batch data to integrateo External data feeds outside your controlo Requirements to do trend analysis of all your datao Pain – that forces you to reengineer every time you ask for a
change to your current data warehousing systems
50
Ontologies & Data VaultHub, Link, Satellite - Definitions
51
Business Keys = Ontology
Business Keys should be arranged in an ontology
In order to learn the dependencies of the data
set
Drug Label Code
Product Number
Firm Name
NDA Application #
Drug Listing
Patent Use Code
Patent Number
Dose Form Code
NOTE: Different Ontologies represent different business
views of the data!
52
Associations = Ontological Hooks
Business Keys are associated by many linking factors, these links comprise the associations in the
hierarchy.
Product Number
Firm Name
NDA Application #
Drug ListingFirms Generate
Product Listings
Firms Manufacture
Products
Listings for Products are in NDA Applications
53
Descriptors = Context
Descriptors provide the context at a specific point in time – they are the warehousing portion of
the Data Vault
Firm Name Firm Locations
Drug ListingFirms Generate
Product Listings
Listing Formulatio
n
Product NumberFirms
Manufacture Products
ProductIngredientsStart & End of
manufacturing
54
A working ExampleNational Drug Codes + Orange Book of Drug Patent
Applications
http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm http://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm
55
Hub Table Structures
SQN = Sequence (insertion order)LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
56
Link Table Structures
Note: A Link is really no different than a factless fact!
57
Satellite Table Structures
SQN = Sequence (parent identity number)LDTS = Load Date (when the Warehouse first sees the data)
LEDTS = End of lifecycle for superseded recordRSRC = Record Source (System + App where the data ORIGINATED)
58
In Review…• Data Vault is…
o A Data Warehouse Model & Methodologyo Hub and Spoke Designo Simple, Easy, Repeatable Structureso Comprised of Standards, Rules & Procedureso Made up of Ontological Metadatao AUTOMATABLE!!!
• Hubs = Business Keys• Links = Associations / Transactions• Satellites = Descriptors
59
Why do we build Links this way?
History Teaches Us…If we model for ONE relationship in the EDW, we BREAK the
others!
60
Portfolio
Customer
M
M
5 yearsFrom now X
Portfolio
Customer
M
1
10 Years ago X
Portfolio
Customer
1
MToday:
Hub Portfolio
Hub Customer
1
M
The EDW is designed to handle TODAY’S relationship, as soon as history is loaded, it breaks the model!
This situation forces re-engineering of the model, load routines, and queries!
61
History Teaches Us…If we model with a LINK table, we can handle ALL the
requirements!
Portfolio
Customer
M
M
5 years from now
Portfolio
Customer
1
MToday:
Portfolio
Customer
M
1
10 Years ago This design is flexible, handles past, present, and future relationship changes with NO RE-ENGINEERING!
Hub Portfolio
Hub Customer
1
M
LNKCust-Port
M
1
Base EDW Created in CorporateFinancials in USA
HubHub
SatSatSatSat
HubHub
SatSatSatSat
LinkLink
SatSatSatSat
Applying the Data Vault to Global DW
HubHub
SatSatSatSatLinkLink
Manufacturing EDW in China
HubHub
SatSatSatSat
Planning in Brazil
LinkLink
HubHub
SatSatSatSatLinkLink
62
63
Query PerformancePoint-in-time and Bridge Tables, overcoming query issues
64
PIT Table Architecture
Hub Custome
r
HubOrder
Hub Product
Link Line Item
SatelliteLine Item
Sat 1
Sat 2
Sat 3
Sat 4
PIT Sat
Sat 1
Sat 2
Sat 3
Sat 4
PIT Sat
Sat 1
Sat 2
PARENT SEQUENCELOAD DATE{Satellite 1 Load Date}{Satellite 2 Load Date}{Satellite 3 Load Date}{…}{Satellite N Load Date}
Satellite: Point In Time
PrimaryKey
65
PIT Table Example
SQN LOAD_DTS NAME1 10-14-2000 Dan L1 11-01-2000 Dan Linedt1 12-31-2000 Dan Linstedt
SAT_CUST_CONTACT_NAMESQN LOAD_DTS CELL1 10-14-2000 999-555-12121 10-15-2000 999-111-12341 10-16-2000 999-252-28341 10-17-2000 999.257-28371 10-18-2000 999-273-5555
SAT_CUST_CONTACT_CELLSQN LOAD_DTS ADDR1 08-01-200026 Prospect1 09-29-2000 26 Prosp St.1 12-17-2000 28 November1 01-01-2001 26 Prospect St
SAT_CUST_CONTACT_ADDR
SQN LOAD_DTS SAT_NAME_LDTS SAT_CELL_LDTSSAT_ADDR_LDTS1 08-01-2000 NULL NULL 08-01-20001 09-01-2000 NULL NULL 08-01-20001 10-01-2000 NULL NULL 09-29-20001 11-01-2000 11-01-2000 10-18-2000 09-29-20001 12-01-2000 11-01-2000 10-18-2000 09-29-20001 01-01-2001 12-31-2000 10-18-2000 01-01-2001
Snapshot Date
66
BridgeTable Architecture
Hub Seller
Hub Product
Link
Satellite
Sat 1
Sat 2
Sat 3
Sat 4
Bridge
Hub Parts
Link
Satellite
UNIQUE SEQUENCELOAD DATE{Hub 1 Sequence #}{Hub 2 Sequence #}{Hub 3 Sequence #}{Link 1 Sequence #}{Link 2 Sequence #}{…}{Link N Sequence #}{Hub 1 Business Key}{Hub 2 Business Key}{…}{Hub N Business Key}
Satellite: BridgePrimary
Key
67
Bridge Table Data Example
SQN LOAD_DTS SELL_SQN SELL_ID PROD_SQN PROD_NUMPART_SQN PART_NUM1 08-01-2000 15 NY*1 2756 ABC-123-9K 525 JK*2*42 09-01-2000 16 CO*24 2654 DEF-847-0L 324 MN*5-23 10-01-2000 16 CO*24 82374 PPA-252-2A 9938 DD*2*34 11-01-2000 24 AZ*25 25222 UIF-525-88 7 UF*9*05 12-01-2000 99 NM*5 81 DAN-347-7F 16 KI*9-26 01-01-2001 99 NM*5 81 DAN-347-7F 24 DL*0-5
Snapshot Date
Bridge Table: Seller by Product by Part
68
Conclusions
69
Where To Learn More• The Technical Modeling Book:
http://LearnDataVault.com/
• On-Line Training direct from me:http://LearnDataVault.com/training
• The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions
• Contact me:http://DanLinstedt.com - web [email protected] - email
70
LIVEDEMONSTRATION
Physical Demonstration, Loading Processes and Execution