venugopal krishnan flexible dw models 2014 jul_ieg

95
Information Excellence informationexcellence.wordpress.com Harvesting Information Excellence Information Excellence 2014 Jul Knowledge Share Session Venugopal Krishnan, Sr. Consultant, TEG, TCS Flexible (Data warehousing) data models Hosted by

Upload: information-excellence

Post on 23-Jan-2015

78 views

Category:

Data & Analytics


1 download

DESCRIPTION

Flexible Data Warehouse Data Models

TRANSCRIPT

Page 1: Venugopal krishnan flexible dw models 2014 jul_ieg

Information Excellence informationexcellence.wordpress.com

Harvesting Information Excellence

Information Excellence2014 Jul Knowledge Share Session

Venugopal Krishnan, Sr. Consultant, TEG, TCS

Flexible (Data warehousing) data models

Hosted by

Page 2: Venugopal krishnan flexible dw models 2014 jul_ieg

Flexible (Data warehousing) data models

Introduction to Temporal dataBasic Concepts & DefinitionsTemporal & Bi-Temporal RepresentationsTemporal DatabasesTemporal representations in DWModeling Temporal dataAnchor ModelingTable EliminationData Vault ModelingOther flexible modelsNOSQL Data Models

Venugopal Krishnan: IEG Session 2014 Jul

Page 3: Venugopal krishnan flexible dw models 2014 jul_ieg

Venugopal Krishnan

3

Venugopal Krishnan is Senior Consultant with the Technology Excellence Group of Insurance & Healthcare Services Division, Tata Consultancy Services. Ltd., Bangalore.

Venu has a Master of Science in Mathematics from Mahatma Gandhi University, Kerala followed by a Post Graduate Certification In Computer & Software Engineering from S.E.R.C (Supercomputer Education & Research Center), Indian Institute of Science, Bangalore.

During his 24+ years of overall industry experience, Venu worked earlier with Flytxt Pvt.Ltd, Cognizant(USA&India), Oracle(USA), Emirates Airlines (Dubai, UAE), Tata Unisys Ltd., and Patni Computers in various capacities as Group Manager/Director, Project/Delivery/Program Manaager, Senior Principal Consultant, Lead Analyst etc. for software services & implementation, product development and Technology management related actvities.

Venu’s primary focus areas are Database, Data management, Data architecture, and Data warehousing. For the past 16+ years, Venu has been architecting, designing and developing data warehouse & BI platforms for customers across the world. He has lot of experience and expertise in Oracle & related tools, data warehousing and data architecture.

Apart from work, Venu is interested in participating in technical forums and conducting training sessions in Oracle, Data warehousing and Data management related areas. He is associated with Oracle Users India Group, Information Excellence Group etc., and has been an active volunteer for many technical events/symposiums organized by the Information Excellence Group. On a personal note, Venu likes travelling and listening to classical music.

Venu is a core member of the Information Excellence Volunteer Team for the past three years, with significant commitment and contributions to the growth of the IEG community.

Venugopal KrishnanSenior Consultant, Technology Excellence Group Insurance & Healthcare Services Division, TCS

Page 4: Venugopal krishnan flexible dw models 2014 jul_ieg

FLEXIBLE DATA MODELS FOR DW (Data Architecture) By VENUGOPAL KRISHNAN Senior Consultant, TCS

Page 5: Venugopal krishnan flexible dw models 2014 jul_ieg

Introduction to Temporal data Basic Concepts & Definitions Temporal & Bi-Temporal Representations Temporal Databases Temporal representations in Data warehousing Temporal vs SCD Modeling Temporal data

Anchor Modeling Table Elimination

Data Vault Modeling Other flexible models NOSQL Data Models

AGENDA

Page 6: Venugopal krishnan flexible dw models 2014 jul_ieg

Introduction to Temporal Data History data is important for Analysis. How to manage History data? How do we answer the following questions?

When were things like our data says they were? When did our data say that things were like that!?

Page 7: Venugopal krishnan flexible dw models 2014 jul_ieg

Introduction contd..

Temporal data Data that represents a state in time, such as the land-use patterns of Hong Kong in 1990, or total rainfall in Honolulu on July 1, 2009. Temporal data has a time period associated with it.

Temporal Data Collection Examples : 1.0 Data regarding the change of cropland worldwide from 1700 To 1992. The percentage changes over time.

2.0 Sea surface temperature changes with each successive month from 1997 to 2000.

3.0 Oil and Gas production rates changing over 1994.

Page 8: Venugopal krishnan flexible dw models 2014 jul_ieg

1994 time stamp of the oil and gas production of a production field in Wyoming in ArcMap. When visualized over time, the pie charts on the map indicate the changing oil and gas production rates from each producing well (red is gas in barrels of oil equivalent, and green is oil in barrels). The graph shows production through time for the entire field: gas (red), oil (green), and water (blue).

Temporal Data Collection Example: Oil and Gas production rate changes

Page 9: Venugopal krishnan flexible dw models 2014 jul_ieg

Healthcare: Patient histories need to be maintained

Insurance: Claims and accident histories are required

Finance: Stock price histories need to be maintained.

Personnel management: Salary and position history need to be maintained

Banking: Credit histories

Examples of Temporal data in different industries

Page 10: Venugopal krishnan flexible dw models 2014 jul_ieg

How does temporal implementation differs from SCD? The SCDs (1,2 and 3) that were proposed by Kimball can be described as poor man’s solutions to historization of dimensions. • While SCDs are simple to understand and provides good response

time, a change in a dimensional attribute effectively changes the context for all facts captured prior to the change.

• This can only be tracked by using temporal structures.

• Actual time of change is not captured in Dimensions. • Checking when it is OK to refer to which DWH IDs is not possible.

• Only Temporal structures can efficiently handle early & late arriving

facts.

Page 11: Venugopal krishnan flexible dw models 2014 jul_ieg

Basic Concepts • Temporal data changes over time. When data changes over time, It

is referred to as changing from real world perspective or business perspective or valid perspective.

• Changes can be independent of real world and business

perspective. e.g. Data changes in paper/computer file/databases. These are called changes from a transactional perspective. • Data could change in the real world and not be changed in the

database. • Data may be changed in the database when it has not changed in

the real world. So they are orthogonal. • The data in the database may be changed at the same time that is

changes in the real world, but there are no guarantees!!

Page 12: Venugopal krishnan flexible dw models 2014 jul_ieg

Basic Concepts Contd..

Questions:

Who were current clients on last May 1st? (Valid Time)

On last May 1st, who were listed as current clients? (Transaction Time)

The above are two different questions.

Valid Time – When were things like our data says they were!

Transaction Time – When did our data say that things were like that!

Page 13: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal Data = Data which changes over time Temporal Data Structure = Data structure which stores a history of how data changed over time. Valid Temporal Data = Data which changes over time from a real-world or business perspective. Valid Temporal Data Structure = Data structure which stores a history of how data changed from a real-world or business perspective. Transaction Temporal Data = Data which changes over time from a data storage device (or database for convenience) perspective Transaction Temporal Data Structure = Data structure which stores a history of how data changed from a data storage device perspective. Non-temporal Data = data which does not change over time. Non-temporal Data Structure = data structure which does not store a

history of how data changed from any perspective.

Definitions

Page 14: Venugopal krishnan flexible dw models 2014 jul_ieg

Definitions Contd..... Bitemporal data is:

• Data which changes both from a real world or business perspective and from a database(transactional) perspective.

• Bitemporal Data = data which changes over two dimensions of time independently.

• Bitemporal Data Structure = data structure which stores a history of how data changed from two independent perspectives.

• The real world or business time is termed as VALID TIME.

• The database time is termed as TRANSACTION TIME.

Bitemporal data:

• Is the only way to have a complete audit trail of what you knew and when you knew it.

• Gives you a reproducible history of data from a business perspective.

• Provides very accurate data with full support for different types of corrections.

• Can alleviate the need for complex, convoluted, and subjective database design techniques as well as eliminate the need for redundant “snapshot” data stores.

Page 15: Venugopal krishnan flexible dw models 2014 jul_ieg

Example:

Consider the biography of John Laker: (address where john stayed from 1975 till 2001). -Born on April 3, 1975 in the Kids Hospital, Medicine County. -Son of Jack Laker and Jane Laker. -Born in Smallville. -Birth registration done on April 4, 1975. -After graduation started to live in Bigtown from August 26,1994. -Registered the address change on December 27, 1994. -Passed away on April 1, 2001. -Reported and registered on same day. In a non-temporal model, we will store the Name and Address in a table. T(name,address) with name as the primary key. The above model cannot store/handle the address changes.

Page 16: Venugopal krishnan flexible dw models 2014 jul_ieg

Non-Temporal Example contd... Date Real world status Database

information

April 3, 1975 John is born Nothing

April 4, 1975 John's father officially reports the birth

John's information is inserted into the database.(John lives in Smallville)

August 26,1994 After graduation,John moves to Bigtown,forgets to register his house address.

John lives in Smallville

December 27,1994 John registers his new address.

John's address is updated.(John lives in Bigtown)

April 1, 2001 John dies Information is deleted(There is no person called John Laker)

Page 17: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal representation The record has two fields, valid_from and valid_to. Based on the date of birth of John, the Valid_from will be the date of birth, and valid_to is not known and it might change in the Future. Person(John Laker, Smallville, 3-Apr-1975, ∞).

After John reports his new address in Big Town on Aug 27,1994, a new entry is made into the database as follows: Person(John Laker, Big Town, 27-Aug-1994, ∞). The earlier record is updated as follows with the Valid_To time set to 26-Aug-1994.: Person(John Laker, Smallville, 3-Apr-1975, 26-Aug-1994). When John dies, the database is again updated as follows: Person(John Laker, Big Town, 27-Aug-1994, 1-Apr-2001).

Page 18: Venugopal krishnan flexible dw models 2014 jul_ieg

Bitemporal Representation The temporal representation only depicted the business valid time, not the time the information was recorded in the database. Bi-temporal representations provide the transaction recorded time also by providing 2 additional fields: Transaction_From and Transaction_To. The following records explain the bitemporal representation: Person(John Laker, Smallville, 3-Apr-1975, ∞, 4-Apr-1975, 27-Dec-1994). Person(John Laker, Smallville, 3-Apr-1975, 26-Aug-1994, 27-Dec-1994, ∞ ). Person(John Laker, Bigtown, 27-Aug-1994, ∞, 27-Dec-1994, 2-Feb-2001 ). Person(John Laker, Bigtown, 27-Aug-1994, 1-Jun-1995, 27-Dec-1994, 2-Feb-2001). Person(John Laker, Beachy, 1-Jun-1995, 3-Sep-2000, 2-Feb-2001, ∞ ). Person(John Laker, Bigtown, 3-Sep-2000, ∞, 2-Feb-2001, 1-Apr-2001 ). Person(John Laker, Bigtown, 3-Sep-2000, 1-Apr-2001, 1-Apr-2001, ∞ ).

Page 19: Venugopal krishnan flexible dw models 2014 jul_ieg

Bitemporal implementation in Databases(SQL:2011 std) 1.0 Oracle 12c – Has a new feature called Temporal Validity. Uses a new PERIOD FOR clause. e.g. ALTER table Dept ADD (v_start DATE, v_end DATE, PERIOD FOR vt(v_start,v_end)); (or PERIOD FOR <column>)

• vt is the period and is a hidden column. • The details of the period are stored in the dictionary table

SYS_FBA_PERIOD. • Supports only conventional DMLs. • Supports TEMPORAL FLASHBACK QUERY.

e.g: SELECT * FROM dept AS OF <VERSIONS> PERIOD FOR vt TO_DATE <BETWEEN > ‘2015-01-01’ order by deptno;

• Temporal flashback queries are not enabled in a multitenant configuration. • Oracle 12c does not support temporal joins and temporal aggregations.

Oracle 11G Workspace Manager - Version enabled tables, valid time support EXECUTE DBMS_WM.EnableVersioning ('employees',

'VIEW_WO_OVERWRITE', FALSE, TRUE); - Version enables the table CREATE TYPE WM_PERIOD AS OBJECT (validFrom TIMESTAMP WITH TIME ZONE, validTill TIMESTAMP WITH TIME

ZONE); -- WM_PERIOD can be used to specify a valid time range for a version enabled table.

Page 20: Venugopal krishnan flexible dw models 2014 jul_ieg

Bitemporal implementation in Databases 2.0 DB2 10

CREATE TABLE policy_info (policy_id CHAR(4) NOT NULL, coverage INT NOT NULL, bus_start DATE NOT NULL, bus_end DATE NOT NULL, sys_start TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW

BEGIN, sys_end TIMESTAMP(12) NOT NULL GENERATED ALWAYS AS ROW END, create_id TIMESTAMP(12) GENERATED ALWAYS AS TRANSACTION START

ID, PERIOD BUSINESS_TIME(bus_start, bus_end), PERIOD SYSTEM_TIME(sys_start, sys_end));

3.0 Teradata 13.0 CREATE MULTISET TABLE Prop_Owner ( customer_number INTEGER, property_number INTEGER, property_VT PERIOD(DATE) NOT NULL AS VALIDTIME, property_TT PERIOD (TIMESTAMP(6) WITH TIME ZONE) NOT NULL AS TRANSACTIONTIME);

4.0 TimesDB (Oracle) Supports Valid Time and Transaction Time fields in ANSI SQL

Page 21: Venugopal krishnan flexible dw models 2014 jul_ieg

Multi Temporal models Tri-Temporal Data • Adds a decision time also to the valid and transaction times. • Decision time describes the date and time a decision was made. E.g: Scott becomes Manager. The decision to change the job description from “Analyst” to “Manager” is made on June 24, 2014. It is irrelevent when this change is entered into the system and also irrelevent when scott is officially a Manager.

Page 22: Venugopal krishnan flexible dw models 2014 jul_ieg

Tri-Temporal Data Model

VT - Valid Time (With Temporal Validity) DT - Decision Time (With Temporal Validity TT - Transaction Time (With FDA) (versions_startscan, versions_endscan)

Oracle 12c supports the tri-temporal feature.

Page 23: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal representations in Data warehousing Rows in a dimension table are not associated with time. New rows are simply added. Changes in values of dimension rows with known source identifiers are either simply overwritten or a new row with new surrogate key (with old source system Id) is added based on the slowly changing dimensions concept. For some kind of analysis, dimensions should also be historized, particularly for comparison of measures across different time periods. Example: How did buying habits of customers change over the last 5 years based on where they live? (History of addresses of customers will need to be kept!).

Page 24: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal representations in Data warehouses Typical Star Schema Time Policy_Fact Prof_Center --------------------------- ----------------- Product <foreign keys> PC_ID ------------- PREMIUM_AMT PC_NAME PROD_ID LOSS_AMT DIV_ID ….. EXPENSE_AMT DIV_NAME PROFIT_AMT ………. CUSTOMER --------------------------- Compare Profits Over the years CLIENT_ID -Grouped by business divisions CLIENT_NAME -Grouped by client ratings CLIENT_RATING ..............

Page 25: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal representations in Data warehouses What happens, over time? • Business divisions change (e.g. profit centers are shifted)? • Ratings of clients change? • Two clients merge (e.g., primary insurers in the reinsurance

business)? • Geography changes (merges,splits,inactivations etc...) Let us suppose that the dimension heirarchies are: - Product (LOB heirarchy) - Profit Center ->Division -> Group - Customer ->Country->Continent->etc… Let us see how temporal representation handles the changes and historization efficiently for COUNTRY for example:

Page 26: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal representations in Data warehouses Possible changes to COUNTRY dimension: • New value addition • Old value replaced by new value • Invalidation (value no more to be used) • Merge (n values merged into a new value) • Split (Old value divided into n values) • Move (position change in heirarchy) Principle 1.0 Add valid begin and end times in dimensions using object table(country) and single property table (CountryNames). 2.0 Enable foreign keys in fact tables refer to the unchanging IDs in object tables. 3.0 Use the 6th normal form basics to arrive at an efficient model for temporal data representation.

Page 27: Venugopal krishnan flexible dw models 2014 jul_ieg

Temporal Representations in Data warehouses Modified Star Schema design (Sample for Customer (Country)) Country CountryNames

--------------------- ----------------------------- CountryID CountryID VTimeBeg VTimeBeg VTimeEnd VTimeEnd

CountryName

CountrySuccession

Population --------------------------------

----------------------- ID – Original ID CountryID SuccID –Direct successor Year CurrID – Ultimate Successor

Time

----------------------------

Page 28: Venugopal krishnan flexible dw models 2014 jul_ieg

Modeling Temporal data

Anchor Modeling – An agile modeling technique using sixth normal form for structurally and temporally evolving data. Flexibility of Anchor Models: • Historization • Null handling – Eliminates NULL • Orphans – Early arriving facts • Separation of Concerns (Start with a small common base and

gradually develop into an EDW). • Prototyping Components of Anchor Model: • Anchors • Knots • Attributes • Ties Anchor Model is based on Sixth Normal Form.

Page 29: Venugopal krishnan flexible dw models 2014 jul_ieg

6NF means that every relation consists of a candidate key plus no more than one other (non-key) attribute.

Examples:

Item {ProductCode, Eff_start_date, Eff_end_date}

ItemName {ProductCode*, Name}

ItemDesc {ProductCode*, Description}

ItemPrice {ProductCode*, Price}

Sixth Normal Form

Page 30: Venugopal krishnan flexible dw models 2014 jul_ieg

Anchors - Entities • Primarily the surrogate key of the entity. • Has metacolumns that contains:

• Batch information • File information

• Meta columns should answer the questions WHEN?WHERE?HOW?

e.g: Customer (Customer_ID) <#42>

Page 31: Venugopal krishnan flexible dw models 2014 jul_ieg

KNOTS –Shared Properties • Shared attributes of the Anchor which is more or less static. • Contains Surrogate key for the Knotted entity. • Contains an attribute value representing the type of the knot • Conatins Metacolumns e.g: The gender of a person <#1, ‘Male’’> <#42, #1> - KNOTted attribute for Customer. Representation of KNOTS KNOTS are represented as follows in an Anchor diagram.

Page 32: Venugopal krishnan flexible dw models 2014 jul_ieg

Attributes – Properties

Contains: • The foreign key of the belonging Anchor • An attribute value • Historization columns • Metacolumns E.g: Surname of a person <#42, ‘Unknown’, 2004-06-19> Representing Attributes

Page 33: Venugopal krishnan flexible dw models 2014 jul_ieg

TIES – Relationships

Contains:

• Foreign Keys of the related Anchors (which may be an n-tuple) • Historization columns • Metacolumns

e.g: Children of a Person <#42, #4711> Representation of TIES.

Page 34: Venugopal krishnan flexible dw models 2014 jul_ieg

Anchor Modeling Example 1 The source system supplies the demanded information in two separate source files according to the structure presented below. Analysis has determined each attribute’s ability to change and categorised each attribute into business keys, slowly changing attributes, rapidly changing attributes and meta data. » File 1: * Business Key - Customer Number * Slowly Changing Attribute – Name * Slowly Changing Attribute – Birth Date * Slowly Changing Attribute – Martial Status * Rapidly Changing Attribute – Income * Meta Data – Changed Date * Meta Data - From Date » File 2: * Business Key - Customer Number * Slowly Changing Attribute – Tax Zone * Rapidly Changing Attribute – Loyalty Value * Meta Data - From Date

Page 35: Venugopal krishnan flexible dw models 2014 jul_ieg

Anchor Modeling Example 1 Contd…. An anchor model is created as follows (without defining views)

» Business keys are loaded into anchors

» Each attribute is divided into their own attribute-tables together with technical Meta data.

» Attribute with more constant content (such as codes) are created as knots with historic ties holding the status for a specific anchor. Thus reducing a lot of overlapping information minimizing data volumes and providing multiple purpose tables.

» Temporal views hold the complete entity provided to subscribers.

» New additional data (regardless if the information comes in a new or extended file) is added to new attribute table and completed by extending all views with that attribute.

» New historical data can be added without affecting any current information in an instant!

Page 36: Venugopal krishnan flexible dw models 2014 jul_ieg

Anchor Modeling Example 1 contd....

The model will look like the following:

Marital Status_ID ------------------------ Marital_Status Created_Dt

Customer_ID ----------------------------- Customer_No Valid_From_Dt Valid_To_Dt Created_Dt

Customer_ID Marital_Status_ID Valid_from_Dt Valid_To_Dt Created_Dt

Tax_Zone_ID ------------------------------ Tax_Zone Created_Dt

Customer_ID Tax_Zone_ID Valid_from_Dt Valid_To_Dt Created_Dt

Customer_ID --------------------- Birth_Date Valid_From_Dt Valid_To_Dt Created_Dt

Customer_ID --------------------- Name Valid_from_Dt Valid_To_Dt Created_DT

Customer_ID Valid_from_Dt Valid_To_Dt ---------------------- Income Created_Dt

Page 37: Venugopal krishnan flexible dw models 2014 jul_ieg

Example 2

Anchor - CU_Customer (CU_ID) Knot - Gen_Gender (Gen_ID, Gen_Gender_Name) Attributes - CUDOB_Customerdateofbirth, (CU_ID,Customerdateofbirth) Ties - CUHH_Customer_Household (CU_ID, HH_ID, HOW_ID, CUHH_Fromdate) Sample values in each object: CU_Customer = (#42,#43,#44) Gen_Gender = (#1,’Male’,#2,’Female) CUDOB_Customerdateofbirth = (#42,1963-08-13,#43,1970-09-24,#44,1958-12-10) CUGEN_Gender (Knotted attribute) = (#42, #1,#43,#1,#44, #2) CUHH_Customer_Household = (#42,#43,#11,1984-11-20, #42, #44,#18, 1990-04-12)

Page 38: Venugopal krishnan flexible dw models 2014 jul_ieg

Customer Store Purchase Item PriceList Inventory

Gender CustomerClass HouseholdOwner VisitingFrequencyInterval

CustomerDateOfBirth CustomerNumber CustomerName CustomerGender

Customer_Address Customer_Household Card_Customer

Anchor Modeling Complete Example

Page 39: Venugopal krishnan flexible dw models 2014 jul_ieg

Anchor Modeling Complete Example Contd..

Select top 5 * from CU_Customer; - CU_ID ------------ 1 2 3 4 5

Select top 5 * from GEN_gender; GEN_ID GEN_Gender ------------ --------------------------- 1 Male 2 Female

Select top 5 * from CUDOB_CustomerDateofBirth; CU_ID CUDOB_CustomerDateOfBirth ---------- -------------------------------------------- 1 1905-03-02 2 1905-07-02 3 1908-09-14 4 1910-02-03 5 1912-04-01

Select top 5 * from CUHH_Customer_Houshold; CU_ID HH_ID HOW_ID CUHH_FromDate ---------- ---------- ------------ ------------------------- 1 1 1 2009-02-13 1 895 0 2009-09-21 2 2 1 2006-10-17 3 3 1 2002-08-20 4 4 1 1993-08-29

Page 40: Venugopal krishnan flexible dw models 2014 jul_ieg

Model Evolution

CU

CUGEN

CUNAM

CU DOB

GEN

CUSAL

Page 41: Venugopal krishnan flexible dw models 2014 jul_ieg

Typical Anchor Model Example

ANCHOR

KNOT

Historized Attribute

Static Attribute

Static TIE Historized TIE

Page 42: Venugopal krishnan flexible dw models 2014 jul_ieg

Physical Implementation Abstraction layer through views and functions created to reduce complexity due to large number of tables. • Complete View: Denormalization of an anchor table along with its

attributes. Constructed Using outer join of anchor table with all its attributes.

• Latest View: A view based on the complete view, where only the

latest values for historized attributes are included.(Uses a sub-select)

• Point-in-Time Function: A function for an anchor with a time point

as an argument returning a data set. It is based on the complete view where the latest value of each attribute before or at the time point is included. (A sub-select with a condition that historization time is latest one that is earlier than the time point).

• Interval Function: Function using 2 time points to return a data set

from the anchor.

Page 43: Venugopal krishnan flexible dw models 2014 jul_ieg

Table Elimination Utilized by modern query optimizers to improve the query performance. Tables that does not contain selected attributes are automatically

eliminated from the execution and plan. This can happen if:

• No column from a table T is selected: OR

• Number of rows returned is not affected by the join with T: Views and functions defined earlier are created to take advantage

of table elimination.

• Use anchor table as the left table in the join for view • Attributes must be left outer joined. The left join ensures that

the number of rows returned is at least as many as in the anchor table.

Page 44: Venugopal krishnan flexible dw models 2014 jul_ieg

Table Elimination Example Oracle optimizer starting with 10gR2 provides the table elimination feature. There are 2 cases when Oracle will eliminate a redundant table: 1.0 Optimizer eliminates tables that are redundant due to primary-foreign key constraints. e.g. create table jobs ( job_id NUMBER PRIMARY KEY, job_title VARCHAR2(35) NOT NULL, min_salary NUMBER, max_salary NUMBER ); create table departments ( department_id NUMBER PRIMARY KEY, department_name VARCHAR2(50) ); create table employees ( employee_id NUMBER PRIMARY KEY, employee_name VARCHAR2(50), department_id NUMBER REFERENCES departments(department_id), job_id NUMBER REFERENCES jobs(job_id) ); select e.employee_name from employees e, departments d where e.department_id = d.department_id;

The above query has join to department redundant. Optimizer re-writes the query as follows: select e.employee_name from employees e where e.department_id is not null;

Oracle 11g Optimizer also eliminates tables that are anti-joined or semi-joined.

Page 45: Venugopal krishnan flexible dw models 2014 jul_ieg

Table Elimination Contd….. 2.0 Outer Join Table Elimination e.g: create table projects ( project_id NUMBER UNIQUE, deadline DATE, priority NUMBER ); alter table employees add project_id number; select e.employee_name, e.project_id from employees e, projects p where e.project_id = p.project_id (+);

Since Outer join guarantees the occurrence of every row in employee at least once, and the unique constraint on project.project_id guarantees every row of employee will match at most one row in projects, the project table is redundant and optimizer will eliminate the table from the outer join. The Optimizer will rewrite the query as follows: select e. employee_name from employees e;

Page 46: Venugopal krishnan flexible dw models 2014 jul_ieg

Advantages of Anchor Modeling

1.0 Ease of modeling • Expressive concepts and notation – Constructed using small

number of expressive concepts. • Historization by Design – Managing different versions is simpler. • Agile Development – Facilitates iterative and flexible modeling. • Reusability and Automation.

2.0 Simplified Database Maintenance. • Ease of attribute changes • Absense of NULL values • Simple Index design – clustered/B-tree indexes. • No Updates, Only Inserts!

3.0 High performance databases

• High run-time performance – few columns per table, table elimination.

• Efficient storage – Smaller size than normalized databases. • Less Index space needed. • Reduced deadlock issues – Only Inserts!!!

Page 47: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling • A flexible data modeling technique built for data warehousing

especially when implemented on MPP-environments. • Removes any need for multiple data storages as it stores

information as it is delivered to the data warehouse, thereby automatically supporting compliance issues (Basically we divide the information into chunks of information regarding a specific business entity or more precise a business key).

Created by Dan Linstedt, the definition is as follows: “A detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3NF and Star Schemas. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.”

Page 48: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling

Data Vault model is comprised of three basic types of tables: HUB - Contains a list of unique business keys having its own surrogate key. LNK - Establishes relationships between business keys (typically hubs, but links can link to other links). SATELLITE - Holds descriptive attributes that can change over time (similar to a Kimball Type II slowly changing dimension). Data Vault Steps A simplified process can be described in a few steps, take a source, define the business key, and separate your target model into 4 types per source. * Business keys - Hub * Slowly Changing Attributes - Satellite * Rapidly Changing Attributes – Satellite * Business key relationships – Links

Page 49: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling

Hubs and Satellites Customer Hub

Customer_Name satellite

Page 50: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Benefits • Scalable and Flexible architecture • Iterative/Agile/Adaptive Data warehousing • Near-Real-Time Loads • History data re- loads • In-Database data mining • Terabytes to Petabytes of information (Big Data) • Incremental build out • Seamless integration of unstructured data • Dynamic Model Adaptation – self healing • Business rule changes (with Ease)

Data Vault Modeling

Page 51: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling

Data Vault Issues: Business Perspective

• Data in the DV is not “cleansed or quality checked”.

• Using a DV forces examination of source data processes, and source business processes.

• Businesses believe their existing operational reports are “right”, the DV architecture proves this is not always the case.

• Business Users from different units MUST agree on the elements (scope) they need in the Data Vault before parts of it can be built.

• Currently there is only one source of information exchange, there are no books on the Data Vault (yet).

• Some businesses fight the idea of implementing a new architecture, they claim it is yet unproven.

Page 52: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling Data Vault Issues: Technical Perspective: • Data Vault model introduces many many joins

• Data Vault model is based on MPP computing, not SMP computing, and is not necessarily a clustered architecture.

• Data Vault contains all deltas, only houses deletes and updates as status flags on the data itself.

• Data must be made into information BEFORE delivering to the business.

• Stand-alone tables for calendar, geography, and sometimes codes and descriptions are acceptable.

• 60% to 80% of source data typically is not tracked by change, forcing a re-load and delta comparison on the way into the DV.

• Businesses must define the metadata on a column based level in order to make sense of the Data Vault storage paradigm.

Page 53: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling

Steps for Data Vault modeling

Step1: Establish the Business Keys, Hubs

Step 2: Establish the relationships between the Business Keys, Links

Step 3: Establish description around the Business Keys, Satellites

Step 4: Add Standalone components like Calendars and code or

descriptions for decoding in Data Marts

Step 5: Tune for query optimization, add performance tables such as

Bridge tables and Point-In-Time structures

Page 54: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling Example 1 Let’s assume we have a case to integrate customer data into our data warehouse. The source system supplies the demanded information in two separate source files according to the schema presented below. Analysis has determined each attribute’s ability to change and categorised each attribute into business keys, slowly changing attributes, rapidly changing attributes and meta data.

» File 1: * Business Key - Customer Number * Slowly Changing Attribute – Name * Slowly Changing Attribute – Birth Date * Slowly Changing Attribute – Martial Status * Rapidly Changing Attribute – Income * Meta Data – Changed Date * Meta Data - From Date » File 2: * Business Key - Customer Number * Slowly Changing Attribute – Tax Zone * Rapidly Changing Attribute – Loyalty Value * Meta Data - From Date

Page 55: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Valut Example Contd….. A data vault model is created as follows:

» Business keys are inserted into the Customer Hub, including load date and source information from both files.

» Slowly changing attributes from file 1 are inserted into a specific satellite for that information only.

» Rapidly changing attributes are divided one by one to specific satellites for each file separately.

» Slowly changing attributes from file 2 are inserted into a specific satellite for that information only.

» No synchronisation is needed at load time since relationships are created on the fly using valid from dates.

» New additional data (regardless if the information comes in a new or extended file) is added to new satellites according to the principle of rapid or slowly changing attributes.

Page 56: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Valut Example Contd…..

The data vault model for the example would be as follows:

Page 57: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Example 2

3NF to Data Vault conversion:

Consider the following 3NF:

SK fields are Surrogate Keys, BK fields are Business Keys

Page 58: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Example -2

3NF to Data Vault conversion:

The data vault model looks like the following:

1) Instead of each master table in 3NF, we add a hub and a satellite. 2) Instead of the transactional table, we add Link table and Satellite. 3) Instead of the joins between master tables, we add Link tables.

Page 59: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Example -2

3NF to Data Vault conversion:

Adding attributes/entities into the data vault model is very easy:

Attributes like customer demographics, and new table named Delivery can be added without any changes to existing tables.

Page 60: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling example - 3 3NF model

Page 61: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling example - 3 Data Vault Hub Design

Page 62: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling example - 3 Data Vault Hubs, Links and Satellites Design

Page 63: Venugopal krishnan flexible dw models 2014 jul_ieg

Data Vault Modeling example - 3 Completed Design

Page 64: Venugopal krishnan flexible dw models 2014 jul_ieg

Other Flexible Data models

The other flexible models are based on Decomposition Storage Model(DSM).

One of the popular DSM is the Index Table Model which is used primarily in SAAS environements.

DSM Structure:

• Records stored as set of binary relations.

• Each relation corresponds to a single attribute and holds <key, value> pairs.

• Each relation is stored twice. One cluster indexed by key and the other cluster indexed by value.

Example:

ACCT TYPE OVERDRAWN? MIN BAL

335

690 Checking N

122 Saving 100

NSM

Page 65: Venugopal krishnan flexible dw models 2014 jul_ieg

DSM structure:

DSM

Other Flexible Data models

ACCT

335

690

122

ACCT OVERDRAWN

690 N

ACCT MIN BAL

122 100

ACCT TYPE

690 Checking

122 Saving

Page 66: Venugopal krishnan flexible dw models 2014 jul_ieg

Example: Distributed relations

R1 R2

Other Flexible Data models

SS# NAME DOB

123-45-6789 Lara 6/11/76

987-56-3488 Nicole 3/30/79

SS# NAME DOB

987-56-3488 Nicole 3/30/79

346-09-0227 Amber 9/17/80

NSM

R1.SS#

123-45-6789

987-56-3488

R2.SS#

987-56-3488

346-09-0227

SS# NAME

123-45-6789 Lara

987-56-3488 Nicole

346-09-0227 Amber

SS# DOB

123-45-6789 6/11/76

987-56-3488 3/30/79

346-09-0227 9/17/80

DSM

Note: R1 and R2 are in different distributed databases.

Page 67: Venugopal krishnan flexible dw models 2014 jul_ieg

Advantages of DSM:

• Eliminates Null values

• Supports distributed relations (very useful in cloud environments).

• Manging delta is easier.

• Simple storage structure.

• Unform access method (key based and attribute based access only).

• Basis for Columnar and NOSQL data models

Drawbacks of DSM:

• DSM uses more storage (between 1 to 4 times of NSM).

• Modification of an attribute require 3 disk writes(2 for record, 1 for index), 2 disk writes for an Insert.

• Retrieval query performance depends on the following:

• Number of projected attributes

• Size of intermediate results (due to joins)

• Number of records to be retrieved.

Other Flexible Data models

Page 68: Venugopal krishnan flexible dw models 2014 jul_ieg

Index Table Model

• Primarily used in SAAS environments.

• Comprises of a base table and a number of supporting tables. The base table contains all columns common to all individual tenant tables with an additional column called Index.

• Each supporting table has 2 columns, one for index and the other for a column which is not common among all tenants.

• If there are “n” non-common columns among the private tables, then this model will have “n” supporting tables apart from the base table.

• Reduces the sparsity among the tables.

• Index provides better access to the required information than other methods.

• This model is based on Decomposition Storage Model(DSM).

Other Flexible Data models

Page 69: Venugopal krishnan flexible dw models 2014 jul_ieg

Example:

Original Table

Index Table model

Index Table Model contd..

Base Table

Index Tenant_ID AID Name

1 17 1 ACME

2 17 2 GUMM

3 42 1 BANNER

4 35 1 BALE

Index Hospital

1 St.Mary

2 Manipal

Index No_Beds

1 135

2 1045

ACCOUNT

AID NAME HOSPITAL NO.OF BEDS

1 ACME ST.MARY 135

2 GUMP STATE 1042

Other Flexible Data models

Page 70: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Page 71: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL data modeling! Do we need it?

• Schema less, yet need data structure based on application data access path (one data access path per data structure).

• Data modeling ends up in the code of the application (No change required for physical data structure).

• Data architect involvement is crucial in NOSQL implementation.

NOSQL Data Models

Page 72: Venugopal krishnan flexible dw models 2014 jul_ieg

Typical Scenario

Webinar Recording information.

Structure: Device IP Address, Program and Date (Primary key), other related information like duration,content etc...

Sample data : 10.30.20.15,Hadoop,20140521230000

Query types:

• SELECT * FROM recording where device_ip = ‘10.30.20.15’;

• SELECT COUNT(*) FROM recording group by program;

• SELECT COUNT(*) FROM recording group by date;

The above are possible in an RDBMS, how about in a NOSQL database?

NOSQL Data Models

Page 73: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL System Families

• Key-Value pair model is the simplest, yet powerful model. One of the drawbacks of this model is the inability to support key range processing.

• Ordered Key-Value overcomes this limitation and improves aggregation capabilities. It does not provide value modeling.

• Big table (Column Family) model supports value modeling through modeling map-of-maps-of-maps namely column families, columns, and timestamped versions.

• Document databases handles arbitrary complexity, and support database managed indexes. Indexes by field names.

• Graph model has evolved from Ordered Key-Value models with additional support for heirarchical modeling. (Graph is an abstract representation of set of objects (Nodes) some of which are connected by links (relationships).

NOSQL Data Models

Page 74: Venugopal krishnan flexible dw models 2014 jul_ieg

Examples of NOSQL Databases

Key-Value stores : Oracle NOSQL, Redis, Kyoto

BigTable(Column Family): Apache Hbase, Apache Cassandra, Google Spanner/F1

Document : MongoDB, CouchDB

Graph : NEO4J, FlockDB

NOSQL Data Models

Page 75: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

1.0 Data Denormalization (Applicable to Key-Value stores, Document databases, BigTable

databases) 2.0 Aggregation (Applicable to Key-Value stores, Document databases, BigTable

databases) 3.0 Application Side Joins (Applicable to Key-Value stores, Document databases, BigTable

databases and Graph databases) 4.0 Enumerable Keys (Applicable to Key-Value stores) 5.0 Dimensionality reduction 6.0 Index Table (Applicable to BigTable Databases)

Data Modeling Techniques

Page 76: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Normalization & Aggregation

Page 77: Venugopal krishnan flexible dw models 2014 jul_ieg

Application side Joins

NOSQL Data Models

Page 78: Venugopal krishnan flexible dw models 2014 jul_ieg

Enumerable Keys

• Use Ordered keys to traverse data

Example: By creating a sequence id for messageID, the composite key userID_messageID will enable traversing the previous and succeeding messages for any given messageID.

• Group data into buckets based on the ordered attribute.

Example: Create bucket based on time(day). Using this, mail box can be traversed forward or backward starting from any date.

Dimensionality Reduction

Map multidimensional data to a Key-Value model or to a non-multidimensional model using Dimensionality reduction methods.

Example: Geohash

NOSQL Data Models

Page 80: Venugopal krishnan flexible dw models 2014 jul_ieg

Heirarchy Modeling Techniques

1.0 Tree Aggregation (Key-Value stores, Document databases)

2.0 Adjacency Lists

3.0 Materialized Paths

4.0 Nested Sets

5.0 Batch Graph Processing

NOSQL Data Models

Page 81: Venugopal krishnan flexible dw models 2014 jul_ieg

Tree Aggregation

NOSQL Data Modeling Techniques

Efficient when the entire tree is accessed once. Search, direct access and updates could be inefficient.

Page 82: Venugopal krishnan flexible dw models 2014 jul_ieg

Adjacency lists

• Simple way of graph modeling.

• Each node is modeled as an independent record and contains arrays of direct ancestors and descendents.

• Enables traversing the graph by parents or children.

Inefficient for deep or wide traversals.

Inefficient for accessing an entire tree for a given node.

NOSQL Data Models

Page 83: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Materialized paths

• Attribute each node by identifiers of all its parent and children. • Avoids recursive traversals of tree-like structures.

Page 84: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Nested Sets

Store leafs of the tree in an array, and map each non-leaf node to a range of leafs.

Page 85: Venugopal krishnan flexible dw models 2014 jul_ieg

Nested Documents Flattening

Example:

Name:John

NOSQL Data Models

Math:Excellent

Poetry:Poor

......

Approach 1 : Name: John Skill: Math,Poetry,.... Level:Excellent,Poor,... Query: Skill:Poetry AND Level:Excellent

Approach 2 : Name: John Skill_1: Math Level_1:Excellent Skill_2: Poetry Level_2:Poor .. Query: OR (skill_i:Poetry and level_i: Excellent)

Query:SkillAndLevel: Distance(Excellent Poetry)=0

Approach 3: Name:John SkillAndLevel:Math Excellent Poetry Poor .....

Page 86: Venugopal krishnan flexible dw models 2014 jul_ieg

Typical Scenario using NOSQL (From Slide 69)

• Data storage structures created based on all anticipated data access paths

• Each of the data structures support a single data access path.

• Example using a Column Family structure:

Additional access paths can be supported by

• Creating secondary indexes (available in latest versions).

• Creating additional column families with different key combinations.

NOSQL Data Models

Page 87: Venugopal krishnan flexible dw models 2014 jul_ieg

Relational to NoSQL

Example:

NOSQL Data Models

• Get user by user id • Get item by item id • Get all the items that a particular

user likes • Get all the users who like a

particular item

Typical Queries:

Relational Model

Page 89: Venugopal krishnan flexible dw models 2014 jul_ieg

Approaches:

1.0 Normalized entities.

• Cannot support join queries.

2.0 Normalized entities with custom indexes.

• Supports join operations, but cannot get the details of all attributes.

3.0 Normalized entities with denormalized indexes.

• Supports all the queries mentioned.

4.0 Partially denormalized indexes.

• Super columns are hard to maintain and it becomes messy.

NOSQL Data Models

Page 90: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Typical Data Model for a Column Family database (Approach 3)

Title and Name are de-normalized in User_By_Item and in Item_By_User.

Page 91: Venugopal krishnan flexible dw models 2014 jul_ieg

NOSQL Data Models

Typical Data Model for a Column Family database (Approach 3 with Timestamp)

The above model supports time based queries (e.g. Most Recent) in addition.

Page 92: Venugopal krishnan flexible dw models 2014 jul_ieg

Questions??

Page 93: Venugopal krishnan flexible dw models 2014 jul_ieg

THANK YOU

Page 94: Venugopal krishnan flexible dw models 2014 jul_ieg

Community Focused

Volunteer Driven

Knowledge Share

Accelerated Learning

Collective Excellence

Distilled Knowledge

Shared, Non Conflicting Goals

Validation / Brainstorm platform

Mentor, Guide, Coach

Satisfied, Empowered Professional

Richer Industry and Academia

About Information Excellence Group

Progress Information Excellence

Towards an Enriched Profession, Business and Society

Page 95: Venugopal krishnan flexible dw models 2014 jul_ieg

About Information Excellence GroupReach us at:

blog: http://informationexcellence.wordpress.com/

presentations: http://www.slideshare.net/informationexcellence

linked in:http://www.linkedin.com/groups/Information-Excellence-3893869

Facebook:http://www.facebook.com/pages/Information-excellence-group/171892096247159

Google+: https://plus.google.com/u/0/communities/102316155996060621595

twitter: #infoexcelemail: [email protected]

[email protected]

Have you enriched yourself by contributing to the community Knowledge Share..