edw data model storming for integration of nosql and rdbms by daniel upton

20
EDW Data Model Storming for Integration of NoSQL with RDBMS SQL Saturday #497, April 2, 2016 Daniel Upton DW-BI Architect, Data Modeler Decision Lab . Net Serving Orange County and San Diego County since 2007 [email protected] blog: www.decisionlab.net linkedin.com/in/DanielUpton

Upload: daniel-upton

Post on 15-Apr-2017

629 views

Category:

Data & Analytics


1 download

TRANSCRIPT

EDW Data Model Storming for

Integration of NoSQL with RDBMS

SQL Saturday #497, April 2, 2016

Daniel Upton

DW-BI Architect, Data Modeler

DecisionLab.Net Serving Orange County and San Diego County since 2007

[email protected] blog: www.decisionlab.net

linkedin.com/in/DanielUpton

__________________________________________________________________________________________________________________________________________________________________________________

Page 2 of 20

Open Questions o With DW-BI now a mainstream I.T. career specialization with an established set of best-

practices, why do many real-world implementations still fall short of satisfying business stakeholder expectations?

o What influence have Lean and Agile thinking had on DW-BI? o What parts of DW implementation have been most resistant to Agile? o Are established DW data modeling methods an asset or a liability? o What factors are driving change in data modeling for business intelligence? o What is Data Model Storming? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS

data? o What do we mean by Integration? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS

data? o What does End-to-End Model Storming mean?

Objectives: o Describe a data modeling method and demonstrate how it differs from both dimensional

modeling and 3rd Normal Form according to… o Agile: Quickly and iteratively deliver minimally viable products (MVP’s) to users. o Lean: Design in loose coupling to minimize or eliminate functional dependencies o PMBOK: Breakdown work (including design) into small-yet-cohesive chunks.

o Review BEAM Dimensional Model Storming (Corr and Stagnitto) o Demonstrate some best-practice NoSQL data models as major variations from 3rd Normal Form. o Introduce and Perform EDW Model Storming with a simple use case involving unpredictable, last

minute changes to business rules o Extend the Model Storm w/ a last-minute requirement for NoSQL integration

__________________________________________________________________________________________________________________________________________________________________________________

Page 3 of 20

Traditional Data Modeling Methods 3rd Normal Form Dimensional Warehouse / Mart: OLTP and EDW Star Schema w/ Facts and Dimensions

__________________________________________________________________________________________________________________________________________________________________________________

Page 4 of 20

3rd Normal OLTP Source Data Vault: Aliases: Lean DW, Hyper-Normal Model

o One Hub and all of its dependent Satellites are known as an Ensemble, a stand-alone set of

tables that always have zero functional dependencies on other Ensembles. o Hubs store business keys (unique identifiers well-known to non-techies and enterprise-wide o Satellites store and historize all attribute fields

__________________________________________________________________________________________________________________________________________________________________________________

Page 5 of 20

o Links store all relationships as associations

__________________________________________________________________________________________________________________________________________________________________________________

Page 6 of 20

BEAM Model Storming (Corr and Stagnitto) o Accelerates agile dimensional design with a great short-hand notation on eye-friendly visual

information displays to perform real-time dimensional design occurring during requirements meetings with business stakeholders.

o Begins with user-information story o Ends with artifacts that capture the business requirement while also specifying the logic for a

star schema. o One such artifact is an event matrix (minimal example):

o Includes source data column profiling at column/record level; ignores source data structure

__________________________________________________________________________________________________________________________________________________________________________________

Page 7 of 20

Best-Practice NoSQL (Wide-Table, No Joins) Data Model: Why not in 3rd Normal Form? o Fields duplicated and / or pivoted to balance join-minimization with redundant storage. o Just an example, not to be integrated in our example...

__________________________________________________________________________________________________________________________________________________________________________________

Page 8 of 20

More on Lean Data Warehouse / Hyper-Normal / Data Vault): Objectives o Fully enforced, simple (single-field equi-joins only) referential integrity o Identify a business key, store values as unique records in a Hub table; Surrogate PK removes all

functional dependencies (tight couplings) to this identifier FROM other tables’ FK’s o Store history of value changes to all attributes in a child table using LoadDTS and LoadEndDTS. o Store all table relationships to accommodate any current or future real-world cardinality

relationships (1-to-1, 1-to-M, & M-to-M), via an associative join table. Why o While preserving all actual relationships between records in related tables, all DW table

relationships now abstracted as Hub_PK, related to Link_FK, related to Hub_PK. o For Satellite’s identifier fields that, in source, were used as foreign keys (thus tightly coupled),

remove these functional dependencies TO other DV Ensembles. o Benefits:

o Zero functional dependencies between DW Ensembles, thus small increments may be designed, loaded and released based only on definition of a Minimally Viable Product (MVP, rather than forcing larger slower releases or more functionally inter-twined, thus much larger increments.

o When a directly related data subject area is later to be added in, this is accomplished with zero re-factoring of the existing ensembles.

__________________________________________________________________________________________________________________________________________________________________________________

Page 9 of 20

Mindset for Lean DW ModelStorm Design:

o K.I.S.S: Once a source table determined in-scope, include all fields and records, so you never have add them later.

o Other than creating Hubs, Satellite, and Links, perform no other transformations in this layer. o No calculations, aggregations, or business rules (yet).

o As such, we are NOT, or at least NOT YET attempting to define a single version of the truth

(SVOT), nor a data presentation / reporting RDBMS layer.

o Instead, we are… o Loosely integrating data from multiple data sources o Aligning it around business keys o Tracking the history of attributes whose old values may be overwritten in source systems o Supporting all actual (intended and otherwise) relationships among records in related

tables. o Doing all of the above while enforcing simple referential integrity exclusively with single-

field equi-joins.

__________________________________________________________________________________________________________________________________________________________________________________

Page 10 of 20

DW ModelStorm Design Steps:

o Begin where BEAM ModelStorming Ends. From there… o Define Business Keys o Identify in-scope source tables

o Reverse engineer in-scope tables into Data Modeling tool

o Identify and define cardinality of physical and logical (non-instantiated) relationships

o Classify each source table as a bonafide Entity or merely an Association

__________________________________________________________________________________________________________________________________________________________________________________

Page 11 of 20

__________________________________________________________________________________________________________________________________________________________________________________

Page 12 of 20

Now, group the source tables into distinct Subject Areas o Make copies of all above tables and place into a new submodel

__________________________________________________________________________________________________________________________________________________________________________________

Page 13 of 20

Next, for each new table-copy…

o Remove all (source-based) foreign key relationships without removing underlying identifier

fields.

o Remove primary key constraint.

o Add the following control / metadata fields: DWLoadBatchID_SourceSys DW_Load_DTS DW_Load_Expire_DTS Placeholder_SurrogateKey (explained later)

o Create new composite Primary Key w/ Placeholder Surrogate Key + Load_DTS

o Satellite-splitting

If a subset of fields are updated in source much more frequently than others, and table will be sufficiently large that ETL processing of the more frequent updates will result in excessive loading time, split table in two or more subsets.

__________________________________________________________________________________________________________________________________________________________________________________

Page 14 of 20

__________________________________________________________________________________________________________________________________________________________________________________

Page 15 of 20

Then, starting with tables classified earlier as bonafide entities

In new submodel, rename Placeholder_SurrogateKey field to Hub[EntityName]_SQN (or …_HashId) for all tables split from the source entity table

Copy one of these tables again In newest table-copy, delete all fields except new PK, new control fields AND

Business Key Rename table as “ Hub_[Enter Entity Name Here] “ Remove ‘Load_DTS’ from Primary Key Add a unique constraint to the Business Key. In each corresponding tables, rename each as “ Sat_[Enter Entity Name

Here_&Something] “ Create a defining relationship between Hub (parent / 1) and each “ Sat_[Enter Entity

Name Here_&Something] “ so that child tables FK is also part of it’s PK.

Once all entity tables are converted into Hub-Satellite Sets, start on mere-Association tables

Still in new submodel, repeat above steps to add control fields

Add new “ Link_[Assoc_Name)_SQN (or _HashID)

As above, set PK as …SQN + Load_DTS

Rename table to “ Sat_Link_[Enter Assoc. Name Here] “

Create another copy of table, and rename as “ Link_[Enter Assoc. Name Here] “

Follow same remaining steps as with Hubs, except that no Business Key remains in the link.

Create defining relationship from Link (child) to directly related Hubs (parents), so that Hub_[ParentHub]_SQN is included in the Link.

Create Unique Key on composite of Hub_ParentHub_SQN fields. o Create defining relationship from Link (parent) to LinkSat (child)

__________________________________________________________________________________________________________________________________________________________________________________

Page 16 of 20

When all Hubs, Links, Satellites done, our examples looks like this…

__________________________________________________________________________________________________________________________________________________________________________________

Page 17 of 20

At this time, in the 11th Hour prior to our release, a new requirement is announced o With a truly elegant display of back-pedaling and dissembling -- by our primary business

stakeholder, standing alongside the organization’s new data scientist. o Remember that ‘not to be integrated NoSQL example? Well, it does need to integrate after all,

and, oops, before the release. o For what it’s worth, the data scientist assures us that, with his astonishing coding skills, he

neither needs nor wants a data presentation layer or SVOT.

__________________________________________________________________________________________________________________________________________________________________________________

Page 18 of 20

Your team huddles privately afterwards…

Amid the grumbling, the PM politely asks, “How long will this take to design and load it”.

You smile & answer: 1 – 2 days. An hour later, you show these model additions…

__________________________________________________________________________________________________________________________________________________________________________________

Page 19 of 20

Questions:

Does Lean Data Warehouse (Data Vault / Hyper Normal) extend to complex data models with many source systems?

__________________________________________________________________________________________________________________________________________________________________________________

Page 20 of 20

DecisionLab.Net

_____________________________________________________________________

Data Warehouse / Business Intelligence envisioning, implementation, oversight, and assessment ________________________________________________________________________________________________________________

This slide deck available now at… slideshare.net/DanielUpton/

_______________________________________________________________________________________________________________

Daniel Upton [email protected] Carlsbad, CA blog: http://www.decisionlab.net phone 760.525.3268