asper database presentation - data modeling topics

71
@tbunio [email protected] agilevoyageur.wordpress.com www.protegra.com

Upload: terry-bunio

Post on 14-Jun-2015

174 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Asper database presentation - Data Modeling Topics

@tbunio [email protected]

agilevoyageur.wordpress.com

www.protegra.com

Page 2: Asper database presentation - Data Modeling Topics

Agenda

• Data Modeling

• The Project

• Hot DB topics

• Relational vs Dimensional

• Dimensional concepts – Facts

– Dimensions

• Complex Concept Introduction

Page 3: Asper database presentation - Data Modeling Topics

What is Data Modeling?

Page 4: Asper database presentation - Data Modeling Topics

Definition

• “A database model is a specification

describing how a database is structured and used” – Wikipedia

Page 5: Asper database presentation - Data Modeling Topics

Definition

• “A database model is a specification

describing how a database is structured and used” – Wikipedia

• “A data model describes how the

data entities are related to each other

in the real world” - Terry

Page 6: Asper database presentation - Data Modeling Topics

Data Model Characteristics

• Organize/Structure like Data Elements

• Define relationships between Data

Entities

• Highly Cohesive

• Loosely Coupled

Page 7: Asper database presentation - Data Modeling Topics

The Project

• Major Health Service provider is

switching their claims system to SAP

• As part of this, they are totally redeveloping their entire Data

Warehouse solution

Page 8: Asper database presentation - Data Modeling Topics

The Project

• 3+ years duration

• 100+ Integration Projects

• 200+ Systems Personnel

Page 9: Asper database presentation - Data Modeling Topics

Integration Project Streams

• Client Administration – Policy Systems

• Data Warehouse

• Legacy – Conversion from Legacy

• Queries – Queries for internal and

external

• Web – External Web Applications

Page 10: Asper database presentation - Data Modeling Topics

Data Warehouse Team

• Terry – Data Architect/Modeler and

PM

• Hanaa – Lead Analyst and Data Analyst

• Kevin – Lead Data Migration

Developer

• Lisa – Lead Report Analyst

• Les – Lead Report Developer

Page 11: Asper database presentation - Data Modeling Topics

Current State

• Sybase Data Warehouse

– Combination of Normalized and

Dimensional design

• Data Migration

– Series of SQL Scripts that move data from Legacy (Cobol) and Java Applications

• Impromptu

– 1000+ Reports

Page 12: Asper database presentation - Data Modeling Topics

Target State

• SQL Server 2012

• SQL Server Integration Services for

Data Migration

• SQL Server Reporting Services for Report Development

• Sharepoint for Report Portal

Page 13: Asper database presentation - Data Modeling Topics

Target Solution

• Initial load moves 2.5 Terabytes of data

• Initial load runs once

• Incremental load runs every hour

Page 14: Asper database presentation - Data Modeling Topics

Target Solution

• Operational Data Store – Normalized

– 400+ tables

• Data Warehouse – Dimensional

– 60+ tables

• Why both? – ODS does not have history (Just

Transactions)

Page 15: Asper database presentation - Data Modeling Topics

Our #1 Challenge

• We needed to be Agile like the other projects! – We are now on revision 3500

• We spent weeks planning of how to be flexible

• Instead of spending time planning, we spent time planning how we could quickly change and adapt

• This also meant we created a new automated test framework

Page 16: Asper database presentation - Data Modeling Topics
Page 17: Asper database presentation - Data Modeling Topics

Beef?

• Where are the hot topics like

– Big Data

– NoSQL

– MySQL

– Data Warehouse Appliances

– Cloud

– Open Source Databases

Page 18: Asper database presentation - Data Modeling Topics

Big Data

• “Commercial Databases” have come

a long way to handle large data volume

• Big Data is still important but probably

is not required for the vast majority of

databases

– But it applicable for the Facebooks and Amazons out there

Page 19: Asper database presentation - Data Modeling Topics

Big Data

• For example, many of the Big Data

solutions featured ColumnStore Indexing

• Now almost all commercial databases

offer ColumnStore Indexes

Page 20: Asper database presentation - Data Modeling Topics

NoSQL

• NoSQL was heralded a few years ago as the death of structured databases

• Mainly promoted from the developer community

• Seems to have found a niche for supporting more mainly unstructured and dynamic data

• Traditional databases still the most efficient for structured data

Page 21: Asper database presentation - Data Modeling Topics

MySQL

• MySQL was also promoted as a great

lightweight, high performance option’

• We actually investigated it as an option for the project

• Great example of never trusting what

you hear

Page 22: Asper database presentation - Data Modeling Topics

MySQL

• All of the great MySQL benchmarks use the simplest database engine with no ACID compliance – MySQL has the option to use different

engines with different features

• Once you use the ACID compliant engine, the performance is equivalent(or worse) to SQL Server and PostgreSQL

Page 23: Asper database presentation - Data Modeling Topics

Data Warehouse Appliances

• “marketing term for an integrated set

of servers, storage, operating system(s), DBMS and software

specifically pre-installed and pre-

optimized for data warehousing”

Page 24: Asper database presentation - Data Modeling Topics

Data Warehouse Appliances

• Recently in the Data Warehouse

Industry, there has been the rise of a the Data Warehouse appliances

• These appliances are a one-stop

solution that builds in Big Data

capabilities

Page 25: Asper database presentation - Data Modeling Topics

Data Warehouse Appliances

• Cool Names like: – Teradata

– GreenPlum

– Netezza

– InfoSphere

– EMC

• Like Big Data these solution are valuable if you need to play in the Big Data/Big Analysis arena

• Most solutions don’t require them

Page 26: Asper database presentation - Data Modeling Topics

Cloud

• Great to store pictures and music – the concept still makes businesses nervous – Also regulatory requirements sometime

prevent it

• Business is starting to become more comfortable – Still a ways to go

• Very few business go to the Cloud unless they have to – Amazon/Microsoft is changing this with their

services

Page 27: Asper database presentation - Data Modeling Topics

Open Source Databases

• We investigated Open Sources

databases for our solution. We looked at:

– MySQL

– PostgreSQL

– others

Page 28: Asper database presentation - Data Modeling Topics

Open Sources Databases

• We were surprised to learn that once

you factor in all the things you get from SQL Server, it actually is cheaper over

10 years than Open Source!!

• So we select SQL Server

Page 29: Asper database presentation - Data Modeling Topics

Foundational DB Practices

Page 30: Asper database presentation - Data Modeling Topics

Two design methods

• Relational – “Database normalization is the process of organizing

the fields and tables of a relational database to

minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between

them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”.”

Page 31: Asper database presentation - Data Modeling Topics

Two design methods

• Dimensional – “Dimensional modeling always uses the concepts of facts

(measures), and dimensions (context). Facts are typically

(but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts

Page 32: Asper database presentation - Data Modeling Topics

Relational

Page 33: Asper database presentation - Data Modeling Topics

Relational

• Relational Analysis

– Database design is usually in Third Normal

Form

– Database is optimized for transaction processing. (OLTP)

– Normalized tables are optimized for modification rather than retrieval

Page 34: Asper database presentation - Data Modeling Topics

Normal forms

• 1st - Under first normal form, all occurrences of a

record type must contain the same number of fields.

• 2nd - Second normal form is violated when a non-

key field is a fact about a subset of a key. It is only relevant when the key is composite

• 3rd - Third normal form is violated when a non-key

field is a fact about another non-key field

Source: William Kent - 1982

Page 35: Asper database presentation - Data Modeling Topics

Dimensional

Page 36: Asper database presentation - Data Modeling Topics

Dimensional

• Dimensional Analysis – Star Schema/Snowflake

– Database is optimized for analytical processing. (OLAP)

– Facts and Dimensions optimized for retrieval • Facts – Business events – Transactions

• Dimensions – context for Transactions – People

– Accounts

– Products

– Date

Page 37: Asper database presentation - Data Modeling Topics

Relational

• 3 Dimensions

• Spatial Model – No historical components except for

transactional tables

• Relational – Models the one truth of the data – One account ‘11’

– One person ‘Terry Bunio’

– One transaction of ‘$100.00’ on April 10th

Page 38: Asper database presentation - Data Modeling Topics

Dimensional

• 4 Dimensions

• Temporal Model – All tables have a time component

• Dimensional – Models the one truth of the data at a point in time – Multiple versions of Accounts over time

– Multiple versions of people over time

– One transaction • Transactions are already temporal

Page 39: Asper database presentation - Data Modeling Topics

Fact Tables

• Contains the measurements or facts about a business process

• Are thin and deep

• Usually is: – Business transaction

– Business Event

• The grain of a Fact table is the level of the data recorded – Order, Invoice, Invoice Item

Page 40: Asper database presentation - Data Modeling Topics

Special Fact Tables

• Degenerate Dimensions

– Degenerate Dimensions are Dimensions

that can typically provide additional context about a Fact

• For example, flags that describe a transaction

• Degenerate Dimensions can either be a separate Dimension table or be

collapsed onto the Fact table

– My preference is the latter

Page 41: Asper database presentation - Data Modeling Topics

Dimension Tables

• Unlike fact tables, dimension tables

contain descriptive attributes that are typically textual fields

• These attributes are designed to serve

two critical purposes:

– query constraining and/or filtering

– query result set labeling.

Source: Wikipedia

Page 42: Asper database presentation - Data Modeling Topics

Dimension Tables

• Shallow and Wide

• Usually corresponds to entities that the

business interacts with

– People

– Locations

– Products

– Accounts

Page 43: Asper database presentation - Data Modeling Topics

Time Dimension

• All Dimensional Models need a time

component

• This is either a:

– Separate Time Dimension (recommended)

– Time attributes on each Fact Table

Page 44: Asper database presentation - Data Modeling Topics

Mini-Dimensions

Page 45: Asper database presentation - Data Modeling Topics

Mini-Dimensions

• Splitting a Dimension up due to the

activity of change for a set of attributes

• Helps to reduce the growth of the

Dimension table

Page 46: Asper database presentation - Data Modeling Topics

Slowly Changing Dimensions

• Type 1 – Overwrite the row with the

new values and update the effective date

– Pre-existing Facts now refer to the updated Dimension

– May cause inconsistent reports

Page 47: Asper database presentation - Data Modeling Topics

Slowly Changing Dimensions

• Type 2 – Insert a new Dimension row with

the new data and new effective date

– Update the expiry date on the prior row

• Don’t update old Facts that refer to the old

row

– Only new Facts will refer to this new Dimension

row

• Type 2 Slowly Changing Dimension

maintains the historical context of the data

Page 48: Asper database presentation - Data Modeling Topics

Slowly Changing Dimensions

• No longer to I have one row to

represent:

– Account 10123

– Terry Bunio

– Sales Representative 11092

• This changes the mindset and query syntax to retrieve data

Page 49: Asper database presentation - Data Modeling Topics

Slowly Changing Dimensions

• Type 3 – The Dimension stores multiple versions for the attribute in question

• This usually involves a current and previous value for the attribute

• When a change occurs, no rows are added but both the current and previous attributes are updated

• Like Type 1, Type 3 does not retain full historical context

Page 50: Asper database presentation - Data Modeling Topics

Complexity

• Most textbooks stop here only show

the simplest Dimensional Models

• Unfortunately, I’ve never run into a Dimensional Model like that

Page 51: Asper database presentation - Data Modeling Topics

Complex Concept Introduction

• Snowflake vs Star Schema

• Multi-Valued Dimensions and Bridges

• Recursive Hierarchies

Page 52: Asper database presentation - Data Modeling Topics

Snowflake vs Star Schema

Page 53: Asper database presentation - Data Modeling Topics

Snowflake vs Star Schema

Page 54: Asper database presentation - Data Modeling Topics

Dimensional

Page 55: Asper database presentation - Data Modeling Topics

Snowflake vs Star Schema

• These extra tables are termed outriggers

• They are used to address real world complexities with the data – Excessive row length

– Repeating groups of data within the Dimension

• I will use outriggers in a limited way for repeating data

Page 56: Asper database presentation - Data Modeling Topics

Multi-Valued Dimensions

• Multi-Valued Dimensions are when a

Fact needs to connect more than once to a Dimension

– Primary Sales Representative

– Secondary Sales Representative

Page 57: Asper database presentation - Data Modeling Topics

Multi-Valued Dimensions

• Two possible solutions

– Create copies of the Dimensions for each

role

– Create a Bridge table to resolve the many to many relationship

Page 58: Asper database presentation - Data Modeling Topics

Multi-Valued Dimensions

Page 59: Asper database presentation - Data Modeling Topics

Bridge Tables

Page 60: Asper database presentation - Data Modeling Topics

Bridge Tables

• Bridge Tables can be used to resolve any

many to many relationships

• This is frequently required with more

complex data areas

• These bridge tables need to be

considered a Dimension and they need

to use the same Slowly Changing

Dimension Design as the base Dimension

– My Recommendation

Page 61: Asper database presentation - Data Modeling Topics

Hierarchies and Recursive Hierarchies

Page 62: Asper database presentation - Data Modeling Topics

Why?

• Why Dimensional Model?

• Allows for a concise representation of

data for reporting. This is especially important for Self-Service Reporting

– We reduced from 400+ tables in our Operational Data Store to 60+ tables in our Data Warehouse

– Aligns with real world business concepts

Page 63: Asper database presentation - Data Modeling Topics

Why?

• The most important reason – – Requires detailed understanding of the

data

– Validates the solution

– Uncovers inconsistencies and errors in the Normalized Model • Easy for inconsistencies and errors to hide in

400+ tables

• No place to hide when those tables are reduced down

Page 64: Asper database presentation - Data Modeling Topics

Why?

• Ultimately there must be a business

requirement for a temporal data model and not just a spatial one.

• Although you could go through the

exercise to validate your

understanding and not implement the Dimensional Data Model

Page 65: Asper database presentation - Data Modeling Topics

How?

• Start with your simplest Dimension and Fact tables and define the Natural Keys for them

– i.e. People, Product, Transaction, Time

• De-Normalize Reference tables to Dimensions (And possibly Facts based on how large the Fact tables will be)

– I place both codes and descriptions on the Dimension and Fact tables

• Look to De-normalize other tables with the same Cardinality into one Dimension

– Validate the Natural Keys still define one row

Page 66: Asper database presentation - Data Modeling Topics

How?

• Don’t force entities on the same Dimension – Tempting but you will find it doesn’t

represent the data and will cause issues for loading or retrieval

– Bridge table or mini-snowflakes are not bad • I don’t like a deep snowflake, but shallow

snowflakes can be appropriate

• Don’t fall into the Star-Schema/Snowflake Holy War – Let your data define the solution

Page 67: Asper database presentation - Data Modeling Topics

How?

• Iterate, Iterate, Iterate

– Your initial solution will be wrong

– Create it and start to define the load

process and reports

– You will learn more by using the data than months of analysis to try and get the model right

Page 68: Asper database presentation - Data Modeling Topics

Two things to Ponder

Page 69: Asper database presentation - Data Modeling Topics

Two things to Ponder

• In the Information Age ahead,

databases will be used more for analysis than operational

– More Dimensional Models and analytical processes

Page 70: Asper database presentation - Data Modeling Topics

Two things to Ponder

• Critical skills going forward will be:

– Data Modeling/Data Architecture

– Data Migration

• There is a whole subject area here for a subsequent presentation. More of an art than science

– Data Verbalization

• Again a real art form to take a huge amount of data and present it in a readable form

Page 71: Asper database presentation - Data Modeling Topics

Whew! Questions?