building an effective data warehouse...

34
Building an Effective Data Warehouse Architecture James Serra, Big Data Evangelist Microsoft May 7-9, 2014 | San Jose, CA

Upload: others

Post on 28-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Building an Effective Data Warehouse Architecture

James Serra, Big Data Evangelist

Microsoft

May 7-9, 2014 | San Jose, CA

Page 2: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Other Presentations Building an Effective Data Warehouse Architecture

Reasons for building a DW and the various approaches and DW concepts (Kimball vs Inmon)

Building a Big Data Solution (Building an Effective Data Warehouse

Architecture with Hadoop, the cloud and MPP)Explains what Big Data is, it’s benefits including use cases, and how Hadoop, the cloud, and MPP fit in

Finding business value in Big Data (What exactly is Big Data and why

should I care?)Very similar to “Building a Big Data Solution” but target audience is business users/CxO instead of architects

How does Microsoft solve Big Data?Covers the Microsoft products that can be used to create a Big Data solution

Modern Data Warehousing with the Microsoft Analytics Platform SystemThe next step in data warehouse performance is APS, a MPP appliance

Power BI, Azure ML, Azure HDInsights, Azure Data Factory, etcDeep dives into the various Microsoft Big Data related products

Page 3: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

About Me

Business Intelligence Consultant, in IT for 28 years

Microsoft, Big Data Evangelist

Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM

architect, PDW developer

Been perm, contractor, consultant, business owner

Presenter at PASS Business Analytics Conference and PASS Summit

MCSE for SQL Server 2012: Data Platform and BI

Blog at JamesSerra.com

SQL Server MVP

Author of book “Reporting with Microsoft SQL Server 2012”

Page 4: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

I tried to build a data warehouse on my own…

And ended up passed-out drunk in a Denny’s

parking lot

Let’s prevent that from happening…

Page 5: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Agenda

What a Data Warehouse is not

What is a Data Warehouse and why use one?

Fast Track Data Warehouse (FTDW)

Appliances

Data Warehouse vs Data Mart

Kimball and Inmon Methodologies

Populating a Data Warehouse

ETL vs ELT

Surrogate Keys

SSAS Cubes

Modern Data Warehouse

Page 6: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

What a Data Warehouse is not

• A data warehouse is not a copy of a source database with the name prefixed with “DW”

• It is not a copy of multiple tables (i.e. customer) from various sources systems unioned

together in a view

• It is not a dumping ground for tables from various sources with not much design put into it

Page 7: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Data Warehouse Maturity Model

Courtesy of Wayne Eckerson

Page 8: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

What is a Data Warehouse and why use one?

A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting. It acts as a central repository for many subject areas and contains the "single version of truth". It is NOT to be used for OLTP applications.

Reasons for a data warehouse:

Reduce stress on production system

Optimized for read access, sequential disk scans

Integrate many sources of data

Keep historical records (no need to save hardcopy reports)

Restructure/rename tables and fields, model data

Protect against source system upgrades

Use Master Data Management, including hierarchies

No IT involvement needed for users to create reports

Improve data quality and plugs holes in source systems

One version of the truth

Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Page 9: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Why use a Data Warehouse?

Legacy applications + databases = chaos

Production Control

MRP

InventoryControl

Parts Management

Logistics

Shipping

Raw Goods

Order Control

Purchasing

Marketing

Finance

Sales

Accounting

Management Reporting

Engineering

Actuarial

Human Resources

Continuity

Consolidation

Control

Compliance

Collaboration

Enterprise data warehouse = order

Single version of the truth

Enterprise Data

Warehouse

Every question = decision

Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could not do before

Page 10: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Hardware Solutions

Fast Track Data Warehouse - A reference configuration optimized for data warehousing. This

saves an organization from having to commit resources to configure and build the server

hardware. Fast Track Data Warehouse hardware is tested for data warehousing which

eliminates guesswork and is designed to save you months of configuration, setup, testing

and tuning. You just need to install the OS and SQL Server

Appliances - Microsoft has made available SQL Server appliances (SMP and MPP) that allow

customers to deploy data warehouse (DW), business intelligence (BI) and database

consolidation solutions in a very short time, with all the components pre-configured and

pre-optimized. These appliances include all the hardware, software and services for a

complete, ready-to-run, out-of-the-box, high performance, energy-efficient solutions

Page 11: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Data Warehouse Fast Track for SQL Server 2014

Hardware system design

• Tight specifications for servers, storage,

and networking

• Resource balanced and validated

• Latest-generation servers and storage,

including solid-state disks (SSDs)

Database configuration

• Workload-specific

• Database architecture

• SQL Server settings

• Windows Server settings

• Performance guidance

Software

• SQL Server 2014 Enterprise

• Windows Server 2012 R2

Processors

Networking

Servers

Storage

Windows Server

2012 R2

SQL Server 2014

Page 12: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Options for data warehouse solutionsBalancing flexibility

and choice

By yourself With a reference

architecture

With an appliance

Tuning and optimization

Installation

ConfigurationTuning and optimization

Installation

Configuration

Installation

Tuning and optimization

HIGH

LOW

Time to

solution

Optional, if you have hardware already

Existing or procured hardware and support

Procured software and support

Offerings• SQL Server 2014

• Windows Server 2012 R2

• System Center 2012 SP1

Offerings• Private Cloud Fast Track

• Data Warehouse Fast Track

Offerings• Data Warehouse Fast Track

• Analytics Platform System

Existing or procured hardware and support

Procured software and support

Procured appliance and support

HIGH

Price

Page 13: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Data Warehouse Fast Track advantages

Flexibility and ChoiceReduced riskFaster Deployment

Page 14: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Vendors with 2014 Fast Track Appliances

Dell

EMC

HP/ScanDisk

Lenovo

NEC

Tegile

Page 15: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Data Warehouse vs Data Mart

Data Warehouse: A single organizational repository of enterprise wide data

across many or all subject areas

Holds multiple subject areas

Holds very detailed information

Works to integrate all data sources

Feeds data mart

Data Mart: Subset of the data warehouse that is usually oriented to specific

subject (finance, marketing, sales)

• The logical combination of all the data marts is a data warehouse

In short, a data warehouse as contains many subject areas, and a data mart

contains just one of those subject areas

Page 16: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball and Inmon Methodologies

Two approaches for building data warehouses

Page 17: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball and Inmon Myths

Myth: Kimball is a bottom-up approach without enterprise focus

Really top-down: BUS matrix (choose business processes/data sources), conformed

dimensions, MDM

Myth: Inmon requires a ton of up-front design that takes a long time

Inmon says to build DW iteratively, not big bang approach (p. 91 BDW, p. 21 Imhoff)

Myth: Star schema data marts are not allowed in Inmon’s model

Inmon says they are good for direct end-user access of data (p. 365 BDW), good for data

marts (p. 12 TTA)

Myth: Very few companies use the Inmon method

Survey had 39% Inmon vs 26% Kimball. Many have a EDW

Myth: The Kimball and Inmon architectures are incompatible

They can work together to provide a better solution

Page 18: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball and Inmon Methodologies

Relational (Inmon) vs Dimensional (Kimball)

Relational Modeling:

Entity-Relationship (ER) model

Normalization rules

Many tables using joins

History tables, natural keys

Good for indirect end-user access of data

Dimensional Modeling:

Facts and dimensions, star schema

Less tables but have duplicate data (de-normalized)

Easier for user to understand (but strange for IT people used to relational)

Slowly changing dimensions, surrogate keys

Good for direct end-user access of data

Page 19: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Relational Model vs Dimensional Model

Relational Model Dimensional Model

If you are a business user, which model is easier to use?

Page 20: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball and Inmon Methodologies

• Kimball:

• Logical data warehouse (BUS), made up of subject areas (data marts)

• Business driven, users have active participation

• Decentralized data marts (not required to be a separate physical data store)

• Independent dimensional data marts optimized for reporting/analytics

• Integrated via Conformed Dimensions (provides consistency across data sources)

• 2-tier (data mart, cube), less ETL, no data duplication

• Inmon:

• Enterprise data model (CIF) that is a enterprise data warehouse (EDW)

• IT Driven, users have passive participation

• Centralized atomic normalized tables (off limit to end users)

• Later create dependent data marts that are separate physical subsets of data and can

be used for multiple purposes

• Integration via enterprise data model

• 3-tier (data warehouse, data mart, cube), duplication of data

Page 21: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball Model

Why staging: Limit source contention (ELT), Recoverability, Backup, Auditing

Page 22: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Inmon Model

Page 23: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Reasons to add a Enterprise Data Warehouse

Single version of the truth

May make building dimensions easier using lightly denormalized tables in EDW instead of

going directly from the OLTP source

Normalized EDW results in enterprise-wide consistency which makes it easier to spawn-off

the data marts at the expense of duplicated data

Less daily ETL refresh and reconciliation if have many sources and many data marts in

multiple databases

One place to control data (no duplication of effort and data)

Reason not to: If have a few sources that need reporting quickly

Page 24: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Which model to use?

Models are not that different, having become similar over the years, and can compliment

each other

Boils down to Inmon creates a normalized DW before creating a dimensional data mart and

Kimball skips the normalized DW

With tweaks to each model, they look very similar (adding a normalized EDW to Kimball,

dimensionally structured data marts to Inmon)

Bottom line: Understand both approaches and pick parts from both for your situation – no

need to just choose just one approach

BUT, no solution will be effective unless you possess soft skills (leadership, communication,

planning, and interpersonal relationships)

Page 25: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Hybrid Model

Advice: Use SQL Server Views to interface between each level in the model

In the DW Bus Architecture, each data mart could be a schema (broken out by business process subject areas), all in one

database. Another option is to have each data mart in its own database with all databases on one server or spread among

multiple servers. Also, the staging areas, CIF, and DW Bus can all be on the same powerful server (MPP)

Page 26: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Kimball Methodology

From: Kimball’s The Microsoft Data Warehouse Toolkit

Kimball defines a development lifecycle, where Inmon is just about the data warehouse (not “how” used)

Page 27: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Populating a Data Warehouse

Determine frequency of data pull (daily, weekly, etc)

Full Extraction – All data (usually dimension tables)

Incremental Extraction – Only data changed from last run (fact tables)

How to determine data that has changed

Timestamp - Last Updated

Change Data Capture (CDC)

Partitioning by date

Triggers on tables

MERGE SQL Statement

Column DEFAULT value populated with date

Online Extraction – Data from source. First create copy of source:

Replication

Database Snapshot

Availability Groups

Offline Extraction – Data from flat file

Page 28: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

ETL vs ELT

• Extract, Transform, and Load (ETL)

• Transform while hitting source system

• No staging tables

• Processing done by ETL tools (SSIS)

• Extract, Load, Transform (ELT)

• Uses staging tables

• Processing done by target database engine (SSIS: Execute T-SQL Statement task instead

of Data Flow Transform tasks)

• Use for big volumes of data

• Use when source and target databases are the same

• Use with the Analytics Platform System (APS)

ELT is better since database engine is more efficient than SSIS

• Best use of database engine: Transformations

• Best use of SSIS: Data pipeline and workflow management

Page 29: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Surrogate Keys

Surrogate Keys – Unique identifier not derived from source system• Embedded in fact tables as foreign keys to dimension tables

• Allows integrating data from multiple source systems

• Protect from source key changes in the source system

• Allows for slowly changing dimensions

• Allows you to create rows in the dimension that don’t exist in the source (-1 in fact table for

unassigned)

• Improves performance (joins) and database size by using integer type instead of text

• Implemented via identity column on dimension tables

Page 30: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

SSAS Cubes

Reasons to report off cubes instead of the data warehouse: Aggregating (Summarizing) the data for performance

Multidimensional analysis – slice, dice, drilldown, show details

Can store Hierarchies

Built-in support for KPI’s

Security: You can use the security setting to give end-users access to only those parts (slices)

of the cube relevant to them

Built-in advanced time-calculations – i.e. 12-month rolling average

Easily use Excel to view data via Pivot Tables

Automatically handles Slowly Changing Dimensions (SCD)

Required for PerformancePoint, Power View, and SSAS data mining

Page 31: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Data Warehouse Architecture

Page 32: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Modern Data Warehouse

Think about future needs:• Increasing data volumes

• Real-time performance

• New data sources and types (Hadoop)

• Cloud-born data

Solution – Microsoft Analytics Platform System:• Scalable

• MPP architecture

• HDInsight

• Polybase

Follow-on presentation: “Building a Big Data Solution (Building an Effective Data Warehouse

Architecture with Hadoop, the cloud, and MPP)”

Page 33: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Resources Data Warehouse Architecture – Kimball and Inmon methodologies: http://bit.ly/SrzNHy

SQL Server 2012: Multidimensional vs tabular: http://bit.ly/SrzX1x

Data Warehouse vs Data Mart: http://bit.ly/SrAi4p

Fast Track Data Warehouse Reference Architecture for SQL Server 2014: http://bit.ly/1xuX9m6

Complex reporting off a SSAS cube: http://bit.ly/SrAEYw

Surrogate Keys: http://bit.ly/SrAIrp

Normalizing Your Database: http://bit.ly/SrAHnc

Difference between ETL and ELT: http://bit.ly/SrAKQa

Microsoft’s Data Warehouse offerings: http://bit.ly/xAZy9h

Microsoft SQL Server Reference Architecture and Appliances: http://bit.ly/y7bXY5

Methods for populating a data warehouse: http://bit.ly/SrARuZ

Great white paper: Microsoft EDW Architecture, Guidance and Deployment Best Practices: http://bit.ly/SrAZug

End-User Microsoft BI Tools – Clearing up the confusion: http://bit.ly/SrBMLT

Microsoft Appliances: http://bit.ly/YQIXzM

Why You Need a Data Warehouse: http://bit.ly/1fwEq0j

Data Warehouse Maturity Model: http://bit.ly/xl4mGM

Operational Data Store (ODS) Defined: http://bit.ly/1H6wnE7

The Modern Data Warehouse: http://bit.ly/1xuX4Py

Page 34: Building an Effective Data Warehouse Architectureunicorninvesting.us/sites/unicorninvesting.us/... · Architecture with Hadoop, the cloud and MPP) Explains what Big Data is, it’s

Q & A ?James Serra, Big Data Evangelist

Email me at: [email protected]

Follow me at: @JamesSerra

Link to me at: www.linkedin.com/in/JamesSerra

Visit my blog at: JamesSerra.com (where this slide deck will be posted under the “Presentations” tab)