building the data warehouse - chapter 03 the data warehouse and design

Upload: bondaigia

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    1/95

    Building Data Warehouse

    By InmonChapter 3: The Data Warehouse and Design

    http://it-slideshares.blogspot.com/

    http://it-slideshares.blogspot.com/http://it-slideshares.blogspot.com/http://it-slideshares.blogspot.com/http://it-slideshares.blogspot.com/
  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    2/95

    3.0 Introduction

    There are two major components tobuilding a data warehouse:

    The design of the interface from

    operational systems.

    The design of the data warehouse

    itself.

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    3/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    4/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    5/95

    3.1 Beginning with Operational Data

    (Encoding Transformation)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    6/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    7/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    8/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    9/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    10/95

    3 2 P d D t M d l

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    11/95

    3.2 Process and Data Models

    and the Architected

    Environment (ct)Data models are discussed in depth in thefollowing section.

    1. Functional decomposition

    2. Context-level zero diagram

    3. Data flow diagram

    4. Structure chart

    5. State transition diagram

    6. HIPO chart7. Pseudo code

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    12/95

    3.3 The Data Warehouse and Data Models

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    13/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    14/95

    3.3.1 The Data Warehouse Data Model

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    15/95

    3.3.1 The Data Warehouse Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    16/95

    3.3.1 The Data Warehouse Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    17/95

    3.3.2 The Midlevel Data

    Model

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    18/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    19/95

    3.3.2 The Midlevel Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    20/95

    3.3.2 The Midlevel Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    21/95

    3.3.2 The Midlevel Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    22/95

    3.3.2 The Midlevel Data Model (ct)

    Of particular interest is the case where a grouping ofdata has two type of lines emanating from it, asshown in Figure 3-17. The two lines leading to theright indicate that there are two type of criteria.One type of criteria is by activity typeeither adeposit or a withdrawal. The other line indicatesanother activity typeeither an ATM activity or ateller activity. Collectively, the two types of activityencompass the following transactions:

    ATM deposit

    ATM withdrawal Teller deposit

    Teller withdrawal

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    23/95

    3.3.2 The Midlevel Data Model (ct)

    The physical table entries that resultedcame from the following two

    transactions:

    An ATM withdrawal that occurred at1:31 p.m. on January 2

    A teller deposit that occurred at 3:15

    p.m. on January 5

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    24/95

    3.3.2 The Midlevel Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    25/95

    3.3.3 The Physical Data Model

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    26/95

    3.3.3 The Physical Data Model (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    27/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    28/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    29/95

    3.3.3 The Physical Data Model (cont)

    Note: This is not an issue of blindly

    transferring a large number of records

    from DASD to main storage. Instead, itis a more sophisticated issue of

    transferring a bulk of records that have

    a high probability of being accessed.

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    30/95

    3.4 The Data Model and

    Iterative DevelopmentWhy iterative developmentis important ? The industry track record of success

    strongly suggests it.

    The end user is unable to articulate many

    requirements until the first iteration is done.

    Management will not make a fullcommitment until at least a few actualresults are tangible and obvious.

    Visible results must be seen quickly.

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    31/95

    3.4 The Data Model and

    Iterative Development (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    32/95

    3.4 The Data Model and

    Iterative Development (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    33/95

    3.5 Normalization and Denormalization(ERD: Entity Relationship Diagram)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    34/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    35/95

    3.5 Normalization and Denormalization

    (hash algorithm: better search ability)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    36/95

    3.5 Normalization and Denormalization(Use of Redundancy data search performance)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    37/95

    3.5 Normalization and Denormalization(Separation of data & access probability)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    38/95

    3.5 Normalization and Denormalization

    (Derived Data What is it ?)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    39/95

    3.5 Normalization and Denormalization

    (Data Indexing vs. Profiles Why ?)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    40/95

    3.5 Normalization and Denormalization(Referential data Integrity)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    41/95

    3 5 1 S h t i th D t

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    42/95

    3.5.1 Snapshots in the Data

    Warehouse (Primary Data)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    43/95

    3.5.1 Snapshots in the Data

    Warehouse (Primary & 2nd data)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    44/95

    3 6 1 M i R f

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    45/95

    3.6.1 Managing Reference

    Tables in a Data Warehouse

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    46/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    47/95

    3 7 C li it f D t Th

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    48/95

    3.7 Cyclicity of DataThe

    Wrinkle of Time (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    49/95

    3 8 C l it f

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    50/95

    3.8 Complexity of

    Transformation and Integration

    As data passes from the operational, legacyenvironment to the data warehouse

    environment, requires transformations and

    or change in technologies

    Extraction data from different sourcing

    systems

    Transformation encoding rules and data

    types Loading to new environment

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    51/95

    3.8 Complexity of

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    52/95

    3.8 Complexity ofTransformation and Integration

    (ct)

    Data is cleansed as it passes from the operationalenvironment to the data warehouse environment.

    Multiple input sources of data exist and must be merged asthey pass into the data warehouse.

    When there are multiple input files, key resolution must bedone before the files can be merged.

    With multiple input files, the sequence of the files may notbe the same or even compatible.

    Multiple outputs may result. Data may be produced atdifferent levels of summarization by the same datawarehouse creation program.

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    53/95

    3.8 Complexity of

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    54/95

    3.8 Complexity ofTransformation and Integration

    (ct) The input record type conversion Fixed-length records

    Variable-length records

    Occurs depending on

    Occurs clause

    Understand semantic (logicalmeanings) data relationship of old

    systems

    3.8 Complexity of

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    55/95

    3.8 Complexity ofTransformation and Integration

    (ct) Data format conversion must be done.EBCDIC to ASCII (or vice versa) must be

    spelled out.

    Massive volumes of input must be

    accounted for.

    The design of the data warehouse must

    conform to a corporate data model.

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    56/95

    3 9 Triggering the Data

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    57/95

    3.9 Triggering the Data

    Warehouse Record The basic business interaction that

    populated data warehouse is called an

    event-snapshot interaction.

    3 9 2 Components of the

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    58/95

    3.9.2 Components of the

    SnapshotThe snapshot placed in the data warehouse normally

    contains several components.

    The unit of time that marks the occurrence of theevent.

    The key that identifies the snapshot.

    Theprimary (nonkey) data that relates to the key

    Artifact of the relationship (secondary data that hasbeen incidentally captured as of the moment of thetaking of the snapshot and placed in the snapshot)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    59/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    60/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    61/95

    3.10 Profile Records (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    62/95

    3 12 Creating Multiple Profile

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    63/95

    3.12 Creating Multiple Profile

    Records Individual call records can be used to

    create:

    A customer profile record

    A district traffic profile record

    A line analysis profile record so forth.

    3.13 Going from the Data

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    64/95

    g

    Warehouse to the Operational

    Environment

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    65/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    66/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    67/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    68/95

    3 15 1 An Airline Commission

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    69/95

    3.15.1 An Airline Commission

    Calculation System (ct)

    3 15 2 A Retail

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    70/95

    3.15.2 A Retail

    Personalization SystemThe retail sales

    representative could findout some otherinformation about cust.

    The last type of purchasemade

    The market segment orsegments in which thecustomer belongs

    While engaging thecustomer in conversation,the sales representativemay initiates

    I see its been since

    February that we lastheard from you.

    How was that bluesweater you purchased?

    Did the problems you

    had with the pants getresolved?

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    71/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    72/95

    3 15 2 A Retail

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    73/95

    3.15.2 A Retail

    Personalization System (ct)

    Periodically, the analysis program spinsoff a file to the operational

    environment that contains such

    information as the following: Last purchase date

    Last purchase type

    Market analysis/segmenting

    3 15 3 Credit Scoring

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    74/95

    3.15.3 Credit Scoring

    based on (Demographics data)

    The background check relies on the data warehouse.In truth, the check is an eclectic one, in which manyaspects of the customer are investigated, such asthe following:

    Past payback history

    Home/property ownership Financial management

    Net worth

    Gross income

    Gross expenses

    Other intangibles

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    75/95

    3.15.3 Credit Scoring (ct)

    The analysis program is run periodicallyand produces a prequalified file foruse in the operational environment. Inaddition to other data, the prequalified

    file includes the following: Customer identification

    Approved credit limit

    Special approval limit

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    76/95

    3.15.3 Credit Scoring (ct)

    3 16 Indirect Use of Data

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    77/95

    3.16 Indirect Use of Data

    Warehouse Data

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    78/95

    3 16 Indirect Use of Data

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    79/95

    3.16 Indirect Use of Data

    Warehouse Data (ct)

    The online pre-analyzed data file: Contains only a small amount of data per unit of

    data

    May contain collectively a large amount of data(because there may be many units of data)

    Contains precisely what the online clerk needs Is not updated, but is periodically refreshed on a

    wholesale basis

    Is part of the online high-performanceenvironment

    Is efficient to access Is geared for access of individual units of data,

    not massive sweeps of data

    3 17 St J i

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    80/95

    3.17 Star Joins

    There are several very good reasons whynormalization and a relational approachproduces the optimal design for a datawarehouse:

    It produces flexibility. It fits well with very granular data.

    It is not optimized for any given set ofprocessing requirements.

    It fits very nicely with the data model.

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    81/95

    3.17 Star Joins (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    82/95

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    83/95

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    84/95

    3.17 Star Joins (ct)

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    85/95

    3.17 Star Joins (ct)

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    86/95

    3.17 Star Joins (ct)

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    87/95

    3.17 Star Joins (ct)

    3 17 St J i ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    88/95

    3.17 Star Joins (ct)

    3 18 S ti th ODS

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    89/95

    3.18 Supporting the ODSIn general, there are four classes of ODS:

    Class IIn a class I ODS, updates of data from theoperational environment to the ODS are synchronous.

    Class IIIn a class II ODS, the updates between theoperational environment and the ODS occur within a two-to-three-hour time frame.

    Class IIIIn a class III ODS, the synchronization of updatesbetween the operational environment and the ODS occursovernight.

    Class IVIn a class IV ODS, updates into the ODS from thedata warehouse are unscheduled. Figure 3-56 shows thissupport.

    3 18 S ti th ODS ( t)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    90/95

    3.18 Supporting the ODS (ct)

    The customer has been active for several years. Theanalysis of the transactions in the data warehouseis used to produce the following profile informationabout a single customer:

    Customer name and ID

    Customer volumehigh/low Customer profitabilityhigh/low

    Customer frequency of activityvery frequent/veryinfrequent

    Customer likes/dislikes (fast cars, single maltscotch)

    3 18 S pporting the ODS (ct)

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    91/95

    3.18 Supporting the ODS (ct)

    3.19 Requirements and the

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    92/95

    q

    Zachman Framework

    3.19 Requirements and the

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    93/95

    q

    Zachman Framework (ct)

    Summary

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    94/95

    y

    Design of data warehouse

    Corporate Data model

    Operational data model

    Iterative approach since requirements are a non-priori

    Different SDLC approach

    Data warehouse construction considerations

    Data Volume (large size) Data Latency (late arrival of data set)

    Require transformation and understand of legacy

    Data Models (granularities)

    Low level

    Mid Level High Level

    Structure of typical record in data warehouse

    Time stamp, a surrogate key, direct data, secondary data

    Summary

  • 7/29/2019 Building the Data WareHouse - Chapter 03 The Data Warehouse and Design

    95/95

    (cont)

    Reference tables must be manage in time-variantmanner

    Data Latency wrinkles of time

    Data Transformation is complex Different architectures

    Different technologies Different encoding rules and complex logics

    Creation of data warehouse record is triggered byon event (activity)

    A profile record is a composite representation of

    data (historical activities) Star Join (is a preferred database design

    techniques