dw - chap 5

Upload: kwadwo-boateng

Post on 05-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 DW - Chap 5

    1/5

    DW Chap 5: Data Modeling

    In this chapter, we will design the data stores for the AE case study. We will use the NDS+DDSarchitecture.

    The game plan is to first look at:

    a. Begin by looking at the business requirements and then design the dimensional data stores DDSaccordingly.

    b. We will define the meaning of the Facts and dimensional attributes.

    c. We will also define data hierarchy.

    d. Then we will map the data in the DDS with the source systems; that is, we will define the source

    of the columns in the fact and dimension tables. There are instances that one column in the DDS ispopulated from several tables in the source system (for example, for column 1 of table A and column 1of table B) and sometimes from more than one source system.

    e. We will define the transformations (formula, calculation logic, or lookups) required to populate the target

    columns .

    f. Then we will design the normalized data store (NDS) by normalizing the dimensional data store and by

    examining the data from the source systems we have mapped. The normalization rules and the first,second, and third normal forms are described in the appendix. We will use these rules to design theNDS.

    Designing the Dimensional Data Store

    The users will be using the DW to do analysis in six business areas:

    a. Product Sales

    b. Subscription Sales

    c. Subscriber Profitability

    d. Supplier Performance

    e. CRM campaign segmentation

    f. CRM campaign results

    So what we need to do is, analyze each business area one by one to model the business process in orderto create the data model. The first business area that we will look at is:

    1. Product Sales: an order-item data mart in the retail industry is a classic example of data

    warehousing.

  • 7/31/2019 DW - Chap 5

    2/5

    2. A product sales event happens when a customer is buying a product, rather than subscribing to a

    package. The roles (who, what, where) in this event are the customer, a product, and a store. Thelevels (or in dimensional modeling terms, the measures) are quantity, unit price, value, direct unitcost, indirect unit cost. We get these levels from the business requirements in chapter 4. In thiscase these are what users need in order to be able to perform their task.

    3.We put the measures in the fact tables and the roles in the dimension tables.

    4. The business event now becomes the Fact Table row.

    Quantity, unit price, and unit cost measures are derived from the source system, but the other threemeasures (sales value, sales cost, and margin) are calculated. They are defined as follows:

    Sales value = unit price X quantity

    Sales cost = unit cost X quantity

    Margin = Sales value Sales cost

    5. The four keys in the Product Sales fact table link the fact table with the four dimensions. According

    to Ralph Kimball, it is important to declare the grain of the fact table. Grain is the smallest unit ofoccurrence of the business event in which the event is measured. In other words, grain iscompleting this sentence: One row in the fact table corresponds to.. in this case, the grain iseach item sold one row in the Product Sales fact corresponds to each item sold.

    6. The general rule for dealing with complex events and exceptions is to always look at the sourcesystem. We have to replicate or mimic source system logic. This is because the output of the DWmust agree with the source system.

    7. It is important to get the data model and business logic with the source system. It is important to getthe data model and business logic correct to ensure that the output of the DW reflects the correctbusiness conditions.

    8. Customer profitability is where you calculate the profit you are making for each customer for a

    certain period.

    9. Realize that order ID and line number are called degenerate dimensions. A degenerate dimension

    is a dimension with only one attribute, and therefore the attribute is put in the fact table. Order IDand line number are identifiers of the order line in the source system.

    10.Also it is always a good idea to put a timestamp column in the fact table that is the time the record

    is loaded in the fact table. This column allows us to determine when the fact row was last modified.These two columns Order ID and line number are in addition to the transactional data/timestampcolumn.

    11.The transaction time stamp explains when the order (or order line) was created, shipped, canceled,

    or returned, but it does not explain when the record was loaded into the DW and when it was lastmodified.

    12. The next step in the fact table design is to determine which column combination uniquely identifiesa fact table row.

  • 7/31/2019 DW - Chap 5

    3/5

    This is important because it is required for both logical and Physical Database Design in order to determinethe primary key(s)

    The Concept of a Data Mart

    A collection of a fact table and its dimension tables is called a Data Mart. Remember; this concept of a datamart is applicable only if the DW is in dimensional model. When we are talking about normalized datastores there is no data marts. A Data Mart is a group of related fact tables and their correspondingdimension tables containing the measurements of business events, categorized by their dimensions. DataMarts exist in dimensional data stores.

    1. We then define the data type for each column.

    2. All key columns have integer data types because they are surrogate keys, that is, simple integervalues with one increment.

    3. The three timestamp columns have datetime data types. The source system code is an integerbecause it contains only the code, and the description is stored in the source system table in themetadata db.

    Dimension Tables

    4. Now that we have discussed the fact table, lets discuss the dimension tables. A dimension table is

    a table that contains various attributes explaining the dimension key in the fact table. As explained

    earlier, the fact table stores business events. The attributes explain the conditions of the entity at

    the same time the business event happened.

    5. You realize that the customer dimension table is linked to the fact table using the customer_key

    column. The customer_key column is a primary key in the customer dimension table, and it is a

    foreign key on the fact table. This is known in the database world as referential integrity.

    6. Referential Integrity is the process of establishing a parent-child relationship between two tables,

    with the purpose of ensuring that every row in the child table has a corresponding parent entry in

    the parent table.

    7. The customer dimension contains columns that describe the condition of the customer who made

    the purchase, including the data about customer name, address, telephone number, date of birth,

    e-mail, gender, interest, occupation, and so on.

  • 7/31/2019 DW - Chap 5

    4/5

    Source System Mapping

    In this section we shall discuss why and how to do source system mapping. Source system mapping is an

    exercise in mapping the dimensional data store to the source systems.

    1. Now that we have completed the DDS design, the next step is to map every column in the DDS to

    the source systems so thatwe know where to get the data from when populating those columns.

    2. When doing this, we need to determine the transformations or calculations required to get the

    source columns into the target.

    3. This is necessary to understand the functionality that the ETL logic must perform when populating

    each column on the DDS tables. Bear in mind that a DDS column may come from more than one

    table in the source system or even from more than one source system, because the ODS

    integrates data from multiple source systems. This is where the source_system_code column

    becomes useful because it enables us to understand from which system the data is coming. There

    will be an upcoming example using the Product Sales data mart.

    Key Points in doing Source System Mapping

    a. The first step is to find out from which tables the columns (that are changing) are coming

    from.

    b. When we designed the DDS, we created the DDS columns (fact table measures and

    dimensional attributes) to fulfill the business requirements. Now we need to find out where

    we can get the data from by looking at the source systems tables.

    c. We write the source tables and their abbreviations in brackets so we can use them later in

    mapping table.

    d. Then we write the join conditions between these source tables so that we know how to

    write the query later when we develop the ETL.

    The following list specifies the target table, the source table, and the join conditions. It shows where the

    Product Sales fact table is populated from and how the source tables are joined.

    a. Target Table in DDS : Product Sales Fact Table

    b. Source: WebTower9 sales_order_header[who], WebTower9 sales_order_detail [wod], Jade

    order_header[joh], Jade order_detail[jod], Jupiter item_master [jim], Jupiter currency_rate table [jcr]

    c. Join condition/criteria: woh.order_id = wod.order_id, joh.order_id = jod.order_id, wod.product_code

    = jim.product_code, jod.product_code = jim.product_code.

    In this case study, the inventory is managed in Jupiter, and the sales transaction is stored in the two front-

    office systems, WebTower9 and Jade, which is why we have a link between the Jupiter inventory master

    table and the order detail tables in WebTower9 and Jade.

  • 7/31/2019 DW - Chap 5

    5/5

    There the key thing out of this whole section is being able to know where every single column will be

    sourced from, including the transformation.

    Designing the Normalized Data Store

    Now that we have designed the dimensional data store and mapped every column in every table to the

    source systems, we are ready to design the normalized data store, which is a normalized database that sits

    in between the Stage and the DDS.

    The NDS is a master data store containing the complete data sets, including all historical transaction data

    and all historical versions of master data. The NDS contains master tables and transaction tables.

    The transaction table is a table that contains a business transaction or business event.

    A master table is a table that contains the persons or objects involved in the business event.

    In this section we will learn how to design the data model for the NDS.

    a. First we list all the entities based on the source tables and based on the fact and dimension

    attributes in the DDS.

    b. We list all the tables in the source system that we have identified during the source system

    mapping exercise in the previous section.

    c. We then normalize the DDS fact and dimension tables into a list of separated normalized tables.

    d. We then arrange the entities according to their relationships to enable us to establish the referential

    integrity between the entities. We do this by connecting the parent table to the child table. A child

    table has a column containing the primary key of the parent table. The DDS fact tables become

    child (transaction) tables in NDS. The DDS dimension table becomes the parent (master) tables in

    the DDS.

    e.