dw - chapter 8

2
Chapter 8: Populating the Data Warehouse Now that we have extracted the data from the source system, we will populate the NDS and the DDS with the data we have extracted. In this chapter, we will look at the five main subjects regarding data warehouse population in the sequence they occur in a data warehouse system: 1. Loading the stag e: we lo ad the so urce sys tem dat a into th e stage. U sually, the focu s is to ex tract the data as soon as possible without doing too much transformation. In other words, the structure of the stage tables is similar to the source system tables. In the previous chapter, we discussed the extraction, and in this chapter we will discuss the loading. 2. Creati ng the data fi rewall : We check t he quality w hen the dat a is loaded fr om the stag e into the NDS or ODS. The check is done using predefined rules that define what action to take: reject the data, allow the data, or fix the data. 3. Populating a normalized data store: This is when we load the data from the stage into the NDS or ODS, after the data passes through the data firewall. Both are normalized data stores consisting of entities with minimal data redundancy. Here we deal with data normalization and key management. 4. Popul ating dimension t ables: This is whe n we load the N DS or OD S into the DDS dim ension tables. This is done after we populate the normalized data store. DDS is a dimensional store where the data is denormalized, so when populating dimension tables, we deal with issues such as denormalization and slowly changing dimension. 5. Popul ating fact t ables: this is the las t step in popu lating th e DW. It is do ne after we p opulate t he dimension tables in the DDS. The data from the NDS or ODS is loaded into the DDS fact tables. In this process we deal with surrogate key lookup and late-arriving fact rows. Stage Loading If your stage is a database, which is common, it is better not to put any indexes or constraints (such as null, PK, or check constraints) in the stage database. The main reason for this is not performance but because we want to capture and report the “bad data” in the data quality process. We want to allow bad data such as null and duplicate primary keys into the stage. If we restrict our stage table or reject null and duplicate primary keys, then the data quality process that sits between the stage and the NDS would not be able to capture these DQ issues and report them for correction in the source system. Indexing stage tables is generally not necessary. This is because of the way the stage tables are structured and selected. The best way to do this is to load the data into empty stage tables without any indexes and then select all records from those tables for loading into the NDS. If you plan to keep say, five days of stage data, in case you need them in the event of failures, it is better not to keep the previous day’s data in the same table. The first reason why we can’t go back to the source system and restage is because the source system data may have changed. The second reason is performance. In other words, its quicker to reload the NDS from the stage than to retrieve data from the source system again.  Appro ach 1 k eeps th e previ ous day ’s data in the same t able: T his ap proach is sim ple to i mplement bec ause you have only one table. You store the time each record was loaded into the stage table by having a

Upload: kwadwo-boateng

Post on 05-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

 

Chapter 8: Populating the Data Warehouse

Now that we have extracted the data from the source system, we will populate the NDS and the DDS with

the data we have extracted. In this chapter, we will look at the five main subjects regarding data warehouse

population in the sequence they occur in a data warehouse system:

1. Loading the stage: we load the source system data into the stage. Usually, the focus is to extract 

the data as soon as possible without doing too much transformation. In other words, the structure

of the stage tables is similar to the source system tables. In the previous chapter, we discussed the

extraction, and in this chapter we will discuss the loading.

2. Creating the data firewall: We check the quality when the data is loaded from the stage into the

NDS or ODS. The check is done using predefined rules that define what action to take: reject the

data, allow the data, or fix the data.

3. Populating a normalized data store: This is when we load the data from the stage into the NDS or 

ODS, after the data passes through the data firewall. Both are normalized data stores consisting of 

entities with minimal data redundancy. Here we deal with data normalization and key management.

4. Populating dimension tables: This is when we load the NDS or ODS into the DDS dimension

tables. This is done after we populate the normalized data store. DDS is a dimensional store where

the data is denormalized, so when populating dimension tables, we deal with issues such as

denormalization and slowly changing dimension.

5. Populating fact tables: this is the last step in populating the DW. It is done after we populate the

dimension tables in the DDS. The data from the NDS or ODS is loaded into the DDS fact tables. In

this process we deal with surrogate key lookup and late-arriving fact rows.

Stage Loading 

If your stage is a database, which is common, it is better not to put any indexes or constraints (such as null,

PK, or check constraints) in the stage database. The main reason for this is not performance but because

we want to capture and report the “bad data” in the data quality process.

We want to allow bad data such as null and duplicate primary keys into the stage. If we restrict our stage

table or reject null and duplicate primary keys, then the data quality process that sits between the stage and 

the NDS would not be able to capture these DQ issues and report them for correction in the source system.

Indexing stage tables is generally not necessary. This is because of the way the stage tables are structured 

and selected. The best way to do this is to load the data into empty stage tables without any indexes and 

then select all records from those tables for loading into the NDS. If you plan to keep say, five days of stage

data, in case you need them in the event of failures, it is better not to keep the previous day’s data in the

same table. The first reason why we can’t go back to the source system and restage is because the source

system data may have changed. The second reason is performance. In other words, its quicker to reload 

the NDS from the stage than to retrieve data from the source system again.

 Approach 1 keeps the previous day’s data in the same table: This approach is simple to implement because

you have only one table. You store the time each record was loaded into the stage table by having a

 

column called loaded_timestamp in the stage table. The loaded_timestamp column is a datetime column

that includes when the record was loaded into the stage.

 Approach 2 keeps each day in a separate table: Even though it more complex than approach 1, approach 2 

give better performance because you load into an empty table and select all records when retrieving data.

When implementing approach 2, you create the Today table before the loading begins and drop the Day 1

table after the loading completed successfully.

 Approach 3 is to have just one table and truncate the table every time before loading: this approach is even

simpler than approach 1. We don’t keep the previous day’s data in the stage database. If we need to

restore the stage data retrieved a few days ago, we have to restore from database backup. Well, the

number of days of the stage database backup to keep in different for each company. It is determined based 

on the extraction frequency (from the source system into the stage) and the loading frequency (from the

stage to the NDS, ODS, or DDS).