dw - chapter 8
TRANSCRIPT
Chapter 8: Populating the Data Warehouse
Now that we have extracted the data from the source system, we will populate the NDS and the DDS with
the data we have extracted. In this chapter, we will look at the five main subjects regarding data warehouse
population in the sequence they occur in a data warehouse system:
1. Loading the stage: we load the source system data into the stage. Usually, the focus is to extract
the data as soon as possible without doing too much transformation. In other words, the structure
of the stage tables is similar to the source system tables. In the previous chapter, we discussed the
extraction, and in this chapter we will discuss the loading.
2. Creating the data firewall: We check the quality when the data is loaded from the stage into the
NDS or ODS. The check is done using predefined rules that define what action to take: reject the
data, allow the data, or fix the data.
3. Populating a normalized data store: This is when we load the data from the stage into the NDS or
ODS, after the data passes through the data firewall. Both are normalized data stores consisting of
entities with minimal data redundancy. Here we deal with data normalization and key management.
4. Populating dimension tables: This is when we load the NDS or ODS into the DDS dimension
tables. This is done after we populate the normalized data store. DDS is a dimensional store where
the data is denormalized, so when populating dimension tables, we deal with issues such as
denormalization and slowly changing dimension.
5. Populating fact tables: this is the last step in populating the DW. It is done after we populate the
dimension tables in the DDS. The data from the NDS or ODS is loaded into the DDS fact tables. In
this process we deal with surrogate key lookup and late-arriving fact rows.
Stage Loading
If your stage is a database, which is common, it is better not to put any indexes or constraints (such as null,
PK, or check constraints) in the stage database. The main reason for this is not performance but because
we want to capture and report the “bad data” in the data quality process.
We want to allow bad data such as null and duplicate primary keys into the stage. If we restrict our stage
table or reject null and duplicate primary keys, then the data quality process that sits between the stage and
the NDS would not be able to capture these DQ issues and report them for correction in the source system.
Indexing stage tables is generally not necessary. This is because of the way the stage tables are structured
and selected. The best way to do this is to load the data into empty stage tables without any indexes and
then select all records from those tables for loading into the NDS. If you plan to keep say, five days of stage
data, in case you need them in the event of failures, it is better not to keep the previous day’s data in the
same table. The first reason why we can’t go back to the source system and restage is because the source
system data may have changed. The second reason is performance. In other words, its quicker to reload
the NDS from the stage than to retrieve data from the source system again.
Approach 1 keeps the previous day’s data in the same table: This approach is simple to implement because
you have only one table. You store the time each record was loaded into the stage table by having a
column called loaded_timestamp in the stage table. The loaded_timestamp column is a datetime column
that includes when the record was loaded into the stage.
Approach 2 keeps each day in a separate table: Even though it more complex than approach 1, approach 2
give better performance because you load into an empty table and select all records when retrieving data.
When implementing approach 2, you create the Today table before the loading begins and drop the Day 1
table after the loading completed successfully.
Approach 3 is to have just one table and truncate the table every time before loading: this approach is even
simpler than approach 1. We don’t keep the previous day’s data in the stage database. If we need to
restore the stage data retrieved a few days ago, we have to restore from database backup. Well, the
number of days of the stage database backup to keep in different for each company. It is determined based
on the extraction frequency (from the source system into the stage) and the loading frequency (from the
stage to the NDS, ODS, or DDS).