5.data warehouse

Upload: rmram234

Post on 04-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 5.Data Warehouse

    1/19

    DATAWAREHOUSE[Type the document subtitle]

    [Type the abstract of the document here. The abstract is typically a short

    summary of the contents of the document. Type the abstract of the document

    here. The abstract is typically a short summary of the contents of the document.]

    2013

    [Type the author name]

    [Type the company name]

    2/6/2013

  • 7/29/2019 5.Data Warehouse

    2/19

    2

    What is data warehouse?

    A data warehouse is a electronic storage of an Organization's historical data for the purpose of

    analysis and reporting. According to Bill Inmon, a datawarehouse should be subject-oriented,

    non-volatile, integrated and time-variant.

    Explanation on the classic definition of data warehouse

    Subject Oriented

    This means a data warehouse has a defined scope and it only stores data under that scope. So

    for example, if the sales team of your company is creating a data warehouse - the data

    warehouse by definition is required to contain data related to sales (and not the data related to

    production management for example)

    Non-volatile

    This means that data once stored in the data warehouse are not removed or deleted from it and

    always stay there no matter what.

    Integrated

    This means that the data stored in a data warehouse make sense. Fact and figures are related to

    each other and they are integrable and projects a single point of truth.

    Time variant

    This means that data is not constant, as new and new data gets loaded in the warehouse, data

    warehouse also grows in size.

    What is the benefits of data warehouse?

    A data warehouse helps to integrate data (seeData integration) and store them historically so

    that we can analyze different aspects of business including, performance analysis, trend,

    prediction etc. over a given time frame and use the result of our analysis to improve the

    efficiency of business processes.

    Why Data Warehouse is used?

    http://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.htmlhttp://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.htmlhttp://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.htmlhttp://www.dwbiconcepts.com/data-warehousing/18-dwbi-basic-concepts/92-data-integration.html
  • 7/29/2019 5.Data Warehouse

    3/19

    3

    For a long time in the past and also even today, Data warehouses are built to facilitate reporting

    on different key business processes of an organization, known as KPI. Data warehouses also

    help to integrate data from different sources and show a single-point-of-truth values about the

    business measures.

    Data warehouse can be further used for data mining which helps trend prediction, forecasts,

    pattern recognition etc.

    What is the difference between OLTP and OLAP?

    OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and

    analysis system on that data.

    OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized.

    On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through

    SELECT operations.

    Explanatory Note:

    In a departmental shop, when we pay the prices at the check-out counter, the sales

    person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is

    transaction data and the related system is a OLTP system.

    On the other hand, the manager of the store might want to view a report on out-of-

    stock materials, so that he can place purchase order for them. Such report will come

    out from OLAP system

    What is data mart?

    Data marts are generally designed for a single subject area. An organization may have data

    pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse

    and each department may have separate data marts. These data marts can be built on top of

    the data warehouse.

    What is ER model?

    ER model or entity-relationship model is a particular methodology of data modeling wherein

    the goal of modeling is to normalize the data by reducing redundancy. This is different than

    dimensional modeling where the main goal is to improve the data retrieval mechanism.

    What is dimensional modeling?

  • 7/29/2019 5.Data Warehouse

    4/19

    4

    Dimensional model consists of dimension and fact tables. Fact tables store different

    transactional measurements and the foreign keys from dimension tables that qualifies the data.

    The goal of Dimensional model is not to achive high degree of normalization but to facilitateeasy and faster data retrieval.

    Ralph Kimball is one of the strongest proponents of this very popular data modeling technique

    which is often used in many enterprise level data warehouses.

    What is dimension?

    A dimension is something that qualifies a quantity (measure).

    For an example, consider this: If I just say 20kg, it does not mean anything. But if I say,

    "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a

    meaningful sense. These product, customerand datesare some dimension that qualified the

    measure - 20kg.

    Dimensions are mutually independent. Technically speaking, a dimension is a data element that

    categorizes each item in a data set into non-overlapping regions.

    What is Fact?

    A fact is something that is quantifiable (Or measurable). Facts are typically (but not always)

    numerical values that can be aggregated.

    What are additive, semi-additive and non-additive measures?

    Non-additive Measures

    Non-additive measures are those which can not be used inside any numeric aggregation

    function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or

    percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can

    also be a non-additive measure when that data is stored in fact tables, e.g. some kind of

    varchar flags in the fact table.

    Semi Additive Measures

    Semi-additive measures are those where only a subset of aggregation function can be applied.

    Lets say account balance. A sum() function on balance does not give a useful result but max()

    or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on

    rate; however, average function might be useful.

  • 7/29/2019 5.Data Warehouse

    5/19

    5

    Additive Measures

    Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is

    Sales Quantity etc.

    What is Star-schema?

    This schema is used in data warehouse models where one centralized fact table references

    number of dimension tables so as the keys (primary key) from all the dimension tables flow into

    the fact table (as foreign key) where measures are stored. This entity-relationship diagram

    looks like a star, hence the name.

    Consider a fact table that stores sales quantity for each product and customer on a certain time.

    Sales quantity will be the measure here and keys from customer, product and time dimension

    tables will flow into the fact table.

    What is snow-flake schema?

    This is another logical arrangement of tables in dimensional modeling where a centralized fact

    table references number of other dimension tables; however, those dimension tables are furthernormalized into multiple related tables.

    Consider a fact table that stores sales quantity for each product and customer on a certain time.

    Sales quantity will be the measure here and keys from customer, product and time dimension

    tables will flow into the fact table. Additionally all the products can be further grouped under

    different product families stored in a different table so that primary key of product family tables

    http://png.dwbiconcepts.com/images/tutorial/dwbi_interview/star.png
  • 7/29/2019 5.Data Warehouse

    6/19

    6

    also goes into the product table as a foreign key. Such construct will be called a snow-flake

    schema as product table is further snow-flaked into product family.

    NoteSnow-flake increases degree of normalization in the design.

    What are the different types of dimension?

    In a data warehouse model, dimension can be of following types,

    1. Conformed Dimension2. Junk Dimension3. Degenerated Dimension4. Role Playing DimensionBased on how frequently the data inside a dimension changes, we can further classify

    dimension as

    1. Unchanging or static dimension (UCD)2. Slowly changing dimension (SCD)3. Rapidly changing Dimension (RCD)

    What is a 'Conformed Dimension'?

    http://png.dwbiconcepts.com/images/tutorial/dwbi_interview/snowflake.png
  • 7/29/2019 5.Data Warehouse

    7/19

    7

    A conformed dimension is the dimension that is shared across multiple subject area. Consider

    'Customer' dimension. Both marketing and sales department may use the same customer

    dimension table in their reports. Similarly, a 'Time' or 'Date' dimension will be shared by

    different subject areas. These dimensions are conformed dimension.

    Theoretically, two dimensions which are either identical or strict mathematical subsets of one

    another are said to be conformed.

    What is degenerated dimension?

    A degenerated dimension is a dimension that is derived from fact table and does not have its

    own dimension table.

    A dimension key, such as transaction number, receipt number, Invoice number etc. does not

    have any more associated attributes and hence can not be designed as a dimension table.

    What is junk dimension?

    A junk dimension is a grouping of typically low-cardinality attributes (flags, indicators etc.) so

    that those can be removed from other tables and can be junked into an abstract dimension

    table.

    These junk dimension attributes might not be related. The only purpose of this table is to store

    all the combinations of the dimensional attributes which you could not fit into the different

    dimension tables otherwise. Junk dimensions are often used to implementRapidly Changing

    Dimensionsin data warehouse.

    What is a role-playing dimension?

    Dimensions are often reused for multiple applications within the same database with different

    contextual meaning. For instance, a "Date" dimension can be used for "Date of Sale", as well as

    "Date of Delivery", or "Date of Hire". This is often referred to as a 'role-playing dimension'

    What is SCD?

    SCD stands for slowly changing dimension, i.e. the dimensions where data is slowly changing.

    These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type 6, although Type 1,

    2 and 3 are most common. Readthisarticle to gather in-depth knowledge on various SCD

    tables.

    What is rapidly changing dimension?

    http://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/133-modeling-for-various-slowly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/133-modeling-for-various-slowly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/133-modeling-for-various-slowly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/133-modeling-for-various-slowly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.html
  • 7/29/2019 5.Data Warehouse

    8/19

    8

    This is a dimension where data changes rapidly. Readthisarticle to know how to implement

    RCD.

    Describe different types of slowly changing Dimension (SCD)

    Type 0:

    A Type 0 dimension is where dimensional changes are not considered. This does not mean that

    the attributes of the dimension do not change in actual business situation. It just means that,

    even if the value of the attributes change, history is not kept and the table holds all the

    previous data.

    Type 1:

    A type 1 dimension is where history is not maintained and the table always shows the recent

    data. This effectively means that such dimension table is always updated with recent datawhenever there is a change, and because of this update, we lose the previous values.

    Type 2:

    A type 2 dimension table tracks the historical changes by creating separate rows in the table

    with different surrogate keys. Consider there is a customer C1 under group G1 first and later on

    the customer is changed to group G2. Then there will be two separate records in dimension

    table like below,

    Key Customer Group Start Date End Date

    1 C1 G1 1st Jan 2000 31st Dec 2005

    2 C1 G2 1st Jan 2006 NULL

    Note that separate surrogate keys are generated for the two records. NULL end date in the

    second row denotes that the record is the current record. Also note that, instead of start and

    end dates, one could also keep version number column (1, 2 etc.) to denote different

    versions of the record.

    Type 3:

    A type 3 dimension stored the history in a separate column instead of separate rows. So unlike

    a type 2 dimension which is vertically growing, a type 3 dimension is horizontally growing. See

    the example below,

    Key Customer Previous Group Current Group

    http://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.htmlhttp://www.dwbiconcepts.com/data-warehousing/12-data-modelling/139-implementing-rapidly-changing-dimension.html
  • 7/29/2019 5.Data Warehouse

    9/19

    9

    1 C1 G1 G2

    This is only good when you need not store many consecutive histories and when date of change

    is not required to be stored.

    Type 6:

    A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but

    only you add one extra column to denote which record is the current record.

    Key Customer Group Start Date End Date Current Flag

    1 C1 G1 1st Jan 2000 31st Dec 2005 N

    2 C1 G2 1st Jan 2006 NULL Y

    What is a mini dimension?

    Mini dimensions can be used to handle rapidly changing dimension scenario. If a dimension has

    a huge number of rapidly changing attributes it is better to separate those attributes in

    different table called mini dimension. This is done because if the main dimension table is

    designed as SCD type 2, the table will soon outgrow in size and create performance issues. It is

    better to segregate the rapidly changing members in different table thereby keeping the main

    dimension table small and performing.

    What is a fact-less-fact?

    A fact table that does not contain any measure is called a fact-less fact. This table will only

    contain keys from different dimension tables. This is often used to resolve a many-to-many

    cardinality issue.

    Explanatory Note:Consider a school, where a single student may be taught by many teachers and a single teacher

    may have many students. To model this situation in dimensional model, one might introduce a

    fact-less-fact table joining teacher and student keys. Such a fact table will then be able to

    answer queries like,

    1. Who are the students taught by a specific teacher.2. Which teacher teaches maximum students.3. Which student has highest number of teachers.etc. etc.

    What is a coverage fact?

  • 7/29/2019 5.Data Warehouse

    10/19

    10

    A fact-less-fact table can only answer 'optimistic' queries (positive query) but can not answer a

    negative query. Again consider the illustration in the above example. A fact-less fact containing

    the keys of tutors and students can not answer a query like below,

    1. Which teacher did not teach any student?2. Which student was not taught by any teacher?Why not? Because fact-less fact table only stores the positive scenarios (like student being

    taught by a tutor) but if there is a student who is not being taught by a teacher, then thatstudent's key does not appear in this table, thereby reducing the coverage of the table.

    Coverage fact table attempts to answer this - often by adding an extra flag column. Flag = 0

    indicates a negative condition and flag = 1 indicates a positive condition. To understand this

    better, let's consider a class where there are 100 students and 5 teachers. So coverage fact

    table will ideally store 100 X 5 = 500 records (all combinations) and if a certain teacher is not

    teaching a certain student, the corresponding flag for that record will be 0.

    What are incident and snapshot facts

    A fact table stores some kind of measurements. Usually these measurements are stored (or

    captured) against a specific time and these measurements vary with respect to time. Now it

    might so happen that the business might not able to capture all of its measures always for

    every point in time. Then those unavailable measurements can be kept empty (Null) or can be

    filled up with the last available measurements. The first case is the example of incident fact and

    the second one is the example of snapshot fact.

    What is aggregation and what is the benefit of aggregation?

    A data warehouse usually captures data with same degree of details as available in source. The

    "degree of detail" is termed as granularity. But all reporting requirements from that data

    warehouse do not need the same degree of details.

    To understand this, let's consider an example from retail business. A certain retail chain has

    500 shops accross Europe. All the shops record detail level transactions regarding the products

    they sale and those data are captured in a data warehouse.

    Each shop manager can access the data warehouse and they can see which products are sold by

    whom and in what quantity on any given date. Thus the data warehouse helps the shop

    managers with the detail level data that can be used for inventory management, trend

    prediction etc.

    Now think about the CEO of that retail chain. He does not really care about which certain sales

    girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown

  • 7/29/2019 5.Data Warehouse

    11/19

    11

    breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin

    accross Europe. Or may be year to year sales growth on eastern Europe. Such data is

    aggregated in nature. Because Sales of goods in East Europe is derived by summing up the

    individual sales data from each shop in East Europe.

    Therefore, to support different levels of data warehouse users, data aggregation is needed.

    What is slicing-dicing?

    Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and

    value (e.g. Brown Bread) and measures (e.g. sales).

    Dicing means viewing the slice with respect to different dimensions and in different level of

    aggregations.

    Slicing and dicing operations are part of pivoting.

    What is drill-through?

    Drill through is the process of going to the detail level data from summary data.

    Consider the above example on retail shops. If the CEO finds out that sales in East Europe has

    declined this year compared to last year, he then might want to know the root cause of the

    decrease. For this, he may start drilling through his report to more detail level and eventually

    find out that even though individual shop sales has actually increased, the overall sales figure

    has decreased because a certain shop in Turkey has stopped operating the business. The detaillevel of data, which CEO was not much interested on earlier, has this time helped him to pin

    point the root cause of declined sales. And the method he has followed to obtain the details

    from the aggregated data is called drill through.

    What is Normalization ? Why should we use it?

    Normalization is a database design technique which organizes tables in a manner that

    reduces redundancy and dependency of data.

    It divides larger tables to smaller tables and link them using relationships.

    The inventor of the relational model Edgar Codd proposed the theory of normalization with the

    introduction of FirstNormal Form and he continued to extend theory with Second and Third

    Normal Form. Later he joined with Raymond F. Boyce to develop the theory of Boyce-Codd

    Normal Form.

  • 7/29/2019 5.Data Warehouse

    12/19

    12

    Theory of Normalization is still being developed further. For example there are discussions

    even on 6th Normal Form. But in most practical applications normalization achieves its best in3rd Normal Form. The evolution of Normalization theories is illustrated below-

    Lets learn Normalization with practical example -

    Assume a video library maintains a database of movies rented out. Without any normalizationall information is stored in one table as shown below.

    Table 1

    Here you seeMovies Rented column has multiple values.Now lets move in to 1stNormal Form

    1NFRules

    Each table cell should contain single value.Each record needs to be unique.

    The above table in 1NF-

  • 7/29/2019 5.Data Warehouse

    13/19

    13

    Table 1 : In 1NF Form

    Before we proceed lets understand a few things --

    What is a KEY ?

    A KEY is a value used to uniquely identify a record in a table. A KEY could be a singlecolumn or combination of multiple columns

    Note: Columns in a table that are NOT used to uniquely identify a record are called non-keycolumns.

    What is a primary Key?

    A primary is a single column values used to uniquely identify adatabase record.

    It has following attributes

    A primary key cannot be NULLA primary key value must be unique

    The primary key values can not be changedThe primary key must be given a value when a new record is

    inserted.

    What is a composite Key?

    A composite key is a primary key composed of multiple columns used to identify a recorduniquely

    In our database , we have two people with the same name Robert Phil but they live at

    different places.

  • 7/29/2019 5.Data Warehouse

    14/19

    14

    Hence we require both Full Name and Address to uniquely identify a record. This is a

    composite key.Lets move into 2NF

    2NF Rules

    Rule 1- Be in 1NF Rule 2- Single Column Primary Key

    It is clear that we cant move forward to make our simple database in 2 nd Normalizationform unless we partition the table above.

    Table 1

    Table 2

    We have divided our 1NF table into two tables viz. Table 1 and Table2. Table 1 containsmember information. Table 2 contains information on movies rented.

    We have introduced a new column called Membership_id which is the primary key for table1. Records can be uniquely identified in Table 1 using membership id

    Introducing Foreign Key!

    In Table 2, Membership_ID is the foreign Key

  • 7/29/2019 5.Data Warehouse

    15/19

    15

    Foreign Key references primary key of another Table!It

    helps connect your Tables

    A foreign key can have a different name from its primarykey

    It ensures rows in one table have corresponding rows in

    anotherUnlike Primary key they do not have to be unique. Mostoften they arent

    Foreign keys can be null even though primary keys can

    not

  • 7/29/2019 5.Data Warehouse

    16/19

    16

    Why do you need a foreign key ?

    Suppose an idiot inserts a record in Table B such as

    You will only be able to insert values into your foreign key that exist in the unique key in theparent table. This helps in referential integrity.

  • 7/29/2019 5.Data Warehouse

    17/19

    17

    The above problem can be overcome by declaring membership id from Table2 as foreign

    key of membership id from Table1Now , if somebody tries to insert a value in the membership id field that does not exist inthe parent table , an error will be shown!

    What is a transitive functional dependencies?

    A transitive functional dependency is when changing a non-key column , might cause any ofthe other non-key columns to changeConsider the table 1. Changing the non-key column Full Name , may change Salutation.

    Lets move ito 3NF

    3NF Rules

    Rule 1- Be in 2NF Rule 2- Has no transitive functional dependencies

    To move our 2NF table into 3NF we again need to need divide our table.

    TABLE 1

  • 7/29/2019 5.Data Warehouse

    18/19

    18

    Table 2

    Table 3

    We have again divided our tables and created a new table which stores Salutations.

    There are no transitive functional dependencies and hence our table is in 3NFIn Table 3 Salutation ID is primary key and in Table 1 Salutation ID is foreign to primary

    key in Table 3

    Now our little example is in a level that cannot further be decomposed to attain higher

    forms of normalization. In fact it is already in higher normalization forms. Separate effortsfor moving in to next levels of normalization are normally needed in complex databases.

    However we will be discussing about next levels of normalizations in brief in the following.

    Boyce-Codd Normal Form (BCNF)

    Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it hasmore than oneCandidate Key.

    Sometimes is BCNF is also referred as 3.5 Normal Form.

    4th Normal Form

    If no database table instance contains two or more, independent and multivalued data

    describing the relevant entity , then it is in 4th Normal Form.

    5th Normal Form

    A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed in to anynumber of smaller tables without loss of data.

    6th Normal Form

    6th Normal Form is not standardized yet however it is being discussed by database experts

    for some time. Hopefully we would have clear standardized definition for 6th Normal Form innear future.

  • 7/29/2019 5.Data Warehouse

    19/19

    19