universitatea politehnica din bucure şti -...

61
Data warehousing - introduction Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea Politehnica din Bucureşti

Upload: others

Post on 13-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Data warehousing - introductionData warehousing - introduction

Prof.dr.ing. Florin Radulescu

Universitatea Politehnica din Bucureşti

Page 2: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�What is a data warehouse

�Operational data stores

�Data Warehouse Architecture

Summary

Road Map

2

Florin Radulescu, Note de curs

DMDW-10

�Summary

Page 3: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The goal of this lesson is to present a comprehensive introduction to Data warehousing, with definitions of the main terms used.

�The lesson is a summary of the scientific literature of the domain, based mainly on the books

Foreword

3

Florin Radulescu, Note de curs

DMDW-10

of the domain, based mainly on the books published by two authors:

�W.H. Inmon, the originator of the term Data Warehousing

�R. Kimball, who developed the dimensional methodology (known also as Kimball methodology) which has become a standard in the area of decision support.

Page 4: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Wikipedia:

�Data warehouse is a repository of an organization's electronically stored data.

�Data warehouses are designed to facilitate reporting and analysis.

Definitions

4

Florin Radulescu, Note de curs

DMDW-10

reporting and analysis.

�A data warehouse houses a standardized, consistent, clean and integrated form of data sourced from various operational systems in use in the organization, structured in a way to specifically address the reporting and analytic requirements.

Page 5: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

R. Kimball (see [Kimball, Ross, 2002]):

�A data warehouse is a copy of transactional data specifically structured for querying and analysis.

Definitions

5

Florin Radulescu, Note de curs

DMDW-10

�According to this definition:�The form of the stored data (RDBMS, flat file) is

not linked with the definition of a data warehouse.

�Data warehousing is not linked exclusively with "decision makers" or used in the process of decision making.

Page 6: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

W.H. Inmon (see [Inmon 2002]):

�A data warehouse is a: �subject-oriented, �integrated, �nonvolatile,

Definitions

6

Florin Radulescu, Note de curs

DMDW-10

�nonvolatile, �time-variant

collection of data in support of management’s decisions.

�The data warehouse contains granular corporate data.

Page 7: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� The definition provided by W.H. Inmon is the accepted definition of a data warehouse: a subject-oriented, integrated, non-volatile, time-variant collection of data

Defintion explained

7

Florin Radulescu, Note de curs

DMDW-10

non-volatile, time-variant collection of data for supporting management decisions in a company.

�The significance of each component of this definition is the following:

Page 8: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Operational data systems of a company are organized considering the main activities, so they are activity-oriented and not subject oriented.

Subject-oriented

8

Florin Radulescu, Note de curs

DMDW-10

�A classical example in the literature is an insurance company where the main activities are auto insurances, health insurances, life insurances and casualty insurances.

Page 9: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Subject-oriented

9

Florin Radulescu, Note de curs

DMDW-10

Page 10: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�For each activity there is possibly another software system managing data on the main subject areas: policies, customers, claims and premiums in the area, so there are possible four separate databases, one for each activity, with similar but

Subject-oriented

10

Florin Radulescu, Note de curs

DMDW-10

databases, one for each activity, with similar but not identical structures.

�When uploading data in the company data warehouse, the data must first be restructured on these major subject areas, integrating data on customers, policies, claims and premiums from each activity (as in the previous figure).

Page 11: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Other examples of major subject areas:

�in a production company: product, order,

vendor, bill of material, and raw goods.

�a retail company: product, stock kipping units,

Subject-oriented

11

Florin Radulescu, Note de curs

DMDW-10

�a retail company: product, stock kipping units,

sale, vendor, etc.

Page 12: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�When preparing data for uploading in the data warehouse, one of the most important activities is the integration. Data is loaded from operational sources and must be converted, summarized, re-keyed, etc., before loading it in the data warehouse.

Integrated

12

Florin Radulescu, Note de curs

DMDW-10

warehouse.

�The next figure illustrates some of the most known actions performed for data integration:�Combine multiple encodings in a single one. For

example, the gender may be encoded as (0, 1), (m, f), (male, female) in separate operational systems. If (m, f) is chosen as the data warehouse encoding, all data encoded using other convention must be converted.

Page 13: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Integrated

13

Florin Radulescu, Note de curs

DMDW-10

Page 14: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Actions performed for data integration – cont.:�Chose a unique measure unit for each piece of

information. For example, if length is measured in cm, inches, yards and meters in different operational systems, one unit must be chosen for the data warehouse and all other values must be converted.

Integrated

14

Florin Radulescu, Note de curs

DMDW-10

warehouse and all other values must be converted.

�If the same object has in some data sources different values for the same attribute (e.g. description, name, features, etc), these must be combined in a single one.

�If the same object has different keys in the source systems it must be re-keyed to have a single key in the data warehouse.

Page 15: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In usual operational systems data is updated or deleted to reflect the actual values. In a data warehouse data is never updated and deleted: after data is loaded, it stays there for future reporting, like a snapshot reflecting the

Non-volatile

15

Florin Radulescu, Note de curs

DMDW-10

future reporting, like a snapshot reflecting the situation in a certain moment.

�The next load operations, instead of changing the old snapshots, are added as new snapshots and so the data warehouse is a sequence of such snapshots that coexist.

Page 16: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

x

16

Florin Radulescu, Note de curs

DMDW-10

Page 17: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In this way the data warehouse contains not

the operational data at a given moment but

all the history of operational data.

�Because of this lack of change, once loaded,

Non-volatile

17

Florin Radulescu, Note de curs

DMDW-10

�Because of this lack of change, once loaded,

data in a data warehouse may be considered

as read-only.

Page 18: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�As described above, a data warehouse contains a sequence of snapshots, each snapshot being actual at a given moment of time.

�Because a DW contains the whole history of

Time variant

18

Florin Radulescu, Note de curs

DMDW-10

�Because a DW contains the whole history of a company, it is possible to retrieve information in a time horizon of 5-10 years or even more.

�Each unit of information is stamped or linked with the moment during which that information was accurate.

Page 19: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Time variant

19

Florin Radulescu, Note de curs

DMDW-10

Page 20: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In an operational system only the current data is kept. For example, if a customer changes address, in the operational system old address is replaced (update) with the new one.

Time variant

20

Florin Radulescu, Note de curs

DMDW-10

one.

�In the data warehouse all successive addresses of a customer are kept.

�Because date and time are very important in analyzing data and reporting, the key structure contains usually the date and sometimes the time.

Page 21: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

In [Kimball, Ross, 2002] there is a list of reasons for a company to build its own data warehouse:

� “We have mountains of data in this company, but we can’t access it.”

� “We need to slice and dice the data every which way.”

� “You’ve got to make it easy for business people to get

Why building a DW?

21

Florin Radulescu, Note de curs

DMDW-10

� “You’ve got to make it easy for business people to get at the data directly.”

� “Just show me what is important.”

� “It drives me crazy to have two people present the same business metrics at a meeting, but with different numbers.”

� “We want people to use information to support more fact-based decision making.”

Page 22: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Also [Kimball & Ross 2002] lists the demands that must be met by a data warehouse in order to be productive and to return the investment.

Requirements for a DW

22

Florin Radulescu, Note de curs

DMDW-10

to return the investment.

�(Building a DW in a company is not always a cheap operation)

Page 23: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�DW content must be understandable.

�DW content must be intuitive or obvious to the non-database specialists, because they are the key users of the system.

�Names must be meaningful (for data categories, features, attributes and so on, so the structure of the

Information must be easy accessible

23

Florin Radulescu, Note de curs

DMDW-10

features, attributes and so on, so the structure of the DW must be understandable for a non-specialist user).

� The DW must provide options for combining data in the DW, the process being known and referred to as slicing and dicing.

� The methods and tools for accessing data in the data warehouse must be simple, easy to use, and the answer must be returned in a short time.

Page 24: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The process of fueling a data warehousing with data contains a step of preprocessing, where data is assembled from many sources, cleansed, quality assured. Data is released (published) to the users only when it is fit for usage.

Information must be consistent

24

Florin Radulescu, Note de curs

DMDW-10

the users only when it is fit for usage.

�As described earlier, an integration step is performed when data is load from operational sources, unifying encodings, units of measure, keys, names and common values/features, etc.

�Common definitions for the contents of the data warehouse must be available for DW users.

Page 25: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� A data warehouse must be designed to be flexible considering the inevitable changes in computer science and engineering. Its content must be structured in such a way that changes in the software and hardware

Flexibility

25

Florin Radulescu, Note de curs

DMDW-10

that changes in the software and hardware platform must be possible.

�Adding new data, reports, queries must be possible and must not interfere with existing ones.

Page 26: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Because of its confidential content, the data warehouse must have the means for rejecting unauthorized access.

�Potential leaks of content may be harmful

Security

26

Florin Radulescu, Note de curs

DMDW-10

�Potential leaks of content may be harmful for the company if competitors have access to the data in the DW.

Page 27: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The primary goal of implementing a data warehouse in an organization is the decision support

�The ultimate output from a DW is the set of decisions based on its content, analyzed and

Decision support

27

Florin Radulescu, Note de curs

DMDW-10

decisions based on its content, analyzed and presented in different ways to the decision makers.

�The original label for a data warehouse and the tools around it was ‘decision support system’.

Page 28: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The ultimate test for the success in implementing a data warehouse is the acceptance test.

� If the business community does not use continue to use it in the first six months after training, then the system has failed the acceptance test, no

Acceptance

28

Florin Radulescu, Note de curs

DMDW-10

the system has failed the acceptance test, no mater how bright is the technical solution.

� It is possible to ignore using it because decisions may be adopted also without a decision support system.

�Key point in user acceptance is simplicity and user friendliness.

Page 29: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�What is a data warehouse

�Operational data stores

�Data Warehouse Architecture

Summary

Road Map

29

Florin Radulescu, Note de curs

DMDW-10

�Summary

Page 30: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The concept of Operational Data Store (ODS) was also introduced by W.H. Inmon and its definition, found in [Inmon 98] is the following:

�An ODS is an integrated, subject-oriented, volatile (including update), current-valued

ODS

30

Florin Radulescu, Note de curs

DMDW-10

volatile (including update), current-valued structure designed to serve operational users as they do high performance integrated processing.

�We can compare an ODS with a database integrating data from multiple sources. Its goal is to help analysis and reporting.

Page 31: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

ODS vs. DW

31

Florin Radulescu, Note de curs

DMDW-10

Source: [Inmon 98]

Page 32: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

According to Inmon, the main features of an ODSare:

�enablement of integrated, collective on-line processing.

�delivers consistent high transaction performance--two to three seconds.

ODS features

32

Florin Radulescu, Note de curs

DMDW-10

�delivers consistent high transaction performance--two to three seconds.

�supports on-line update.

� is integrated across many applications.

�provides a foundation for collective, up-to- the-second views of the enterprise.

� the ODS supports decision support processing.

Page 33: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Subject-oriented data:

�Before data is loaded in the ODS, it must first be restructured on major subject areas (as in the case of insurance company: integrating data on customers, policies, claims and premiums from

Similarities DW - ODS

33

Florin Radulescu, Note de curs

DMDW-10

customers, policies, claims and premiums from each activity).

Integrated content:

�Data is sourced from multiple operational systems (sources), and the integration step includes, like in DW case, cleaning, unifying encodings, re-keying, removing redundancies, preserving integrity, etc.

Page 34: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Its content is volatile (or updateable):

�In an ODS data is updated, is like a transaction processing system. Limited or no history is maintained.

Dissimilarities DW - ODS

34

Florin Radulescu, Note de curs

DMDW-10

no history is maintained.

Its content is not time-variant (or current):

�An ODS is designed to contain limited history, containing ‘real time’ or ‘near real time’ data.

Page 35: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�What is a data warehouse

�Operational data stores

�Data Warehouse Architecture

Summary

Road Map

35

Florin Radulescu, Note de curs

DMDW-10

�Summary

Page 36: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

The basic elements of a Data Warehouse environment are:

�Operational Source Systems. These are the source of the data in the DW, and are placed outside of the data warehouse

�Data Staging Area. Here data is prepared

DW architecture

36

Florin Radulescu, Note de curs

DMDW-10

�Data Staging Area. Here data is prepared (transformed) for loading in the presentation area. This area is not accessible to the regular user.

�Data Presentation. This part is what regular users see and consider to be a DW.

�Data Access Tools. These tools are used for analyzing and reporting. They provide the interface between the user and the DW.

Page 37: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

DW architecture

37

Florin Radulescu, Note de curs

DMDW-10

Page 38: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The data staging area (DSA) of a data warehouse is compared in [Kimball, Ross, 2002] with the kitchen of a restaurant. It is:

�A storage area and

�A set of processes performing the so-called Extract-Transform-Load (ETL) operation:

Data staging area

38

Florin Radulescu, Note de curs

DMDW-10

�A set of processes performing the so-called Extract-Transform-Load (ETL) operation:�Extract – Extracting data from Operational Source

Systems

�Transform – Integrating data from all sources, as described below

�Load – Publishing data for users, meaning loading data in the Data presentation area

Page 39: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Dealing with synonyms: same data with different name in different operational systems

�Dealing with homonymous: same name for different data

�Unifying keys from different sources

Integration tasks

39

Florin Radulescu, Note de curs

DMDW-10

�Unifying keys from different sources

�Unifying encodings

�Unifying unit measures and levels of detail

�Dealing with different software platforms

�Dealing with missing data

�Dealing with different value ranges, etc.

Page 40: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�DSA contains everything between the operational source systems and the data presentation area.

�As we said earlier, this area is not

Data staging area

40

Florin Radulescu, Note de curs

DMDW-10

�As we said earlier, this area is not accessible to the regular users of the data warehouse.

Page 41: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� Storing data in a DW (so also in DSA) may be done following two main approaches:

�1. The normalized approach (supported by the work of W.H. Inmon – see [Inmon 2002]

�2. The dimensional approach (supported by the work

Main approaches

41

Florin Radulescu, Note de curs

DMDW-10

�2. The dimensional approach (supported by the work of Ralph Kimball – see [Kimball, Ross, 2002])

�These approaches are not mutually exclusive, and there are other approaches.

�Dimensional approaches can involve normalizing data to a degree.

�This lesson is based on the dimensional approach

Page 42: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� In the normalized approach, data are stored following database normalization rules.

�Tables are grouped by subject areas (data on customers, policies, claims and premiums for

Normalized approach

42

Florin Radulescu, Note de curs

DMDW-10

customers, policies, claims and premiums for example).

�The main advantage of this approach is that loading data is straightforward because the philosophy of structuring data is the same for operational source systems and the data warehouse.

Page 43: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�The main disadvantage of this approach is the number of joins needed to obtain meaningful information.

�A regular user needs also to have a good knowledge about the data in the DW and also a training period in obtaining de-normalized tables

Normalized approach

43

Florin Radulescu, Note de curs

DMDW-10

training period in obtaining de-normalized tables from normalized ones.

�Missing a join condition when performing a query may lead to Cartesian products instead of joins. In other words, regular user may need assistance from a database specialist to perform usual operations.

Page 44: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In a dimensional approach, data are partitioned in two main categories:

�Facts (numeric transaction data). In a retail

example, the fact table contains quantity sold,

Dimensional approach

44

Florin Radulescu, Note de curs

DMDW-10

example, the fact table contains quantity sold,

total price, total cost, total gross profit.

�Dimensions (standardized contexts for facts).

In a retail example, dimensions may be:

product, date, time, location, customer,

salesperson, etc.

Page 45: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

• Advantages of the dimensional approach are: – Data is easy to understand, easy to use, no need

for assistance from a database specialist, speed in solving queries.

Dimensional approach

45

Florin Radulescu, Note de curs

DMDW-10

– Data being de-normalized (or partially de-normalized) the number of joins needed for performing a query is lower than in the normalized approach.

– Joins between the fact table and its dimensions is easy to perform because the fact table contains surrogate keys for all involved dimension tables.

Page 46: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Disadvantages of dimensional approach: �The ETL process is harder to be performed

because of the different philosophy in structuring data in the operational systems and the data warehouse: transform and load

Dimensional approach

46

Florin Radulescu, Note de curs

DMDW-10

and the data warehouse: transform and load steps are more complicated than in the normalized approach.

�A second disadvantage is that is more difficult to modify the data warehouse scheme when the company changes its way to do business.

Page 47: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�At the end of the ETL process prepared data is loaded in the Data Presentation Area (DPA).

�After that moment, data is available for users for querying, reporting and other analytical applications.

Data presentation area

47

Florin Radulescu, Note de curs

DMDW-10

applications.

�Because regular users have access only to that area, they may consider the presentation area as being the data warehouse.

�This area is structured as a series of integrated data marts, each presenting the data from a single business process.

Page 48: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

[SQLServer 2005]:

�A data mart is defined as a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers.

Data marts – Definition 1

48

Florin Radulescu, Note de curs

DMDW-10

knowledge workers.

�Data may derive from an enterprise-wide database or data warehouse or be more specialized.

�The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use.

Page 49: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�[Wikipedia] defines a data mart as a asubset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.

Data marts – Definition 2

49

Florin Radulescu, Note de curs

DMDW-10

to support business needs.

�Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization.

Page 50: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Data marts are often derived from subsets of data in a data warehouse,

�The data warehouse is created from the union of organizational data marts.

Data marts

50

Florin Radulescu, Note de curs

DMDW-10

union of organizational data marts.

�In the DPA data is stored, presented, and accessed in dimensional schemas.

�We can imagine a hypercube with edges labeled with the dimensions, e.g. customer, product and time.

Page 51: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

Hypercube

51

Florin Radulescu, Note de curs

DMDW-10

Source: [Rainardi 2008]

Page 52: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� Data contained is detailed, atomic data.

�This is necessary for evaluating ad hoc user queries, not covered by the pre-defined queries or other options of the

Other data mart features

52

Florin Radulescu, Note de curs

DMDW-10

defined queries or other options of the tools used in accessing data.

�Data marts contains also summary data, obtain via aggregation, these data being used for performance (speed) enhancement.

Page 53: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Data marts use common dimensions and facts.

�Kimball refers them as ‘conformed’.

�This means for example that the same date

Other data mart features

53

Florin Radulescu, Note de curs

DMDW-10

�This means for example that the same date dimension is used in all data marts, and in all star schemes of the DW, if the significance is the same for all cases.

�Because data marts use conformed dimensions and facts, they can be combined and used together

Page 54: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�A large enterprise data warehouse will consist of 20 or more very similar-looking data marts, with similar dimensional models.

Examples

54

Florin Radulescu, Note de curs

DMDW-10

models.

�Each data mart may contain several fact tables, each with 5 to 15 dimension tables.

�Many of these dimension tables will be shared between several fact tables.

Page 55: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�Almost all DW regular users (80% to 90%) will access the data via some prebuilt parameter-driven analytic applications.

�Generally a user has four channels to

Data access tools

55

Florin Radulescu, Note de curs

DMDW-10

�Generally a user has four channels to interact with a DW:�Ah-hoc query tools.

�Report writers

�Analytic applications.

�Modeling tools.

Page 56: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�By this channel the user obtains raw data verifying the conditions specified in the ad-hoc query.

�For using this channel the user must have a good knowledge on the DW structure and on

Ah-hoc query tools

56

Florin Radulescu, Note de curs

DMDW-10

good knowledge on the DW structure and on query language used.

�This channel is for specialists and experienced users.

�Sometimes there are some pre-built queries that may be.

Page 57: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�This channel is at the same level as the first one.

�Raw data is presented as a report.

�Usually there are several pre-built reports

Report writers

57

Florin Radulescu, Note de curs

DMDW-10

�Usually there are several pre-built reports that user may run without knowledge on DW structure and query language.

�Building new reports may need extra abilities.

Page 58: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In this category there are:

�interactive reports,

�dashboards,

�scorecards, and

Analytic applications

58

Florin Radulescu, Note de curs

DMDW-10

�scorecards, and

�other reporting tools allowing users to access

and analyze on data in the DW.

Page 59: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

�In this category can be mentioned data mining products, forecasting and scoring tools.

�At this level the result is not only a

Modeling tools

59

Florin Radulescu, Note de curs

DMDW-10

�At this level the result is not only a sophisticated report on existing data but also extracted new knowledge, models for forecasting and other outputs providing new knowledge to the user.

Page 60: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

� This course presented:�Some definitions of a with data warehouse and a detailed

discussion based on Inmon definition, explaining what means the four features of a DW: subject-oriented, integrated, non-volatile and time-variant. Some reasons for building a data warehouse are also discussed.

�A definition of the concept of Operational data store with a

Summary

60

Florin Radulescu, Note de curs

DMDW-10

�A definition of the concept of Operational data store with a parallel between ODS and DW

�A discussion about the architecture of a DW presenting the Data Stage Area, Data presentation Area and Data Access Tools, the main parts of such a construction.

�Next week: Dimensional modeling

Page 61: Universitatea Politehnica din Bucure şti - UPBcursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW10.pdf · Data warehousing - introduction Prof.dr.ing. Florin Radulescu Universitatea

[Inmon 2002] W.H. Inmon - Building The Data Warehouse. Third Edition, Wiley & Sons, 2002 [Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data Warehouse Toolkit, Second Edition, Wiley & Sons, 2002 [CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS 680 Course notes, 2004 (page https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt visited2010)[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org, visited

References

61

Florin Radulescu, Note de curs

DMDW-10

[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org, visited 2009.[SQLServer 2005] Dan Gallagher, Tim D. Nelson, and Steve Proctor, Data mart, nov. 2005, Site: http://searchsqlserver.techtarget.com/definition/data-mart, visited June 20, 2012[Inmon, 98] W.H. Inmon - The Operational Data Store, July 1, 1998, web page visited June 20, 2012: http://www.information-management.com/issues/19980701/469-1.html [Rainardi, 2008] Vincent Rainardi, Building a Data Warehouse with Examples in SQL Server, Springer, 2008