unit 4. data warehousing data warehousing...

15
UNIT 4. DATA WAREHOUSING Data Warehousing Components -Multi Dimensional Data Model- Data Warehouse Architecture-Data Warehouse Implementation- -Mapping the Data Warehouse to Multiprocessor Architecture- OLAP.-Need- Categorization of OLAP Tools. 4.1 Data Warehousing Components 4.1.1 Definition:” A data warehouse is a subject- oriented, integrated, time-Variant and non-volatile collection of data in support of management decision making process.” •Subject-Oriented: Stored data targets specific subjects. Example:It may store data regarding total Sales, Number of Customers, etc. and not general data on everyday operations. •Integrated:Data may be distributed across heterogeneous sources which haveto be integrated. Example: Sales data may be on RDB, Customer information on Flatfiles, etc. •Time Variant:Data stored may not be current but varies with time and data have an element of time. Example: Data of sales in last 5 years, etc. •Non-Volatile:It is separate from the Enterprise Operational Database and hence is not subject to frequent modification. It generally has only 2operations performed on it: Loading of dataand Access of data. 4.1.2 Features of a Warehouse: It is separate from Operational Database. •Integrates data from heterogeneous systems. •Stores HUGE amount of data, more historical than current data. •Does not require data to be highly accurate. •Queries are generally complex. •Goal is to execute statistical queries and provide results whichcan influence decision making in favor of the Enterprise.

Upload: vuduong

Post on 28-Apr-2018

242 views

Category:

Documents


5 download

TRANSCRIPT

UNIT 4. DATA WAREHOUSING

Data Warehousing Components -Multi Dimensional Data Model- Data Warehouse Architecture-Data Warehouse Implementation- -Mapping the Data Warehouse to Multiprocessor Architecture- OLAP.-Need- Categorization of OLAP Tools.

4.1 Data Warehousing Components

4.1.1 Definition:” A data warehouse is a subject- oriented, integrated, time-Variant and non-volatile collection of data in support of management decision making process.”

•Subject-Oriented: Stored data targets specific subjects.

Example:It may store data regarding total Sales, Number of Customers, etc. and not general data on everyday operations.

•Integrated:Data may be distributed across heterogeneous sources which haveto be integrated.

Example: Sales data may be on RDB, Customer information on Flatfiles, etc.

•Time Variant:Data stored may not be current but varies with time and data have an element of time.

Example: Data of sales in last 5 years, etc.

•Non-Volatile:It is separate from the Enterprise Operational Database and hence is not subject to frequent modification. It generally has only 2operations performed on it: Loading of dataand Access of data.

4.1.2 Features of a Warehouse:

It is separate from Operational Database.

•Integrates data from heterogeneous systems.

•Stores HUGE amount of data, more historical than current data.

•Does not require data to be highly accurate.

•Queries are generally complex.

•Goal is to execute statistical queries and provide results whichcan influence decision making in favor of the Enterprise.

•These systems are thus called Online Analytical Processing Systems (OLAP).

4.1.3 Three Main reasons for having a separate data base

1. OLTP systems require high concurrency, reliability, locking which provide good performance for short and simple OLTP queries. An OLAP query is very complex and does not require these properties. Use of OLAP query on OLTP system degrades its performance.

2. An OLAP query reads HUGE amount of data and generates the required result. The query is very complex too. Thus special primitives have to provide to support this kind of data access.

3. OLAP systems access historical data and not current volatile data while OLTP systems access current up-to-date data and do not need historical data.

4.1.4 Difference between OLTP and OLAP

Feature OLTP OLAPusers clerk, IT professional knowledge workerfunction day to day operations decision supportDB design application-oriented subject-orienteddata current, up-to-date

detailed, flat relationalisolated

historical, summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hocaccess read/write

index/hash on prim. keylots of scans

unit of work short, simple transaction complex query# records accessed tens millions#users thousands hundredsDB size 100MB-GB 100GB-TBmetric transaction throughput query throughput, response

4.1.5 How do we represent this data???

This multi-dimensional data can be represented using a data cube as shown below.

Or

This figure shows a 3-Dimensional dataModel.X –Dimension : Item typeY –Dimension : Time/PeriodZ –Dimension : Location

Each cell represents the items sold of type ‘x’, in location ‘z’during the quarter‘y’. •Data cube is thus a n -dimensional data model

4.1.6 Schemas used for Multidimensional model

Star schema

There is a central large Fact table with no redundancy

•Each tuple in the fact table has a foreign key to a dimension table which describes the details of that dimension

Problem: Redundancy•Values of city, province_or_state and country would be repeated for two streets in the same city.Thus we can normalize the table by splitting location into sub tables. (Snowflake Schema)

Advantage: PerformanceAs less number of joins required

Snowflake schemaSome of the dimension tables are normalized thus splitting data into additional tables.

Problem: Performance•Too many joins required to form the result.

Thus Snowflake schema is not as popular as the Star schema.

Fact Constellation schema

Two or more fact tables share dimension tables. In the figure below the ‘Sales’ fact table and ‘Shipping’ fact table Share the dimension tables As

As the multiple fact tables are linked to each other by dimension tables, its called as fact Constellation schema

Data Mining Query Language (DMQL)

Syntax:

define cube <cube name>[<dimension list>]:<measure list>

define dimension <dimension name> as (<attribute list>)

Star Schema Example:

Fact Table:define cube sales_star [time,item,branch,location]:dollars_sold=sum(sales_in_dollars),units_sold=count(*)Dimensions:define dimension time as (time_key,day,day_of_week,month,quarter,year)

define dimension item as (item_key,item_name,brand,type)

define dimension branch as (branch_key,branch_name,branch_type)

define dimension location as (location_key,street,city,province,country)

Defining a hierarchy of dimension tables for snowflake schema.

define dimension location as (location_key,street,city(city_key,city,province,country))

Defining a shared dimension table for Fact Constellation Schema

define dimension time as (time_key,day,day_of_week,month,quarter,year)define dimension time as time in cube sales

4.2 Data warehousing architecture

Approaches for data warehousing

1. Top-Down Approach�Starts with overall design and planning�Useful where the technology is mature and well known�Useful where the business problems to be solved are clear and well understood

2. Bottom-Up Approach�Starts with experiment and prototypes�Allows an organization to move forward at considerably less expense�Allows to evaluate the benefits of the technology before making significant commitments

3. Combined Approach�Exploits the planned and strategic nature of the top-down approach�Retains the rapid implementation and opportunistic application of the bottom-up approach

4.2.1 Four views of Data warehouse

Top-down, bottom-up approaches or a combination of both Top-down : Starts with overall design and planning (mature) Bottom-up : Starts with experiments and prototypes (rapid)

From software engineering point of view Waterfal l: structured and systematic analysis at each step before

proceeding to the next Spiral : rapid generation of increasingly functional systems, short turn

around time, quick turn around Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record

4.2.2 Three tier architecture

Data Mart

Contains a subset of corporate-wide data that is of value to a specific group of users. Scope is confined to specific selected subjects

�Implemented on low-cost departmental servers that are UNIX or Windows NT based. Implementation cycle is measured in weeks rather than months or years

�They are further classified as:

Independent•Sourced from data captured from one or more operational systems or external information providers•Sourced from data generated locally within a particular department or geographic area

Dependent•Sourced directly from enterprise data warehouses

Virtual Warehouse

�It is a Set of Views over operational databases

�For efficient query processing; only some of the possible summary views may be materialized

�It is easy to build but requires excess capacity on operational database servers

August 17, 2010 Data Mining: Concepts and Techniques 16

Data Warehouse Development: A Recommended Approach

Define a high -level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi -Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

4.2.3 Data Warehouse Back-End Tools and Utilities

Data extractionget data from multiple, heterogeneous, and external sources

Data cleaningdetect errors in the data and rectify them when possible

Data transformationconvert data from legacy or host format to warehouse format

Loadsort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

Refreshpropagate the updates from the data sources to the warehouse

4.2.4 Metadata Repository

Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-datadata lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance

Warehouse schema, view and derived data definitions Business data

Business terms and definitions, ownership of data, charging policies

4.2.5 OLAP Server Architectures

Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services Greater scalability

Multidimensional OLAP (MOLAP) Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP ) (e.g., Microsoft SQLServer) Flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake schemas

4.2.6 OLAP operations

1. Roll-up

Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction2. Drill-downCan be realized by either stepping down a concept hierarchy for a dimension or introducing additional dimensions3.Slice and DiceSlice performs a selection on one dimension of the given cube, resulting in a sub cubeDice defines a subcube by performing a selection on two or more dimensions

4.Pivot (rotate)

It’s a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data

4.2.7 Concept hierarchies

It is a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

Data warehouse Implementation

Cube: A Lattice of Cuboids

Review

Key terms

OLAP Rollup Concept hierarchies

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3OLAP/OLAM

Layer2MDDB

Layer1Data Repository

Layer4User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

ROLAP Drill down Metadata Repository Data cube pivot Data martcuboid slice Dice

Multiple choice questions

1) _____ routines attempt to fill in missing values (a)Data cleaning (b)Data transformation (c)data integration (d)None of the above

2) Which method is the best for filling missing values? (a)Ignore the tuple (b) Manual filling (c) by global constant (d) use attribute mean

3) What are the methods for handling noisy data? (a)Binning (b)Regression (c)clustering (d)All of the above

4) Rules for examining the data are (a)unique rule (b)consequetive rule (c)null rule (d)All of the above

5) commercial tools for discrepancy detection (a)data scrubbing tools (b)data mining tools (c)data auditing tools (d)All of the

6) Redundancy can be detected by (a)correlation analysis (b)Entity identification problem (c)min-max normalization (d)All of the above

7) Attribute data are scaled to fall within a specified range 0.0. to 1.0 is called as (a)Integration (b)Normalization (c)generalization (d)None of the above

8) Concept hierarchy is also called as (a)Generalization (b)Normalization (c)Integration (d)None of the above

9) smoothing can be performed by (a)Binning (b)clusering (c)regression (d)All of the above

10) Suppose minimum and maximum values are given,we can use ____ normalization (a)Min - max (b)Z- score (c)decimal (d)All of the above 11) Suppose mean and standard deviation of the values are given, we can use __

normalization (a)Min - max (b)Z- score (c)Decimal scaling (d)All of the above

12) suppose recorded values given in a range, we can use ___ normalization. (a)Min - max (b)Z-score (c)Decimal scaling (d)All of the above

13) Attribute construction is also known as (a)Feature construction (b) Feature selection (c) attribute selection (d) none of the above

14) In _______ irrelevant attributes can be detected and removed. (a) Attribute selection (b) Data transformation (c) data integration (d) data cleaning

15) _________ methods can also be applied for data reduction. (a)Data cleaning (b) Data transformation (c) data smoothing (d) All of the above

Review questions

PART – A

1. Define data warehouse2. Define the characteristics of data warehouse3. Define a Data mart?4. Define enterprise data warehouse5. What are the schemas used for multidimensional model?6. What is the disadvantage of star schema?7. What is the disadvantage of snowflake schema?8. What is data warehouse performance issue?9. What is Data Inconsistency Cleaning?10.Merits of Data warehouse.11. What are the characteristics of metadata repository?12.List some of the data warehouse tools.13.Define concept hierarchy.14.Explain OLAP.15.Explain ROLAP.16.Explain MOLAP.17.Explain HOLAP.18.Define specialized SQL server.

PART – B

1. Explain about data warehousing architecture.2. Discuss in detail. OLAP operations.3. Describe about concept hierarchies.4. What are Schemas used for Multidimensional model?5. Discuss about data warehouse implementation.

REFERENCES

1. Jiawei Han, Micheline Kamber, "Data Mining: Concepts and Techniques", MorganKaufmann Publishers, 2002.2. Alex Berson,Stephen J. Smith, “Data Warehousing, Data Mining,& OLAP”, TataMcGraw- Hill, 2004.3. Usama M.Fayyad, Gregory Piatetsky - Shapiro, Padhrai Smyth And RamasamyUthurusamy, "Advances In Knowledge Discovery And Data Mining", The M.I.TPress, 1996.4. Ralph Kimball, "The Data Warehouse Life Cycle Toolkit", John Wiley & Sons Inc.,1998.5. Sean Kelly, "Data Warehousing In Action", John Wiley & Sons Inc., 1997.

For Further references

http://www.cs.wmich.edu/~yang/teach/cs595/han/ch3.pdf

http://www.slideshare.net/idnats/data-warehousing-and-data-mining-presentation-725476

http://onlineassociate.net/ppt/Data-Warehouse-and-OLAP-Technology-Slide-1/