introduction to data warehousing and...

$: Introduction to Data Warehousing and OLAPweb.uettaxila.edu.pk/CMS/seADMSbsSp09/notes\ADBMS-Lecture-10 Data...– Data type differences – Missing fields – Semantic differences$
1

Introduction to Data Warehousing and OLAP

Outline

• Part I – Introduction– OLAP vs OLTP – Data Cleaning and Integration

• Part II– Data Models and Warehouse Design

• Part III – Index Structures for Data Warehouses

2

Types of Data

• Operational Data (OLTP applications)– Data that “works”– Frequent updates and queries – Normalized for efficient search and updates

(minimize update anomalies) – Fragmented and local relevance– Point Queries: queries accessing individual tuples

Types of Data

• Historical Data (OLAP applications)– Data that “tells”– Very infrequent updates– Integrated data set with global relevance– Analytical queries that require huge amounts of

aggregation – Performance issues mainly in query response time

(not in updates)

3

Example OLTP Queries

• What is the salary of Mr. Ali? • What is the address and phone number of the

person in charge of the Supplies department? • How many employees have received an

“excellent” credential in the latest appraisal?

Example OLAP Queries

• How is the employee attrition scene changing over the years across the company?

• Is there a correlation between the geographical location of a company unit and excellent employee appraisals?

• Is it financially viable to continue our manufacturing unit in Taiwan?

4

A Data Warehouse

• An infrastructure to manage historical data• Designed to support OLAP queries involving

gratuitous use of aggregations• Post retrieval processing (reporting) just as

complex, if not more, as the retrieval itself

Warehousing Data

OLTPUnit

OLTPUnit

OLTPUnit

DataCleaning

andIntegration

OperationalData

DataWarehouse

5

Data Marts

• Data warehouses seen as a collection of “data-marts” – or historical data about each OLTP segment that feeds into the warehouse

• Data-marts also seen as small warehouses for OLAP activities within a given segment

Data Cleaning and Integration

DCU DIU DataWarehouse

Updates / FeedbackOLTPDatabases

Back flush

6

Dirty Data

• Lack of Standardization– Multiple encodings, locales, languages…– Spurious abbreviations: “Allama Iqbal Road” and

“A.I. Road” are the same…– Semantic equivalence: “Rawalpindi” is the same

as “Pindi”…– Multiple standards: 1.6 kilometer is the same as 1

miles …

Dirty Data

• Missing, spurious and duplicate data– Missing age field for an employee– Spurious (incorrectly entered) sales values– Duplication of data-sets across OLTP units– Semantic duplication (M. A. Khan appearing in

another data set as Khan Muhammad Ali)

7

Dirty Data

• Inconsistencies– Incorrect use of codes (use of M/F in addition to

0/1 for gender) – Codes with inconsistent or outdated meaning

(Travel eligibility “C” denoting eligibility to travel only by III class sleeper, which no longer exists)

– Inconsistent duplicate data (two data sets are found to belong to the same person, but have two different address information)

Dirty Data

• Inconsistencies– Inconsistent associations (Sales figures provided

by the marketing department do not add up to the total sales figures by the retail units)

– Semantic inconsistencies (Feb 31st) – Referential inconsistency (Rs. 10 lakhs sales

reported from a unit that has been closed down)

8

Issues in Data Cleaning

• Cannot be fully automated• GIGO (Garbage in Garbage out)• Requires considerable knowledge that is tacit

and beyond the purview of the warehouse (metrics, geography, govt. policies, etc.)

• Complexity increases (usually geometrically) with increase in data sources

• Complexity increases with the history span that is taken up for cleaning

Steps in Data Cleaning (Rahm and Do [1])

1. Data Analysis: Analyze data set to obtain meta-data and detect dirty data

2. Definition of transformation rules: Transform data from its current “dirty” form to the required “clean”form. Transformation can be either at the schema level or data level

3. Rule Verification: Verification of the transformation rules on test data sets

4. Transformation: Execution of transformation rules on data set

5. Backflow: Re-populating data sources with cleaned data.

9

Data Analysis Techniques (Refs [1],[2])

Compare cardinality with #rows, detect nulls, use rules to predict incorrect or missing values.

Duplicate and missing values

Column comparison (compare value sets from given column across tables)

Lack of standards

Hashing, N-gram outliers

Spelling mistakes

(max, min), (mean, deviation),Cardinality

Illegal values

Meta-data usedProblem to be Detected

Transformation Algorithms

• Hash-Merge for duplicate elimination1. Hash tuples based on given column into buckets2. (Tuples with duplicate values are hashed onto

the same bucket) 3. Merge tuples within each bucket separately

10


Products30, SnkyRd.

Rahim

PR50, LvlRd.

M.A. Khan

R&D25, LB RdSaleem

Sales50, LvlRd.

M.A. Khan

DeptAddressName

Hash key

Hash Buckets


• Sorted Neighborhood Technique for misspelling integration1. Identify a set of data values within a given row as

key 2. Sort table based on key 3. Slide a window of n rows over the sorted table

and merge data values based on rules. (Ex: Merge names if all other values like age, address, dept, etc. match)

4. Make multiple passes until there are no more merges of records

11


Products30, SnkyRd.

Rahim

PR50, LvlRd.

M.A. Khan

R&D25, LB Rd.

Saleem

Sales50, LvlRd.

M.A. Khan

DeptAddressName

R&D25, LB Rd.

Saleem

Products30, SnkyRd.

Rahim

PR50, LvlRd.

M.A. Khan

Sales50, LvlRd.

M.A. Khan

DeptAddressName

Rule: Merge rows ifname and address match.

Window size n = 3.

Transformation Algorithms (MongeElkan ‘97, [3])

• Graph-Based transitive closure to reduce number of passes1. Use sorted neighborhood technique and sort

records based on identified keys2. Create an undirected graph structure where

nodes correspond to records and edges correspond to “is a duplicate of” relationship

3. Records R1 and R2 need not be compared in any pass if they belong to the same connected component

12

Transformation algorithmsR1 R2

R3 R4

R5

R1 R2

R3 R4

R5

R1 R2

R3 R4

R5

No need to compare R1/R2 with R4/R5

Slide 0: R1, R2, R3Slide 1: R2, R3, R4 Slide 2: R3, R4, R5

Slide 0: R1, R2, R3 Slide 1: R3, R4, R5

Naïve sliding window

Graph-based transitive closure

Integration

• Combining disparate data sources into a single schematic structure– Schema Integration: Forming an integrated

schematic structure from the disparate data sources

– Data Integration: Cleaning and merging data from different sources

13

Schema Integration

• Consider the following schemata [4]:

Cars (serialNo, model, colour, stereo, glasstint, …)

and

Autos(serienNr, modelle, farbe)Optionen(serienNr, stereo, glastint, …)

Schema Integration

• Challenges– Naming differences – Structural differences – Data type differences – Missing fields – Semantic differences

14

Schema IntegrationGeneric Architecture of an Integrator

Wrapper / Extractor

Wrapper / Extractor

Wrapper / Extractor

Mediator /Constructor

Data Sources

Integration

• Wrapper / Extractor – Creates a common view across all data sources– Bridges differences in naming, type and schema

structure– Wrappers do not physically extract data from the

data sources• Mediator / Constructor

– Constructs an integrated schematic structure– Performs data integration and populates the data

warehouse

15

Tools for Data Cleaning and Integration

• dfPower– From Dataflux corporation

(http://www.dataflux.com/) – De-duplication engine – Analyzes data based on values and number of

occurrences – Does not support detection of semantic duplicates

based on user specified rules – Permits duplicates to be grouped or merged


• ETI* Data Cleanse – From Evolutionary Technologies Int’l

(http://www.evtech.com/) – Table driven data cleaning, matching and quality

review, duplicate matching, imprecise spelling correction

– Supports meta-data repositories to store schemas, transformation rules, interrelationships, etc.

16


• SSA Name/Data Clustering Engine– From Search Software America

(http://www.searchsoftware.co.uk/) – Addresses errors in spelling, typing, transcription,

nicknames, synonyms, abbreviations, prefix/suffix variations, punctuation, casing, etc.

– Supports user specified transformation rules – Scalable up to 500 M records

Summary

• OLAP versus OLTP • Characteristics of OLAP queries • Data Warehousing systems • Data Cleaning Issues

– Dirty Data – Cleaning Algorithms

• Integration of Data and Schema

17


Part II: Data Models and Warehouse Design

Example OLAP Queries

• How is the employee attrition scene changing over the years across the company?

• Is there a correlation between the geographical location of a company unit and excellent employee appraisals?

• Is it financially viable to continue our manufacturing unit in Taiwan?

18

OLAP Query Characteristics

• Aggregation and summarization over large data sets

• Clustering • Trend detection • Multi-dimensional projections

A Typical Warehouse

HypercubeCore

MaterializedViews

19

A Typical Warehouse

• Hypercube Core– Manages the atomic data elements– Global schematic structure for the entire

warehouse– Based on the multi-dimensional data model

• Materialized Views– Physical views for faster aggregate-query

answering– De-normalization of the core

The Sales Hyper Cube

Branch

ProductWeek

20

The Sales Hyper Cube

• “Sales” is the fact • “Branch”, “Product” and “Week” are

dimensions

Operations on Hyper Cubes

• Pivoting: Choosing (rotating the cube on a pivot) a set of dimensions for display

• Slicing-dicing: Select some subset of the cube • Roll-up: Aggregate a dimension to a smaller

dimension (Roll-up weeks dimensions into months) • Drill-down: Open an aggregated dimension to reveal

details (Open up months to reveal week-by-week information)

21

Implementation of Hyper Cubes

• Multi-dimensional to relational mapping (ROLAP)– Map hyper cube queries to relational queries and maintain

the data cube in a set of RDBMS tables– Ex: True Relational OLAP from Microstrategy Inc

(http://www.microstrategy.com/)

• Native multidimensional model (MOLAP) – Use a separate storage model for multidimensional data – Ex: Arbor Essbase (http://www.arborsoft.com/)

Physical models: Star

Brnch Prod Wk Sales

BranchDimension

Table

ProductDimension

Table

WeekDimension

Table

Fact Table

22

Star Schema

• Features– Central Fact Table– Set of supporting dimension tables– Denormalized data storage

• Advantages– Simple to comprehend and design– Small meta-data– Quick query responses

• Limitations– Not robust towards changes (Changes in dimension table)– Enormous amount of redundancy in dimension-table data

Physical models: Snowflake

Brnch Prod Wk Sales

Branch Product

Week

Options SchemeDivision

Unit

23

Snowflake Schema

• Features– Central Fact Table– Normalized dimension tables storing atomic data units

• Advantages– Faster query responses– Easy updation

• Limitations– Large amount of meta-data– May result in too many tables – Harder to comprehend manually

Physical models: Constellation

Brnch Prod Wk Sales

BranchDimension

Table

ProductDimension

Table

WeekDimension

Table

Wk Prod Sch Dist

SchemeDimension

Table

Sales Fact Table

DiscountsFact Table

24

Constellation

• Most commonly used architecture• Used when multiple fact tables are needed• Usually has a “main” fact table and several “auxiliary”

fact tables which are summary tables or materialized views over the main fact table

• Helps in faster query answering for frequently asked queries

• Costlier to update than snowflake

Issues in Data Cubes• Curse of high dimensionality

– Currently known index structures degrade to linear search when number of dimensions become high

• Categorical dimensions– In order to run certain algorithms like clustering, dimensions

should belong to ordinal classes – Categorical dimensions difficult to index

• Ordinal changes during aggregation– Certain dimensions may change their ordinal property when

aggregated and should be indexed at several levels. – Ex: Student names are ordered lexicographically, but when

aggregated into classes, are ordered on their graduation year.

25

The Time Dimension

• Mandatory in most warehouse applications• Has several meanings and roll-up techniques

depending on application context – Simple calendar based rollup – Fiscal calendar based rollup – Academic calendar based rollup …

• Need to separately index special dates like releases, events, etc.

• Order of traversal of time dimension important

Materialized Views

• Summary tables that create physical views of fact table

• Trade-off between faster query answering and increased complexity during updates

• When to materialize? Use the result to search space (RSS) ratio: – (#rows returned / # rows scanned) for query– Summarize if RSS ratio too small and query is too frequent

26

Revision History Table(s)

• Manages data that is revised over time

• Queries select appropriate value based on relevant version

• Usually required for most warehousing applications

01-01-2001140,011101-06-2000130,0451

01-01-2000110,050 1(turnover per employee)

RevisedValId

Designing a Data Warehouse

EnterpriseModel

DWLogicalModel

DWPhysical

Model

End-user +DBA

DBA+AutomatedTools

27

Enterprise to Warehouse

• Some thumb rules: – Warehouse logical model closely resembles

enterprise model– Some transformation usually necessary from

enterprise to warehouse models – Warehouse logical model should depict

denormalized data sets implicit in enterprise model

– Special planning required for managing time dimension and revision histories

OLTP to Warehouse Models

• OLTP databases usually organized around the enterprise model

• OLTP schemas provide a good starting point for designing OLAP logical models

28

OLTP to Warehouse Models

• Some thumb rules when converting OLTP schemas into OLAP schemas: – Look for operational data fields and remove them (Ex:

Counter-sales table containing register number, cashier Emp_id)

– Add time element (and version elements if necessary) to data sets before populating the warehouse

– Decide on derived data and summary tables at design time itself

– Iterate between transformation rule specification, integration and schema design

– Add the commonly required summary information “ALL” to every domain


Part III: Index structures and query processing

29

Classes of Dimensions

• Categorical– {Cats, dogs, sheep, cows, bulls, buffaloes}

• Ordinal– Totally ordered (integers), partially ordered

(credentials of a candidate)• Sparse

– Small number of data points per value • Dense

– Large number of data points per value

Multi-dimensional indexes

• Usually based around ordinal classes• Different kinds of indexes for sparse and dense data

sets • Performance may depend on storage structure for

data set

30

Representing Multi-dimensional Data

• Multi-level sorting– Sorts data based on

different dimensions one after the other

– Simple to implement– Searching is fast if

dominant attribute is part of query

– Search becomes fragmented if dominant attributed omitted from query

Dim 1 Dim 2 Dim 3

1 34 21 34 51 34 101 56 202 45 92 49 103 69 203 69 304 23 294 23 484 40 50

Representing Multi-dimensional Data

• Space filling curves– Sorted on all attributes at

once– Location of a data point

easily computable – Suffers with increase in

number of dimensions

0 1 2 3 4012345

31

Multi-dimensional Indexes

• Ordered index on multiple attributes– Considers a composite key as a tuple of simple

keys (k1, k2, …kn)

– Ordered index files maintained by ordering each key in sequence.


• Partitioned Hashing– Given a composite key (k1, k2, …kn), partitioned

hashing returns n different bucket numbers – Hash bucket is determined by concatenating the n

numbers.

32


• Grid Files – Partitions the range of key values for each key into

several buckets– Combinations of buckets of each key forms a

“grid”– A grid file stores a grid as any other multi-

dimensional data set.

Grid Files

Roll No. 1 2 3 4 5

Grade

A

B

C

D

Roll No.

1 001– 0252 026 – 0503 051 – 0754 076 – 1005 101 – 125

Bucket Pool

33


• Bit-map indexes – Used on fields that are sparse (i.e. has only a

small number of values. Example, gender, grade, etc.)

– A bit vector enumerates all possible values and sets corresponding bit for each data element

– Much more compact than other index structures– Useful for efficiently answering composite queries

over multiple bit-vectored fields – Can be integrated with tree indexes


Encoding Bit-map indexes

Grade = {A, B, C, D, E, F} Subject = {DB, AI, PDS}A = 000001 DB = 001B = 000010 AI = 010C = 000100 PDS = 100D = 001000 no value = 000E = 010000F = 100000 Student who has scored A in DB and

AINo value = 000000 (000001 && 001 && 001)

34

Multidimensional Tree Indexes

• KD Trees– A binary tree structure that can store n-

dimensional data points – Each dimension compared at appropriate level– Useful for point queries

KD TreesLet data be represented as 2-dimensional points of the form (x,y) representing (salary, age)

Example data set:

(2500, 20)(5000, 32)(4500, 28) (2000, 23)(4800, 25) (1800, 18)(6500, 27)

35

KD Trees(2500, 20)(5000, 32)(4500, 28) (2000, 23)(4800, 25) (1800, 18)(6500, 27)

2500, 20

5000, 32

4500, 28

2000, 23

4800, 25

1800, 18

6500, 27

KD Trees

36

KD Trees

• Each point divides search space along one of the dimensions

• Structure of the tree (and hence its performance) sensitive to the order of insertion of data points

Quad Trees

• Initially, index contains only one bucket representing the entire space

• If number of data points in any bucket exceeds maximum limit, it is split into two along each dimension and are added as children of the larger bucket

• When number of dimensions = 2, splitting results in a quad

37

Quad Trees

R Trees

• Manages regions • Leaf nodes represent data regions and non-leaf

nodes represent virtual (non-data) regions • A node is split when it contains too many regions • Addition of regions begins from root node until the

smallest accommodating region is found (possibly by splitting one or more regions)

• Sibling regions may overlap but may not subsume one another

38

R Trees

Data region

Virtual region

R Trees

• Suitable for range, neighborhood and nearness searches

• Tree structure and performance sensitive to order in which data regions are added

• Suffers from the curse of high dimensionality

39

Indexing Categorical Data (Ref [7])

• Categorical Data– Have no ordinal relationship – Cannot be compared, except for equality

– Can be represented as sets in many cases – Example categorical attribute: Team members of a

given project, Ingredients for a given recipe, Products manufactured by a unit, etc.

– Comparison operators on sets: equality, membership, superset, subset

Signatures

• Represent a set as a bitmap where each bit corresponds to an object in a larger UoD

UoD = {set of all ingredients} S, T ⊆ UoD : ingredients for two recipiess, t : corresponding bit maps of S and T

Queries: S ⊆ T s ∧ ~t = 0 S ⊇ T t ∧ ~s = 0

40

Signature Trees

• Leaf nodes contain (signature, datapointer) pairs • Non-leaf nodes formed by bit-wise ORing of its

children nodes • Traverse the tree by AND’ing the query signature

with the node signature

1111 1100 1011

1000 0100 1001 0011

Extensible Signature Hashing

• Hash tables constructed based on the most significant d bits of signature

• Hash levels extended by extending d whenever overflow occurs

41

Extensible Signature Hashing

000

100101110111

010011

001

Bucket for records whose hash values starts with 00




d = 2

d = 3

d = 3

Global depth n = 3

d = 1

Summary

• The OLAP Hypercube• Materialized views • ROLAP and MOLAP implementations • Star, Snowflake and Constellation • Time dimensions and revision tables • Thumb rules for OLAP design • Multi-dimensional index structures

42

Furthermore…

• Topics not addressed for reasons of brevity– Query Language constructs– Data Mining over warehouses– Handling semi-structured data in warehouses– Performance Tuning– Maintenance of materialized views – Browsing and Visualization– …

Thank You

43

References

1. Erhard Rahm, Hong Hai Do. Data Cleaning: Problems and Current Approaches. Bulletin on the Technical Committee on Data Engineering, IEEE Computer Society, Vol. 23, No. 4, Dec 2000.

2. Vijay T. Raisinghani. Cleaning Methods in Data Warehousing. PhD seminar report, IIT Bombay, Dec 1999.

3. A. Monge, C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997. http://citeseer.ist.psu.edu/monge97efficient.html

References

4. H. Garcia-Molina, J.D. Ullman, J. Widom. Database Systems: The Complete Book. Pearson Education, 2004.

5. R. Agrawal, A. Gupta, and S. Sarawagi, Modeling multidimensional databases, ICDE, 1997.

6. Oliver Guenther. Data Warehouses and Data Mining. Course Notes, Humboldt University, Berlin. http://www.wiwi.hu-berlin.de/~guenther/DW/dw_ws03.html

7. Helmer, S., Moerkotte, G., 1999. A study of four index structures for set-valued attributes of low cardinality. Reihe Informatik 2, University of Mannheim. pp. 20.

44

Conferences and Workshops

• DaWaK: Data Warehousing and Knowledge Discovery (http://www.dexa.org/)

• VLDB: Very Large Databases (http://www.vldb.org/) • EDBT: Extending Database Technology

(http://www.edbt.org/) • DOLAP: ACM International Workshop on Data

Warehousing and OLAP (http://www.cis.drexel.edu/faculty/song/dolap.html)

Some WWW Links• DW Infocenter (http://www.dwinfocenter.org/) • Data.com (http://www.data.com/) • The Data Warehousing Institute (http://www.dw-institute.com/) • KDNuggets, a comprehensive portal on knowledge discovery

(http://www.kdnuggets.com/) • Oracle Data Warehousing Tutorial (regn. Required)

(http://www.oracle.com/technology/idevelop/online/courses/oln/how_to04.html)

• Data Warehouse: Online Recourses (http://www.dci.com/news/datawarehouse/articles/1998/05/links.htm)

• Data Warehousing and OLAP bibliography (http://www.ondelette.com/OLAP/dwbib.html)

introduction to data warehousing and...

Documents