introduction to data warehousing and...
TRANSCRIPT
1
Introduction to Data Warehousing and OLAP
Outline
• Part I – Introduction– OLAP vs OLTP – Data Cleaning and Integration
• Part II– Data Models and Warehouse Design
• Part III – Index Structures for Data Warehouses
2
Types of Data
• Operational Data (OLTP applications)– Data that “works”– Frequent updates and queries – Normalized for efficient search and updates
(minimize update anomalies) – Fragmented and local relevance– Point Queries: queries accessing individual tuples
Types of Data
• Historical Data (OLAP applications)– Data that “tells”– Very infrequent updates– Integrated data set with global relevance– Analytical queries that require huge amounts of
aggregation – Performance issues mainly in query response time
(not in updates)
3
Example OLTP Queries
• What is the salary of Mr. Ali? • What is the address and phone number of the
person in charge of the Supplies department? • How many employees have received an
“excellent” credential in the latest appraisal?
Example OLAP Queries
• How is the employee attrition scene changing over the years across the company?
• Is there a correlation between the geographical location of a company unit and excellent employee appraisals?
• Is it financially viable to continue our manufacturing unit in Taiwan?
4
A Data Warehouse
• An infrastructure to manage historical data• Designed to support OLAP queries involving
gratuitous use of aggregations• Post retrieval processing (reporting) just as
complex, if not more, as the retrieval itself
Warehousing Data
OLTPUnit
OLTPUnit
OLTPUnit
DataCleaning
andIntegration
OperationalData
DataWarehouse
5
Data Marts
• Data warehouses seen as a collection of “data-marts” – or historical data about each OLTP segment that feeds into the warehouse
• Data-marts also seen as small warehouses for OLAP activities within a given segment
Data Cleaning and Integration
DCU DIU DataWarehouse
Updates / FeedbackOLTPDatabases
Back flush
6
Dirty Data
• Lack of Standardization– Multiple encodings, locales, languages…– Spurious abbreviations: “Allama Iqbal Road” and
“A.I. Road” are the same…– Semantic equivalence: “Rawalpindi” is the same
as “Pindi”…– Multiple standards: 1.6 kilometer is the same as 1
miles …
Dirty Data
• Missing, spurious and duplicate data– Missing age field for an employee– Spurious (incorrectly entered) sales values– Duplication of data-sets across OLTP units– Semantic duplication (M. A. Khan appearing in
another data set as Khan Muhammad Ali)
7
Dirty Data
• Inconsistencies– Incorrect use of codes (use of M/F in addition to
0/1 for gender) – Codes with inconsistent or outdated meaning
(Travel eligibility “C” denoting eligibility to travel only by III class sleeper, which no longer exists)
– Inconsistent duplicate data (two data sets are found to belong to the same person, but have two different address information)
Dirty Data
• Inconsistencies– Inconsistent associations (Sales figures provided
by the marketing department do not add up to the total sales figures by the retail units)
– Semantic inconsistencies (Feb 31st) – Referential inconsistency (Rs. 10 lakhs sales
reported from a unit that has been closed down)
8
Issues in Data Cleaning
• Cannot be fully automated• GIGO (Garbage in Garbage out)• Requires considerable knowledge that is tacit
and beyond the purview of the warehouse (metrics, geography, govt. policies, etc.)
• Complexity increases (usually geometrically) with increase in data sources
• Complexity increases with the history span that is taken up for cleaning
Steps in Data Cleaning (Rahm and Do [1])
1. Data Analysis: Analyze data set to obtain meta-data and detect dirty data
2. Definition of transformation rules: Transform data from its current “dirty” form to the required “clean”form. Transformation can be either at the schema level or data level
3. Rule Verification: Verification of the transformation rules on test data sets
4. Transformation: Execution of transformation rules on data set
5. Backflow: Re-populating data sources with cleaned data.
9
Data Analysis Techniques (Refs [1],[2])
Compare cardinality with #rows, detect nulls, use rules to predict incorrect or missing values.
Duplicate and missing values
Column comparison (compare value sets from given column across tables)
Lack of standards
Hashing, N-gram outliers
Spelling mistakes
(max, min), (mean, deviation),Cardinality
Illegal values
Meta-data usedProblem to be Detected
Transformation Algorithms
• Hash-Merge for duplicate elimination1. Hash tuples based on given column into buckets2. (Tuples with duplicate values are hashed onto
the same bucket) 3. Merge tuples within each bucket separately
10
Transformation Algorithms
Products30, SnkyRd.
Rahim
PR50, LvlRd.
M.A. Khan
R&D25, LB RdSaleem
Sales50, LvlRd.
M.A. Khan
DeptAddressName
Hash key
Hash Buckets
Transformation Algorithms
• Sorted Neighborhood Technique for misspelling integration1. Identify a set of data values within a given row as
key 2. Sort table based on key 3. Slide a window of n rows over the sorted table
and merge data values based on rules. (Ex: Merge names if all other values like age, address, dept, etc. match)
4. Make multiple passes until there are no more merges of records
11
Transformation Algorithms
Products30, SnkyRd.
Rahim
PR50, LvlRd.
M.A. Khan
R&D25, LB Rd.
Saleem
Sales50, LvlRd.
M.A. Khan
DeptAddressName
R&D25, LB Rd.
Saleem
Products30, SnkyRd.
Rahim
PR50, LvlRd.
M.A. Khan
Sales50, LvlRd.
M.A. Khan
DeptAddressName
Rule: Merge rows ifname and address match.
Window size n = 3.
Transformation Algorithms (MongeElkan ‘97, [3])
• Graph-Based transitive closure to reduce number of passes1. Use sorted neighborhood technique and sort
records based on identified keys2. Create an undirected graph structure where
nodes correspond to records and edges correspond to “is a duplicate of” relationship
3. Records R1 and R2 need not be compared in any pass if they belong to the same connected component
12
Transformation algorithmsR1 R2
R3 R4
R5
R1 R2
R3 R4
R5
R1 R2
R3 R4
R5
No need to compare R1/R2 with R4/R5
Slide 0: R1, R2, R3Slide 1: R2, R3, R4 Slide 2: R3, R4, R5
Slide 0: R1, R2, R3 Slide 1: R3, R4, R5
Naïve sliding window
Graph-based transitive closure
Integration
• Combining disparate data sources into a single schematic structure– Schema Integration: Forming an integrated
schematic structure from the disparate data sources
– Data Integration: Cleaning and merging data from different sources
13
Schema Integration
• Consider the following schemata [4]:
Cars (serialNo, model, colour, stereo, glasstint, …)
and
Autos(serienNr, modelle, farbe)Optionen(serienNr, stereo, glastint, …)
Schema Integration
• Challenges– Naming differences – Structural differences – Data type differences – Missing fields – Semantic differences
14
Schema IntegrationGeneric Architecture of an Integrator
Wrapper / Extractor
Wrapper / Extractor
Wrapper / Extractor
Mediator /Constructor
Data Sources
Integration
• Wrapper / Extractor – Creates a common view across all data sources– Bridges differences in naming, type and schema
structure– Wrappers do not physically extract data from the
data sources• Mediator / Constructor
– Constructs an integrated schematic structure– Performs data integration and populates the data
warehouse
15
Tools for Data Cleaning and Integration
• dfPower– From Dataflux corporation
(http://www.dataflux.com/) – De-duplication engine – Analyzes data based on values and number of
occurrences – Does not support detection of semantic duplicates
based on user specified rules – Permits duplicates to be grouped or merged
Tools for Data Cleaning and Integration
• ETI* Data Cleanse – From Evolutionary Technologies Int’l
(http://www.evtech.com/) – Table driven data cleaning, matching and quality
review, duplicate matching, imprecise spelling correction
– Supports meta-data repositories to store schemas, transformation rules, interrelationships, etc.
16
Tools for Data Cleaning and Integration
• SSA Name/Data Clustering Engine– From Search Software America
(http://www.searchsoftware.co.uk/) – Addresses errors in spelling, typing, transcription,
nicknames, synonyms, abbreviations, prefix/suffix variations, punctuation, casing, etc.
– Supports user specified transformation rules – Scalable up to 500 M records
Summary
• OLAP versus OLTP • Characteristics of OLAP queries • Data Warehousing systems • Data Cleaning Issues
– Dirty Data – Cleaning Algorithms
• Integration of Data and Schema
17
Introduction to Data Warehousing and OLAP
Part II: Data Models and Warehouse Design
Example OLAP Queries
• How is the employee attrition scene changing over the years across the company?
• Is there a correlation between the geographical location of a company unit and excellent employee appraisals?
• Is it financially viable to continue our manufacturing unit in Taiwan?
18
OLAP Query Characteristics
• Aggregation and summarization over large data sets
• Clustering • Trend detection • Multi-dimensional projections
A Typical Warehouse
HypercubeCore
MaterializedViews
19
A Typical Warehouse
• Hypercube Core– Manages the atomic data elements– Global schematic structure for the entire
warehouse– Based on the multi-dimensional data model
• Materialized Views– Physical views for faster aggregate-query
answering– De-normalization of the core
The Sales Hyper Cube
Branch
ProductWeek
20
The Sales Hyper Cube
• “Sales” is the fact • “Branch”, “Product” and “Week” are
dimensions
Operations on Hyper Cubes
• Pivoting: Choosing (rotating the cube on a pivot) a set of dimensions for display
• Slicing-dicing: Select some subset of the cube • Roll-up: Aggregate a dimension to a smaller
dimension (Roll-up weeks dimensions into months) • Drill-down: Open an aggregated dimension to reveal
details (Open up months to reveal week-by-week information)
21
Implementation of Hyper Cubes
• Multi-dimensional to relational mapping (ROLAP)– Map hyper cube queries to relational queries and maintain
the data cube in a set of RDBMS tables– Ex: True Relational OLAP from Microstrategy Inc
(http://www.microstrategy.com/)
• Native multidimensional model (MOLAP) – Use a separate storage model for multidimensional data – Ex: Arbor Essbase (http://www.arborsoft.com/)
Physical models: Star
Brnch Prod Wk Sales
BranchDimension
Table
ProductDimension
Table
WeekDimension
Table
Fact Table
22
Star Schema
• Features– Central Fact Table– Set of supporting dimension tables– Denormalized data storage
• Advantages– Simple to comprehend and design– Small meta-data– Quick query responses
• Limitations– Not robust towards changes (Changes in dimension table)– Enormous amount of redundancy in dimension-table data
Physical models: Snowflake
Brnch Prod Wk Sales
Branch Product
Week
Options SchemeDivision
Unit
23
Snowflake Schema
• Features– Central Fact Table– Normalized dimension tables storing atomic data units
• Advantages– Faster query responses– Easy updation
• Limitations– Large amount of meta-data– May result in too many tables – Harder to comprehend manually
Physical models: Constellation
Brnch Prod Wk Sales
BranchDimension
Table
ProductDimension
Table
WeekDimension
Table
Wk Prod Sch Dist
SchemeDimension
Table
Sales Fact Table
DiscountsFact Table
24
Constellation
• Most commonly used architecture• Used when multiple fact tables are needed• Usually has a “main” fact table and several “auxiliary”
fact tables which are summary tables or materialized views over the main fact table
• Helps in faster query answering for frequently asked queries
• Costlier to update than snowflake
Issues in Data Cubes• Curse of high dimensionality
– Currently known index structures degrade to linear search when number of dimensions become high
• Categorical dimensions– In order to run certain algorithms like clustering, dimensions
should belong to ordinal classes – Categorical dimensions difficult to index
• Ordinal changes during aggregation– Certain dimensions may change their ordinal property when
aggregated and should be indexed at several levels. – Ex: Student names are ordered lexicographically, but when
aggregated into classes, are ordered on their graduation year.
25
The Time Dimension
• Mandatory in most warehouse applications• Has several meanings and roll-up techniques
depending on application context – Simple calendar based rollup – Fiscal calendar based rollup – Academic calendar based rollup …
• Need to separately index special dates like releases, events, etc.
• Order of traversal of time dimension important
Materialized Views
• Summary tables that create physical views of fact table
• Trade-off between faster query answering and increased complexity during updates
• When to materialize? Use the result to search space (RSS) ratio: – (#rows returned / # rows scanned) for query– Summarize if RSS ratio too small and query is too frequent
26
Revision History Table(s)
• Manages data that is revised over time
• Queries select appropriate value based on relevant version
• Usually required for most warehousing applications
01-01-2001140,011101-06-2000130,0451
01-01-2000110,050 1(turnover per employee)
RevisedValId
Designing a Data Warehouse
EnterpriseModel
DWLogicalModel
DWPhysical
Model
End-user +DBA
DBA+AutomatedTools
27
Enterprise to Warehouse
• Some thumb rules: – Warehouse logical model closely resembles
enterprise model– Some transformation usually necessary from
enterprise to warehouse models – Warehouse logical model should depict
denormalized data sets implicit in enterprise model
– Special planning required for managing time dimension and revision histories
OLTP to Warehouse Models
• OLTP databases usually organized around the enterprise model
• OLTP schemas provide a good starting point for designing OLAP logical models
28
OLTP to Warehouse Models
• Some thumb rules when converting OLTP schemas into OLAP schemas: – Look for operational data fields and remove them (Ex:
Counter-sales table containing register number, cashier Emp_id)
– Add time element (and version elements if necessary) to data sets before populating the warehouse
– Decide on derived data and summary tables at design time itself
– Iterate between transformation rule specification, integration and schema design
– Add the commonly required summary information “ALL” to every domain
Introduction to Data Warehousing and OLAP
Part III: Index structures and query processing
29
Classes of Dimensions
• Categorical– {Cats, dogs, sheep, cows, bulls, buffaloes}
• Ordinal– Totally ordered (integers), partially ordered
(credentials of a candidate)• Sparse
– Small number of data points per value • Dense
– Large number of data points per value
Multi-dimensional indexes
• Usually based around ordinal classes• Different kinds of indexes for sparse and dense data
sets • Performance may depend on storage structure for
data set
30
Representing Multi-dimensional Data
• Multi-level sorting– Sorts data based on
different dimensions one after the other
– Simple to implement– Searching is fast if
dominant attribute is part of query
– Search becomes fragmented if dominant attributed omitted from query
Dim 1 Dim 2 Dim 3
1 34 21 34 51 34 101 56 202 45 92 49 103 69 203 69 304 23 294 23 484 40 50
Representing Multi-dimensional Data
• Space filling curves– Sorted on all attributes at
once– Location of a data point
easily computable – Suffers with increase in
number of dimensions
0 1 2 3 4012345
31
Multi-dimensional Indexes
• Ordered index on multiple attributes– Considers a composite key as a tuple of simple
keys (k1, k2, …kn)
– Ordered index files maintained by ordering each key in sequence.
Multi-dimensional Indexes
• Partitioned Hashing– Given a composite key (k1, k2, …kn), partitioned
hashing returns n different bucket numbers – Hash bucket is determined by concatenating the n
numbers.
32
Multi-dimensional Indexes
• Grid Files – Partitions the range of key values for each key into
several buckets– Combinations of buckets of each key forms a
“grid”– A grid file stores a grid as any other multi-
dimensional data set.
Grid Files
Roll No. 1 2 3 4 5
Grade
A
B
C
D
Roll No.
1 001– 0252 026 – 0503 051 – 0754 076 – 1005 101 – 125
Bucket Pool
33
Multi-dimensional Indexes
• Bit-map indexes – Used on fields that are sparse (i.e. has only a
small number of values. Example, gender, grade, etc.)
– A bit vector enumerates all possible values and sets corresponding bit for each data element
– Much more compact than other index structures– Useful for efficiently answering composite queries
over multiple bit-vectored fields – Can be integrated with tree indexes
Multi-dimensional Indexes
Encoding Bit-map indexes
Grade = {A, B, C, D, E, F} Subject = {DB, AI, PDS}A = 000001 DB = 001B = 000010 AI = 010C = 000100 PDS = 100D = 001000 no value = 000E = 010000F = 100000 Student who has scored A in DB and
AINo value = 000000 (000001 && 001 && 001)
34
Multidimensional Tree Indexes
• KD Trees– A binary tree structure that can store n-
dimensional data points – Each dimension compared at appropriate level– Useful for point queries
KD TreesLet data be represented as 2-dimensional points of the form (x,y) representing (salary, age)
Example data set:
(2500, 20)(5000, 32)(4500, 28) (2000, 23)(4800, 25) (1800, 18)(6500, 27)
35
KD Trees(2500, 20)(5000, 32)(4500, 28) (2000, 23)(4800, 25) (1800, 18)(6500, 27)
2500, 20
5000, 32
4500, 28
2000, 23
4800, 25
1800, 18
6500, 27
KD Trees
36
KD Trees
• Each point divides search space along one of the dimensions
• Structure of the tree (and hence its performance) sensitive to the order of insertion of data points
Quad Trees
• Initially, index contains only one bucket representing the entire space
• If number of data points in any bucket exceeds maximum limit, it is split into two along each dimension and are added as children of the larger bucket
• When number of dimensions = 2, splitting results in a quad
37
Quad Trees
R Trees
• Manages regions • Leaf nodes represent data regions and non-leaf
nodes represent virtual (non-data) regions • A node is split when it contains too many regions • Addition of regions begins from root node until the
smallest accommodating region is found (possibly by splitting one or more regions)
• Sibling regions may overlap but may not subsume one another
38
R Trees
Data region
Virtual region
R Trees
• Suitable for range, neighborhood and nearness searches
• Tree structure and performance sensitive to order in which data regions are added
• Suffers from the curse of high dimensionality
39
Indexing Categorical Data (Ref [7])
• Categorical Data– Have no ordinal relationship – Cannot be compared, except for equality
– Can be represented as sets in many cases – Example categorical attribute: Team members of a
given project, Ingredients for a given recipe, Products manufactured by a unit, etc.
– Comparison operators on sets: equality, membership, superset, subset
Signatures
• Represent a set as a bitmap where each bit corresponds to an object in a larger UoD
UoD = {set of all ingredients} S, T ⊆ UoD : ingredients for two recipiess, t : corresponding bit maps of S and T
Queries: S ⊆ T s ∧ ~t = 0 S ⊇ T t ∧ ~s = 0
40
Signature Trees
• Leaf nodes contain (signature, datapointer) pairs • Non-leaf nodes formed by bit-wise ORing of its
children nodes • Traverse the tree by AND’ing the query signature
with the node signature
1111 1100 1011
1000 0100 1001 0011
Extensible Signature Hashing
• Hash tables constructed based on the most significant d bits of signature
• Hash levels extended by extending d whenever overflow occurs
41
Extensible Signature Hashing
000
100101110111
010011
001
Bucket for records whose hash values starts with 00
Bucket for records whose hash values starts with 010
Bucket for records whose hash values starts with 011
Bucket for records whose hash values starts with 1
d = 2
d = 3
d = 3
Global depth n = 3
d = 1
Summary
• The OLAP Hypercube• Materialized views • ROLAP and MOLAP implementations • Star, Snowflake and Constellation • Time dimensions and revision tables • Thumb rules for OLAP design • Multi-dimensional index structures
42
Furthermore…
• Topics not addressed for reasons of brevity– Query Language constructs– Data Mining over warehouses– Handling semi-structured data in warehouses– Performance Tuning– Maintenance of materialized views – Browsing and Visualization– …
Thank You
43
References
1. Erhard Rahm, Hong Hai Do. Data Cleaning: Problems and Current Approaches. Bulletin on the Technical Committee on Data Engineering, IEEE Computer Society, Vol. 23, No. 4, Dec 2000.
2. Vijay T. Raisinghani. Cleaning Methods in Data Warehousing. PhD seminar report, IIT Bombay, Dec 1999.
3. A. Monge, C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997. http://citeseer.ist.psu.edu/monge97efficient.html
References
4. H. Garcia-Molina, J.D. Ullman, J. Widom. Database Systems: The Complete Book. Pearson Education, 2004.
5. R. Agrawal, A. Gupta, and S. Sarawagi, Modeling multidimensional databases, ICDE, 1997.
6. Oliver Guenther. Data Warehouses and Data Mining. Course Notes, Humboldt University, Berlin. http://www.wiwi.hu-berlin.de/~guenther/DW/dw_ws03.html
7. Helmer, S., Moerkotte, G., 1999. A study of four index structures for set-valued attributes of low cardinality. Reihe Informatik 2, University of Mannheim. pp. 20.
44
Conferences and Workshops
• DaWaK: Data Warehousing and Knowledge Discovery (http://www.dexa.org/)
• VLDB: Very Large Databases (http://www.vldb.org/) • EDBT: Extending Database Technology
(http://www.edbt.org/) • DOLAP: ACM International Workshop on Data
Warehousing and OLAP (http://www.cis.drexel.edu/faculty/song/dolap.html)
Some WWW Links• DW Infocenter (http://www.dwinfocenter.org/) • Data.com (http://www.data.com/) • The Data Warehousing Institute (http://www.dw-institute.com/) • KDNuggets, a comprehensive portal on knowledge discovery
(http://www.kdnuggets.com/) • Oracle Data Warehousing Tutorial (regn. Required)
(http://www.oracle.com/technology/idevelop/online/courses/oln/how_to04.html)
• Data Warehouse: Online Recourses (http://www.dci.com/news/datawarehouse/articles/1998/05/links.htm)
• Data Warehousing and OLAP bibliography (http://www.ondelette.com/OLAP/dwbib.html)