itcs 6163 data warehousing xintao wu. history 60s c. bachman ge network data model late 60s ibm ims...
TRANSCRIPT
ITCS 6163 Data Warehousing
Xintao Wu
History
60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase 90s- DW, internet Turing award and Turing test?
Evolution of Database Technology(See Fig. 1.1)
1960s: Data collection, database creation, IMS and network
DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s—2000s: Data mining and data warehousing, multimedia
databases, and Web databases
Can You Easily Answer These Questions?
What are Personnel
Services costs across all
departments for all funding sources?
What are the effects of
outsourcing specific services?
What is the correlation between
expenditures and collection of
delinquent taxes?
What is the impact on revenues and
expenditures of changing the operating
hours of the Dept. of Motor Vehicles?
What is the economic impact of the small business initiative in
our district?
Overview: Data Warehousing and OLAP Technology for Data Mining
What is a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
What is a Warehouse?
Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile
more
What is a Warehouse?
Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse
What is a Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organization’s operational database
Support information processing by providing a solid platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. InmonData warehousing:
The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered,
etc. When data is moved to the warehouse, it is
converted.
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not
contain “time element”.
Data Warehouse—Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in
the data warehouse environment. Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Warehouse is specialized DB
Mostly updatesMany small transactionsMb-Tb of dataCurrent snapshotRaw dataClerical users
Mostly readsQueries are long, complexGb-Tb of dataHistorySummarized, consolidated dataDecision-makers, analysts as users
Standard DB Warehouse
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach
When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Why Separate Data Warehouse?
High performance for both systems DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation.
Different functions and different data: missing data: Decision support requires historical
data which operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
Why a Warehouse?
Two Approaches: Query-Driven (Eager) Warehouse (Lazy)
Source Source
?
Query-Driven Approach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
Advantages of Warehousing
High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information
Advantages of Query-Driven
No need to copy data less storage no need to purchase data
More up-to-date dataQuery needs can be unknownOnly query interface needed at sources
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Modeling OLTP SystemsGoal -- Update as many transactions as possible in the shortest period of timeApproachModel to 3rd Normal Form (3NF)Minimize redundancy to optimize update
ResultCreate many (hundreds) of tablesDifficult for business users to understand and useRetrieval requires many JOINs = lousy performance
Modeling the Data Warehouse
Tuning the relational modelDenormalize
– Reduces the number of tables
– Improves usability
– Improves performanceAdd aggregate data (typically separate tables)
– Improves performance
– Degrades usability
Modeling the Data Warehouse
“Entity relation data models are a disaster for querying because they cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses.”
Ralph Kimball, The Data Warehouse Toolkit,
1996, John Wiley & Sons, Inc., ISBN 0-471-15337-0
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions &
measures Star schema: A fact table in the middle connected to a set
of dimension tables Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Multidimensional DataSales volume as a function of product, month, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
A Sample Data Cube
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Relational Operators
SelectProjectJoin
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
Another Example
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
Aggregates
Operators: sum, count, max, min, median, avg
Type Distributive Algebraic holistic
“Having” clauseUsing dimension hierarchy average by region (within store) maximum by month (within date)
Cube Aggregation
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
Aggregation Using Hierarchies
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
region A region Bp1 56 54p2 11 8
customer
region
country
(customer c1 in Region A;customers c2, c3 in Region B)
Pivoting
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3p1 56 4 50p2 11 8
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Design of a Data Warehouse: A Business Analysis Framework
Four views regarding the design of a data warehouse
Top-down view allows selection of the relevant information necessary
for the data warehouse Data source view
exposes the information being captured, stored, and managed by operational systems
Data warehouse view consists of fact tables and dimension tables
Business query view sees the perspectives of data in the warehouse from
the view of end-user
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfall: structured and systematic analysis at each step
before proceeding to the next Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record
Multi-Tiered ArchitectureMulti-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
other
sources
Data Storage
OLAP Server
Three Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning
the entire organization
Data Mart a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse A set of views over operational databases Only some of the possible summary views may be
materialized
What is a Data Mart ?
A data mart is a small-scale data warehouse that is focused on a single department or single subject area to provide a subset of data warehouse data to address specific reporting and analysis requirements.
Budget
Finance
HR
Purch-asing
AssetMgmt.
Info.Tech.
Smaller warehouses Spans part of organization Do not require enterprise-wide consensus
but long term integration problems?
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
OLAP Server Architectures
Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.
MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services greater scalability
Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers specialized support for SQL queries over star/snowflake schemas
MOLAP Databases Data is stored using a proprietary
format(MOLAP) Accessible only through the DB vendor’s tools Suitable only for summarized data Data may be summarized in advance or real-time Examples:
PowerPlay Holos Essbase
RDBMS: Indexing Strategies
Select columns to be indexed: Choose combinations of columns most often used to
constrain queries (“where …” clause) Queries must use constraining columns in the same
order as the columns in the index
Unique more efficient than non-unique.
More indexes means faster query performance, but also longer transformation/load times.
Types of Indexes: B-tree -- many possible values (e.g., invoice number) Bitmap -- few possible values (e.g., M/F, S/M/D/W)
MOLAP versus ROLAP
MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations
ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations
Data Warehouse Usage
Three kinds of data warehouse applications Information processing
supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,
pivoting Data mining
knowledge discovery from hidden patterns supports associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools.
Differences among the three tasks
Data Mining & Forecasting
Mining the Warehouse Choose data population Select mining technique Segment data into
groups Identify data patterns
Forecasting Data Select trend data Choose forecast model Run forecast Display predictions
Accessing & Analyzing Data
Query & Reporting … retrieving data directly from the warehouse and preparing it for presentationOnline Analytical Processing (OLAP) … analyzing aggregated data from a variety of perspectivesData Mining & Forecasting … analyzing and predicting data using mathematical models
Query & ReportingQuery the Data ... Select & filter data Retrieve results
Report the Results ... Sort & group data Format & present data
Save or Export Data Save queries & reports Export to other tools Publish HTML pages
Query & Reporting Tools
Cognos ImpromptuBusiness ObjectsCrystal InfoBrioQueryIQGQLSAS
Online Analytical Processing
Slice and Dice ... Select dimensions Choose measures Filter by dimensions
Drill Down ... Drill down
hierarchies Drill through to
details
Present the Results Present as
spreadsheet Display graphically
OLAP Tools
Cognos PowerPlayBusiness AnalyzerHolosBrioAnalyzerMicrostrategyOracle ExpressSASArbor Essbase
Data Mining & Forecasting
Mining the Warehouse Choose data population Select mining technique Segment data into
groups Identify data patterns
Forecasting Data Select trend data Choose forecast model Run forecast Display predictions
Data Mining
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transactio
n
id custo
mer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
Mining and Forecasting Tools
Scenario4ThoughtBusiness MinerClementineDarwinHolosSAS
Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external
sourcesData cleaning: detect errors in the data and rectify them when
possibleData transformation: convert data from legacy or host format to
warehouse formatLoad: sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitionsRefresh propagate the updates from the data sources to the
warehouse
Data Cleaning
• Of primordial interest in the warehouse creation
• One of the biggest problems• Difficult to achieve• Probability of one or many of the
sources containing “dirty data” is high.
• Lots of manual intervention
Data Cleaning Problems
Data quality problems
Single Source Multi-source
Schema level Instance level
Schema level Instance level
(poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data)
.Uniqueness .Misspellings Naming conflictsInconsistent
aggregation
Multisource problems
All the previous problems +Schema differences (translation and integration) E.g.: EmpID, CID, Sex= M/F, Sex=0/1
Instance level conflicts Duplicate records, contradicting records Different measures ($, Euros) Different aggregation levels (weeks,
quarters)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehouse to data mining
Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Steps of a KDD Process
Learning the application domain: relevant prior knowledge and goals of application
Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
From OLAP to On Line Analytical Mining (OLAM)
Why online analytical mining? High quality of data in data warehouses
DW contains integrated, consistent, cleaned data Available information processing structure surrounding
data warehouses ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and tasks.
Architecture of OLAM
An OLAM Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
Summary
Data warehouse A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM