![Page 1: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/1.jpg)
ITCS 6163 Data Warehousing
Xintao Wu
![Page 2: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/2.jpg)
History
60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model 80s SQL IBM R transaction J. Gray Late 80s-90s DB2, Oracle, informix, sybase 90s- DW, internet Turing award and Turing test?
![Page 3: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/3.jpg)
Evolution of Database Technology(See Fig. 1.1)
1960s: Data collection, database creation, IMS and network
DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s—2000s: Data mining and data warehousing, multimedia
databases, and Web databases
![Page 4: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/4.jpg)
Can You Easily Answer These Questions?
What are Personnel
Services costs across all
departments for all funding sources?
What are the effects of
outsourcing specific services?
What is the correlation between
expenditures and collection of
delinquent taxes?
What is the impact on revenues and
expenditures of changing the operating
hours of the Dept. of Motor Vehicles?
What is the economic impact of the small business initiative in
our district?
![Page 5: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/5.jpg)
Overview: Data Warehousing and OLAP Technology for Data Mining
What is a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
![Page 6: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/6.jpg)
What is a Warehouse?
Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile
more
![Page 7: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/7.jpg)
What is a Warehouse?
Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse
![Page 8: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/8.jpg)
What is a Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from the organization’s operational database
Support information processing by providing a solid platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. InmonData warehousing:
The process of constructing and using data warehouses
![Page 9: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/9.jpg)
Data Warehouse—Subject-Oriented
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
![Page 10: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/10.jpg)
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered,
etc. When data is moved to the warehouse, it is
converted.
![Page 11: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/11.jpg)
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not
contain “time element”.
![Page 12: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/12.jpg)
Data Warehouse—Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in
the data warehouse environment. Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
![Page 13: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/13.jpg)
Warehouse is specialized DB
Mostly updatesMany small transactionsMb-Tb of dataCurrent snapshotRaw dataClerical users
Mostly readsQueries are long, complexGb-Tb of dataHistorySummarized, consolidated dataDecision-makers, analysts as users
Standard DB Warehouse
![Page 14: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/14.jpg)
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach
When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
![Page 15: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/15.jpg)
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
![Page 16: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/16.jpg)
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
![Page 17: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/17.jpg)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
![Page 18: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/18.jpg)
Why Separate Data Warehouse?
High performance for both systems DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP
queries, multidimensional view, consolidation.
Different functions and different data: missing data: Decision support requires historical
data which operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
![Page 19: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/19.jpg)
Warehouse Architecture
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
Metadata
![Page 20: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/20.jpg)
Why a Warehouse?
Two Approaches: Query-Driven (Eager) Warehouse (Lazy)
Source Source
?
![Page 21: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/21.jpg)
Query-Driven Approach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
![Page 22: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/22.jpg)
Advantages of Warehousing
High query performanceQueries not visible outside warehouseLocal processing at sources unaffectedCan operate when sources unavailableCan query data not stored in a DBMSExtra information at warehouse Modify, summarize (store aggregates) Add historical information
![Page 23: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/23.jpg)
Advantages of Query-Driven
No need to copy data less storage no need to purchase data
More up-to-date dataQuery needs can be unknownOnly query interface needed at sources
![Page 24: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/24.jpg)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
![Page 25: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/25.jpg)
Modeling OLTP SystemsGoal -- Update as many transactions as possible in the shortest period of timeApproachModel to 3rd Normal Form (3NF)Minimize redundancy to optimize update
ResultCreate many (hundreds) of tablesDifficult for business users to understand and useRetrieval requires many JOINs = lousy performance
![Page 26: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/26.jpg)
Modeling the Data Warehouse
Tuning the relational modelDenormalize
– Reduces the number of tables
– Improves usability
– Improves performanceAdd aggregate data (typically separate tables)
– Improves performance
– Degrades usability
![Page 27: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/27.jpg)
Modeling the Data Warehouse
“Entity relation data models are a disaster for querying because they cannot be understood by users and they cannot be navigated usefully by DBMS software. Entity relation models cannot be used as the basis for enterprise data warehouses.”
Ralph Kimball, The Data Warehouse Toolkit,
1996, John Wiley & Sons, Inc., ISBN 0-471-15337-0
![Page 28: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/28.jpg)
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
![Page 29: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/29.jpg)
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
![Page 30: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/30.jpg)
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions &
measures Star schema: A fact table in the middle connected to a set
of dimension tables Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
![Page 31: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/31.jpg)
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
![Page 32: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/32.jpg)
Example of Snowflake Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
![Page 33: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/33.jpg)
Example of Fact Constellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
![Page 34: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/34.jpg)
Multidimensional DataSales volume as a function of product, month, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
![Page 35: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/35.jpg)
A Sample Data Cube
Total annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
![Page 36: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/36.jpg)
Cuboids Corresponding to the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
![Page 37: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/37.jpg)
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
![Page 38: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/38.jpg)
Relational Operators
SelectProjectJoin
![Page 39: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/39.jpg)
Aggregates
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
ans date sum1 812 48
![Page 40: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/40.jpg)
Another Example
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
sale prodId date amtp1 1 62p2 1 19p1 2 48
drill-down
rollup
![Page 41: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/41.jpg)
Aggregates
Operators: sum, count, max, min, median, avg
Type Distributive Algebraic holistic
“Having” clauseUsing dimension hierarchy average by region (within store) maximum by month (within date)
![Page 42: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/42.jpg)
Cube Aggregation
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
c1p1 110p2 19
129
. . .
drill-down
rollup
Example: computing sums
![Page 43: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/43.jpg)
Aggregation Using Hierarchies
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
region A region Bp1 56 54p2 11 8
customer
region
country
(customer c1 in Region A;customers c2, c3 in Region B)
![Page 44: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/44.jpg)
Pivoting
sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3p1 56 4 50p2 11 8
![Page 45: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/45.jpg)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
![Page 46: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/46.jpg)
Design of a Data Warehouse: A Business Analysis Framework
Four views regarding the design of a data warehouse
Top-down view allows selection of the relevant information necessary
for the data warehouse Data source view
exposes the information being captured, stored, and managed by operational systems
Data warehouse view consists of fact tables and dimension tables
Business query view sees the perspectives of data in the warehouse from
the view of end-user
![Page 47: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/47.jpg)
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of both
Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view Waterfall: structured and systematic analysis at each step
before proceeding to the next Spiral: rapid generation of increasingly functional systems,
short turn around time, quick turn around
Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record
![Page 48: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/48.jpg)
Multi-Tiered ArchitectureMulti-Tiered Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
AnalysisQueryReportsData mining
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational DBs
other
sources
Data Storage
OLAP Server
![Page 49: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/49.jpg)
Three Data Warehouse Models
Enterprise warehouse collects all of the information about subjects spanning
the entire organization
Data Mart a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse A set of views over operational databases Only some of the possible summary views may be
materialized
![Page 50: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/50.jpg)
What is a Data Mart ?
A data mart is a small-scale data warehouse that is focused on a single department or single subject area to provide a subset of data warehouse data to address specific reporting and analysis requirements.
Budget
Finance
HR
Purch-asing
AssetMgmt.
Info.Tech.
Smaller warehouses Spans part of organization Do not require enterprise-wide consensus
but long term integration problems?
![Page 51: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/51.jpg)
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data Mart
Data Mart
Distributed Data Marts
Multi-Tier Data Warehouse
Enterprise Data Warehouse
Model refinementModel refinement
![Page 52: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/52.jpg)
OLAP Server Architectures
Relational OLAP (ROLAP) ROLAP - provides a Multi-dimensional view of a relational DB (e.g.
MicroStrategy) Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services greater scalability
Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers specialized support for SQL queries over star/snowflake schemas
![Page 53: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/53.jpg)
MOLAP Databases Data is stored using a proprietary
format(MOLAP) Accessible only through the DB vendor’s tools Suitable only for summarized data Data may be summarized in advance or real-time Examples:
PowerPlay Holos Essbase
![Page 54: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/54.jpg)
RDBMS: Indexing Strategies
Select columns to be indexed: Choose combinations of columns most often used to
constrain queries (“where …” clause) Queries must use constraining columns in the same
order as the columns in the index
Unique more efficient than non-unique.
More indexes means faster query performance, but also longer transformation/load times.
Types of Indexes: B-tree -- many possible values (e.g., invoice number) Bitmap -- few possible values (e.g., M/F, S/M/D/W)
![Page 55: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/55.jpg)
MOLAP versus ROLAP
MOLAPMultidimensional OLAPData stored in multi-dimensional cubeTransformation requiredData retrieved directly from cube for analysisFaster analytical processingCube size limitations
ROLAPRelational OLAPData stored in relational database as virtual cubeNo transformation neededData retrieved via SQL from database for analysisSlower analytical processingNo size limitations
![Page 56: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/56.jpg)
Data Warehouse Usage
Three kinds of data warehouse applications Information processing
supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling,
pivoting Data mining
knowledge discovery from hidden patterns supports associations, constructing analytical models,
performing classification and prediction, and presenting the mining results using visualization tools.
Differences among the three tasks
![Page 57: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/57.jpg)
Data Mining & Forecasting
Mining the Warehouse Choose data population Select mining technique Segment data into
groups Identify data patterns
Forecasting Data Select trend data Choose forecast model Run forecast Display predictions
![Page 58: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/58.jpg)
Accessing & Analyzing Data
Query & Reporting … retrieving data directly from the warehouse and preparing it for presentationOnline Analytical Processing (OLAP) … analyzing aggregated data from a variety of perspectivesData Mining & Forecasting … analyzing and predicting data using mathematical models
![Page 59: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/59.jpg)
Query & ReportingQuery the Data ... Select & filter data Retrieve results
Report the Results ... Sort & group data Format & present data
Save or Export Data Save queries & reports Export to other tools Publish HTML pages
![Page 60: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/60.jpg)
Query & Reporting Tools
Cognos ImpromptuBusiness ObjectsCrystal InfoBrioQueryIQGQLSAS
![Page 61: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/61.jpg)
Online Analytical Processing
Slice and Dice ... Select dimensions Choose measures Filter by dimensions
Drill Down ... Drill down
hierarchies Drill through to
details
Present the Results Present as
spreadsheet Display graphically
![Page 62: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/62.jpg)
OLAP Tools
Cognos PowerPlayBusiness AnalyzerHolosBrioAnalyzerMicrostrategyOracle ExpressSASArbor Essbase
![Page 63: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/63.jpg)
Data Mining & Forecasting
Mining the Warehouse Choose data population Select mining technique Segment data into
groups Identify data patterns
Forecasting Data Select trend data Choose forecast model Run forecast Display predictions
![Page 64: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/64.jpg)
Data Mining
tran1 cust33 p2, p5, p8tran2 cust45 p5, p8, p11tran3 cust12 p1, p9tran4 cust40 p5, p8, p11tran5 cust12 p2, p9tran6 cust12 p9
transactio
n
id custo
mer
id products
bought
salesrecords:
• Trend: Products p5, p8 often bough together• Trend: Customer 12 likes product p9
![Page 65: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/65.jpg)
Mining and Forecasting Tools
Scenario4ThoughtBusiness MinerClementineDarwinHolosSAS
![Page 66: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/66.jpg)
Data Warehouse Back-End Tools and UtilitiesData extraction: get data from multiple, heterogeneous, and external
sourcesData cleaning: detect errors in the data and rectify them when
possibleData transformation: convert data from legacy or host format to
warehouse formatLoad: sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitionsRefresh propagate the updates from the data sources to the
warehouse
![Page 67: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/67.jpg)
Data Cleaning
• Of primordial interest in the warehouse creation
• One of the biggest problems• Difficult to achieve• Probability of one or many of the
sources containing “dirty data” is high.
• Lots of manual intervention
![Page 68: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/68.jpg)
Data Cleaning Problems
Data quality problems
Single Source Multi-source
Schema level Instance level
Schema level Instance level
(poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data)
.Uniqueness .Misspellings Naming conflictsInconsistent
aggregation
![Page 69: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/69.jpg)
Multisource problems
All the previous problems +Schema differences (translation and integration) E.g.: EmpID, CID, Sex= M/F, Sex=0/1
Instance level conflicts Duplicate records, contradicting records Different measures ($, Euros) Different aggregation levels (weeks,
quarters)
![Page 70: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/70.jpg)
Overview: Data Warehousing and OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehouse to data mining
![Page 71: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/71.jpg)
Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
![Page 72: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/72.jpg)
Steps of a KDD Process
Learning the application domain: relevant prior knowledge and goals of application
Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
![Page 73: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/73.jpg)
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
![Page 74: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/74.jpg)
From OLAP to On Line Analytical Mining (OLAM)
Why online analytical mining? High quality of data in data warehouses
DW contains integrated, consistent, cleaned data Available information processing structure surrounding
data warehouses ODBC, OLEDB, Web accessing, service facilities,
reporting and OLAP tools OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and tasks.
Architecture of OLAM
![Page 75: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/75.jpg)
An OLAM Architecture
Data Warehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
![Page 76: ITCS 6163 Data Warehousing Xintao Wu. History 60s C. Bachman GE network data model Late 60s IBM IMS hierarchical data model 70 E.Codd relational model](https://reader036.vdocuments.mx/reader036/viewer/2022070406/56649dde5503460f94ad6e37/html5/thumbnails/76.jpg)
Summary
Data warehouse A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivotingOLAP servers: ROLAP, MOLAP, HOLAPFrom OLAP to OLAM