1 data wharehousing, olap and data mining. 2 acknowledgments a. balachandran anand deshpande sunita...
TRANSCRIPT
![Page 1: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/1.jpg)
1
Data Wharehousing, OLAP and Data Mining
![Page 2: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/2.jpg)
2
Acknowledgments
A. Balachandran Anand DeshpandeSunita Sarawagi
S. Seshadri
![Page 3: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/3.jpg)
3
Overview
Part 1: Data Warehouses Part 2: OLAPPart 3: Data MiningPart 4: Query Processing and
Optimization
![Page 4: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/4.jpg)
4
Part 1: Data Warehouses
![Page 5: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/5.jpg)
5
Data, Data everywhereyet ...
I can’t find the data I need data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed
from one form to other
![Page 6: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/6.jpg)
6
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
![Page 7: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/7.jpg)
7
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Why Data Warehousing?
![Page 8: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/8.jpg)
8
Decision Support
Used to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and
can be ad-hocUsed by managers and end-users to
understand the business and make judgements
![Page 9: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/9.jpg)
9
Evolution of Decision Support
60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every
request
70’s: Terminal based DSS and EIS80’s: Desktop data access and analysis
tools query tools, spreadsheets, GUIs easy to use, but access only operational db
90’s: Data warehousing with integrated OLAP engines and tools
![Page 10: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/10.jpg)
10
What are the users saying...
Data should be integrated across the enterprise
Summary data had a real value to the organization
Historical data held the key to understanding data over time
What-if capabilities are required
![Page 11: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/11.jpg)
11
Data Warehousing -- It is a process
Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
A decision support database maintained separately from the organization’s operational database
![Page 12: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/12.jpg)
12
Traditional RDBMS used for OLTP
Database Systems have been used traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical
Will call these operational systems
![Page 13: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/13.jpg)
13
OLTP vs Data Warehouse
OLTP Application Oriented Used to run business Clerical User Detailed data Current up to date Isolated Data Repetitive access by
small transactions Read/Update access
Warehouse (DSS) Subject Oriented Used to analyze business Manager/Analyst Summarized and refined Snapshot data Integrated Data Ad-hoc access using large
queries Mostly read access (batch
update)
![Page 14: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/14.jpg)
14
Data Warehouse Architecture
RelationalDatabases
LegacyData
Purchased Data
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
![Page 15: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/15.jpg)
15
From the Data Warehouse to Data Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
![Page 16: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/16.jpg)
16
Users have different views of Data
Organizationallystructured
OLAP
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse information harvestedby farmers
![Page 17: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/17.jpg)
17
Wal*Mart Case Study
Founded by Sam WaltonOne the largest Super Market Chains in
the US
Wal*Mart: 2000+ Retail Stores SAM's Clubs 100+Wholesalers Stores
This case study is from Felipe Carino’s (NCR Teradata) presentation made at Stanford Database Seminar
![Page 18: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/18.jpg)
18
Old Retail Paradigm
Wal*Mart Inventory Management Merchandise Accounts
Payable Purchasing Supplier Promotions:
National, Region, Store Level
Suppliers Accept Orders Promote Products Provide special
Incentives Monitor and Track
The Incentives Bill and Collect
Receivables Estimate Retailer
Demands
![Page 19: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/19.jpg)
19
New (Just-In-Time) Retail Paradigm
No more deals Shelf-Pass Through (POS Application)
One Unit Price Suppliers paid once a week on ACTUAL items sold
Wal*Mart Manager Daily Inventory Restock Suppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass Through Stock some Large Items
Delivery may come from supplier Distribution Center
Supplier’s merchandise unloaded directly onto Wal*Mart Trucks
![Page 20: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/20.jpg)
20
Information as a Strategic Weapon
Daily Summary of all Sales InformationRegional Analysis of all Stores in a logical
areaSpecific Product SalesSpecific Supplies SalesTrend Analysis, etc.Wal*Mart uses information when
negotiating with Suppliers Advertisers etc.
![Page 21: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/21.jpg)
21
Schema Design
Database organization must look like business must be recognizable by business user approachable by business user Must be simple
Schema Types Star Schema Fact Constellation Schema Snowflake schema
![Page 22: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/22.jpg)
22
Star Schema
A single fact table and for each dimension one dimension table
Does not capture hierarchies directly
T im
e
prod
cust
city
fact
date, custno, prodno, cityname, sales
![Page 23: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/23.jpg)
23
Dimension Tables
Dimension tables Define business in terms already familiar to
users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions
time periods, geographic region (markets, cities), products, customers, salesperson, etc.
![Page 24: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/24.jpg)
24
Fact Table
Central table Typical example: individual sales
records mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a
billion) Access via dimensions
![Page 25: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/25.jpg)
25
Snowflake schema
Represent dimensional hierarchy directly by normalizing tables.
Easy to maintain and saves storage
T im
e
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
![Page 26: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/26.jpg)
26
Fact Constellation
Fact Constellation Multiple fact tables that share many
dimension tables Booking and Checkout may share many
dimension tables in the hotel industryHotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
![Page 27: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/27.jpg)
27
Data Granularity in Warehouse
Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller
number of records to be processed design around traditional high level
reporting needs tradeoff with volume of data to be stored
and detailed usage of data
![Page 28: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/28.jpg)
28
Granularity in Warehouse
Solution is to have dual level of granularity Store summary data on disks
95% of DSS processing done against this data
Store detail on tapes5% of DSS processing against this data
![Page 29: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/29.jpg)
29
Levels of Granularity
Operational
60 days ofactivity
account activity date amount teller location account bal
accountmonth # trans withdrawals deposits average bal
amountactivity date amount account bal
monthly accountregister -- up to 10 years
Not all fieldsneed be archived
Banking Example
![Page 30: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/30.jpg)
30
Data Integration Across Sources
Trust Credit cardSavings Loans
Same data different name
Different data Same name
Data found here nowhere else
Different keyssame data
![Page 31: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/31.jpg)
31
Data Transformation
Data transformation is the foundation for achieving single version of the truth
Major concern for ITData warehouse can fail if appropriate
data transformation strategy is not developed
Sequential Legacy Relational ExternalOperational/Source Data
Data Transformation
Accessing Capturing Extracting Householding FilteringReconciling Conditioning Loading Validating Scoring
![Page 32: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/32.jpg)
32
Data Integrity Problems
Same person, different spellings Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names mumbai, bombay
Different account numbers generated by different applications for the same customer
Required fields left blank Invalid product codes collected at point of sale
manual entry leads to mistakes “in case of a problem use 9999999”
![Page 33: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/33.jpg)
33
Data Transformation Terms
ExtractingConditioningScrubbingMergingHouseholding
EnrichmentScoringLoadingValidatingDelta Updating
![Page 34: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/34.jpg)
34
Data Transformation Terms
Householding Identifying all members of a household
(living at the same address) Ensures only one mail is sent to a
household Can result in substantial savings: 1
million catalogues at $50 each costs $50 million . A 2% savings would save $1 million
![Page 35: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/35.jpg)
35
Refresh
Propagate updates on source data to the warehouse
Issues: when to refresh how to refresh -- incremental refresh
techniques
![Page 36: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/36.jpg)
36
When to Refresh?
periodically (e.g., every night, every week) or after significant events
on every update: not warranted unless warehouse data require current data (up to the minute stock quotes)
refresh policy set by administrator based on user needs and traffic
possibly different policies for different sources
![Page 37: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/37.jpg)
37
Refresh techniques
Incremental techniques detect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots (Oracle)transaction shipping (Sybase)
compute changes to derived and summary tables
maintain transactional correctness for incremental load
![Page 38: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/38.jpg)
38
How To Detect Changes
Create a snapshot log table to record ids of updated rows of source data and timestamp
Detect changes by: Defining after row triggers to update
snapshot log when source table changes Using regular transaction log to detect
changes to source data
![Page 39: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/39.jpg)
39
Querying Data Warehouses
SQL ExtensionsMultidimensional modeling of data
OLAP More on OLAP later …
![Page 40: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/40.jpg)
40
SQL Extensions
Extended family of aggregate functions rank (top 10 customers) percentile (top 30% of customers) median, mode Object Relational Systems allow addition
of new aggregate functionsReporting features
running total, cumulative totals
![Page 41: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/41.jpg)
41
Reporting Tools
Andyne Computing -- GQL Brio -- BrioQuery Business Objects -- Business Objects Cognos -- Impromptu Information Builders Inc. -- Focus for Windows Oracle -- Discoverer2000 Platinum Technology -- SQL*Assist, ProReports PowerSoft -- InfoMaker SAS Institute -- SAS/Assist Software AG -- Esperant Sterling Software -- VISION:Data
![Page 42: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/42.jpg)
42
Bombay branch Delhi branch Calcutta branchCensusdata
Operational data
Detailed transactionaldata
Data warehouseMergeCleanSummarize
DirectQuery
Reportingtools
MiningtoolsOLAP
Decision support tools
Oracle SAS
RelationalDBMS+e.g. Redbrick
IMS
Crystal reports Essbase Intelligent Miner
GISdata
![Page 43: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/43.jpg)
43
Deploying Data Warehouses
What business information keeps you in business today? What business information can put you out of business tomorrow?
What business information should be a mouse click away?
What business conditions are the driving the need for business information?
![Page 44: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/44.jpg)
44
Cultural Considerations
Not just a technology project
New way of using information to support daily activities and decision making
Care must be taken to prepare organization for change
Must have organizational backing and support
![Page 45: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/45.jpg)
45
User Training
Users must have a higher level of IT proficiency than for operational systems
Training to help users analyze data in the warehouse effectively
![Page 46: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/46.jpg)
46
Warehouse ProductsComputer Associates -- CA-Ingres Hewlett-Packard -- Allbase/SQL Informix -- Informix, Informix XPSMicrosoft -- SQL Server Oracle – OracleRed Brick -- Red Brick Warehouse SAS Institute -- SAS Software AG -- ADABAS Sybase -- SQL Server, IQ, MPP
![Page 47: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/47.jpg)
47
Part 2: OLAP
![Page 48: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/48.jpg)
48
Nature of OLAP Analysis
Aggregation -- (total sales, percent-to-total)
Comparison -- Budget vs. ExpensesRanking -- Top 10, quartile analysisAccess to detailed and aggregate dataComplex criteria specificationVisualizationNeed interactive response to aggregate queries
![Page 49: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/49.jpg)
49MonthMonth
1 1 22 3 3 4 4 776 6 5 5
Pro
du
ctP
rod
uct
Toothpaste Toothpaste
JuiceJuiceColaColaMilk Milk
CreamCream
Soap Soap
Regio
n
Regio
n
WWS S
N N
DimensionsDimensions: : Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths
Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year
Category Region Quarter Category Region Quarter
Product City Month weekProduct City Month week
Office DayOffice Day
Multi-dimensional Data
Measure - sales (actual, plan, variance)
![Page 50: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/50.jpg)
50
Conceptual Model for OLAP
Numeric measures to be analyzed e.g. Sales (Rs), sales (volume), budget,
revenue, inventoryDimensions
other attributes of data, define the space e.g., store, product, date-of-sale hierarchies on dimensions
e.g. branch -> city -> state
![Page 51: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/51.jpg)
51
Operations
Rollup: summarize data e.g., given sales data, summarize sales
for last year by product category and region
Drill down: get more details e.g., given summarized sales as above,
find breakup of sales by city within each region, or within the Andhra region
![Page 52: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/52.jpg)
52
More Cube Operations
Slice and dice: select and project e.g.: Sales of soft-drinks in Andhra over
the last quarterPivot: change the view of data
Q1 Q2 Total L S TotalL RedS BlueTotal Total
22 33 5515 44 59
37 77 114
14 07 2141 52 93
55 59 114
![Page 53: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/53.jpg)
53
More OLAP Operations
Hypothesis driven search: E.g. factors affecting defaulters view defaulting rate on age aggregated over
other dimensions for particular age segment detail along
profession
Need interactive response to aggregate queries => precompute various aggregates
![Page 54: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/54.jpg)
54
MOLAP vs ROLAP
MOLAP: Multidimensional array OLAP
ROLAP: Relational OLAP Type Size Colour AmountShirt S Blue 10Shirt L Blue 25
Shirt ALL Blue 35Shirt S Red 3Shirt L Red 7Shirt ALL Red 10Shirt ALL ALL 45… … … …ALL ALL ALL 1290
![Page 55: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/55.jpg)
55
SQL Extensions
Cube operator group by on all subsets of a set of
attributes (month,city) redundant scan and sorting of data can
be avoidedVarious other non-standard SQL
extensions by vendors
![Page 56: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/56.jpg)
56
OLAP: 3 Tier DSSData Warehouse
Database Layer
Store atomic data in industry standard Data Warehouse.
OLAP Engine
Application Logic Layer
Generate SQL execution plans in the OLAP engine to obtain OLAP functionality.
Decision Support Client
Presentation Layer
Obtain multi-dimensional reports from the DSS Client.
![Page 57: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/57.jpg)
57
Strengths of OLAP
It is a powerful visualization tool
It provides fast, interactive response times
It is good for analyzing time series
It can be useful to find some clusters and outliners
Many vendors offer OLAP tools
![Page 58: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/58.jpg)
58
Brief History
Express and System W DSS Online Analytical Processing - coined by
EF Codd in 1994 - white paper by Arbor Software
Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System
MOLAP: Multidimensional OLAP (Hyperion (Arbor Essbase), Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
![Page 59: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/59.jpg)
59
OLAP and Executive Information Systems
Andyne Computing -- Pablo
Arbor Software -- Essbase Cognos -- PowerPlay Comshare -- Commander
OLAP Holistic Systems -- Holos Information Advantage --
AXSYS, WebOLAP Informix -- Metacube Microstrategies
--DSS/Agent
Oracle -- Express Pilot -- LightShip Planning Sciences --
Gentium Platinum Technology
-- ProdeaBeacon, Forest & Trees
SAS Institute -- SAS/EIS, OLAP++
Speedware -- Media
![Page 60: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/60.jpg)
60
Microsoft OLAP strategy
Plato: OLAP server: powerful, integrating various operational sources
OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP
Pivot-table services: integrate with Office 2000 Every desktop will have OLAP capability.
Client side caching and calculationsPartitioned and virtual cubeHybrid relational and multidimensional storage
![Page 61: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/61.jpg)
61
Part 3: Data Mining
![Page 62: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/62.jpg)
62
Why Data Mining Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular customer?
Customer relationship management: Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
Data Mining helps extract such information
![Page 63: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/63.jpg)
63
Data mining
Process of semi-automatically analyzing large databases to find interesting and useful patterns
Overlaps with machine learning, statistics, artificial intelligence and databases but more scalable in number of features and
instances more automated to handle
heterogeneous data
![Page 64: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/64.jpg)
64
Some basic operations
Predictive: Regression Classification
Descriptive: Clustering / similarity matching Association rules and variants Deviation detection
![Page 65: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/65.jpg)
65
Classification
Given old data about customers and payments, predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
![Page 66: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/66.jpg)
66
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other polynomial)
a*x1 + b*x2 + c = Ci. Nearest neighourDecision tree classifier: divide decision
space into piecewise constant regions.Probabilistic/generative modelsNeural networks: partition by non-linear
boundaries
![Page 67: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/67.jpg)
67
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
![Page 68: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/68.jpg)
68
Pros and Cons of decision trees
• Cons– Cannot handle complicated relationship between features– simple decision boundaries– problems with lots of missing data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features
More information: http://www.stat.wisc.edu/~limt/treeprogs.html
![Page 69: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/69.jpg)
69
Neural networkSet of nodes connected by directed
weighted edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
iii
ey
xwo
1
1)(
)(1
Basic NN unitA more typical NN
![Page 70: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/70.jpg)
70
Pros and Cons of Neural Network
• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes
• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features
Conclusion: Use neural nets only if decision trees/NN fail.
![Page 71: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/71.jpg)
71
Bayesian learning
Assume a probability model on generation of data. Apply bayes theorem to find most likely class as:
Naïve bayes: Assume attributes conditionally independent given class value
Easy to learn probabilities by counting, Useful in some domains e.g. text
)(
)()|(max)|(max :class predicted
dp
cpcdpdcpc jj
cj
c jj
n
iji
j
ccap
dp
cpc
j 1
)|()(
)(max
![Page 72: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/72.jpg)
72
Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
Key requirement: Need a good measure of similarity between instances.
Identify micro-markets and develop policies for each
![Page 73: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/73.jpg)
73
Association rules
Given set T of groups of itemsExample: set of item sets
purchased Goal: find all rules on itemsets of
the form a-->b such that support of a and b > user threshold s conditional probability (confidence) of
b given a > user threshold c
Example: Milk --> breadPurchase of product A --> service B
Milk, cerealTea, milk
Tea, rice, bread
cereal
T
![Page 74: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/74.jpg)
74
Variants
High confidence may not imply high correlation
Use correlations. Find expected support and large departures from that interesting.. see statistical literature on contingency
tables.Still too many rules, need to prune...
![Page 75: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/75.jpg)
75
Prevalent Interesting
Analysts already know about prevalent rules
Interesting rules are those that deviate from prior expectation
Mining’s payoff is in finding surprising phenomena
1995
1998
Milk andcereal selltogether!
Zzzz... Milk andcereal selltogether!
![Page 76: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/76.jpg)
76
What makes a rule surprising?
Does not match prior expectation Correlation
between milk and cereal remains roughly constant over time
Cannot be trivially derived from simpler rules Milk 10%, cereal 10% Milk and cereal 10%
… surprising Eggs 10% Milk, cereal and eggs
0.1% … surprising! Expected 1%
![Page 77: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/77.jpg)
77
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis
Telecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
![Page 78: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/78.jpg)
78
Data Mining in Use
The US Government uses Data Mining to track fraud
A Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers
![Page 79: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/79.jpg)
79
Why Now?
Data is being producedData is being warehousedThe computing power is availableThe computing power is affordableThe competitive pressures are strongCommercial products are available
![Page 80: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/80.jpg)
80
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
![Page 81: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/81.jpg)
81
Mining market
Around 20 to 30 mining tool vendorsMajor players:
Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.
All pretty much the same set of toolsMany embedded products: fraud detection,
electronic commerce applications
![Page 82: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/82.jpg)
82
OLAP Mining integrationOLAP (On Line Analytical Processing)
Fast interactive exploration of multidim. aggregates.
Heavy reliance on manual operations for analysis:
Tedious and error-prone on large multidimensional data
Ideal platform for vertical integration of mining but needs to be interactive instead of batch.
![Page 83: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/83.jpg)
83
State of art in mining OLAP
integration
Decision trees [Information discovery, Cognos] find factors influencing high profits
Clustering [Pilot software] segment customers to define hierarchy on that
dimension
Time series analysis: [Seagate’s Holos] Query for various shapes along time: eg. spikes, outliers
etc
Multi-level Associations [Han et al.] find association between members of dimensions
![Page 84: 1 Data Wharehousing, OLAP and Data Mining. 2 Acknowledgments A. Balachandran Anand Deshpande Sunita Sarawagi S. Seshadri](https://reader036.vdocuments.mx/reader036/viewer/2022062407/56649d9f5503460f94a8a1ad/html5/thumbnails/84.jpg)
84
Vertical integration: Mining on the web
Web log analysis for site design: what are popular pages, what links are hard to find.
Electronic stores sales enhancements: recommendations, advertisement: Collaborative filtering: Net perception,
Wisewire
Inventory control: what was a shopper looking for and could not find..