advanced db chapter 18 data analysis & mining
TRANSCRIPT
Advanced DB
CHAPTER 18DATA ANALYSIS & MINING
Chapter 18: Data Analysis and Mining
Decision Support SystemsData Analysis and OLAPData Warehousing Data Mining
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 2
Online Transaction Processing
Transaction processing activities are needed to capture and process data in ol ed in transactionsprocess data involved in transactionsOnline transaction processing systems
Real time systems that capt re and process transactions immediatelyReal-time systems that capture and process transactions immediatelyCan help provide superior service to customers and other trading partners
An integral part of the every day business of an enterpriseAn integral part of the every day business of an enterpriseOrder processing system of an electronics distributorBanking system of a bankgOrder control and billing system of a hospitalSwitches and billing system of a telephone company
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 3
Decision Support Systems
Decision-support systems are used to make business decisionsWhat items to stock?What items to stock?What insurance premium to change?To whom to send advertisements?
Examples of data used for making decisionsRetail sales transaction detailsCustomer profiles (income age gender etc )Customer profiles (income, age, gender, etc.)
Identifying patterns in customer behavior and Using the patterns to make business decisions
A Sudden spurt in purchases of flannel shirtsMost of small sports cars are bought by young woman with more than $50,000 annual salaryy
EIS (Executive Information System)A reporting subsystem for decision makers of an enterprise
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 4
Summary, trend, comparative, integrated information
Multidimensional View
Data that can be modeled as dimension attributes and measure attributesattributes
sales(item-name, color, size, number)
Measure attributesexample: number
Sales
Seoulexample: numberMeasure some valueCan be aggregated upon
Item
1 2 3
AB
CBusan
Kwangju
Dimension attributesexample: item name color size
Month1 2 3
example: item-name, color, sizeDefine the dimensions on which measure attributes, and summaries of measure attributes
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 5
Online Analytical Processing (OLAP)
Interactive analysis of summary informationStatistical analysis often requires grouping on multiple attributesAlthough complex statistical analysis is best left to statistics packages, database should support simple, commonly used, forms of data analysisSeveral SQL extensions have been developed to support OLAP tools
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 6
Cross Tabulation of sales by item-name and color
An example of a cross-tabulation (cross-tab)also referred to as a pivot-table.Values for one of the dimension attributes form the row headersValues for another dimension attribute form the column headersOther dimension attributes are listed on topValues in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 7
dimension attributes that specify the cell.
Relational Representation of Cross-tabs
Cross-tabs can be represented as relationsrelations
We use the value all is used to represent aggregatesTh SQL 1999 t d d t llThe SQL:1999 standard actually uses null values in place of alldespite confusion with regular null values
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 17 - 8
Data Cube
A data cube is a multidimensional generalization of a cross-tabC h di i h 3 b lCan have n dimensions; we show 3 below Cross-tabs can be used as views on a data cube
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 9
Online Analytical Processing Operations
Pivoting: changing the dimensions used in a cross-tab is called Slicing: creating a cross-tab for fixed values only
Sometimes called dicing, particularly when values for multiple dimensions are fixedare fixed.
Rollup: moving from finer-granularity data to a coarser granularity g yDrill down: The opposite operation - that of moving from coarser-granularity data to finer-granularity datag y g y
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 10
Hierarchies on Dimensions
Hierarchy on dimension attributes: lets dimensions to be viewed at different le els of detailat different levels of detail
E.g. the dimension DateTime can be used to aggregate by hour of day, date, day of week, month, quarter or yeary q y
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 11
Cross Tabulation With Hierarchy
Cross-tabs can be easily extended to deal with hierarchiesC d ill d ll hi hCan drill down or roll up on a hierarchy
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 12
OLAP Implementation
MOLAP (multidimensional OLAP)Multidimensional arrays in memory to store data cubes (earliest OLAPMultidimensional arrays in memory to store data cubes (earliest OLAP systems)
ROLAP (relational OLAP)Data stored in relational databaseData stored in relational database
HOLAP (hybrid OLAP)Store some summaries in memoryStore the base data and other summaries in rdb
Many OLAP systems are implemented as client-server systems
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 13
OLAP Implementation (Cont.)
Precompute all possible aggregates in order to provide online responseresponse
2n combinations of group by
Precompute some aggregates, and compute others on demand p gg g pfrom one of the precomputed aggregates
Can compute aggregate on (item-name, color) from an aggregate on (item-name color size)name, color, size)
Several optimizations available for computing multiple aggregatesCan compute aggregate on (item-name, color) from an aggregate on p gg g gg g(item-name, color, size)Can compute aggregates on (item-name, color, size), (item-name, color), and (item-name) using a single sorting of the base data( ) g g g
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 14
Extended Aggregation in SQL:1999
The cube operation computes union of group by’s on every subset of the specified attributessubset of the specified attributesE.g. consider the query
select item-name, color, size, sum(number)select item name, color, size, sum(number)from salesgroup by cube(item-name, color, size)
computes the union of eight different groupings of the sales relation:
{ (it l i ) (it l ) (it i ){ (item-name, color, size), (item-name, color), (item-name, size),(color, size), (item-name), (color), (size), ( ) }
where ( ) denotes an empty group by listwhere ( ) denotes an empty group by list.For each grouping, the result contains the null value for attributes not present in the grouping.
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 15
not present in the grouping.
Extended Aggregation (Cont.)
Relational representation of cross-tab of figure 18.2 can be computed bycomputed by
select item-name color sum(number)select item name, color, sum(number)from salesgroup by cube(item-name, color)
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 16
Data Warehouse - Motivation
Decision Support Problems in conventional systems (W. H. Inmon)Credibility of data: no time basis, algorithmic difference, no common sourcey , g ,Productivity: multiple sources in multiple environment, insufficient information, repeated workData into information: data from multiple applications, consistency, past historyp pp y p y
Conventional DSS Cycle
Sales
requestItem
Month1 2 3
AB
C
executiveIS / ITmodifications
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 17
executive
Solution 1: Rebuild System
ProsA & i i d d d lA new & consistent integrated data modelOpportunity adopt new HW, SW, & methodology
ConsConsLarge long-term projectChange in management/organization/operations during projectg g / g / p g p jTransition risksStill operations (OLTP) oriented
DSS is second priority at best
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 18
Solution 2: DW-based Restructuring
DSSDSS
DSSDSS
EIS
DWExtractTransformLoad
Legacy OLTP
EIS
g y
Reports
Data Processing(Transaction)
Information Processing
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 19
Data Warehouse – Features
A repository (or archive) of information gathers from multiple sources stored under a unified schema at a single sitesources, stored under a unified schema, at a single site
Once gathered, the data are stored for a long time, permitting access to historical dataProvide the user a single consolidated interface to data, making decision-support queries easier to writeOnline transaction-processing systems are not affected by the decision-Online transaction processing systems are not affected by the decisionsupport workload
Key FeaturesSubject orientedIntegratedTime variant historicalTime-variant, historicalNonvolatile
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 20
Data Warehouse – Features
Subject orientedD i OLTP d i d/ d/ d i hData in a OLTP system are designed/managed/used with a process/function-oriented viewData in a DW are designed/managed/used with a subject-oriented viewg g j
Focus of a DSS is in the subjects of an organization (not the processes)Not all data from OLTP need to be moved to the DW
Operational Data Warehouse
Loan proc.Deposit proc.
customer
account
Credit review operations
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 21
Data Warehouse – Features
IntegratedN bl d lNames: table names and column namesUnits, Values and codesAggregation and consolidationAggregation and consolidationA consistent source of information
balanceamount
balanceA l A l th cmamount
Curr. amountAppl. A length:cm
Appl. B length:inchcm
transformation
남, 여m, f0, 1
m , f
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 22
Data Warehouse – Features
Time VariantOLTP h f h d i ldOLTP: represents the current state of the domain worldDW: need to log the history of a subject
Monthly call charges for K. Kim for last yeary g yWeekly sales for each brand (or item)
Time element must be present in the key
K Kim appendK Kim
K. Kimaddr:Busan
K. Kimaddr:Seoul
…
appendK. Kimaddr:Busan
…K. Kim
addr:Seoul
…date:2008-4-5
update
Moved from Seoulto Busan on
2008-4-5
…date:2002-7-15
OLTP
DWK. Kim
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 23
Data Warehouse – Features
NonvolatileLi l ( ) dLittle (or no) updatesMost operations are new inserts (load)Normalization is not importantNormalization is not important
OLTP
update
DWOLTP
insert
update
delete
load access
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 24
Data Modeling
DB Design for OLTPNormalization: minimize update anomalies by eliminating redundancyNormalization: minimize update anomalies by eliminating redundancyTransactions usually access (relatively) small number of recordsNode – table / Link – join pathAd hoc join paths – not easy to predict access patterns
DB Design for Data warehouseDB Design for Data warehouseHistorical and summary orientedAnalytical “views”Non-volatile
Dimensional modelFact tables and dimensional tablesFact tables and dimensional tables
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 25
Data Modeling – OLTP
ShiShipType Shipper Ship To Product
District Order Contact ProductDistrictCredit
Order Item
ContactLoc
Product Line
SalesO d
CustLoc Product
Contrct
Order
ContrctType Customer
Loc
Contact
Product Group
SalesRep
SalesDistrict
SalesDivision
SalesRegion
Q: The top selling brands on holidays in Q1?
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 26
Q: The top selling brands on holidays in Q1?
Multi-dimensional Modeling – Fact table
Store all transactions for factual data about a businessHave a multi-part key (composite key)Each element of the key is a FK to a single dimension tableThe remaining fields in fact table are known as FactsFacts are almost numeric – target of aggregation (count, sum, average, …)much larger than dimension tables
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 27
Multi-dimensional Modeling – Dimension table
Dimensions are attribute of factsStore hierarchical relationships between different grouping or characteristics of those transactionsDi i hi hDimension hierarchy
Time : Year, Quarter, Week, DayProduct : Group, Subgroup, Department, Class, Itemp, g p, p , ,Territory : Division, Country, Region, Distinct
Unit of drill-down & roll-up
Have single key that is joined to a Fact tableThe remaining fields are called attributesDimension attributes
Textual, Discrete
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 28
Source of constraints and grouping columns
Hierarchies on Dimensions
The different levels of detail for an attribute can be organized into a hierarchinto a hierarchy
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 29
Data Modeling Example
RequirementsA convenience store chainA convenience store chainSale results by date, item, and storeSales by area, month, item or brandPerformance of promotions
Fact tableFact event: sale transactionFact event: sale transactionAtomic unit: item, day, store, …Fact table attributes: sales quantity, amount, cost
Dimension tablestime: key, date, month, quarter, year, day-of-week, holiday-flag, seasonitem: key name category package-type size weight coloritem: key, name, category, package type, size, weight, colorstore: key, name, street, city, zipcode, area, managerpromotion: key, name, type, discount rate, display-type, cost
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 30
Data Modeling – Multi-dim. Star Schema
item#item name
timeKeydate
Item dimension Time dimension
item nameitem class
brand package
datedayofWeek
monthquarter
timeKeyitem#store#
i #
sizeweightcolor
qyear
holidayseason
promotion#items_sold
amount_soldcost
promotion#promo_name store#
Promotion dimension Store dimension
cost ppromo_typediscount%displays
promo cost
store_nameaddress
cityzipcode
Fact table
promo_costpromo_dates
zipcodearea
manager
Q: The top selling brands on holidays in Q1?
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 31Center for E-Business Technology
Q: The top selling brands on holidays in Q1?
Example Queries
For the past 3 months, compare holiday and workday sales for each brandeach brand
Time: restriction (month) & grouping (holiday)Item: grouping (brand)Item: grouping (brand)Measure: sum(amount_sold)
Last quarter profitability per promotion typeq p y p p ypTime: restriction (quarter)Promotion : grouping (promo_type)Measure : (sum(amount_sold)-sum(cost))/sum(promo_cost)
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 32
Data Warehouse Tuning
CharacteristicsData Wareho ses are bigger terab tes petab tesData Warehouses are bigger – terabytes~petabytesThey change less oftenQueries use aggregation or complex WHERE clausesgg g p
ImplicationsScanning all data is very very slowWe can keep redundant dataIndex is important
T i t h iTuning techniquesBitmap indexMaterialized viewate a ed v ewApproximate answer
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 33
Bitmaps
Order of magnitude improvementR Order of magnitude improvement compared to scan.
Bitmaps are best suited for multiple conditions on several attributes, each
BY
colorconditions on several attributes, each having a low selectivity.Y
Nholiday_flag
10
12
14
ec) bitmap - two
attributes10
12
sec)
4
6
8
10
pons
e Ti
me
(se attributes
r-tree
4
6
8
ughp
ut (Q
uerie
s/s
linear scanbitmap
0
2
point query range query
Res
p
0
2
1 2 3
Thro
u bitmap
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 34
Materialized Views
highlyi d monthly sales per item-typesummarized monthly sales per item type
for past 10 years
M
lightlysummarized
weekly sales forpast 5 years
Metadata
sales detail of current yearCurrentDetail
ETL from OLTP system
sales detail of current year(current detail)
sales detail for past 00 years(older detail)
OLTP system
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 35
Materialized Views
Graph:14
pOracle9i on LinuxTotal sale by vendor is materialized
Trade-off between query speed-up and view 68
101214
Tim
e (s
ec)
q y p pmaintenance:
The impact of incremental maintenance on performance is significant.R b ild i hi d
0246
Materialized View No Materialized View
Res
pons
e
Rebuild maintenance achieves a good throughput.A static data warehouse offers a good trade-off.
1200
1600
sec)
Materialized View No Materialized View
0
400
800
1200
ut (s
tate
men
ts/
0MaterializedView (fast on
commit)
MaterializedView (complete
on demand)
NoMaterialized
ViewThro
ughp
u
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 17 - 36
Approximations
Ideal application parameters: broad, dynamic, large dataRipplingRippling
When computing average of all salariesIf average value changes little with increasing size, then it is probably accurate for all data (within some error bound)( )Can be extended to multiple joins
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 17 - 37
Approximations (cont.)
Good approximation for query Q1 onTPC-H Query Q1: lineitem Good approximation for query Q1 on lineitem
The aggregated values obtained on a query with a 6-way join are significantly different
y
20
40
60
mat
ed
e Va
lues
Bas
e e
Valu
e)
1% sample10% sample
with a 6 way join are significantly different from the actual values -- for some applications may still be good enough.
-40-20
01 2 3 4 5 6 7 8
Aggregate values
App
roxi
Agg
rega
te (%
of B
Agg
rega
te
TPC-H Query Q5: 6 way join
60
gg g
33.5
4
ec)
base relations
0
20
40
60
1 2 3 4 5
prox
imat
ed
gate
d va
lues
%
of B
ase
rega
te V
alue
)
0.51
1.52
2.53
spon
se T
ime
(se
1 % sample10% sample
-40
-20 1 2 3 4 5
Groups returned in the query
Ap
aggr
e (%A
ggr
1% sample10% sample
0Q1 Q5
TPC-H Queries
Re
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 38
Other Issues
When and how to gather dataSource-driven architectureDestination-driven architectureData warehouses typically have slightly out-of-date data
What schema to useUsually an integration of multiple sourcesMust decide on a unified (consolidated) view of the data
Data cleansinggThe process of correcting and preprocessing dataMerge-purge operationHouseholdingg
How to propagate updatesView-maintenance problem
What data to summarizeWhat data to summarizeMaintain only summary data obtained by aggregation on a relationHow do we decide on the unit granularity?
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 39
Data Mining
Data mining is the process of semi-automatically analyzing large databases to find sef l patternsdatabases to find useful patternsKnowledge Discovery in (Large) Database (KDD)
"knowledge extraction" "data dredging" "data archaeology"knowledge extraction , data dredging , data archaeology , "data/pattern analysis“, etc.
Prediction based on past historyp yPredict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past historyPredict if a pattern of phone calling card usage is likely to be fraudulent
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 40
Data Mining (Cont.)
Some examples of prediction mechanisms:Cl ifi iClassification
Given a new item whose class is unknown, predict to which class it belongs
Regression formulaeRegression formulaeGiven a set of mappings for an unknown function, predict the function result for a new parameter value
AssociationsFind books that are often bought by “similar” customers. If a new such customer buys one such book suggest the others toocustomer buys one such book, suggest the others too.Associations may be used as a first step in detecting causation
E.g. association between exposure to chemical X and cancer,
ClustersE.g. typhoid cases were clustered in an area surrounding a contaminated well
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 41
Detection of clusters remains important in detecting epidemics
Classification Rules
Classification rules help assign new objects to classes. E i bil i li h ld h h bE.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk?
Classification rules for above example could use a variety of dataClassification rules for above example could use a variety of data, such as educational level, salary, age, etc.
∀ person P, P.degree = masters and P.income > 75,000 ⇒ P.credit = excellent
∀ person P, P.degree = bachelors and(Pincome ≥ 25 000 and Pincome ≤ 75 000)(P.income ≥ 25,000 and P.income ≤ 75,000)
⇒ P.credit = good
Rules are not necessarily exact: there may be some y ymisclassifications
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 42
Decision Tree
Classification rules can be shown compactly as a decision tree.
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 43
Other Types of Classifiers
Neural net classifiers are studied in artificial intelligence and are not covered herenot covered here Bayesian classifiers use Bayes theorem, which says
p (cj | d ) = p (d | cj ) p (cj )p (cj | d ) p (d | cj ) p (cj ) p ( d )
where p (cj | d ) = probability of instance d being in class cj, p (d | cj ) = probability of generating instance d given class cj, p (cj ) = probability of occurrence of class cjp (d ) = probability of instance d occurring
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 44
Naïve Bayesian Classifiers
Bayesian classifiers requirei f (d | )computation of p (d | cj )
precomputation of p (cj )p (d ) can be ignored since it is the same for all classesp (d ) can be ignored since it is the same for all classes
To simplify the task, assume attributes have independent distributions;;
p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* (p (dn | cj )Each of the p (di | cj ) can be estimated from a histogram on di values for | j geach class cjthe histogram is computed from the training instances
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 45
Association Rules
Examples Someone who buys bread is quite likely also to buy milkSomeone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts.
Association rules:
bread ⇒ milkDB-Concepts, OS-Concepts ⇒ Networks
Left hand side: antecedent, right hand side: consequentAn association rule must have an associated population; the population p p ; p pconsists of a set of instances
E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the populationp p
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 46
Association Rules (Cont.)
Rules have an associated support, as well as an associated confidenceconfidence. Support
P(X ∩ Y) ho often does this p ttern o rP(X ∩ Y): how often does this pattern occurwhat fraction of the population satisfies both the antecedent and the consequent of the ruleE.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk ⇒ screwdrivers is low.
C fidConfidenceP(Y | X): how prevalent is the pattern given X how often the consequent is true when the antecedent is truehow often the consequent is true when the antecedent is true. E.g. the rule bread ⇒ milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk.
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 47
Finding Association Rules
To discover association rule,(i1 , i2 , …, in) => i0( 1 , 2 , , n) 0Find large itemsets
large itemset S: sets of items with sufficient supportFind rule (S – s)⇒s for every subset s⊂S( ) ywhere (S – s)⇒s has sufficient confidence
B1 = {m, c, b} B2 = {m, p, j}+ { , , } { , p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}
–– +
{ , , j} { , }An association rule: {m, b} -> c.
Confidence = 2/4 = 50%Support = 2/8 = 25%Support = 2/8 = 25%
A typical question is “find all association rules with support >= s and confidence >= c.”
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 48
Finding Association Rules (cont.)
Sample Database1. A supermarket sells 10,000 items2. The average basket has 10 items3. There are 10,000,000 baskets in the database
Table structureBaskets_long(BID, item1, item2, …, item10)Baskets(BID, item)( , )
10 X 10,000,000 = 100,000,000 records
The support threshold is 1% of the basketsppi.e., X=>Y is interesting only if X, Y are purchased together at least 1% of the time (100,000 baskets)
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 49
(
Frequent Pairs in SQL
SELECT b1.item, b2.item Look for twoBasket tuples
FROM Baskets b1, Baskets b2Basket tupleswith the same
basket anddifferent items.
WHERE b1.BID = b2.BID
AND b1.item < b2.item
First item mustprecede second,
so we don’tAND b1.item b2.item
GROUP BY b1.item, b2.itemcount the same
pair twice.
HAVING COUNT(*) >= 100000;Create a group foreach pair of itemseach pair of itemsthat appears in atleast one basket.
Throw away pairs of itemsthat do not appear at least
s times.
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 50
A-Priori Algorithm
The naïve way (query in previous page)Join complexity: 100,000,000 X 100,000,000Each basket contributes C(10, 2) = 45 pairsResult of join has 45 X 10,000,000 recordsNeed to perform Group by and Having
The A-Priori algorithmKey idea: If item i does not appear in s baskets, then no pair y pp , pincluding i can appear in s basketsSo, do not consider items that appear in less than s baskets
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 51
A-Priori Algorithm (cont.)
The new queryINSERT INTO Baskets1(basket item)INSERT INTO Baskets1(basket, item)SELECT * FROM BasketsWHERE item IN ( SELECT item FROM Baskets
GROUP BY itemHAVING COUNT(*) >= 100000 );
Perform previous query with Baskets1Join computation
Computing Baskets1 is cheap since there is no joinOr could have built index from beginning
At most 1 000 items can be frequentAt most 1,000 items can be frequentIn practice Baskets1 is less than ¼ of Baskets
Join complexity is N square
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 52
Other Types of Mining
Text mining: application of data mining to textual documentsl W b fi d l dcluster Web pages to find related pages
cluster pages a user has visited to organize their visit historyclassify Web pages automatically into a Web directoryclassify Web pages automatically into a Web directory
Data visualization systems help users examine large volumes of data and detect patterns visuallyp y
Can visually encode large amounts of information on a single screenHumans are very good a detecting visual patterns
Original Slides:© Silberschatz, Korth and Sudarshan
Intro to DB (2008-1)Copyright © 2006 - 2008 by S.-g. Lee Chap 18 - 53
END OF CHAPTER 18