carnegie mellon univ. dept. of computer science 15-415/615 – db applications

72
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)

Upload: biana

Post on 11-Jan-2016

17 views

Category:

Documents


0 download

DESCRIPTION

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications. Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26). Data mining - detailed outline. Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Carnegie Mellon Univ.Dept. of Computer Science

15-415/615 – DB Applications

Lecture # 24: Data Warehousing / Data Mining

(R&G, ch 25 and 26)

Page 2: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 2

Data mining - detailed outline• Problem• Getting the data: Data Warehouses, DataCubes,

OLAP• Supervised learning: decision trees• Unsupervised learning

– association rules– (clustering)

Page 3: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 3

Problem

Given: multiple data sourcesFind: patterns (classifiers, rules, clusters, outliers...)

sales(p-id, c-id, date, $price)

customers( c-id, age, income, ...)

NY

SF

???

PGH

Page 4: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 4

Data Ware-housing

First step: collect the data, in a single place (= Data Warehouse)

How?How often?How about discrepancies / non-

homegeneities?

Page 5: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 5

Data Ware-housing

First step: collect the data, in a single place (= Data Warehouse)

How? A: Triggers/Materialized viewsHow often? A: [Art!]How about discrepancies / non-

homegeneities? A: Wrappers/Mediators

Page 6: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 6

Data Ware-housing

Step 2: collect counts. (DataCubes/OLAP) Eg.:

Page 7: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 7

OLAP

Problem: “is it true that shirts in large sizes sell better in dark colors?”

ci-d p-id Size Color $

C10 Shirt L Blue 30

C10 Pants XL Red 50

C20 Shirt XL White 20

sales

...

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

Page 8: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 8

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 9: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 9

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 10: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 10

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 11: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 11

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 12: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 12

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 13: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 13

DataCubes

‘color’, ‘size’: DIMENSIONS‘count’: MEASURE

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

DataCube

Page 14: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 14

DataCubes

SQL query to generate DataCube:• Naively (and painfully:)

select size, color, count(*)from sales where p-id = ‘shirt’group by size, color

select size, count(*)from sales where p-id = ‘shirt’group by size...

Page 15: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 15

DataCubes

SQL query to generate DataCube:• with ‘cube by’ keyword:

select size, color, count(*)from saleswhere p-id = ‘shirt’cube by size, color

Page 16: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 16

DataCubes

DataCube issues:Q1: How to store them (and/or materialize

portions on demand)Q2: Which operations to allow

Page 17: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 17

DataCubes

DataCube issues:Q1: How to store them (and/or materialize

portions on demand) A: ROLAP/MOLAPQ2: Which operations to allow A: roll-up,

drill down, slice, dice[More details: book by Han+Kamber]

Page 18: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 18

DataCubes

Q1: How to store a dataCube?

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

Page 19: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 19

DataCubes

Q1: How to store a dataCube?A1: Relational (R-OLAP)

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

Color Size count

'all' 'all' 47

Blue 'all' 14

Blue M 3

Page 20: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 20

DataCubes

Q1: How to store a dataCube?A2: Multi-dimensional (M-OLAP)A3: Hybrid (H-OLAP) C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

Page 21: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 21

DataCubes

Pros/Cons:ROLAP strong points: (DSS, Metacube)

Page 22: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 22

DataCubes

Pros/Cons:ROLAP strong points: (DSS, Metacube)• use existing RDBMS technology• scale up better with dimensionality

Page 23: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 23

DataCubes

Pros/Cons:MOLAP strong points: (EssBase/hyperion.com)• faster indexing(careful with: high-dimensionality; sparseness)

HOLAP: (MS SQL server OLAP services)• detail data in ROLAP; summaries in MOLAP

Page 24: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 24

DataCubes

Q1: How to store a dataCubeQ2: What operations should we support?

Page 25: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 25

DataCubes

Q2: What operations should we support?

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 26: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 26

DataCubes

Q2: What operations should we support?Roll-up

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 27: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 27

DataCubes

Q2: What operations should we support?Drill-down

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 28: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 28

DataCubes

Q2: What operations should we support?Slice

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 29: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 29

DataCubes

Q2: What operations should we support?Dice

C / S S M L TOT

Red 20 3 5 28

Blue 3 3 8 14

Gray 0 0 5 5

TOT 23 6 18 47

colorsize

color; size

Page 30: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 30

DataCubes

Q2: What operations should we support?• Roll-up• Drill-down• Slice• Dice• (Pivot/rotate; drill-across; drill-through• top N• moving averages, etc)

Page 31: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 31

D/W - OLAP - Conclusions

• D/W: copy (summarized) data + analyze• OLAP - concepts:

– DataCube– R/M/H-OLAP servers– ‘dimensions’; ‘measures’

Page 32: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 32

Outline• Problem• Getting the data: Data Warehouses, DataCubes,

OLAP• Supervised learning: decision trees• Unsupervised learning

– association rules– (clustering)

Page 33: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 33

Decision trees - ProblemAge Chol-level Gender … CLASS-ID

30 150 M +

-

??

Page 34: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 34

Decision trees

• Pictorially, we have

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

Page 35: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 35

Decision trees

• and we want to label ‘?’

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

Page 36: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 36

Decision trees

• so we build a decision tree:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

?

50

40

Page 37: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 37

Decision trees

• so we build a decision tree:

age<50

Y

+ chol. <40

N

- ...

Y N

Page 38: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 38

Outline• Problem• Getting the data: Data Warehouses, DataCubes,

OLAP• Supervised learning: decision trees

– problem– approach– scalability enhancements

• Unsupervised learning– association rules– (clustering)

Page 39: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 39

Decision trees

• Typically, two steps:– tree building– tree pruning (for over-training/over-fitting)

Page 40: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 40

Tree building

• How?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

--

--

Page 41: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 41

Tree building

• How?

• A: Partition, recursively - pseudocode:Partition ( Dataset S)

if all points in S have same label

then return

evaluate splits along each attribute A

pick best split, to divide S into S1 and S2

Partition(S1); Partition(S2)

Page 42: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 42

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

Not In Exam=N.I.E.

Page 43: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 43

Tree building

• Q1: how to introduce splits along attribute Ai

• A1:– for num. attributes:

• binary split, or

• multiple split

– for categorical attributes:• compute all subsets (expensive!), or

• use a greedy algo

N.I.E.

Page 44: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 44

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

N.I.E.

Page 45: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 45

Tree building

• Q1: how to introduce splits along attribute Ai

• Q2: how to evaluate a split?

• A: by how close to uniform each subset is - ie., we need a measure of uniformity:

N.I.E.

Page 46: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 46

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

Any other measure?

N.I.E.

Page 47: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 47

Tree building

entropy: H(p+, p-)

p+10

0

1

0.5

‘gini’ index: 1-p+2 - p-

2

p+10

0

1

0.5

N.I.E.

Page 48: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 48

Tree building

entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-

2

(How about multiple labels?)

N.I.E.

Page 49: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 49

Tree building

Intuition:

• entropy: #bits to encode the class label

• gini: classification error, if we randomly guess ‘+’ with prob. p+

N.I.E.

Page 50: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 50

Tree building

Thus, we choose the split that reduces entropy/classification-error the most: Eg.:

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

N.I.E.

Page 51: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 51

Tree building

• Before split: we need(n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13)

bits total, to encode all the class labels

• After the split we need:0 bits for the first half and

(2+6) * H(2/8, 6/8) bits for the second half

N.I.E.

Page 52: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 52

Tree pruning• What for?

num. attr#1 (eg., ‘age’)

num. attr#2(eg., chol-level)

+

-++ +

++

+

-

----

...

N.I.E.

Page 53: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 53

Tree pruningShortcut for scalability: DYNAMIC pruning:• stop expanding the tree, if a node is

‘reasonably’ homogeneous– ad hoc threshold [Agrawal+, vldb92]– ( Minimum Description Language (MDL)

criterion (SLIQ) [Mehta+, edbt96] )

N.I.E.

Page 54: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 54

Tree pruning• Q: How to do it?• A1: use a ‘training’ and a ‘testing’ set -

prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

• (A2: or, rely on MDL (= Minimum Description Language) )

N.I.E.

Page 55: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 55

Outline• Problem• Getting the data: Data Warehouses, DataCubes,

OLAP• Supervised learning: decision trees

– problem– approach– scalability enhancements

• Unsupervised learning– association rules– (clustering)

N.I.E.

Page 56: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 56

Scalability enhancements• Interval Classifier [Agrawal+,vldb92]:

dynamic pruning• SLIQ: dynamic pruning with MDL; vertical

partitioning of the file (but label column has to fit in core)

• SPRINT: even more clever partitioning

N.I.E.

Page 57: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 57

Conclusions for classifiers• Classification through trees• Building phase - splitting policies• Pruning phase (to avoid over-fitting)• For scalability:

– dynamic pruning– clever data partitioning

Page 58: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 58

Outline• Problem• Getting the data: Data Warehouses, DataCubes,

OLAP• Supervised learning: decision trees

– problem– approach– scalability enhancements

• Unsupervised learning– association rules– (clustering)

Page 59: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 59

Association rules - idea

[Agrawal+SIGMOD93]• Consider ‘market basket’ case:

(milk, bread)(milk)(milk, chocolate)(milk, bread)

• Find ‘interesting things’, eg., rules of the form: milk, bread -> chocolate | 90%

Page 60: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 60

Association rules - idea

In general, for a given ruleIj, Ik, ... Im -> Ix | c

‘c’ = ‘confidence’ (how often people by Ix, given that they have bought Ij, ... Im

‘s’ = support: how often people buy Ij, ... Im, Ix

Page 61: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 61

Association rules - idea

Problem definition:• given

– a set of ‘market baskets’ (=binary matrix, of N rows/baskets and M columns/products)

– min-support ‘s’ and– min-confidence ‘c’

• find– all the rules with higher support and confidence

Page 62: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 62

Association rules - idea

Closely related concept: “large itemset”Ij, Ik, ... Im, Ix

is a ‘large itemset’, if it appears more than ‘min-support’ times

Observation: once we have a ‘large itemset’, we can find out the qualifying rules easily (how?)

Thus, let’s focus on how to find ‘large itemsets’

Page 63: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 63

Association rules - idea

Naive solution: scan database once; keep 2**|I| counters

Drawback?Improvement?

Page 64: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 64

Association rules - idea

Naive solution: scan database once; keep 2**|I| counters

Drawback? 2**1000 is prohibitive...Improvement? scan the db |I| times, looking for 1-,

2-, etc itemsets

Eg., for |I|=3 items only (A, B, C), we have

Page 65: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 65

Association rules - idea

A B Cfirst pass

100 200 2

min-sup:10

Page 66: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 66

Association rules - idea

A B Cfirst pass

100 200 2

min-sup:10

A,B A,C B,C

Page 67: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 67

Association rules - idea

Anti-monotonicity property:if an itemset fails to be ‘large’, so will every superset

of it (hence all supersets can be pruned)

Sketch of the (famous!) ‘a-priori’ algorithmLet L(i-1) be the set of large itemsets with i-1

elementsLet C(i) be the set of candidate itemsets (of size i)

Page 68: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 68

Association rules - idea

Compute L(1), by scanning the database.repeat, for i=2,3...,

‘join’ L(i-1) with itself, to generate C(i)two itemset can be joined, if they agree on their first i-2 elements

prune the itemsets of C(i) (how?)scan the db, finding the counts of the C(i) itemsets - set

this to be L(i)unless L(i) is empty, repeat the loop

Page 69: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 69

Association rules - Conclusions

Association rules: a great tool to find patterns• easy to understand its output• fine-tuned algorithms exist

Page 70: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 70

Overall Conclusions• Data Mining = ``Big Data’’ Analytics = Business

Intelligence:– of high commercial, government and research interest

• DM = DB+ ML+ Stat+Sys

• Data warehousing / OLAP: to get the data• Tree classifiers (SLIQ, SPRINT)• Association Rules - ‘a-priori’ algorithm• (clustering: BIRCH, CURE, OPTICS)

Page 71: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 71

Reading material

• Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD 1993.

• M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996

Page 72: Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications

CMU SCS

Faloutsos & Pavlo CMU SCS 15-415/615 72

Additional references

• Agrawal, R., S. Ghosh, et al. (Aug. 23-27, 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada.

• Jiawei Han and Micheline Kamber, Data Mining , Morgan Kaufman, 2001, chapters 2.2-2.3, 6.1-6.2, 7.3.5