ranking interesting subgroups

23
Stefan Rüping Fraunhofer IAIS [email protected] Ranking Interesting Subgroups

Upload: velvet

Post on 22-Feb-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Ranking Interesting Subgroups. Stefan Rüping Fraunhofer IAIS [email protected]. Motivation. name_score >= 1 & geoscore >= 1 & housing >= 5  p = 41.6% Income_score >= 5 & name_score >= 5 & housing >= 5  p = 36.0% - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ranking Interesting Subgroups

Stefan RüpingFraunhofer [email protected]

Ranking Interesting Subgroups

Page 2: Ranking Interesting Subgroups

2

Fraunhofer Web-Projekt, Kick-off am 17.7.08

1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%

2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%

3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%

4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%

5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%

Motivation

Page 3: Ranking Interesting Subgroups

3

Fraunhofer Web-Projekt, Kick-off am 17.7.08

1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%

2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%

3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%

4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%

5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%

Motivation

Page 4: Ranking Interesting Subgroups

4

Fraunhofer Web-Projekt, Kick-off am 17.7.08

1. name_score >= 1 &geoscore >= 1 & housing >= 5 p = 41.6%

2. Income_score >= 5 & name_score >= 5 & housing >= 5 p = 36.0%

3. Active_housholds >= 3 & queries_per_household >= 1 &housing >= 5 p = 43.8%

4. Families == 0 &name_score >= 1 & housing == 0 p = 28.9%

5. Financial_status == 0 &name_score >= 3 &housing <= 5 p = 66.1%

Motivation

Applying ranking to complex data: subgroup models

Optimization of data mining models for non-expert users

Page 5: Ranking Interesting Subgroups

5

Fraunhofer IAIS

Overview

Introduction to Subgroup Discovery Interesting Patterns Ranking Subgroups

• Representation• Ranking SVMs• Iterative algorithm

Experiments Conclusions

Page 6: Ranking Interesting Subgroups

6

Fraunhofer IAIS

Subgroup Discovery

Input• X defined by nominal attributes A1,…,Ad

• Data Subgroup language

• Propositional formula Ai1 = vj1 Ai2 = vj2 … For a subgroup S let

• g(S) = #{ xi S }/n, p(S) = #{ xi S | yi = 1 }/g(S), p0 = |yi = 1|/n• q(S) = g(S)a (p(S)-p0)

Task• Find k subgroups with highest significance (maximal quality q)

}1,0{),(,),,( 11 Xyxyx nn

a = 0.5 t-testSubgroup quality = significance of

pattern

Subgroup size and class probability

Page 7: Ranking Interesting Subgroups

7

Fraunhofer IAIS

Subgroup Discovery: Example

Weather Advertised

Ice Cream Sales

good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low

Page 8: Ranking Interesting Subgroups

8

Fraunhofer IAIS

Subgroup Discovery: Example

Weather Advertised

Ice Cream Sales

good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low

S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265

Page 9: Ranking Interesting Subgroups

9

Fraunhofer IAIS

Subgroup Discovery: Example

Weather Advertised

Ice Cream Sales

good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low

S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265

S2: Advertised = yes sales = highg(s) = 2/8p(S) = 2/2q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187

Page 10: Ranking Interesting Subgroups

10

Fraunhofer IAIS

Subgroup Discovery: Example

Weather Advertised

Ice Cream Sales

good yes highgood no highgood no highgood no highbad no lowbad yes highbad no lowbad no low

S1: Weather = good sales = highg(S) = 4/8p(S) = 4/4q(S) = (4/8)0.5 (4/4 - 5/8) = 0.265

S2: Advertised = yes sales = highg(s) = 2/8p(S) = 2/2q(S) = (2/8)0.5 (2/2 – 5/8) = 0.187Significance ≠ Interestingness

Page 11: Ranking Interesting Subgroups

11

Fraunhofer IAIS

Interesting Patterns

What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist Attributes

• Actionability• Acquaintedness

Sub-space• Novelty

Complexity• Not too complex• Not too simple

?

Page 12: Ranking Interesting Subgroups

12

Fraunhofer IAIS

Overview: Ranking Interesting Subgroups

Data Subgroup Discovery

Ranking SVM

Task Modification

Subgroup Representatio

n

„S1 > S2“

Page 13: Ranking Interesting Subgroups

13

Fraunhofer IAIS

Subgroup Representation (1/3)

Subgroups become examples of ranking learner! Notation

• Ai = original attribute• r(S) = representation of subgroup S

Remember: important properties of subgroups• Attributes• Examples• Complexity

Representing complexity• r(S) includes g(S) and p(S)-p0

Page 14: Ranking Interesting Subgroups

14

Fraunhofer IAIS

Subgroup Representation (2/3)

Representing attributes For each attribute Ai of the original examples include

into subgroup representation attribute

Observation: TF/IDF-like representation performs even better

else

AcontainsSiffSr i

i 01

)(

jji

iTFIDFi Sr

SrSr)(1

)()(

Page 15: Ranking Interesting Subgroups

15

Fraunhofer IAIS

Subgroup Representation (3/3)

Representing examples User may be more interested in subset of examples Construct list of known relevant and irrelevant

subgroups from user feedback For each subgroup S and each known relevant/irrelevant

subgroup T define

relatedness of S to known subgroup T||||||)(

TSTSSrT

Page 16: Ranking Interesting Subgroups

16

Fraunhofer IAIS

Ranking Optimization Problem

Rationale• Subgroup discovery gives quality q(S) = g(S)a (p(S)-p0)• User defines ranking by pairs „S1 > S2“ (S1 is better than S2)• Find true ranking q* such that S1 > S2 <=> q*(S1) > q*(S2)

Assumption

(justfied by assuming hidden labels of interestingness of examples)

Define linear ranking function log q*(S) = (a,1,w) r(S)

d

i

Srwa iiepSpSgSq3

)(0

* 2))(()()(

Page 17: Ranking Interesting Subgroups

17

Fraunhofer IAIS

Ranking Optimization Problem (2/2)

Solution similar to ranking SVM Optimization problem:

Equivalent problem:

where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)

0,)(log)(log..

min),1,(

2,*

1,*

202

1

iiii

ii

SqSqts

Cwaa

0,),1,(..

min)( 2212

021

ii

ii

zwats

Cwaa

Page 18: Ranking Interesting Subgroups

18

Fraunhofer IAIS

Ranking Optimization Problem (2/2)

Solution similar to ranking SVM Optimization problem:

Equivalent problem:

where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)

0,)(log)(log..

min),1,(

2,*

1,*

202

1

iiii

ii

SqSqts

Cwaa

0,),1,(..

min)( 2212

021

ii

ii

zwats

Cwaa

Deviation from parameter a0 in

subgroup discovery

Page 19: Ranking Interesting Subgroups

19

Fraunhofer IAIS

Ranking Optimization Problem (2/2)

Solution similar to ranking SVM Optimization problem:

Equivalent problem:

where z = r(Si,1)-r(Si,2). Remember log q*(S) = (a,1,w) r(S)

0,)(log)(log..

min),1,(

2,*

1,*

202

1

iiii

ii

SqSqts

Cwaa

0,),1,(..

min)( 2212

021

ii

ii

zwats

Cwaa

Deviation from parameter a0 in

subgroup discovery

Constant weight for g(S) defines margin

Page 20: Ranking Interesting Subgroups

20

Fraunhofer IAIS

Iterative Procedure

Why?• Google: ~1012 web pages• Same number of possible subgroups on 12-dimensional data set

with 9 distinct values per attribute• cannot compute all subgroups for single-step ranking

Approach• Optimization problem gives new estimate of a• Transform weight of subgroups–features into weights for original

examples• Idea: replace binary y with numeric value. Appropriate offset

guarantees that subgroup-q is approximates optimized q*

subgroup rankingsearch

Page 21: Ranking Interesting Subgroups

21

Fraunhofer IAIS

Experiments

Simulation on UCI data• Replace true label with most correlated attribute• Use true label to simulate user• Measure correspondence of algorithm‘s ranking with subgroups

found on true label• Tests ability of approach to flexibly adapt to correlated patterns

Performance measure• Area under the curve – retrieval of true top 100 subgroups• Kendall‘s - internal consistency of returned ranking

Page 22: Ranking Interesting Subgroups

22

Fraunhofer IAIS

Results

Wilcoxon signed rank test confirms significance

3 Data sets with minimal AUC are exactly the ones with minimal correlation between true and proxy label!

Data set AUC

Diabetes 0.256 0.008Breast-w 0.759 0.120Vote 0.664 0.051Segment 0.596 0.601Vehicle 0.053 0.500Heart-c 0.180 0.036Primary-tumor 0.739 0.532Hypothyroid 0.729 0.307Ionosphere 0.227 0.708Credit-a 0.050 0.241Credit-g 0.019 0.285Colic 1.9E-4 0.213Anneal 0.030 0.329Soybean 1.9E-4 0.040Mushroom 0.542 0.320mean 0.323 0.286

Page 23: Ranking Interesting Subgroups

23

Fraunhofer IAIS

Conclusions

Example of ranking on complex, knowledge-rich data Interestingness of subgroups patterns can be

significantly increased with interactive ranking-based method

Step toward automating machine learning for end-users Future work:

• Validation with true users• Active learning approach