facilitating interactive mining of global and local ... · department of computer science,...

28
Facilitating Interactive Mining of Global and Local Association Rules Abhishek Mukherji* Elke A. Rundensteiner Matthew O. Ward Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the past 20 years (http://sourceforge.net/projects/xmdvtool/). This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139.

Upload: others

Post on 14-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Facilitating Interactive Mining of

Global and Local Association Rules

Abhishek Mukherji* Elke A. Rundensteiner Matthew O. Ward

Department of Computer Science, Worcester Polytechnic Institute, MA, USA.

*Samsung Research America, CA, USA.

Xmdvtool is an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the

past 20 years (http://sourceforge.net/projects/xmdvtool/).

This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139.

Page 2: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Era of Big Data …. And we are DRIVING!

11/03/2014 2

1. Where’s the Data in the Big Data Wave? Gerhard Weikum, Res. Director at Max Planck Institute, http://wp.sigmod.org/?p=786.

2. Analytic DB Technology for the Data Enthusiast. Pat Hanrahan, Stanford & Tableau, SIGMOD‘12 Keynote Talk.

Volume

Veracity Variety

Velocity

Page 3: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

XmdvTool’s Efforts Towards This Paradigm Shift

11/03/2014 3

Visualize Static Data I. Visualize Stream & Sensor Data

SNIFTool & FireStream

*Di Yang et al., Interactive visual exploration of neighbor-based patterns in data streams, ACM SIGMOD’10 Demo.

ViStream*

II. Visualize Mined Results Visualize Data Records

PARAS/FIRE COLARM

Page 4: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

I. Stream & Sensor Data Processing

1. SNIFTool/FireStream: Discover Patterns in Live

Stream [CIKM ’08, ICDE Demo ’07]

2. JAQPOT: High Velocity Streams MJoin Exec.

[BNCOD ’11]

Summary of Graduate Research Works

11/03/2014 4

CAPE* XMDVTool^

*http://davis.wpi.edu/dsrg/PROJECTS/CAPE/index.html

^http://davis.wpi.edu/xmdv/index.html

III. Scalable Nugget-guided Hypothesis Testing

1. SPHINX: Evidence-Hypotheses Explor.[CIKM’13]

2. Iterative Multi-Evidence-Hypotheses Model

II. Interactive Mining

1. PARAS /FIRE [VLDB’13, SIGMOD’13, CIKM’13]

2. COLARM [EDBT’14]

Page 5: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

PARAS/FIRE: Interactive Visual Support for

Parameter Space-Driven Mining of Global Rules [PVLDB 2013, SIGMOD 2013, CIKM 2013]

Joint work with

Xika Lin, Christopher Ryan Botaish, Jason Whitehouse,

Elke A. Rundensteiner, Matthew O. Ward

Department of Computer Science, Worcester Polytechnic Institute (WPI), MA, USA.

Page 6: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Association Rule Mining (ARM) Basics

<Age: 30..39> and <Married: Yes> <NumCars: 2>

Support = 40%, Confidence = 100%

RecordID Age Married NumCars

100 23 No 1

200 25 Yes 1

300 29 No 0

400 34 Yes 2

500 38 Yes 2

R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.

R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD’96.

6

Which customers to target for

multi-car discount promos?

11/03/2014

Page 7: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Motivation for Interactive Mining

Data Miner

(minsupp, minconf)

{ARs}

Improve turnaround times of mining queries.

Provide parameter recommendations.

Preprocess data to enable fast interactive mining experience.

Unacceptably long response time.

Trial-and-error iterations.

Forced to rerun for each subset.

Data Analyst

C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.

C. Hidber. Online Association Rule Mining, SIGMOD’99.

B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.

M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.

M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.

Limitations

Research Goals

7 11/03/2014

Page 8: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

X Y Z

XYZ

{}

80 60 40

40 20 20

10

XY XZ YZ

100

II. Rule Generation

I. Frequent Itemset Generation

Offline

Online

Assumptions

1. Cost(Freq. Itemset Generation) >> Cost(Rule Generation),

2. Count(Itemsets) << Count(Rules).

C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.

M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.

M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.

^ http://www.jcmit.com/memoryprice.htm

XYZ

X

The State-of-the-art in Online Rule Mining

^Cost per GB of RAM: $1000 (in 2000) $25 (in 2012).

8 11/03/2014

Page 9: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

X Y Z

XYZ

{}

80 60 40

40 20 20

10

XY XZ YZ

100

Adjacency Lattice and Redundancy

C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.

Strict Redundancy [(AUC)XYZ כ (AUC)XY]

11/03/2014 9

Simple Redundancy [(AUC) = XYZ, (A)XYZ (A)XYZ] ∩

XYZ

X

Starting with maximal ancestors of XYZ, i.e., X, Y and Z.

If (XYZ qualifies)

Then skip XY and XZ as antecedent (simple) or consequent (strict).

Page 10: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Research Challenges PARAS: Preprocessing/Computational aspects

11/03/2014 10

1. Instead of the itemsets, can we pre-store association rules to altogether

avoid the online rule generation step?

2. Instead of the itemset index, can we have a direct look-up using

(minsupp, minconf)?

3. How to handle Redundancy Relationships in the context of the

parameterized Index?

1. How should we visually present these mining results to the users?

2. How can we leverage these results to support interactive rule exploration?

3. Can we utilize some data visualization techniques to help users better

understand these mined results?

FIRE: Visualization aspects

Page 11: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

X Y Z

XYZ

{}

80 60 40

40 20 20

10

XY XZ YZ

100

Co

nfid

en

ce

Support 1 0.8 0.5 0.2

1

0.8

0

.5

0.2

XZ

XY

XYZ

(0.1, 0.125)

XYZ,

ZXY

XZ Y,

YZ X

YX (0.4,0.67)

ZX, ZY

Parameter Space Model (PARAS)

S2 =S

S1= S (0,0.5)

(0.4,0.67)

(0.2,0)

(0.4,0.5)

l1

l2

l3

Stable Regions

11 11/03/2014

Page 12: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Stable Regions {S } w/ Neighbors + Rules +

Further, re-examining the redundancy definitions, we observed certain

properties that enabled us to optimize computation and storage of

redundancy information with respect to the parameter space.*

12 11/03/2014

S2

S1

PARAS: Parameter Space Framework for Online Association Mining, VLDB 2013.

* Xika Lin majorly contributed in the redundancy results.

Page 13: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Framework for Interactive Rule Exploration (FIRE)

Mushroom dataset Chess dataset

13 11/03/2014

Page 14: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

All rules versus unique rules view

14 11/03/2014

Page 15: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Unique + non-redundant rules view

15 11/03/2014

Page 16: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Two-region Comparison

16 11/03/2014

Page 17: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Rule Glyph View

17

Lined Glyph*

Filled Glyph

11/03/2014

Filled Glyph MDS layout

{poisonous? = edible}

{gill-attachment = free,

veil-type = partial,

veil-color = white}

*M. O. Ward, A taxonomy of glyph placement strategies for multidimensional data visualization,

Information Visualization 2002.

Page 18: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

PARAS: Experimental Evaluation

Data sets

Synthetic*: IBM Quest Generator (T10I4D100k and T10I4D5000k). Tx_Iy_Dz = x avg # of items per transaction, y x 1k total # of items, z transactions.

Real : Chess, Mushroom, Webdocs`. ~

18 11/03/2014

* R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.

~ A. Asuncion and D. Newman, UCI ML repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007.

` C. Lucchese and S. Orlando and R. Perego and F. Silvestri, Webdocs: a real-life huge transactional dataset. FIMI’04.

Page 19: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

PARAS: Experimental Evaluation

^ C. Borgelt, Efficient apriori, eclat & fp-growth, http://www.borgelt.net.

w/o redundancy resolution

1. Apriori

2. Eclat

3. FP-Growth

4. PARAS

Tested Algorithms^

w/ redundancy resolution

1. Apriori_RR

2. Eclat_RR

3. FP-Growth_RR

4. AdjLattice_RR

5. PARAS_RR

19 11/03/2014

Page 20: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

PARAS: Experimental Methodologies

1. Average Online Processing Times (w/ and w/o RR).

Varying minsupp, fixed minconf

Fixed minsupp, varying minconf

2. Offline Preprocessing Times (AdjLatticeRR vs. PARAS)

20 11/03/2014

Page 21: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

1. Average Online Processing times (T5000k)

w/ RR w/o RR

21 11/03/2014

For a large diversity of online queries, PARAS consistently outperforms the state-of-the-art

competitors from the literature by 2 to 5 orders of magnitude over the tested datasets.

Page 22: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

2. Pre-processing Times

C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.

B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.

PARAS requires ~10% extra offline preprocess time compared with AdjLatticeRR.

11/03/2014 22

Rule Generation

T5000k = 4 sec

Webdocs = 220 sec

Confirmed:

Cost(Freq. Itemset Generation)

>> Cost(Rule Generation)

Page 23: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

FIRE: User Study

Questions

Stable Region Usage Tests T1: What are the most prominent rules by support and confidence?

T2: Which settings (out of choice of 4) returns a different set of rules?

T3: Find the common and unique rules for two distinct parameter settings.

Filter/Redundancy Test T4: Find the most frequent characteristics of edible and poisonous mushrooms.

Skyline View Test T5: Find the parameter settings that produce top-k rules in the dataset,

where k = 20, 50, 100.

22 subjects

Mushroom and chess datasets

Cached Rule Miner (CRM) versus FIRE

Randomization to eliminate pre-knowledge

23 11/03/2014

Page 24: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Mushroom Dataset: Tasks 1, 2 and 3

24 11/03/2014

Overall, FIRE outperforms the competitor CRM approach such that the users can achieve

similar or better accuracy while having to use significantly less time for the tasks.

Page 25: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Tasks 4 and 5

25 11/03/2014

Overall, FIRE outperforms the competitor CRM approach such that the users can achieve

similar or better accuracy while having to use significantly less time for the tasks.

Page 26: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Conclusion

Gains of several orders of magnitude when using PARAS for online processing outweigh

the one-time minimal offline preprocessing time and storage requirements.

26 11/03/2014

We proposed a novel parameter space model, developed optimal algorithms and

designed effective visualizations to facilitate interactive rule exploration by tackling

challenges related to both computational and visualization aspects of online rule mining.

Our user study establishes usability and effectiveness of the proposed features and

interactions of the FIRE system in facilitating interactive rule mining.

Page 27: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Recent works at Samsung Research America

27

MobileMiner: Mining Your Frequent Behavior Patterns On Your Phone

V Srinivasan et al., ACM UbiComp 2014 (Best Paper Nominee), HotMobile 2013.

Mobile Sequence Miner: Adding Intelligence to Your Mobile Device via On-Device Sequential Pattern Mining

A Mukherji et al., ACM MCSS Workshop in UbiComp 2014.

User Behavior Analysis via

On-device Mobile Sensing

Unobtrusively learn sequential patterns of

mobile users

“Typically, when I am home on Sunday nights, I call my parents”

Association rule mining over

multi-modal mobile context data

11/03/2014

Page 28: Facilitating Interactive Mining of Global and Local ... · Department of Computer Science, Worcester Polytechnic Institute, MA, USA. *Samsung Research America, CA, USA. Xmdvtool is

Thanks

Contact me with questions:

Abhishek Mukherji

Samsung Research America

[email protected]

28 11/03/2014