facilitating interactive mining of global and local ... · department of computer science,...
TRANSCRIPT
Facilitating Interactive Mining of
Global and Local Association Rules
Abhishek Mukherji* Elke A. Rundensteiner Matthew O. Ward
Department of Computer Science, Worcester Polytechnic Institute, MA, USA.
*Samsung Research America, CA, USA.
Xmdvtool is an open source multivariate visual analytics tool developed at WPI with a series of NSF grants over the
past 20 years (http://sourceforge.net/projects/xmdvtool/).
This PhD research work was partly supported by NSF under grants IIS-0812027, CCF-0811510 and IIS-1117139.
Era of Big Data …. And we are DRIVING!
11/03/2014 2
1. Where’s the Data in the Big Data Wave? Gerhard Weikum, Res. Director at Max Planck Institute, http://wp.sigmod.org/?p=786.
2. Analytic DB Technology for the Data Enthusiast. Pat Hanrahan, Stanford & Tableau, SIGMOD‘12 Keynote Talk.
Volume
Veracity Variety
Velocity
XmdvTool’s Efforts Towards This Paradigm Shift
11/03/2014 3
Visualize Static Data I. Visualize Stream & Sensor Data
SNIFTool & FireStream
*Di Yang et al., Interactive visual exploration of neighbor-based patterns in data streams, ACM SIGMOD’10 Demo.
ViStream*
II. Visualize Mined Results Visualize Data Records
PARAS/FIRE COLARM
I. Stream & Sensor Data Processing
1. SNIFTool/FireStream: Discover Patterns in Live
Stream [CIKM ’08, ICDE Demo ’07]
2. JAQPOT: High Velocity Streams MJoin Exec.
[BNCOD ’11]
Summary of Graduate Research Works
11/03/2014 4
CAPE* XMDVTool^
*http://davis.wpi.edu/dsrg/PROJECTS/CAPE/index.html
^http://davis.wpi.edu/xmdv/index.html
III. Scalable Nugget-guided Hypothesis Testing
1. SPHINX: Evidence-Hypotheses Explor.[CIKM’13]
2. Iterative Multi-Evidence-Hypotheses Model
II. Interactive Mining
1. PARAS /FIRE [VLDB’13, SIGMOD’13, CIKM’13]
2. COLARM [EDBT’14]
PARAS/FIRE: Interactive Visual Support for
Parameter Space-Driven Mining of Global Rules [PVLDB 2013, SIGMOD 2013, CIKM 2013]
Joint work with
Xika Lin, Christopher Ryan Botaish, Jason Whitehouse,
Elke A. Rundensteiner, Matthew O. Ward
Department of Computer Science, Worcester Polytechnic Institute (WPI), MA, USA.
Association Rule Mining (ARM) Basics
<Age: 30..39> and <Married: Yes> <NumCars: 2>
Support = 40%, Confidence = 100%
RecordID Age Married NumCars
100 23 No 1
200 25 Yes 1
300 29 No 0
400 34 Yes 2
500 38 Yes 2
R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.
R. Srikant and R. Agrawal, Mining quantitative association rules in large relational tables, SIGMOD’96.
6
Which customers to target for
multi-car discount promos?
11/03/2014
Motivation for Interactive Mining
Data Miner
(minsupp, minconf)
{ARs}
Improve turnaround times of mining queries.
Provide parameter recommendations.
Preprocess data to enable fast interactive mining experience.
Unacceptably long response time.
Trial-and-error iterations.
Forced to rerun for each subset.
Data Analyst
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
C. Hidber. Online Association Rule Mining, SIGMOD’99.
B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.
M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.
M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.
Limitations
Research Goals
7 11/03/2014
X Y Z
XYZ
{}
80 60 40
40 20 20
10
XY XZ YZ
100
II. Rule Generation
I. Frequent Itemset Generation
Offline
Online
Assumptions
1. Cost(Freq. Itemset Generation) >> Cost(Rule Generation),
2. Count(Itemsets) << Count(Rules).
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
M. Kubat et al., Itemset trees for targeted association querying, IEEE TKDE’03.
M. Kaya and R. Alhajj. Online mining of fuzzy multidimensional weighted association rules. Applied Intelligence’08.
^ http://www.jcmit.com/memoryprice.htm
XYZ
X
The State-of-the-art in Online Rule Mining
^Cost per GB of RAM: $1000 (in 2000) $25 (in 2012).
8 11/03/2014
X Y Z
XYZ
{}
80 60 40
40 20 20
10
XY XZ YZ
100
Adjacency Lattice and Redundancy
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
Strict Redundancy [(AUC)XYZ כ (AUC)XY]
11/03/2014 9
Simple Redundancy [(AUC) = XYZ, (A)XYZ (A)XYZ] ∩
XYZ
X
Starting with maximal ancestors of XYZ, i.e., X, Y and Z.
If (XYZ qualifies)
Then skip XY and XZ as antecedent (simple) or consequent (strict).
Research Challenges PARAS: Preprocessing/Computational aspects
11/03/2014 10
1. Instead of the itemsets, can we pre-store association rules to altogether
avoid the online rule generation step?
2. Instead of the itemset index, can we have a direct look-up using
(minsupp, minconf)?
3. How to handle Redundancy Relationships in the context of the
parameterized Index?
1. How should we visually present these mining results to the users?
2. How can we leverage these results to support interactive rule exploration?
3. Can we utilize some data visualization techniques to help users better
understand these mined results?
FIRE: Visualization aspects
X Y Z
XYZ
{}
80 60 40
40 20 20
10
XY XZ YZ
100
Co
nfid
en
ce
Support 1 0.8 0.5 0.2
1
0.8
0
.5
0.2
XZ
XY
XYZ
(0.1, 0.125)
XYZ,
ZXY
XZ Y,
YZ X
YX (0.4,0.67)
ZX, ZY
Parameter Space Model (PARAS)
S2 =S
S1= S (0,0.5)
(0.4,0.67)
(0.2,0)
(0.4,0.5)
l1
l2
l3
Stable Regions
11 11/03/2014
Stable Regions {S } w/ Neighbors + Rules +
Further, re-examining the redundancy definitions, we observed certain
properties that enabled us to optimize computation and storage of
redundancy information with respect to the parameter space.*
12 11/03/2014
S2
S1
PARAS: Parameter Space Framework for Online Association Mining, VLDB 2013.
* Xika Lin majorly contributed in the redundancy results.
Framework for Interactive Rule Exploration (FIRE)
Mushroom dataset Chess dataset
13 11/03/2014
All rules versus unique rules view
14 11/03/2014
Unique + non-redundant rules view
15 11/03/2014
Two-region Comparison
16 11/03/2014
Rule Glyph View
17
Lined Glyph*
Filled Glyph
11/03/2014
Filled Glyph MDS layout
{poisonous? = edible}
{gill-attachment = free,
veil-type = partial,
veil-color = white}
*M. O. Ward, A taxonomy of glyph placement strategies for multidimensional data visualization,
Information Visualization 2002.
PARAS: Experimental Evaluation
Data sets
Synthetic*: IBM Quest Generator (T10I4D100k and T10I4D5000k). Tx_Iy_Dz = x avg # of items per transaction, y x 1k total # of items, z transactions.
Real : Chess, Mushroom, Webdocs`. ~
18 11/03/2014
* R. Agrawal and R. Srikant, Fast algorithms for mining association rules in large databases, VLDB’94.
~ A. Asuncion and D. Newman, UCI ML repository. http://www.ics.uci.edu/~mlearn/MLRepository.html, 2007.
` C. Lucchese and S. Orlando and R. Perego and F. Silvestri, Webdocs: a real-life huge transactional dataset. FIMI’04.
PARAS: Experimental Evaluation
^ C. Borgelt, Efficient apriori, eclat & fp-growth, http://www.borgelt.net.
w/o redundancy resolution
1. Apriori
2. Eclat
3. FP-Growth
4. PARAS
Tested Algorithms^
w/ redundancy resolution
1. Apriori_RR
2. Eclat_RR
3. FP-Growth_RR
4. AdjLattice_RR
5. PARAS_RR
19 11/03/2014
PARAS: Experimental Methodologies
1. Average Online Processing Times (w/ and w/o RR).
Varying minsupp, fixed minconf
Fixed minsupp, varying minconf
2. Offline Preprocessing Times (AdjLatticeRR vs. PARAS)
20 11/03/2014
1. Average Online Processing times (T5000k)
w/ RR w/o RR
21 11/03/2014
For a large diversity of online queries, PARAS consistently outperforms the state-of-the-art
competitors from the literature by 2 to 5 orders of magnitude over the tested datasets.
2. Pre-processing Times
C.C. Aggarwal and P.S. Yu, A new approach to online generation of association rules, IEEE TKDE’01.
B. Nag, P. M. Deshpande, and D. J. DeWitt, Using a knowledge cache for interactive discovery of association rules, SIGKDD’99.
PARAS requires ~10% extra offline preprocess time compared with AdjLatticeRR.
11/03/2014 22
Rule Generation
T5000k = 4 sec
Webdocs = 220 sec
Confirmed:
Cost(Freq. Itemset Generation)
>> Cost(Rule Generation)
FIRE: User Study
Questions
Stable Region Usage Tests T1: What are the most prominent rules by support and confidence?
T2: Which settings (out of choice of 4) returns a different set of rules?
T3: Find the common and unique rules for two distinct parameter settings.
Filter/Redundancy Test T4: Find the most frequent characteristics of edible and poisonous mushrooms.
Skyline View Test T5: Find the parameter settings that produce top-k rules in the dataset,
where k = 20, 50, 100.
22 subjects
Mushroom and chess datasets
Cached Rule Miner (CRM) versus FIRE
Randomization to eliminate pre-knowledge
23 11/03/2014
Mushroom Dataset: Tasks 1, 2 and 3
24 11/03/2014
Overall, FIRE outperforms the competitor CRM approach such that the users can achieve
similar or better accuracy while having to use significantly less time for the tasks.
Tasks 4 and 5
25 11/03/2014
Overall, FIRE outperforms the competitor CRM approach such that the users can achieve
similar or better accuracy while having to use significantly less time for the tasks.
Conclusion
Gains of several orders of magnitude when using PARAS for online processing outweigh
the one-time minimal offline preprocessing time and storage requirements.
26 11/03/2014
We proposed a novel parameter space model, developed optimal algorithms and
designed effective visualizations to facilitate interactive rule exploration by tackling
challenges related to both computational and visualization aspects of online rule mining.
Our user study establishes usability and effectiveness of the proposed features and
interactions of the FIRE system in facilitating interactive rule mining.
Recent works at Samsung Research America
27
MobileMiner: Mining Your Frequent Behavior Patterns On Your Phone
V Srinivasan et al., ACM UbiComp 2014 (Best Paper Nominee), HotMobile 2013.
Mobile Sequence Miner: Adding Intelligence to Your Mobile Device via On-Device Sequential Pattern Mining
A Mukherji et al., ACM MCSS Workshop in UbiComp 2014.
User Behavior Analysis via
On-device Mobile Sensing
Unobtrusively learn sequential patterns of
mobile users
“Typically, when I am home on Sunday nights, I call my parents”
Association rule mining over
multi-modal mobile context data
11/03/2014
Thanks
Contact me with questions:
Abhishek Mukherji
Samsung Research America
28 11/03/2014