a parallel algorithm for approximate frequent itemset mining using mapreduce

29
L.A.C.A.M. KDDE A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola and Donato Malerba Ciao ciao Vai a fare ciao ciao Dipartimento di INFORMATIC A Department of Computer Science University of Bari “Aldo Moro” via Orabona, 4, I-70125 Bari, Italy

Upload: fabio-fumarola

Post on 11-May-2015

697 views

Category:

Engineering


4 download

DESCRIPTION

Recently, several algorithms based on the MapRe- duce framework have been proposed for frequent pattern mining in Big Data. However, the proposed solutions come with their own technical challenges, such as inter-communication costs, in- process synchronizations, balanced data distribution and input parameters tuning, which negatively affect the computation time. In this paper we present MrAdam, a novel parallel, distributed algorithm which addresses these problems. The key principle underlying the design of MrAdam is that one can make reasonable decisions in the absence of perfect answers. Indeed, given the classical threshold for minimum support and a user- specified error bound, MrAdam exploits the Chernoff bound to mine “approximate” frequent itemsets with statistical error guarantees on their actual supports. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The result collections of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. MrAdam can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments we also show that the proposed algorithm returns a good statistically bounded approximation of the exact results.

TRANSCRIPT

Page 1: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

A Parallel Algorithm for Approximate Frequent Itemset Mining using

MapReduce Fabio Fumarola and Donato Malerba

Ciaociao

Vai a fare

ciao ciao

Dipartimento di INFORMATICA

Department of Computer Science

University of Bari “Aldo Moro”

via Orabona, 4, I-70125 Bari, Italy

Page 2: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

2

Outline• Frequent Pattern Mining• Frequent Itemset Mining using MapReduce– Issues related to data distribution

• Discover “Approximate” frequent itemset with a statistical error guarantee

• MrAdam– Chernoff Bound– Local Model Discovery– Global Combination and Interpolation

• Experiments• Conclusions

Fumarola and Malerba – A parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 3: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

3

Frequent Pattern Analysis• Frequent pattern: a pattern (a set of items,

subsequences, substructures, etc.) that occurs frequently in a dataset

• First proposed by Agrawal et al. in the context of frequent itemsets and association rule mining

• Motivation: Finding inherent regularities in data– What products were often purchased together?– What kinds of DNA are sensitive to this new drug?

• Application: Basket data analysis, cross-marketing, catalog design, Web log analysis, and DNA sequence analysis

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 4: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

4

Why Frequent Pattern Mining Important?

Foundation for many essential data mining tasks– Association, correlation, and causality analysis– Sequential, structural (e.g., sub-graph) patterns– Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data – Classification: discriminative, frequent pattern

analysis– Cluster analysis: frequent pattern-based clustering

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 5: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

5

Basic Concepts: Frequent Patterns

• itemset: A set of one or more items• k-itemset X = {x1, …, xk}• (absolute) support, or, support

count of X: Frequency or occurrence of an itemset X

• (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

• An itemset X is frequent if X’s support is no less than a minsup threshold

Customerbuys diaper

Customerbuys both

Customerbuys beer

Tid

Items bought

10 Beer, Nuts, Diaper

20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk

50 Nuts, Coffee, Diaper, Eggs, Milk

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 6: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

6

FIM & MapReduce

• Several algorithm have been proposed for Frequent Itemset Mining using MapReduce.

• However, their computation time is negatively affected by:– Inter-communication costs,– In-process synchronizations,– Balanced data distribution, and– Input parameters tuning.

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 7: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

7

Problem

• Is that Frequent Itemset Mining is not Map-Reducible.

{a,b,c}{a,b}

{a,b,c}{a,c,d}

{a,b,c}{a}

{a,b,c}

{a,c}{a}

{a,b,c}

minsup >= 0.5

{a,b}=3/4 {a,b}=2/3 {a,b}=1/3

{a,b}= (3/4 +2/3 + 0)/3 < 0.5

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 8: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

8

Research Goals

Can we still make reasonable decisions in the absence of perfect answers?

(Yes…)

If we introduce an error is still acceptable?(Not Always)

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 9: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

9

Contribution

• To mine “approximate” frequent itemsets from Big Data with statistical error guarantees.

• We want to avoid:– Inter-communication costs,– In-process synchronizations,– Additional input parameters tuning.

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 10: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

11

Literature (DDM)

• In 1996, Aggrawal and Shafer proposed approaches based on: Count Distribution, Data Distribution, and Candidate Distribution

• In 2002, Orlando et al. proposed the Partion Algorithm• In 2004, Asharafi et al. proposed Odam, where the

mining process is synchronized via message passing• Still all the proposed solutions are based on Apriori

approach– Issues: synchronization, data balancing

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 11: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

12

Literature: MapReduce• However, Google introduced– 2003 - Google File System,– 2004 - MapReduce,

• In 2008: Li et al. proposed Parallel FP-Growth (PFP)1. Parallel and distributed counting of frequent items2. MapReduce round to generate group-dependent

transactions3. FP-growth applied to each group

• VLDB 2010, DBKA 2010, KDD 2011 evolutionsIssues: data replication, inter-communication and synchronizations

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 12: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

13

Literature: MapReduce

• Riondato et al. [CIKM 2012] proposed an approach, which:– Creates samples of the dataset– Extracts frequent itemsets with support ≥ (minsup – ε/2)

from the samples– Combines the frequent itemesets

• Parameters: number of transactions, allowed error ε, probability δ, replication parameter Φ, and the type of Mapper to be used (Partition, Binomial, Count Flipper, Input Sample)

• Matlab script for parameters tuning• Composed by 2 map-reduce rounds

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 13: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

14

MrAdam• Given a dataset stored on HDFS, MrAdam takes:– Input: 2 parameters, minsup and a value for the

reliability parameter δ.– Output: a (1 – δ) approximation of the exact set of

frequent itemsets• MrAdam combines:– A statistical approach based on the Chernoff bound– A MapReduce based algorithm– The SE-Tree to combine intermediate results– An interpolation function to estimate the support of an

itemset

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 14: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

15

Chernoff bound

• It allows us to give an estimation of the expected value for a random variable with Binomial distribution.

• We can use it to express the allowed error ε in term of δ.

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 15: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

16

MapReduce

• We modeled the computation into Map and Reduce functions.

1.map(key: LongWritable ,value: Text ,context: Context)

2.reduce(key: Text, vals: Iterable[LongWritable],context: Context)

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 16: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

17

1. Local Model Discovery

Input: minsup, δ, hdfs folder1. It uses the Chernoff bound to compute the

maximum acceptable error ε2. It executes a Map-step with FP-growth as

routine and it returns the collection of frequent itemsets with support (minsup – ε)

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 17: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

18

2. Global Combination

• The partial result are aggregated using an SE-Tree

• The SE-Tree enumerates the ordered collection of the discovered frequent itemsets

• Baseline Approach: the support is computed by summing the values in the SE-Tree nodes

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 18: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

19

3. Structural Interpolated Support:

Let S be an SE-Tree, X a candidate k-itemset infrequent on Di and P(X) the set of (k-1)-subitemsets of X• The structural interpolated support of X

accepted if:1. None of the (k-1)-subitemsets of X is marked as

approximated2.

Method: MrAdam-ChernoffFumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 19: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

20

EXPERIMENTS

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 20: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

21

Experiment: Goal

1. The runtime overhead of MrAdam on Hadoop w.r.t. FP-Growth

2. Evaluate the performances of MrAdam w.r.t the literature.

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 21: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

Fumarola and Malerba – A parallel algorrithm for Approximate Frequent Itemset Mining using MapReduce 22

Experimental Settings

• MrAdam implemented using Hadoop 2.2• Private Cluster composed by 8 VMs– Intel Xeon 2.2GHz – 8GB of RAM

• One Vm equipped with 32GB of RAM

Page 22: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

23

Mushroom

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 23: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

24

Pumsb

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 24: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

25

Accident

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 25: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

26

LARGE DATASETS

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 26: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

27

Mushroom Large

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 27: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

28

Pumsb

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 28: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

29

Scalability

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

Page 29: A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce

L.A.C.A.M.KDDE

30

Conclusions

We presented MrAdam algorithm that:– Does not require any time-consuming

communication and synchronization activity, – It generates in parallel the sets of itemsets locally

frequent and,– It return a (1-δ) approximation of the collection of

the globally frequent itemsets by aggregation and interpolation.

– Experiment shows that MrAdam is from 2 to 100 times faster than PFP.

Fumarola and Malerba – A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce