parallel association rule mining

Parallel Association Rule Mining

Presented by: Ramoza Ahsan and Xiao Qin

November 5th, 2013Parallel Association Rule MiningOutlineBackground of Association Rule MiningApriori AlgorithmParallel Association Rule MiningCount DistributionData DistributionCandidate DistributionFP tree Mining and growthFast Parallel Association Rule mining without candidate generationMore Readings2Association Rule MiningAssociation rule miningFinding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)Record usually contains transaction date and items bought.Literature work more focused on serial mining.Support and Confidence: Parameters for Association Rule mining.3

Association rule Mining ParametersThe support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X)

Transaction IDMilkBreadEggJuice1110020010300014111050100Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.54OutlineBackground of Association Rule MiningApriori AlgorithmParallel Association Rule MiningCount DistributionData DistributionCandidate DistributionFP tree Mining and growthFast Parallel Association Rule mining without candidate generationFP tree over Hadoop 5Apriori AlgorithmApriori runs in two steps. Generation of candidate itemsetsPruning of itemsets which are infrequentLevel-wise generation of frequent itemsets. Apriori principle:If an itemset is frequent, then all of its subsets must also be frequent.6Apriori Algorithm for generating frequent itemsetsMinimum support=2

7Parallel Association Rule Mining8Paper presents parallel algorithm for generating frequent itemsetsEach of N procesor has private memory and disk.Data is distributed evenly on the disks of every processor.Count Distribution algorithm focusses on minimizing communication.Data Distribution utilizes memory aggregation efficientlyCandidate Distribution reduces synchronization between processors.Algorithm 1: Count Distribution9Each processor generates complete Ck ,using complete frequent itemset Lk-1.Processor traverses over its local data partition and develops local support counts.Exchange the counts with other processors to develop global count. Synchronization is needed.Each processor computes Lk from Ck.Each processor makes a decision to continue or stop.Data is partitioned.9Algorithm 2: Data DistributionPartition the dataset into N small chunksPartition the set of candidates k-itemsets into N exclusive subsets.Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks. Aggregate the count.Algorithm 2: Data DistributionData1/N Data1/N Data1/N Data1/N Data1/N DataCk1/N Ck1/N Ck1/N Ck1/N Ck1/N CkAlgorithm 2: Data Distribution1/N Ck1/N Ck1/N Ck1/N Ck1/N Ck1/N Data1/N Data1/N Data1/N Data1/N DatasynchronizeAlgorithm 3: Candidates DistributionIf the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.

The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates.Algorithm 3: Candidates DistributionDataData_1Data_2Data_3Data_4 Data_5Lk-1Lk-1_1Lk-1_2Lk-1_3Lk-1_4Lk-1_5Ck_1Ck_3Ck_2Ck_4Ck_5Algorithm 3: Candidates DistributionData_1Data_2Data_3Data_4 Data_5Ck_1Ck_3Ck_2Ck_4Ck_5Lk-1_1Lk-1_2Lk-1_3Lk-1_4Lk-1_5Data Partition and L Partition DataEach pass, every node grabs the necessary tuples from the dataset.LLet L3={ABC, ABD, ABE, ACD, ACE}The items in the itemsets are lexicographically ordered.Partition the itemsets based on common k-1 long prefixes.Rule GenerationEx.Frequent Itemset {ABCDE,AB}The Rule that can be generated from this set isAB => CDESupport : Sup(ABCDE)Confidence : Sup(ABCDE)/Sup(AB)OutlineBackground of Association Rule MiningApriori AlgorithmParallel Association Rule MiningCount DistributionData DistributionCandidate DistributionFP tree Mining and growthFast Parallel Association Rule mining without candidate generationFP tree over Hadoop 18FP Tree AlgorithmAllows frequent itemset discovery without candidate itemset generation:Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set.Step 2: Extracts frequent itemsets directly from the FP-tree19

FP-Tree & FP-Growth example20

Min supp=3Fast Parallel Association Rule Mining Without Candidacy Generation21Phase 1: Each processor is given equal number of transactions.Each processor locally counts the items.Local count is summed up to get global count.Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.construction of parallel frequent pattern trees for each processor.Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table.

Example with min supp=422TIDItemsProcessor Number123A,B,C,D,EF,B,D,E,GB,D,A,E,GP0456A,B,F,G,DB,F,D,G,KA,B,F,D,G,KP1789A,R,M,K,OB,F,G,A,DA,B,F,M,OP2ItemP0P1P2A223B332C100..Step 1Proc. #ItemGlobal CounterP0ABCD7817P1E6P2..ItemGlobal CounterB8A7D7F6G6Step 4. After pruning infrequent onesFP tree for P023ItemGlobal CounterB8A7D7F6G6TIDItemsReordered Transaction123A,B,C,D,EF,B,D,E,GB,D,A,E,GB,A,DB,D,F,GB,A,D,GB:1A:1D:1B:2D:1F:1G:1B:3A:2D:2G:1Construction of local FP trees24

Conditional Pattern Bases 25ItemsConditional Pattern BaseGD:1,A:1,B:1F:1,D:1,B:1FDAB

ItemsConditional Pattern BaseGD:1,A:1,B:1F:1,D:1,B:1FD:1,B:1DABItemsConditional Pattern BaseGD:1,A:1,B:1F:1,D:1,B:1FD:1,B:1DA:2,B:2B:1ABItemsConditional Pattern BaseGD:1,A:1,B:1F:1,D:1,B:1FD:1,B:1DA:2,B:2B:1AB:2B{}Frequent pattern strings26All frequent pattern trees are shared by all processorsEach generate conditional pattern base from respective items in header tableMerging all conditional pattern bases of same item yields frequent string.If support of item is less than threshold it is not added in final frequent string.More Readings

[1][2][3][4]

FP-Growth on Hadoop

3 Map-Reduce(s)FP-Growth on Hadoop

CoreThank You!

parallel association rule mining

Documents

rule xy

parallel algorithm

data set

data distributionpartition

serial mining

local data partition

processor traverses

whichever processor