Download - Data Mining Lab Report

8/11/2019 Data Mining Lab Report

1/6

Data Mining Lab Report

Lab 1

Apriori and FP growth

Submitted by

Redowan Mahmud

Roll:16

Submitted to

Md. Samiullah


2/6

Tasks to perform

Implement Apriori and FP growth algorithm for mining frequent item sets.

Run the simulation program on some given datasets.

Derive Minimum support Vs. Time, Minimum support Vs. Memory,Dataset Size Vs. Time, Dataset Size Vs. Memory graph for the two

algorithms.

*System configuration should be mentioned.

Basic Knowledge

Apriori Algorithm:

As is common in association rule mining, given a set of item sets , the algorithm

attempts to find subsets which are common to at least a minimum number C of theitem sets. Apriori uses a "bottom up" approach, where frequent subsets are

extended one item at a time (a step known as candidate generation), and groups ofcandidates are tested against the data. The algorithm terminates when no further

successful extensions are found.

Apriori uses breadth-first search and a tree structure to count candidate item setsefficiently. It generates candidate item sets of length k from item sets of length k-1.

Then it prunes the candidates which have an infrequent sub pattern. According to

the downward closure lemma, the candidate set contains all frequent k-length item

sets. After that, it scans the transaction database to determine frequent item setsamong the candidates.

Apriori, while historically significant, suffers from a number of inefficiencies or

trade-offs, which have spawned other algorithms. Candidate generation generates

large numbers of subsets (the algorithm attempts to load up the candidate set with

as many as possible before each scan). Bottom-up subset exploration (essentially abreadth-first traversal of the subset lattice) finds any maximal subset S only after

all 2 | S |1 of its proper subsets.


3/6

FP growth Algorithm

The FP-Growth Algorithm is an alternative way to find frequent item sets without

using candidate generations, thus improving performance. For so much it uses a

divide-and-conquer strategy. The core of this method is the usage of a special datastructure named frequent-pattern tree (FP-tree), which retains the item set

association information.

In simple words, this algorithm works as follows: first it compresses the inputdatabase creating an FP-tree instance to represent frequent items. After this first

step it divides the compressed database into a set of conditional databases, each

one associated with one frequent pattern. Finally, each such database is minedseparately. Using this strategy, the FP-Growth reduces the search costs looking for

short patterns recursively and then concatenating them in the long frequent

patterns, offering good selectivity.

In large databases, its not possible to hold the FP-tree in the main memory. A

strategy to cope with this problem is to firstly partition the database into a set of

smaller databases (called projected databases), and then construct an FP-tree fromeach of these smaller databases.

Implementation

Both the algorithm have been developed in Java programming language.Some built in data structure of Java is used in both source code.

Java Vector to handling the candidate item set (Apriori algorithm) and

projected conditional database ( FP growth algorithm)

Java File System to read input from dataset and write output.

Java Object to keep information of each item name and their occurances.

Simulation

Both the program have been simulated on some real life datasets (chess,mashroom). The outputs are compared with each other using the help of 3rdparty

reference . And the simulation result was 100% correct.


4/6

Graphs

1.

Dataset: ChessName : Minimum support (x) Vs.

Time (y)Constant : Dataset Size (3196)

2.

Dataset: Chess

Name : Minimum support (x) Vs.Memory (y)

Constant : Dataset Size (3196)

3.

Dataset: ChessName : Dataset Size (x) Vs. Time (y)

Constant : Minimum Support (87%)

4.

Dataset: Chess

Name : Dataset Size (x) Vs. Memory(y)



5/6

5.

Dataset: Mashrum

Name : Minimum support (x) Vs.

Time (y)Constant : Dataset Size (8124)

6.

Dataset: Mashrum

Name : Minimum support (x) Vs.

Memory (y)Constant : Dataset Size (8124)

7.

Dataset: Mashrum

Name : Dataset Size (x) Vs. Time (y)


8.

Dataset: Mashrum

Name : Dataset Size (x) Vs. Memory(y)



6/6

System Configuration

Download - Data Mining Lab Report

Top Related