Transcript
  • 8/11/2019 Data Mining Lab Report

    1/6

    Data Mining Lab Report

    Lab 1

    Apriori and FP growth

    Submitted by

    Redowan Mahmud

    Roll:16

    Submitted to

    Md. Samiullah

  • 8/11/2019 Data Mining Lab Report

    2/6

    Tasks to perform

    Implement Apriori and FP growth algorithm for mining frequent item sets.

    Run the simulation program on some given datasets.

    Derive Minimum support Vs. Time, Minimum support Vs. Memory,Dataset Size Vs. Time, Dataset Size Vs. Memory graph for the two

    algorithms.

    *System configuration should be mentioned.

    Basic Knowledge

    Apriori Algorithm:

    As is common in association rule mining, given a set of item sets , the algorithm

    attempts to find subsets which are common to at least a minimum number C of theitem sets. Apriori uses a "bottom up" approach, where frequent subsets are

    extended one item at a time (a step known as candidate generation), and groups ofcandidates are tested against the data. The algorithm terminates when no further

    successful extensions are found.

    Apriori uses breadth-first search and a tree structure to count candidate item setsefficiently. It generates candidate item sets of length k from item sets of length k-1.

    Then it prunes the candidates which have an infrequent sub pattern. According to

    the downward closure lemma, the candidate set contains all frequent k-length item

    sets. After that, it scans the transaction database to determine frequent item setsamong the candidates.

    Apriori, while historically significant, suffers from a number of inefficiencies or

    trade-offs, which have spawned other algorithms. Candidate generation generates

    large numbers of subsets (the algorithm attempts to load up the candidate set with

    as many as possible before each scan). Bottom-up subset exploration (essentially abreadth-first traversal of the subset lattice) finds any maximal subset S only after

    all 2 | S |1 of its proper subsets.

  • 8/11/2019 Data Mining Lab Report

    3/6

    FP growth Algorithm

    The FP-Growth Algorithm is an alternative way to find frequent item sets without

    using candidate generations, thus improving performance. For so much it uses a

    divide-and-conquer strategy. The core of this method is the usage of a special datastructure named frequent-pattern tree (FP-tree), which retains the item set

    association information.

    In simple words, this algorithm works as follows: first it compresses the inputdatabase creating an FP-tree instance to represent frequent items. After this first

    step it divides the compressed database into a set of conditional databases, each

    one associated with one frequent pattern. Finally, each such database is minedseparately. Using this strategy, the FP-Growth reduces the search costs looking for

    short patterns recursively and then concatenating them in the long frequent

    patterns, offering good selectivity.

    In large databases, its not possible to hold the FP-tree in the main memory. A

    strategy to cope with this problem is to firstly partition the database into a set of

    smaller databases (called projected databases), and then construct an FP-tree fromeach of these smaller databases.

    Implementation

    Both the algorithm have been developed in Java programming language.Some built in data structure of Java is used in both source code.

    Java Vector to handling the candidate item set (Apriori algorithm) and

    projected conditional database ( FP growth algorithm)

    Java File System to read input from dataset and write output.

    Java Object to keep information of each item name and their occurances.

    Simulation

    Both the program have been simulated on some real life datasets (chess,mashroom). The outputs are compared with each other using the help of 3rdparty

    reference . And the simulation result was 100% correct.

  • 8/11/2019 Data Mining Lab Report

    4/6

    Graphs

    1.

    Dataset: ChessName : Minimum support (x) Vs.

    Time (y)Constant : Dataset Size (3196)

    2.

    Dataset: Chess

    Name : Minimum support (x) Vs.Memory (y)

    Constant : Dataset Size (3196)

    3.

    Dataset: ChessName : Dataset Size (x) Vs. Time (y)

    Constant : Minimum Support (87%)

    4.

    Dataset: Chess

    Name : Dataset Size (x) Vs. Memory(y)

    Constant : Minimum Support (87%)

  • 8/11/2019 Data Mining Lab Report

    5/6

    5.

    Dataset: Mashrum

    Name : Minimum support (x) Vs.

    Time (y)Constant : Dataset Size (8124)

    6.

    Dataset: Mashrum

    Name : Minimum support (x) Vs.

    Memory (y)Constant : Dataset Size (8124)

    7.

    Dataset: Mashrum

    Name : Dataset Size (x) Vs. Time (y)

    Constant : Minimum Support (47%)

    8.

    Dataset: Mashrum

    Name : Dataset Size (x) Vs. Memory(y)

    Constant : Minimum Support (47%)

  • 8/11/2019 Data Mining Lab Report

    6/6

    System Configuration


Top Related