p 07-12 amit kumar mishra

Upload: editorijer-ijer

Post on 14-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 p 07-12 Amit Kumar Mishra

    1/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 7

    A Recent Review on XML data mining and FFP

    Amit Kumar Mishra, Hitesh GuptaPatel College of Science and Technology, Bhopal, M.P. India

    [email protected], [email protected]

    Abstract

    The goal of data mining is to extract or mine"

    knowledge fr om large amounts of data. Emerging

    technologies of semi-structured data have attracted

    wide attenti on of networks, e-commerce, inf ormation

    retri eval and databases.XM L has become very popular

    for r epresenti ng semi structured data and a standard

    for data exchange over the web. Mining XML data

    from the web is becoming increasingly important.

    However, the structure of the XML data can be more

    complex and irregular than that. Association Rule

    M in ing plays a key role in the process of mining data

    for frequent pattern matching. Fir st Frequent Pattern -

    growth, for mining the complete set of frequent

    patterns by pattern f ragment growth. Fir st F requent

    Pattern- tree based mining adopts a pattern f ragment

    growth method to avoid the costly generati on of a large

    number of candidate sets and a parti tion-based, divide-

    and-conquer method is used. This paper shows a

    complete review of XML data mining using Fast

    Frequent Pattern min ing in various domains.

    Keywords: Data mining, semi-structured data mining,

    Association mi ning, XM L

    I. INTRODUCTIONThe World Wide Web is one of the fastest growing areas

    of intelligence gathering. Data mining[1], which is also

    referred to as knowledge discovery in databases, means a

    process of nontrivial extraction of implicit, previously

    unknown and potentially useful information (such asknowledge rules, constraints, regularities) from data in

    databases. Other terms for data mining are knowledge

    mining from databases, knowledge extraction, data

    archaeology, data dredging, data analysis, etc. Mining

    information and knowledge from large databases has

    been recognized by many researchers as a key research

    topic in database systems and machine learning and by

    many industrial companies as an important area with an

    opportunity of major revenues. The discovered

    knowledge can be applied to information management

    query processing, decision making [1],

    process control, and many other applications. Re

    searchers in many different fields, including database

    systems, knowledge-base systems, artificial intelligence

    machine learning, knowledge acquisition, statistics, and

    spatial databases have shown great interest in data

    mining. Nowadays, we have huge amount of data storedin each institute, university, and company. However

    data could not tell us anything without processing. We

    often flooded

    by data but lack of the information. Data mining has

    attracted increasing interests in recent years trying to

    find the underlying models or patterns of the data, and

    making use of the found models and patterns [2].

    Data mining has been used in a broad range of

    applications [3]. More and more leading-edge

    organizations are realizing that data mining provide them

    the ability to reach their goals in customer relationship

    management, risk management, fraud and abuse

    detection, and e-business etc.

    Frequent pattern mining [3] and Association rule mining

    (ARM) [3] plays a very important role in data mining

    Semi-structured data refers to set of data in which there

    is some implicit structure that is generally followed, but

    no enough of a regular structure to qualify for the kinds

    of management and automation usually applied to

    structured data. Examples include the World Wide Webbioinformatics databases and data ware housing.

    Currently many websites are built with HTML tags, one

    problem associated with retrieval of data from web

    documents of HTML is that they are not structured in

    traditional databases because the Web pages created

    using HTML are semi structured thus making querying

    more difficult than with well-formed database containing

    mailto:[email protected]:[email protected]
  • 7/27/2019 p 07-12 Amit Kumar Mishra

    2/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 8

    schemas and attributes with defined domains. The

    concepts of XML have brought convenience for it.

    Association rule mining is a process of mining data as a

    set of rules from a transactional database which support

    the minimum support and confidence. Association rule

    of data mining involves picking out the unknown inter-

    dependence of the data and finding out the rules between

    those items. Association Rule Mining plays a key role in

    the process of mining data for frequent pattern matching.

    A frequent pattern is a pattern (ie. a set of items,

    substructures, subsequences etc.) that occurs frequently

    in a dataset. Frequent item sets play an essential role in

    many data mining tasks that try to find interesting

    patterns from databases, such as association rules,

    correlations, sequences, episodes, classifiers, clusters

    and many more of which the mining of association rulesis one of the most popular problems.

    We start this review with a discussion on mining in

    semi-structured data and explore the various challenges

    and look into the algorithms for mining semi-structured

    data.

    II. BACKGROUND

    Association rule mining is a process of mining data as a

    set of rules from a transactional database[4] which

    support the minimum support and confidence.

    Support:The rule (A=>B) holds in the transaction set D

    with support[4] s, where s is the percentage of

    transactions in D containing A D.

    Support(A=>B) = P(A D)

    Confidence:The rule (A=>B) has confidence[4] c in the

    transaction set D, where c is the percentage of

    transactions in D containing A that also contains B.

    Confidence (A=>B) = P( B | A)

    In general, association rule mining can be viewed as a

    two-step process. (i) Generating all item sets having

    support factor greater than or equal to, the user specifiedminimum support. (ii) Generating all rules having the

    confidence factor greater than or equal to the user

    specified minimum confidence.

    Normally there exist two categories of data mining. (i)

    Descriptive data mining (ii) Predictive data mining. For

    carrying out summarizations or generalizations

    descriptive data mining is used. Whereas for finding out

    the inference or predictions, Predictive data mining is

    used. Association rule mining falls under the descriptive

    category. Association rules aims in extracting important

    correlation among the data items in the databases.

    (A)Semi-Structured Data Mining TechniquesIn this section, we briefly describe key semi-structured

    data mining algorithms that have been developed to

    handle data which does not have rigid structure.

    a) Association Rules Association rule mining is a

    process of mining data as a set of rules from a

    transactional database which support the minimum

    support and confidence. Algorithms for mining

    association rules from relational data have been well

    developed. Several query languages have been proposed

    to assist association rule mining. Mining semi-structured

    data, for example XML data has received little attentionas the data mining community has focused on the

    development of the techniques for extracting common

    structure from heterogeneous XML data. The straight

    forward approach for association rule mining from XML

    data is to map the XML documents to relational data

    model and to store them in a relational database. This

    allows us to apply the standard tools that are in use to

    perform the rule mining from relational databases.

    b) Clustering Clustering (clustering) is to group objects

    of a database into multiple clusters or classes (cluster) so

    that objects in the same group have a large similarity

    (similarity) and objects in different groups have a large

    dissimilarity (dissimilarity). Clustering in data mining is

    a useful technique for discovering interesting data

    distributions and patterns in the underlying data

    Clustering is also helpful for categorizing www

    documents, grouping genes and proteins that have

    similar functions or the detection of seismic faults by

    grouping the entries in an earthquake catalog.

    The algorithms used for frequent pattern mining is

    divided into two categories: Apriori based algorithmsand tree-structure based algorithms. The apriori-based

    algorithms use a generate-and-test strategy (i.e.) they

    finds frequent patterns by constructing candidate items

    and checking their support counts or frequencies against

    the transactional database.

    The tree-structure based algorithms follows a test only

    approach (i.e.) there is no need to generate candidate

  • 7/27/2019 p 07-12 Amit Kumar Mishra

    3/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 9

    items and tests only the support counts or frequencies.

    Examples are FP-tree and FP-growth, CAN-tree etc.

    The eXtensible Markup Language (XML) has become a

    standard language for data representation and exchange.

    XML is a Standard, flexible syntaxfor data exchanging

    Regular, structured data. Database content of all kinds:

    Inventory, billing, orders etc. It has small typed values

    and irregular, unstructured text. It can consist of

    documents of all kinds: Transcripts, books, legal briefs

    etc. With the continuous growth in XML data sources,

    the ability to manage collections of XML documents and

    discover knowledge from them for decision support

    becomes increasingly important. Mining of XML

    documents significantly differs from structured data

    mining and text mining. XML allows the representation

    of semi-structured and hierarchal data containing notonly the values of individual items but also the

    relationships between data items. Element tags and their

    nesting therein dictate the structure of an XML

    document.

    II. LITERATURE SURVEYIII.

    (A)FP-growth algorithmIn 2000, Han et al. proposed the FP-growth[5]

    algorithmthe first pattern-growth concept algorithm.

    FP-growth constructs an FP-tree structure and mines

    frequent patterns by traversing the constructed FPtree.

    The FP-tree structure is an extended prefix-tree structure

    involving crucial condensed information of frequent

    patterns.

    FP-tree structure: The FP-tree structure has sufficient

    information to mine complete frequent patterns. It

    consists of a prefix tree of frequent 1-itemset and a

    frequent-item header table. Each node in the prefix-tree

    has three fields: item-name, count, and node-link.

    Construction of FP-tree: FP-growth has to scan the TDB(Transactional Database) twice to construct an FP-tree.

    The first scan of TDB retrieves a set of frequent items

    from the TDB . Then, the retrieved frequent items are

    ordered by descending order of their supports. The

    ordered list is called an F-list. In the second scan, a tree

    T whose root node R labeled with null is created.

    Then, the following steps are applied to every

    transaction in the TDB. Here, let a transaction represent

    [p\P] wherep is the first item of the transaction and Pis

    the remaining items. In each transaction, infrequent

    items are discarded. Then, only the frequent items are

    sorted by the same order of F-list.

    Advantages and Disadvantages

    This method is advantageous because, it doesn

    generate any candidate items. It is disadvantageous

    because, it suffers from the issues of special and

    temporal locality issues.

    (B)FP-Tree (Frequent Pattern Tree)A tree structure in which all items are arranged in

    descending order of their frequency or support count

    After constructing the tree, the frequent items can be

    mined using FP-growth.(a) Creation of FP-Tree

    First Iteration: Consider a transactional database which

    consists of set of transactions with their transaction id

    and list of items in the transaction. Then scan the entire

    database. Collect the count of the items present in the

    database. Then sort the items in decreasing order based

    on their frequencies (no. of occurrences).

    (b)Second Iteration

    Now, once again scan the transactional database. The

    FP-tree is constructed as follows. Start with an empty

    root node. Add the transactions one after another as

    prefix subtrees of the root node. Repeat this process unti

    all the transactions have been included in the FP-tree

    Then construct a header table which consists of the

    items, counts and their head-of-node links.

    (c)Finding Frequent Patterns from FP-Tree

    After the construction of FP-tree, the frequent patterns

    can be mined using an iterative approach FP-growth

    This approach looks up the header table and selects the

    items that support the minimum support. It removes the

    infrequent items from the prefix-path of an existing nodeand the remaining items are considered as the frequent

    itemsets of the specified item.

    Advantages and Disadvantages

    This method is advantageous because, it doesnt

    generate any candidate items. It is disadvantageous

    because, it suffers from the issues of special and

    temporal locality issues.

  • 7/27/2019 p 07-12 Amit Kumar Mishra

    4/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 10

    (C)CAN-Tree (CANonical Tree)A tree structure that arranges or orders the nodes of a

    tree in some canonical order. It follows a tree-based

    incremental mining approach. Like FP-tree approach,

    there is no need to rescan the transactional database

    when it is updated. Because of following the canonical

    order, frequency changes (if any) due to incremental

    updates like insertion, deletion and modification of the

    transactions will not affect the ordering of the nodes in

    CAN tree[3]. After constructing the CAN tree, we can

    mine the frequent patterns from the tree.

    Finding Frequent Patterns from CAN-Tree: After

    constructing the CAN-tree, we have to mine the frequent

    patterns by traversing the tree in a upward direction.

    This can be done similar to FP-growth by constructing aheader table and finding only the frequent items.

    3.3. Advantages and Disadvantages

    This method is advantageous because it supports

    incremental updates without any major changes in the

    tree.

    (D)COFI-TreeCOFI[6] is much faster than FP-Growth and requires

    significantly less memory. The idea of COFI is to build

    projections from the FP-tree each corresponding to sub-

    transactions of items co-occurring with a given frequent

    item. These trees are built and efficiently mined one at a

    time making the footprint in memory significantly small.

    The COFI algorithm generates candidates using a top-

    down approach, where its performance shows to be

    severely affected while mining databases that has

    potentially long candidate patterns that turns to be not

    frequent, as COFI needs to generate candidate sub-

    patterns for all its candidates patterns and also build

    upon the COFI approach to find the set of frequent

    patterns but after avoiding generating useless candidates.

    Creation of COFI-Tree: The basic idea of our newalgorithm is simple and is and is based on the notion of

    maximal frequent patterns. A frequent item set X is said

    to be maximal if there is no frequent item set X such

    that X X.

    Frequent maximal patterns are a relatively small subset

    of all frequent item sets. In other words, each maximal

    frequent item set is a superset of some frequent item

    sets. Let us assume that we have an Oracle that knows

    all the maximal frequent item sets in a transactional

    database. Deriving all frequent item sets becomes trivial

    All there is to do is counting them, and there is no need

    to generate candidates that are doomed infrequent. The

    oracle obviously does not exist, but we propose a

    pseudo-oracle that discovers this maximal pattern using

    the COFI-trees[4] and we derive all item sets from them.

    CATS[10] algorithms enable frequent pattern mining

    with different supports without rebuilding the tree

    structure. The algorithms allow mining with a single

    pass over the database as well as efficient insertion or

    deletion of transactions at any time.

    CATS Tree and CATS Tree algorithms. Once CATS

    Tree is built, it can be used for multiple frequent pattern

    mining with different supports. CATS Tree and CATSTree algorithms allow single pass frequent pattern

    mining and transaction stream mining. In addition

    transactions can be added to or removed from the tree at

    any time.

    Finding Frequent Patterns from CATS-Tree: After

    constructing the CATS tree, the frequent patterns can be

    found by following the frequency of an item by

    considering its both upward and downward paths.

    Advantages and Disadvantages

    This method is advantageous because it requires only

    one scan of the database and the trees are ordered

    according to their local frequency in the paths.

    It is disadvantageous because lot of computation is

    required to build the tree.

    (E)IMPROVED FREQUENT PATTERN TREE(FP -TREE)

    It is proposed to recover the weakness of some

    traditional data mining algorithm. For example

    Apriori[7]. It compresses a large data set into a

    structured and compact data structure, known as FP-

    Tree. FP Trees follows the divide and conquersmethodology. The root of the FP-tree is labeled as

    NULL. A set of item is defined as a child of the root

    Traditionally a FP tree consists of three fields- Item

    name, node link and count. To help the tree traversal, a

    header table is used. It decomposes the data mining task

    in to smaller parts.

  • 7/27/2019 p 07-12 Amit Kumar Mishra

    5/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 11

    The main idea is, to construct an efficient FPtree and

    mining frequent item set algorithm (MFI algorithm).

    This is because the FP-tree is highly condensed and

    surely it is smaller than the main database (for example,

    the transaction database as described as table 2) .It will

    reduce the expensive database scan in mining method.

    After forming a FP tree, the frequency of the FP-tree (F)

    is used as an input for MFI[12] algorithm. It will

    generate the frequent item sets. These frequent item sets

    will be used to get strong association rules.

    Algorithm 1: Construct improve FP-Tree

    Input : Transaction database.

    Output : Improved FP-tree, Frequency of each item in

    FPtree, and the Stable count.

    Step1 : Sort the items in the transaction database in

    descending order, according to the number of occurrencein transaction.

    Step 2 : Create the root of the tree R. Since it is a prefix

    tree, R= NULL.

    Step 3 : For each transaction in transaction database do

    Step 4 : Let a descending ordered transaction is

    represented by [p|Q].where p is the first item and Q is

    rest of the items in transaction.

    Step 5 : If p= the most frequent item, do the following

    step 6 and step7.Otherwise go to step 8.

    Step 6 : If the root R has a direct child node M, such that

    Ms item_name = ps item_name, Increase the frequency

    of the item M, denoted as G.M by 1. Transfer the root

    from R to p.

    Step 7 : For each item in Q do the following steps up to

    Q is empty.

    [a] Create a new node for each new item from root.

    Increase the frequency of the new item, G. new item by

    1. Transfer the root to new node.

    [b] If an item, Qi has no single edge from the current

    root to its node, Where Qi is an item in Q and already

    exists in FP-tree. Then Qi is a spare item.Store Qi with frequency 1 in spare table (Stable)

    Step 8 : For each item in transaction [p|Q]

    Store all items in Stable with frequency 1.

    Step 9 : Calculate the total frequency of each item in

    stable, called Stable count.

    Step 10: Output the FP-tree, frequency of each item in

    FP-tree and total frequency of Stable count

    In above algorithm, maximum attention on the most

    frequent item, because most of the relations are related

    with that particular item. So when use the descending

    ordered transactions, then the most frequent item come

    first and the tree is formed based on that. The root of the

    tree is changed with the transactions (step 6 of algorithm

    1).As the root is changing with the coming items, so, the

    correlation is more efficiently visible. Algorithm 1

    compares items and classify them (step 5).This

    classification decides which is spare item (case 1 in

    definition 1).It enhances the efficiency of the FPtree

    After one node is created the other nodes will be created

    according to the frequency of occurrence of each item

    (step 7 of algorithm 1). If the same element occurs then

    the frequency of that item is increased by 1(step 6 of

    algorithm 1).

    Algorithm 2 For Frequent Item set

    1:Mining Frequent Item set(FP-tree, S, C, F, R)

    2: for each item i in FP-tree where (i! = R) do

    3: ifi.S = i.F then \* i.S is the support of the item

    And i.F is frequency of the item in FP tree */

    4: frequency of the frequent item set, K = i.F

    5: Generate item set, P = (i ) with the

    frequency value of the tree.

    Frequent item set is written in {P: K} format.

    6: else ifi.S > i. F then

    7: Frequency of frequent item set K = i.F + C

    8: Generate item set, P = (i )

    9: else Frequency of the frequent item set K = i.F

    10: Generate item set, P = (i )

    11: end of for

    When the database is large, the above algorithm may

    suffer the problem of memory scarcity. The algorithm

    will improve if it solves the memory problem without

    any preprocessing or postprocessing. And another

    method is to improve Bitwise And Operation on thebinary string, or replace it by some more effective

    techniques.

    To date, the famous Apriori algorithm to mine any XML

    document for association rules without any pre

    processing or post-processing has been implemented

    But the algorithm only can mine the set of items that can

    be written a path expression for. However, the structure

  • 7/27/2019 p 07-12 Amit Kumar Mishra

    6/6

    International Journal of Engineering Research (ISSN : 2319-6890)

    Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013

    IJER@2013 Page 12

    of the XML data can be more complex and irregular than

    that. Consequently, it is difficult to identify the mining

    context. First Frequent Pattern-growth[9], for mining the

    complete set of frequent patterns by pattern fragment

    growth. First Frequent Pattern-tree based mining adopts

    a pattern fragment growth method to avoid the costly

    generation of a large number of candidate sets and a

    partition-based, divide-and-conquer method can be used.

    IV. CONCLUSION

    Data Mining is referred to as Knowledge Discovery in

    Databases. It deals with issues such as representation

    schemes for the concept or pattern to be discovered,

    design of appropriate functions and algorithms to find

    patterns. However data on the web and bioinformatics

    databases often lack such a regular structure called semi-structured. This survey papers gives a brief survey of

    XML data mining using association rules and fast

    frequent pattern in various fields, the modifications

    made to the association rules according to the

    applications they were used and its effective results.

    Thus association rules prove themselves to be the most

    effective technique for frequent pattern matching over a

    decade. XML has become very popular for representing

    semi structured data and a standard for data exchange

    over the web. Mining XML data from the web is

    becoming increasingly important. However, the structure

    of the XML data can be more complex and irregular than

    that. Association Rule Mining plays a key role in the

    process of mining data for frequent pattern matching.

    First Frequent Pattern-growth, for mining the complete

    set of frequent patterns by pattern fragment growth. First

    Frequent Pattern-tree based mining adopts a pattern

    fragment growth method to avoid the costly generation

    of a large number of candidate sets and a partition-based,

    divide-and-conquer method is used. This paper shows a

    review of XML data mining using Fast Frequent Patternmining in various domains.

    Referencesi. Jaiwei Han and Micheline Kamber, Data Mining Concepts and

    Techniques, Second Edition , Morgan Kaufmann Publishers.

    ii. Jayalakshmi.S, Dr k. Nageswara Rao, Mining Association rulefor Large Transactions using New Support and Confidence

    Measures, Journal of Theoretical and applied Information

    Technology, 2005.

    iii. Muthaimenul Adnan and Reda Alhajj, A Bounded and AdaptiveMemory-Based Approach to Mine Frequent Patterns From Very

    Large Databases IEEE Transactions on Systems Management and

    Cybernetics- Vol.41,No. 1,February 2011.

    iv. W. Cheung and O. R. Zaiane, Incremental mining of frequenpatterns without candidate generation or support constraint, in

    Proc. IEEE Int.Conf. Database Eng. Appl., Los Alamitos, CA

    2003, pp. 111116.

    v. C. K.-S. Leung, Q. I. Khan, and T. Hoque, Cantree: A treestructure for efficient incremental mining of frequent patterns, in

    Proc. IEEE Int.Conf. Data Mining, Los Alamitos, CA, 2005, pp

    274281.

    vi. M. El-Hajj and O. R. Zaiane, COFI approach for mining frequenitemsets revisited, in Proc. ACM SIGMOD Workshop Res. Issue

    Data Mining Knowl. Discovery, New York, 2004, pp. 7075.

    vii. Savasere, E. Omiecinski, and S. B. Navathe, An efficienalgorithm for mining association rules in large databases, in Proc

    Int. Conf. Very Large Data Bases, 1995, pp. 432444.

    viii. C.K.-S. Leung. Interactive constrained frequent-pattern miningsystem.In Proc.IDEAS 2004, pp. 4958.

    ix. J. Han, J. Pei, Y. Yin, and R. Mao . Mining frequent patternswithout candidate generation: a frequent-pattern tree approach

    Data Mining and Knowledge Discovery, 8(1), pp. 5387, Jan

    2004.

    x. Jacky W.W. Wan , Gillian Dobbie, Mining association rules fromXML data using XQUERY.

    xi. D. Braga, A. Campi, M. Klemettinen and P.L Lanzi. Miningassociation rules from XML data. In Proceedings of the 4 t

    International conference on data warehousing and knowledge

    discovery. France 2002.

    xii. A.B.M.Rezbaul Islam, Tae-Sun Chung, An Improved FrequenPattern Tree Based Association Rule Mining Technique, IEEE

    2011 R. Agrawal, T. Imielinski, and A. Swami.. Mining

    association rules between sets of items in large databases. InProceedings of the 1993 ACM SIGMOD International Conference

    on Management of Data, pages 207-216, Washington, DC, May

    26-281993.