p 07-12 amit kumar mishra
TRANSCRIPT
-
7/27/2019 p 07-12 Amit Kumar Mishra
1/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 7
A Recent Review on XML data mining and FFP
Amit Kumar Mishra, Hitesh GuptaPatel College of Science and Technology, Bhopal, M.P. India
[email protected], [email protected]
Abstract
The goal of data mining is to extract or mine"
knowledge fr om large amounts of data. Emerging
technologies of semi-structured data have attracted
wide attenti on of networks, e-commerce, inf ormation
retri eval and databases.XM L has become very popular
for r epresenti ng semi structured data and a standard
for data exchange over the web. Mining XML data
from the web is becoming increasingly important.
However, the structure of the XML data can be more
complex and irregular than that. Association Rule
M in ing plays a key role in the process of mining data
for frequent pattern matching. Fir st Frequent Pattern -
growth, for mining the complete set of frequent
patterns by pattern f ragment growth. Fir st F requent
Pattern- tree based mining adopts a pattern f ragment
growth method to avoid the costly generati on of a large
number of candidate sets and a parti tion-based, divide-
and-conquer method is used. This paper shows a
complete review of XML data mining using Fast
Frequent Pattern min ing in various domains.
Keywords: Data mining, semi-structured data mining,
Association mi ning, XM L
I. INTRODUCTIONThe World Wide Web is one of the fastest growing areas
of intelligence gathering. Data mining[1], which is also
referred to as knowledge discovery in databases, means a
process of nontrivial extraction of implicit, previously
unknown and potentially useful information (such asknowledge rules, constraints, regularities) from data in
databases. Other terms for data mining are knowledge
mining from databases, knowledge extraction, data
archaeology, data dredging, data analysis, etc. Mining
information and knowledge from large databases has
been recognized by many researchers as a key research
topic in database systems and machine learning and by
many industrial companies as an important area with an
opportunity of major revenues. The discovered
knowledge can be applied to information management
query processing, decision making [1],
process control, and many other applications. Re
searchers in many different fields, including database
systems, knowledge-base systems, artificial intelligence
machine learning, knowledge acquisition, statistics, and
spatial databases have shown great interest in data
mining. Nowadays, we have huge amount of data storedin each institute, university, and company. However
data could not tell us anything without processing. We
often flooded
by data but lack of the information. Data mining has
attracted increasing interests in recent years trying to
find the underlying models or patterns of the data, and
making use of the found models and patterns [2].
Data mining has been used in a broad range of
applications [3]. More and more leading-edge
organizations are realizing that data mining provide them
the ability to reach their goals in customer relationship
management, risk management, fraud and abuse
detection, and e-business etc.
Frequent pattern mining [3] and Association rule mining
(ARM) [3] plays a very important role in data mining
Semi-structured data refers to set of data in which there
is some implicit structure that is generally followed, but
no enough of a regular structure to qualify for the kinds
of management and automation usually applied to
structured data. Examples include the World Wide Webbioinformatics databases and data ware housing.
Currently many websites are built with HTML tags, one
problem associated with retrieval of data from web
documents of HTML is that they are not structured in
traditional databases because the Web pages created
using HTML are semi structured thus making querying
more difficult than with well-formed database containing
mailto:[email protected]:[email protected] -
7/27/2019 p 07-12 Amit Kumar Mishra
2/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 8
schemas and attributes with defined domains. The
concepts of XML have brought convenience for it.
Association rule mining is a process of mining data as a
set of rules from a transactional database which support
the minimum support and confidence. Association rule
of data mining involves picking out the unknown inter-
dependence of the data and finding out the rules between
those items. Association Rule Mining plays a key role in
the process of mining data for frequent pattern matching.
A frequent pattern is a pattern (ie. a set of items,
substructures, subsequences etc.) that occurs frequently
in a dataset. Frequent item sets play an essential role in
many data mining tasks that try to find interesting
patterns from databases, such as association rules,
correlations, sequences, episodes, classifiers, clusters
and many more of which the mining of association rulesis one of the most popular problems.
We start this review with a discussion on mining in
semi-structured data and explore the various challenges
and look into the algorithms for mining semi-structured
data.
II. BACKGROUND
Association rule mining is a process of mining data as a
set of rules from a transactional database[4] which
support the minimum support and confidence.
Support:The rule (A=>B) holds in the transaction set D
with support[4] s, where s is the percentage of
transactions in D containing A D.
Support(A=>B) = P(A D)
Confidence:The rule (A=>B) has confidence[4] c in the
transaction set D, where c is the percentage of
transactions in D containing A that also contains B.
Confidence (A=>B) = P( B | A)
In general, association rule mining can be viewed as a
two-step process. (i) Generating all item sets having
support factor greater than or equal to, the user specifiedminimum support. (ii) Generating all rules having the
confidence factor greater than or equal to the user
specified minimum confidence.
Normally there exist two categories of data mining. (i)
Descriptive data mining (ii) Predictive data mining. For
carrying out summarizations or generalizations
descriptive data mining is used. Whereas for finding out
the inference or predictions, Predictive data mining is
used. Association rule mining falls under the descriptive
category. Association rules aims in extracting important
correlation among the data items in the databases.
(A)Semi-Structured Data Mining TechniquesIn this section, we briefly describe key semi-structured
data mining algorithms that have been developed to
handle data which does not have rigid structure.
a) Association Rules Association rule mining is a
process of mining data as a set of rules from a
transactional database which support the minimum
support and confidence. Algorithms for mining
association rules from relational data have been well
developed. Several query languages have been proposed
to assist association rule mining. Mining semi-structured
data, for example XML data has received little attentionas the data mining community has focused on the
development of the techniques for extracting common
structure from heterogeneous XML data. The straight
forward approach for association rule mining from XML
data is to map the XML documents to relational data
model and to store them in a relational database. This
allows us to apply the standard tools that are in use to
perform the rule mining from relational databases.
b) Clustering Clustering (clustering) is to group objects
of a database into multiple clusters or classes (cluster) so
that objects in the same group have a large similarity
(similarity) and objects in different groups have a large
dissimilarity (dissimilarity). Clustering in data mining is
a useful technique for discovering interesting data
distributions and patterns in the underlying data
Clustering is also helpful for categorizing www
documents, grouping genes and proteins that have
similar functions or the detection of seismic faults by
grouping the entries in an earthquake catalog.
The algorithms used for frequent pattern mining is
divided into two categories: Apriori based algorithmsand tree-structure based algorithms. The apriori-based
algorithms use a generate-and-test strategy (i.e.) they
finds frequent patterns by constructing candidate items
and checking their support counts or frequencies against
the transactional database.
The tree-structure based algorithms follows a test only
approach (i.e.) there is no need to generate candidate
-
7/27/2019 p 07-12 Amit Kumar Mishra
3/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 9
items and tests only the support counts or frequencies.
Examples are FP-tree and FP-growth, CAN-tree etc.
The eXtensible Markup Language (XML) has become a
standard language for data representation and exchange.
XML is a Standard, flexible syntaxfor data exchanging
Regular, structured data. Database content of all kinds:
Inventory, billing, orders etc. It has small typed values
and irregular, unstructured text. It can consist of
documents of all kinds: Transcripts, books, legal briefs
etc. With the continuous growth in XML data sources,
the ability to manage collections of XML documents and
discover knowledge from them for decision support
becomes increasingly important. Mining of XML
documents significantly differs from structured data
mining and text mining. XML allows the representation
of semi-structured and hierarchal data containing notonly the values of individual items but also the
relationships between data items. Element tags and their
nesting therein dictate the structure of an XML
document.
II. LITERATURE SURVEYIII.
(A)FP-growth algorithmIn 2000, Han et al. proposed the FP-growth[5]
algorithmthe first pattern-growth concept algorithm.
FP-growth constructs an FP-tree structure and mines
frequent patterns by traversing the constructed FPtree.
The FP-tree structure is an extended prefix-tree structure
involving crucial condensed information of frequent
patterns.
FP-tree structure: The FP-tree structure has sufficient
information to mine complete frequent patterns. It
consists of a prefix tree of frequent 1-itemset and a
frequent-item header table. Each node in the prefix-tree
has three fields: item-name, count, and node-link.
Construction of FP-tree: FP-growth has to scan the TDB(Transactional Database) twice to construct an FP-tree.
The first scan of TDB retrieves a set of frequent items
from the TDB . Then, the retrieved frequent items are
ordered by descending order of their supports. The
ordered list is called an F-list. In the second scan, a tree
T whose root node R labeled with null is created.
Then, the following steps are applied to every
transaction in the TDB. Here, let a transaction represent
[p\P] wherep is the first item of the transaction and Pis
the remaining items. In each transaction, infrequent
items are discarded. Then, only the frequent items are
sorted by the same order of F-list.
Advantages and Disadvantages
This method is advantageous because, it doesn
generate any candidate items. It is disadvantageous
because, it suffers from the issues of special and
temporal locality issues.
(B)FP-Tree (Frequent Pattern Tree)A tree structure in which all items are arranged in
descending order of their frequency or support count
After constructing the tree, the frequent items can be
mined using FP-growth.(a) Creation of FP-Tree
First Iteration: Consider a transactional database which
consists of set of transactions with their transaction id
and list of items in the transaction. Then scan the entire
database. Collect the count of the items present in the
database. Then sort the items in decreasing order based
on their frequencies (no. of occurrences).
(b)Second Iteration
Now, once again scan the transactional database. The
FP-tree is constructed as follows. Start with an empty
root node. Add the transactions one after another as
prefix subtrees of the root node. Repeat this process unti
all the transactions have been included in the FP-tree
Then construct a header table which consists of the
items, counts and their head-of-node links.
(c)Finding Frequent Patterns from FP-Tree
After the construction of FP-tree, the frequent patterns
can be mined using an iterative approach FP-growth
This approach looks up the header table and selects the
items that support the minimum support. It removes the
infrequent items from the prefix-path of an existing nodeand the remaining items are considered as the frequent
itemsets of the specified item.
Advantages and Disadvantages
This method is advantageous because, it doesnt
generate any candidate items. It is disadvantageous
because, it suffers from the issues of special and
temporal locality issues.
-
7/27/2019 p 07-12 Amit Kumar Mishra
4/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 10
(C)CAN-Tree (CANonical Tree)A tree structure that arranges or orders the nodes of a
tree in some canonical order. It follows a tree-based
incremental mining approach. Like FP-tree approach,
there is no need to rescan the transactional database
when it is updated. Because of following the canonical
order, frequency changes (if any) due to incremental
updates like insertion, deletion and modification of the
transactions will not affect the ordering of the nodes in
CAN tree[3]. After constructing the CAN tree, we can
mine the frequent patterns from the tree.
Finding Frequent Patterns from CAN-Tree: After
constructing the CAN-tree, we have to mine the frequent
patterns by traversing the tree in a upward direction.
This can be done similar to FP-growth by constructing aheader table and finding only the frequent items.
3.3. Advantages and Disadvantages
This method is advantageous because it supports
incremental updates without any major changes in the
tree.
(D)COFI-TreeCOFI[6] is much faster than FP-Growth and requires
significantly less memory. The idea of COFI is to build
projections from the FP-tree each corresponding to sub-
transactions of items co-occurring with a given frequent
item. These trees are built and efficiently mined one at a
time making the footprint in memory significantly small.
The COFI algorithm generates candidates using a top-
down approach, where its performance shows to be
severely affected while mining databases that has
potentially long candidate patterns that turns to be not
frequent, as COFI needs to generate candidate sub-
patterns for all its candidates patterns and also build
upon the COFI approach to find the set of frequent
patterns but after avoiding generating useless candidates.
Creation of COFI-Tree: The basic idea of our newalgorithm is simple and is and is based on the notion of
maximal frequent patterns. A frequent item set X is said
to be maximal if there is no frequent item set X such
that X X.
Frequent maximal patterns are a relatively small subset
of all frequent item sets. In other words, each maximal
frequent item set is a superset of some frequent item
sets. Let us assume that we have an Oracle that knows
all the maximal frequent item sets in a transactional
database. Deriving all frequent item sets becomes trivial
All there is to do is counting them, and there is no need
to generate candidates that are doomed infrequent. The
oracle obviously does not exist, but we propose a
pseudo-oracle that discovers this maximal pattern using
the COFI-trees[4] and we derive all item sets from them.
CATS[10] algorithms enable frequent pattern mining
with different supports without rebuilding the tree
structure. The algorithms allow mining with a single
pass over the database as well as efficient insertion or
deletion of transactions at any time.
CATS Tree and CATS Tree algorithms. Once CATS
Tree is built, it can be used for multiple frequent pattern
mining with different supports. CATS Tree and CATSTree algorithms allow single pass frequent pattern
mining and transaction stream mining. In addition
transactions can be added to or removed from the tree at
any time.
Finding Frequent Patterns from CATS-Tree: After
constructing the CATS tree, the frequent patterns can be
found by following the frequency of an item by
considering its both upward and downward paths.
Advantages and Disadvantages
This method is advantageous because it requires only
one scan of the database and the trees are ordered
according to their local frequency in the paths.
It is disadvantageous because lot of computation is
required to build the tree.
(E)IMPROVED FREQUENT PATTERN TREE(FP -TREE)
It is proposed to recover the weakness of some
traditional data mining algorithm. For example
Apriori[7]. It compresses a large data set into a
structured and compact data structure, known as FP-
Tree. FP Trees follows the divide and conquersmethodology. The root of the FP-tree is labeled as
NULL. A set of item is defined as a child of the root
Traditionally a FP tree consists of three fields- Item
name, node link and count. To help the tree traversal, a
header table is used. It decomposes the data mining task
in to smaller parts.
-
7/27/2019 p 07-12 Amit Kumar Mishra
5/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 11
The main idea is, to construct an efficient FPtree and
mining frequent item set algorithm (MFI algorithm).
This is because the FP-tree is highly condensed and
surely it is smaller than the main database (for example,
the transaction database as described as table 2) .It will
reduce the expensive database scan in mining method.
After forming a FP tree, the frequency of the FP-tree (F)
is used as an input for MFI[12] algorithm. It will
generate the frequent item sets. These frequent item sets
will be used to get strong association rules.
Algorithm 1: Construct improve FP-Tree
Input : Transaction database.
Output : Improved FP-tree, Frequency of each item in
FPtree, and the Stable count.
Step1 : Sort the items in the transaction database in
descending order, according to the number of occurrencein transaction.
Step 2 : Create the root of the tree R. Since it is a prefix
tree, R= NULL.
Step 3 : For each transaction in transaction database do
Step 4 : Let a descending ordered transaction is
represented by [p|Q].where p is the first item and Q is
rest of the items in transaction.
Step 5 : If p= the most frequent item, do the following
step 6 and step7.Otherwise go to step 8.
Step 6 : If the root R has a direct child node M, such that
Ms item_name = ps item_name, Increase the frequency
of the item M, denoted as G.M by 1. Transfer the root
from R to p.
Step 7 : For each item in Q do the following steps up to
Q is empty.
[a] Create a new node for each new item from root.
Increase the frequency of the new item, G. new item by
1. Transfer the root to new node.
[b] If an item, Qi has no single edge from the current
root to its node, Where Qi is an item in Q and already
exists in FP-tree. Then Qi is a spare item.Store Qi with frequency 1 in spare table (Stable)
Step 8 : For each item in transaction [p|Q]
Store all items in Stable with frequency 1.
Step 9 : Calculate the total frequency of each item in
stable, called Stable count.
Step 10: Output the FP-tree, frequency of each item in
FP-tree and total frequency of Stable count
In above algorithm, maximum attention on the most
frequent item, because most of the relations are related
with that particular item. So when use the descending
ordered transactions, then the most frequent item come
first and the tree is formed based on that. The root of the
tree is changed with the transactions (step 6 of algorithm
1).As the root is changing with the coming items, so, the
correlation is more efficiently visible. Algorithm 1
compares items and classify them (step 5).This
classification decides which is spare item (case 1 in
definition 1).It enhances the efficiency of the FPtree
After one node is created the other nodes will be created
according to the frequency of occurrence of each item
(step 7 of algorithm 1). If the same element occurs then
the frequency of that item is increased by 1(step 6 of
algorithm 1).
Algorithm 2 For Frequent Item set
1:Mining Frequent Item set(FP-tree, S, C, F, R)
2: for each item i in FP-tree where (i! = R) do
3: ifi.S = i.F then \* i.S is the support of the item
And i.F is frequency of the item in FP tree */
4: frequency of the frequent item set, K = i.F
5: Generate item set, P = (i ) with the
frequency value of the tree.
Frequent item set is written in {P: K} format.
6: else ifi.S > i. F then
7: Frequency of frequent item set K = i.F + C
8: Generate item set, P = (i )
9: else Frequency of the frequent item set K = i.F
10: Generate item set, P = (i )
11: end of for
When the database is large, the above algorithm may
suffer the problem of memory scarcity. The algorithm
will improve if it solves the memory problem without
any preprocessing or postprocessing. And another
method is to improve Bitwise And Operation on thebinary string, or replace it by some more effective
techniques.
To date, the famous Apriori algorithm to mine any XML
document for association rules without any pre
processing or post-processing has been implemented
But the algorithm only can mine the set of items that can
be written a path expression for. However, the structure
-
7/27/2019 p 07-12 Amit Kumar Mishra
6/6
International Journal of Engineering Research (ISSN : 2319-6890)
Volume No.2, Issue No.1, pp : 07-12 01 Jan. 2013
IJER@2013 Page 12
of the XML data can be more complex and irregular than
that. Consequently, it is difficult to identify the mining
context. First Frequent Pattern-growth[9], for mining the
complete set of frequent patterns by pattern fragment
growth. First Frequent Pattern-tree based mining adopts
a pattern fragment growth method to avoid the costly
generation of a large number of candidate sets and a
partition-based, divide-and-conquer method can be used.
IV. CONCLUSION
Data Mining is referred to as Knowledge Discovery in
Databases. It deals with issues such as representation
schemes for the concept or pattern to be discovered,
design of appropriate functions and algorithms to find
patterns. However data on the web and bioinformatics
databases often lack such a regular structure called semi-structured. This survey papers gives a brief survey of
XML data mining using association rules and fast
frequent pattern in various fields, the modifications
made to the association rules according to the
applications they were used and its effective results.
Thus association rules prove themselves to be the most
effective technique for frequent pattern matching over a
decade. XML has become very popular for representing
semi structured data and a standard for data exchange
over the web. Mining XML data from the web is
becoming increasingly important. However, the structure
of the XML data can be more complex and irregular than
that. Association Rule Mining plays a key role in the
process of mining data for frequent pattern matching.
First Frequent Pattern-growth, for mining the complete
set of frequent patterns by pattern fragment growth. First
Frequent Pattern-tree based mining adopts a pattern
fragment growth method to avoid the costly generation
of a large number of candidate sets and a partition-based,
divide-and-conquer method is used. This paper shows a
review of XML data mining using Fast Frequent Patternmining in various domains.
Referencesi. Jaiwei Han and Micheline Kamber, Data Mining Concepts and
Techniques, Second Edition , Morgan Kaufmann Publishers.
ii. Jayalakshmi.S, Dr k. Nageswara Rao, Mining Association rulefor Large Transactions using New Support and Confidence
Measures, Journal of Theoretical and applied Information
Technology, 2005.
iii. Muthaimenul Adnan and Reda Alhajj, A Bounded and AdaptiveMemory-Based Approach to Mine Frequent Patterns From Very
Large Databases IEEE Transactions on Systems Management and
Cybernetics- Vol.41,No. 1,February 2011.
iv. W. Cheung and O. R. Zaiane, Incremental mining of frequenpatterns without candidate generation or support constraint, in
Proc. IEEE Int.Conf. Database Eng. Appl., Los Alamitos, CA
2003, pp. 111116.
v. C. K.-S. Leung, Q. I. Khan, and T. Hoque, Cantree: A treestructure for efficient incremental mining of frequent patterns, in
Proc. IEEE Int.Conf. Data Mining, Los Alamitos, CA, 2005, pp
274281.
vi. M. El-Hajj and O. R. Zaiane, COFI approach for mining frequenitemsets revisited, in Proc. ACM SIGMOD Workshop Res. Issue
Data Mining Knowl. Discovery, New York, 2004, pp. 7075.
vii. Savasere, E. Omiecinski, and S. B. Navathe, An efficienalgorithm for mining association rules in large databases, in Proc
Int. Conf. Very Large Data Bases, 1995, pp. 432444.
viii. C.K.-S. Leung. Interactive constrained frequent-pattern miningsystem.In Proc.IDEAS 2004, pp. 4958.
ix. J. Han, J. Pei, Y. Yin, and R. Mao . Mining frequent patternswithout candidate generation: a frequent-pattern tree approach
Data Mining and Knowledge Discovery, 8(1), pp. 5387, Jan
2004.
x. Jacky W.W. Wan , Gillian Dobbie, Mining association rules fromXML data using XQUERY.
xi. D. Braga, A. Campi, M. Klemettinen and P.L Lanzi. Miningassociation rules from XML data. In Proceedings of the 4 t
International conference on data warehousing and knowledge
discovery. France 2002.
xii. A.B.M.Rezbaul Islam, Tae-Sun Chung, An Improved FrequenPattern Tree Based Association Rule Mining Technique, IEEE
2011 R. Agrawal, T. Imielinski, and A. Swami.. Mining
association rules between sets of items in large databases. InProceedings of the 1993 ACM SIGMOD International Conference
on Management of Data, pages 207-216, Washington, DC, May
26-281993.