odam an optimized distributed association rule mining algorithm (synopsis)

ODAM An Optimized Distributed Association

Rule Mining Algorithm

(Synopsis)

INTRODUCTION

Data mining, the extraction of hidden predictive information

from large databases, is a powerful new technology with great

potential to help companies focus on the most important information in

their data warehouses. Data mining tools predict future trends and

behaviors, allowing businesses to make proactive, knowledge-driven

decisions. The automated, prospective analyses offered by data mining

move beyond the analyses of past events provided by retrospective

tools typical of decision support systems. Data mining tools can answer

business questions that traditionally were too time consuming to

resolve. They scour databases for hidden patterns, finding predictive

information that experts may miss because it lies outside their

expectations.

Most companies already collect and refine massive quantities of

data. Data mining techniques can be implemented rapidly on existing

software and hardware platforms to enhance the value of existing

information resources, and can be integrated with new products and

systems as they are brought on-line. When implemented on high

performance client/server or parallel processing computers, data

mining tools can analyze massive databases to deliver answers to

questions such as, "Which clients are most likely to respond to my next

promotional mailing, and why?"

Data mining (DM), also called Knowledge-Discovery in

Databases (KDD) or Knowledge-Discovery and Data Mining, is the

process of automatically searching large volumes of data for patterns

using tools such as classification, association rule mining, clustering,

etc.. Data mining is a complex topic and has links with multiple core

fields such as computer science and adds value to rich seminal

computational techniques from statistics, information retrieval,

machine learning and pattern recognition.

Data mining techniques are the result of a long process of research

and product development. This evolution began when business data

was first stored on computers, continued with improvements in data

access, and more recently, generated technologies that allow users to

navigate through their data in real time. Data mining takes this

evolutionary process beyond retrospective data access and navigation

to prospective and proactive information delivery. Data mining is ready

for application in the business community because it is supported by

three technologies that are now sufficiently mature:

o Massive data collection

o Powerful multiprocessor computers

o Data mining algorithms

Commercial databases are growing at unprecedented rates. A recent

META Group survey of data warehouse projects found that 19% of

respondents are beyond the 50 gigabyte level, while 59% expect to be

there by second quarter of 1996.1 In some industries, such as retail,

these numbers can be much larger. The accompanying need for

improved computational engines can now be met in a cost-effective

manner with parallel multiprocessor computer technology. Data mining

algorithms embody techniques that have existed for at least 10 years,

but have only recently been implemented as mature, reliable,

understandable tools that consistently outperform older statistical

methods.

With the explosive growth of information sources available on

the World Wide Web, it has become increasingly necessary for users to

utilize automated tools in find the desired information resources, and

to track and analyze their usage patterns. These factors give rise to

the necessity of creating serverside and clientside intelligent systems

that can effectively mine for knowledge. Web mining can be broadly

defined as the discovery and analysis of useful information from the

World Wide Web. This describes the automatic search of information

resources available online, i.e. Web content mining, and the

discovery of user access patterns from Web servers, i.e., Web usage

mining.

Web Mining is the extraction of interesting and potentially

useful patterns and implicit information from artifacts or activity

related to the WorldWide Web. There are roughly three knowledge

discovery domains that pertain to web mining: Web Content Mining,

Web Structure Mining, and Web Usage Mining. Web content mining is

the process of extracting knowledge from the content of documents or

their descriptions. Web document text mining, resource discovery

based on concepts indexing or agent based technology may also fall in

this category. Web structure mining is the process of inferring

knowledge from the World Wide Web organization and links between

references and referents in the Web. Finally, web usage mining, also

known as Web Log Mining, is the process of extracting interesting

patterns in web access logs.

Web Content Mining

Web content mining is an automatic process that goes beyond

keyword extraction. Since the content of a text document

presents no machinereadable semantic, some approaches have

suggested to restructure the document content in a

representation that could be exploited by machines. The usual

approach to exploit known structure in documents is to use

wrappers to map documents to some data model. Techniques

using lexicons for content interpretation are yet to come.

There are two groups of web content mining strategies: Those

that directly mine the content of documents and those that

improve on the content search of other tools like search engines.

Web Structure Mining

WorldWide Web can reveal more information than just the

information contained in documents. For example, links pointing

to a document indicate the popularity of the document, while

links coming out of a document indicate the richness or perhaps

the variety of topics covered in the document. This can be

compared to bibliographical citations. When a paper is cited

often, it ought to be important. The PageRank and CLEVER

methods take advantage of this information conveyed by the

links to find pertinent web pages. By means of counters, higher

levels cumulate the number of artifacts subsumed by the

concepts they hold. Counters of hyperlinks, in and out

documents, retrace the structure of the web artifacts

summarized.

Web Usage Mining

Web servers record and accumulate data about user interactions

whenever requests for resources are received. Analyzing the web

access logs of different web sites

can help understand the user behaviour and the web structure,

thereby improving the design of this colossal collection of resources.

There are two main tendencies in Web Usage Mining driven by the

applications of the discoveries: General Access Pattern Tracking and

Customized Usage Tracking.

The general access pattern tracking analyzes the web logs to

understand access patterns and trends. These analyses can shed light

on better structure and grouping of resource providers. Many web

analysis tools existd but they are limited and usually unsatisfactory.

We have designed a web log data mining tool, WebLogMiner, and

proposed techniques for using data mining and OnLine Analytical

Processing (OLAP) on treated and transformed web access files.

Applying data mining techniques on access logs unveils interesting

access patterns that can be used to restructure sites in a more efficient

grouping, pinpoint effective advertising locations, and target specific

users for specific selling ads.

Customized usage tracking analyzes individual trends. Its purpose is to

customize web sites to users. The information displayed, the depth of

the site structure and the format of the resources can all be

dynamically customized for each user over time based on their access

patterns.

While it is encouraging and exciting to see the various potential

applications of web log file analysis, it is important to know that the

success of such applications depends on what and how much valid and

reliable knowledge one can discover from the large raw log data.

Current web servers store limited information about the accesses.

Some scripts customtailored for some sites may store additional

information. However, for an effective web usage mining, an important

cleaning and data transformation step before analysis may be needed.

Abstract

With the explosive growth of information sources available on

the World Wide Web, it has become increasingly necessary for users to

utilize automated tools in find the desired information resources, and

to track and analyze their usage patterns.

Association rule mining is an active data mining research area.

However, most ARM algorithms cater to a centralized environment. In

contrast to previous ARM algorithms, ODAM is a distributed algorithm

for geographically distributed data sets that reduces communication

costs. Recently, as the need to mine patterns across distributed

databases has grown, Distributed Association Rule Mining (D-ARM)

algorithms have been developed. These algorithms, however, assume

that the databases are either horizontally or vertically distributed. In

the special case of databases populated from information extracted

from textual data, existing D-ARM algorithms cannot discover rules

based on higher-order associations between items in distributed

textual documents that are neither vertically nor horizontally

distributed, but rather a hybrid of the two.

Modern organizations are geographically distributed. Typically,

each site locally stores its ever increasing amount of day-to-day data.

Using centralized data mining to discover useful patterns in such

organizations' data isn't always feasible because merging data sets

from different sites into a centralized site incurs huge network

communication costs. Data from these organizations are not only

distributed over various locations but also vertically fragmented,

making it difficult if not impossible to combine them in a central

location. Distributed data mining has thus emerged as an active

subarea of data mining research.

A significant area of data mining research is association rule

mining. Unfortunately, most ARM algorithms focus on a sequential or

centralized environment where no external communication is required.

Distributed ARM algorithms, on the other hand, aim to generate rules

from different data sets spread over various geographical sites; hence,

they require external communications throughout the entire process.

DARM algorithms must reduce communication costs so that generating

global association rules costs less than combining the participating

sites' data sets into a centralized site. However, most DARM algorithms

don't have an efficient message optimization technique, so they

exchange numerous messages during the mining process. We have

developed a distributed algorithm, called Optimized Distributed

Association Mining, for geographically distributed data sets. ODAM

generates support counts of candidate itemsets quicker than other

DARM algorithms and reduces the size of average transactions, data

sets, and message exchanges.

Description of Problem

After the advent of computer the data are enormously available

and by making use of such raw collection data to invent the knowledge

is the process of Data Mining. Like wise in Web also plenty of Web

Documents resides in online. Web is repository of variety of

information like Technology, Science, History, Geography, Sports

Politics and others. If any one know about particular topic, then they

are using search engine to search for their requirements and it gives

full satisfaction for that user by giving entire related information about

the topic. We can categorize parallel ARM algorithms as data-

parallelism or task-parallelism algorithms. In the former, the algorithms

partition the data sets among different nodes; in the latter, each site

performs the task independently but must access the entire data set.

The Count Distribution (CD) algorithm is a simple data-parallelism

algorithm.2 It uses the sequential Apriori algorithm in a parallel

environment and assumes data sets are horizontally partitioned among

different sites.

DARM discovers rules from various geographically distributed data

sets. However, the network connection between those data sets isn't

as fast as in a parallel environment, so distributed mining usually aims

to minimize communication costs.

Existing Method

The Data mining Algorithms can be categorized into the following

:

Association Algorithm

Classification

Clustering Algorithm

Classification:

The process of dividing a dataset into mutually exclusive groups

such that the members of each group are as "close" as possible to one

another, and different groups are as "far" as possible from one

another, where distance is measured with respect to specific

variable(s) you are trying to predict. For example, a typical

classification problem is to divide a database of companies into groups

that are as homogeneous as possible with respect to a

creditworthiness variable with values "Good" and "Bad."

Clustering:

The process of dividing a dataset into mutually exclusive groups

such that the members of each group are as "close" as possible to one

another, and different groups are as "far" as possible from one

another, where distance is measured with respect to all available

variables.

Given databases of sufficient size and quality, data mining technology

can generate new business opportunities by providing these

capabilities:

Automated prediction of trends and behaviors. Data mining

automates the process of finding predictive information in large

databases. Questions that traditionally required extensive hands-

on analysis can now be answered directly from the data —

quickly. A typical example of a predictive problem is targeted

marketing. Data mining uses data on past promotional mailings

to identify the targets most likely to maximize return on

investment in future mailings. Other predictive problems include

forecasting bankruptcy and other forms of default, and

identifying segments of a population likely to respond similarly to

given events.

Automated discovery of previously unknown patterns.

Data mining tools sweep through databases and identify

previously hidden patterns in one step. An example of pattern

discovery is the analysis of retail sales data to identify seemingly

unrelated products that are often purchased together. Other

pattern discovery problems include detecting fraudulent credit

card transactions and identifying anomalous data that could

represent data entry keying errors.

DARM discovers rules from various geographically distributed data

sets. However, the network connection between those data sets isn't

as fast as in a parallel environment, so distributed mining usually aims

to minimize communication costs.

Proposed System

Unlike other algorithms, ODAM offers better performance by

minimizing candidate itemset generation costs. It achieves this by

focusing on two major DARM issues communication and

synchronization. Communication is one of the most important DARM

objectives. DARM algorithms will perform better if we can reduce

communication (for example, message exchange size) costs.

Synchronization forces

each participating site to wait a certain period until globally frequent

itemset generation completes. Each site will wait longer if computing

support counts takes more time. Hence, we reduce the computation

time of candidate itemsets' support counts.

To reduce communication costs, we highlight several message

optimization techniques. ARM algorithms and on the message

exchange method, we can divide the message optimization techniques

into two methods direct and indirect support counts exchange. Each

method has different aims, expectations, advantages, and

disadvantages. For example, the first method exchanges each

candidate itemset's support count to generate globally frequent

itemsets of that pass (CD and FDM are examples of this approach). All

sites share a common globally frequent itemset with identical support

counts, so rules that are generated at different participating sites have

identical confidence. This approach focuses on a rule's exactness and

correctness.

System Requirement

Hardware specifications:

Processor : Intel Processor IV

RAM : 128 MB

Hard disk : 20 GB

CD drive : 40 x Samsung

Floppy drive : 1.44 MB

Monitor : 15’ Samtron color

Keyboard : 108 mercury keyboard

Mouse : Logitech mouse

Software Specification

Operating System – Windows XP/2000

Language used – J2sdk1.4.0, JCreator

odam an optimized distributed association rule mining algorithm (synopsis)

Technology

world wide web

minimize communication costs

association rule mining

finding predictive information

darm discovers rules

message optimization techniques

web usage mining

mutually exclusive groups