implementation of c-trend for commercial application

5
Implementation Of C-Trend For Commercial Application Prof. Nilesh Shah #1 ,Latika Chaudhary #2 , Aditya Rajmane #3 ,Neha Patil #4 ,Komal Khatke #5 Department of Computers, Padmabhushan Vasantdada Patil Pratishthan’s College of Engineering, Mumbai [email protected](8879309384), [email protected](9768975739), [email protected](9870669255),[email protected](9664683567) AbstractData mining has made broad and significant progress since its early beginnings. Today data mining is used in a vast array of areas, and numerous commercial data mining system that are available. There are many data mining systems and research prototypes to choose from. When selecting a data mining product that is appropriate for one’s task, it is important to consider various features of data mining systems from a multidimensional point of view. Researchers have been striving to build theoretical foundations for data mining. Various clustering techniques have been used for identifying and visualizing trends in multi-attribute transactional data . This paper introduce Cluster-based Temporal Representation of EvenT Data (C-TREND), a system that implements the temporal cluster graph construct, which maps multiattribute temporal data to a two-dimensional directed graph that identifies trends in dominant data types over time. KeywordsClustering, Data Visualization, Trend Analysis, Multiattribute Data I. INTRODUCTION Since the inception of information storage, the ability to sift through and analyze huge amounts of information was a dream sought out for in many ways and through different ways. With the advent of electronic and magnetic data storage, rational databases emerged as one of the efficient and widely used method to store data. Data stored in such large databases are not always comprehendible by humans, it needed to be filtered and analyzed first. Stored records are raw amounts of data poor in information, not only is it large and seamlessly irrelevant but also continuously increasing, updating and changing[8]. Organizations and firms are increasingly capturing more transactional data, containing multiple attributes and some measure of time, example through their websites, e-commerce firms capture clickstream and purchasing behavior of their customers[2]. Real life transactional data often poses challenges such as very large size, high dimensionality. Consider the problem of technology forecasting for a firm. Technologies possess many features that change over time and understanding how a technology evolves requires trend analysis of multiple attributes at once. Similar issues arise in the trend analysis of consumer purchasing behavior and many other business intelligence applications[3]. Business intelligence applications represent to help firms gather and analyze information about their performance, customers, information about their performance, customers, competitors, and business environment. Knowledge representation and data visualization tools constitute one form of business intelligence techniques that present information to users in a manner that supports business decision-making processes[5]. Identifying temporal relationships or trends in data constitutes an important problem that is relevant in many business and academic settings, and the data mining literature has provided analytical techniques for some specialized types of temporal data, such as time series analysis and sequence analysis techniques. However, temporal data can take many forms, most commonly being general multi-attribute transactional data with a timestamp, for which time series or sequence analysis methods are not particularly well suited. The ability to identify trends in such general temporal faced data can provide significant benefits, to provide competitive advantage to a firm performing forecasts or making decisions on future investment and strategies. In this paper, we present, CTREND-Cluster-based Temporal Representation of EveNt Data, a new method for discovering and visualizing trends and temporal patterns in transactional attribute-value data that builds upon standard data mining clustering techniques. II.RELATED WORK Data mining and visualizations are knowledge discovery tools used for autonomous analysis of data stored in large sets in many different ways. Data mining is a knowledge discovery process; it is the analysis step of knowledge discovery in databases or KDD for short. Identifying and visualizing temporal relationships (e.g., trends) in data constitutes an important problem that is relevant in many business, scientific, and academic settings[5]. Large data sets of data cannot possibly be analyzed manually; mining tools and visualization provide automated means to comprehend such data sets. In this section, we provide a brief review of related research in the temporal data mining and visualization streams. © 2013 IJEA. ALL RIGHTS RESERVED 27 International Journal of Engineering Associates (2320 – 0804) / #27 / Volume 2 Issue 6

Upload: warren-smith-qc-quantum-cryptanalyst

Post on 28-Nov-2015

8 views

Category:

Documents


2 download

DESCRIPTION

Business Intelligence: #tags Access permission, Accountability Scorecard, Agent, Alias, Anonymous Access, Application tier component, Attribute, Authentication, Authentication provider, Burst, CA, Calculated member, Canvas, Capability, Cascading prompt, Certificate, Certificate authority (CA), CGI, Cipher suite, Class style, CM, Common Gateway Interface (CGI), Condition, Constraint, Contact, Content locale, Content Manager (CM), Content store, Credential, Cube, Custom set, Dashboard, Data source, Data source connection, Data tree, Deployment, Deployment archive, Deployment specification, Derived index, Details-based set, Dimension, Dimensional data source, Drill down, Event, Event key, Event list, Fact, Gateway, Glyph, Group, Grouping, Hierarchy, Information card, Information pane, Initiative, Item, Job, Job step, Layout, Level, Locale, MDX, Measure, Member, Metric, Metric extract, Metric package, Metric store, Metric type, Model, Multidimensional data source, Multidimensional Expression Language (MDX), Named set, Namespace, News item, Object, Object extract, Package, Page set, Passport, Portlet, Product locale, Project, Prompt, Properties pane, Publish, Query, Query item, Query subject, Really Simple Syndication (RSS), Repeater, Repeater table, Report, Report output, Report specification, Report view, Response file, Rich Site Summary (RSS), RSS, Score, Scorecard, Scorecard structure, Security provider, Selection-based set, Session, Set, Stacked set, Strategy, Strategy map, Summary, Task, Task execution rule, Template, Thumbnail, Tuple, Union set, User, User-defined column, Watch list, Watch rule, Web Services for Remote Portlets, Widget, Work area, Workspace

TRANSCRIPT

Page 1: Implementation Of C-Trend For Commercial Application

Implementation Of C-Trend For Commercial

Application Prof. Nilesh Shah

#1,Latika Chaudhary

#2, Aditya Rajmane

#3,Neha Patil

#4,Komal Khatke

#5

Department of Computers, Padmabhushan Vasantdada Patil Pratishthan’s College of Engineering, Mumbai

[email protected](8879309384), [email protected](9768975739),

[email protected](9870669255),[email protected](9664683567)

Abstract— Data mining has made broad and significant

progress since its early beginnings. Today data mining is used in

a vast array of areas, and numerous commercial data mining

system that are available. There are many data mining systems

and research prototypes to choose from. When selecting a data

mining product that is appropriate for one’s task, it is

important to consider various features of data mining systems

from a multidimensional point of view. Researchers have been

striving to build theoretical foundations for data mining.

Various clustering techniques have been used for identifying

and visualizing trends in multi-attribute transactional data . This paper introduce Cluster-based Temporal Representation

of EvenT Data (C-TREND), a system that implements the

temporal cluster graph construct, which maps multiattribute

temporal data to a two-dimensional directed graph that

identifies trends in dominant data types over time.

Keywords— Clustering, Data Visualization, Trend Analysis,

Multiattribute Data

I. INTRODUCTION

Since the inception of information storage, the ability to sift

through and analyze huge amounts of information was a

dream sought out for in many ways and through different

ways. With the advent of electronic and magnetic data

storage, rational databases emerged as one of the efficient

and widely used method to store data. Data stored in such

large databases are not always comprehendible by humans, it

needed to be filtered and analyzed first. Stored records are

raw amounts of data poor in information, not only is it large

and seamlessly irrelevant but also continuously increasing,

updating and changing[8]. Organizations and firms are

increasingly capturing more transactional data, containing

multiple attributes and some measure of time, example

through their websites, e-commerce firms capture

clickstream and purchasing behavior of their customers[2].

Real life transactional data often poses challenges such as

very large size, high dimensionality. Consider the problem of

technology forecasting for a firm. Technologies possess

many features that change over time and understanding how

a technology evolves requires trend analysis of multiple

attributes at once. Similar issues arise in the trend analysis of

consumer purchasing behavior and many other business

intelligence applications[3]. Business intelligence

applications represent to help firms gather and analyze

information about their performance, customers, information

about their performance, customers, competitors, and

business environment. Knowledge representation and data

visualization tools constitute one form of business

intelligence techniques that present information to users in a

manner that supports business decision-making processes[5].

Identifying temporal relationships or trends in data

constitutes an important problem that is relevant in many

business and academic settings, and the data mining

literature has provided analytical techniques for some

specialized types of temporal data, such as time series

analysis and sequence analysis techniques. However,

temporal data can take many forms, most commonly being

general multi-attribute transactional data with a timestamp,

for which time series or sequence analysis methods are not

particularly well suited. The ability to identify trends in such

general temporal faced data can provide significant benefits,

to provide competitive advantage to a firm performing

forecasts or making decisions on future investment and

strategies. In this paper, we present, CTREND-Cluster-based

Temporal Representation of EveNt Data, a new method for

discovering and visualizing trends and temporal patterns in

transactional attribute-value data that builds upon standard

data mining clustering techniques.

II.RELATED WORK

Data mining and visualizations are knowledge discovery

tools used for autonomous analysis of data stored in large

sets in many different ways. Data mining is a knowledge

discovery process; it is the analysis step of knowledge

discovery in databases or KDD for short. Identifying and

visualizing temporal relationships (e.g., trends) in data

constitutes an important problem that is relevant in many

business, scientific, and academic settings[5]. Large data sets

of data cannot possibly be analyzed manually; mining tools

and visualization provide automated means to comprehend

such data sets. In this section, we provide a brief review of

related research in the temporal data mining and visualization

streams.

© 2013 IJEA. ALL RIGHTS RESERVED 27

International Journal of Engineering Associates (2320 – 0804) / #27 / Volume 2 Issue 6

Page 2: Implementation Of C-Trend For Commercial Application

Fig 1.What is Data Mining?

1. Temporal Data Mining

Temporal data mining is a single step in the process of

knowledge discovery in temporal databases that enumerates

structures (temporal patterns or models) over the temporal

data. Temporal data mining is concerned with the analysis of

temporal data and for finding temporal patterns and

regularities in sets of temporal data. Also temporal data

mining techniques allow for the possibility of computer-

driven, automatic exploration of the data

2. Clustering

A computer cluster is a group of linked computers, working

together closely so that in many respects they form a single

computer. The components of a cluster are commonly, but

not always, connected to each other through fast local area

networks. Clusters are usually deployed to improve

performance and availability over that provided by a single

computer, while typically being much more cost-effective

than single computers of comparable speed or availability.

3. Trend analysis

The term "trend analysis" refers to the concept of collecting

information and attempting to spot a pattern, or trend, in the

information. In some fields of study, the term "trend

analysis" has more formally-defined meanings. The data

mining techniques uses clusters identified in multiple time

periods and identifies trends based on similarities between

clusters over time. It is a clustering approach for discovering

temporal patterns, which builds on temporal clustering

methods. Trend analysis decomposes time-series data into

trend movements, cyclic movements, seasonal movements,

and irregular movements.

4. Data Visualization

Visual data mining integrates data mining and data

visualization to discover implicit and useful knowledge from

large data sets. Visual data mining includes data

visualization, data mining results visualization, data mining

process visualization and interactive visual data mining. Data

visualization is the process of presenting data in some visual

form and allowing the human to interact with the data.

i.Hierarchical Clustering: Hierarchical clustering is

one of the visualization techniques partition all dimensions

into subsets. The subsets are visualized in a hierarchical

manner.

ii. Dendrogram: A Dendrogram is a tree-structured

graph used in heat maps to visualize the result of a

hierarchical clustering calculation. The result of a clustering

is presented either as the distance or the similarity between

the clustered rows or columns depending on the selected

distance measure.

iii.Temporal cluster graph: To proceed with the

temporal cluster graph firstly, transactional data set is to be

partitioned with respect to time and then clustering technique

is applied. The temporal cluster graph is a directed graph that

consists of set of nodes and directed edges.

Fig 2.Reducing multi-attribute temporal complexity by

partitioning data into time periods and producing a temporal

cluster graph

III.THE C-TREND TECHNIQUE

1. C-Trend Overview

C-TREND is designed to work with what we term

transactional attribute-value data. Specifically, transactional

attribute-value data is a general form of temporal data that

consists of a collection of records each with a time stamp and

described by a set of attributes. Examples of this type of data

include shopping cart data with a sale timestamp and

numerical values representing the number of products

purchased in certain as well as product description data that

includes a release date and a set of indicators representing the

presence and/or quantity of specific product features.

© 2013 IJEA. ALL RIGHTS RESERVED 28

International Journal of Engineering Associates (2320 – 0804) / #28 / Volume 2 Issue 6

Page 3: Implementation Of C-Trend For Commercial Application

Fig 3.Overview Of C-Trend Process

C-TREND is the system implementation of the temporal

cluster-graph-based trend identification and visualization

technique; it provides an end user with the ability to generate

graphs from data and adjust the graph parameters. C-TREND

consists of two main phases: 1) offline preprocessing of the

data and 2) online interactive analysis and graph rendering.

In the preprocessing phase, the data set is partitioned based

on time periods, and each partition is clustered using one of

many traditional clustering techniques such as a hierarchical

approach The results of the clustering for each partition are

used to generate two data structures: the node list and the

edge list.

Fig 4.The C-Trend Process

Creating this list in the preprocessing phase allows for more

effective (real-time) visualization updates of the C-TREND

output graphs. Based on these data structures, graph entities

(nodes and edges) are generated and rendered as a temporal

cluster graph in the system output window. In the interactive

analysis, C-TREND allows the user to modify ki(i=1,…t),

and on demand in real time and, as a result, update the

view of the temporal cluster graph. Note that in this initial

implementation, the time partition size is set exogenously by

the user and stays constant throughout preprocessing and

online interactive analysis. We followed this approach

because of the domain-specific nature of time granularity.

For example, for analyzing technology evolution, the desired

time granularity could be a year, whereas financial market

trend analysis may require a much smaller time window. For

this reason, we decided to rely on the domain expert to

specify the most appropriate time granularity for a given

application. However, an important future extension for this

research would be to provide the ability to adjust the time

granularity interactively in real time.

2. Data Preprocessing

An important requirement for real-time graph customization

in C-TREND is the pre computation of multiple clustering

solutions from the initial data set. Depending on the type of

clustering algorithm employed, the cluster solution can be

stored in a way that maximizes the efficiency of the output

graph customization. C-TREND can be implemented with

multiple different standard clustering algorithms (e.g.,

agglomerative or divisive hierarchical clustering or partition-

based clustering). C-TREND utilizes optimized Dendrogram

data structures for storing and extracting cluster solutions

generated by hierarchical clustering algorithms. C-TREND is

the system implementation of the temporal cluster graph-

based trend identification and visualization technique; it

provides an end user with the ability to generate graphs from

data and adjust the graph parameters. C-TREND produces a

dendrogram for each data partition and utilizes a global input

value N that represents the maximum-sized cluster solution

maintained for each data partition. For all practical purposes,

a useful solution will consist of a set of N <<n clusters (n is

number of data points in partition i) and, therefore, C-

TREND has to store only 2N -1 nodes per partition. We have

found that maintaining a maximum solution size consisting

of N ¼ 50 clusters is more than sufficient for many practical

applications of data visualization The value of N can also be

set by the user before the preprocessing phase.

Fig 5.Dendogram example with node n=10

A dendrogram data structure allows for quick extraction of

any specific clustering solution for each data partition when

© 2013 IJEA. ALL RIGHTS RESERVED 29

International Journal of Engineering Associates (2320 – 0804) / #29 / Volume 2 Issue 6

Page 4: Implementation Of C-Trend For Commercial Application

the user changes partition zoom level ki. To obtain a specific

clustering solution from the data structure for data partition

Di, C-TREND uses the DENDRO_EXTRACT algorithm

(Algorithm 1), which takes the desired number of clusters in

the solution ki as an input and returns the set CurrCl

containing the clusters corresponding to the ki-sized solution.

Cluster attributes such as center and size are then accessible

from the corresponding dendrogram data structure by

referencing the clusters in CurrCl.

DENDRO_EXTRACT starts at the root of the dendrogram

and traverses the dendrogram by splitting the highest

numbered node in the current set of clusters until k clusters

are included in the set. MaxCl represents the highest element

in the current cluster set CurrCl. It is easy to see that because

of the specific dendrogram structure, it is always the case

that MaxCl=DendrogramRooti- |CurrCl| +1. Furthermore, the

dendrogram data array maintains the successive levels of the

hierarchical solution in order; therefore, replacing MaxCl by

its children MaxCl:Left (left child) and MaxCl:Right (right

child) is sufficient for identifying the next solution level in

the dendrogram. DENDRO_EXTRACT is linear in time

complexity O(ki), which provides for the real-time extraction

of cluster solutions.

3. Interactive Data Analysis

C-TREND utilizes a series of validation flags to maintain

and update the displayed state of the output trend graph.

Combinations of the validation flags are used to determine

whether or not each possible edge and node should be

displayed in the graph, and as these flags change, the

displayed components of the graph also change. Each cluster

in the node list possesses two flags: ki -pass and a-pass.

These flags are used to indicate whether the cluster should

be included in the output graph based on the ki value and the

a value, respectively. Specifically, when ki is changed, the

dendrogram data structure is updated so that only the clusters

that should be extracted for the clustering solution of size ki

have a valid k-pass flag. Similarly, when is changed, the

dendrogram data structure is updated so that only the clusters

that are large enough to pass the node filter based on are

assigned a valid pass flag. The nodes that have both valid

k-pass and pass flags make up the set of nodes that are

both large enough and in the desired clustering solution and

therefore are included in the output graph. In our

implementation, a list of all possible edges and their weights

is generated during preprocessing. Each edge in the list

possesses pass flag. When is changed, all edges with a

passing weight (based on ) are assigned a valid pass flag,

and all others are assigned an invalid flag. Only edges that

have a valid pass flag and are incident to two valid nodes

(nodes with valid k-pass and pass flags) are included in the

output graph. Using the implementation described above, C-

TREND can update output graphs based on user changes to

the ki, , and parameters very efficiently. Changing any

one parameter requires only one operation to update the

corresponding flag in the data structure for a given node or

edge[5].

Fig 6.Implementation Of C-Trend

IV.ADVANTAGES

C-TREND provides three advantages over existing

techniques. First, C-TREND presents temporal data in a

unique and intuitive manner that emphasizes trends between

dominant transaction types over time, and its output graphs

resemble evolutionary diagrams and naturally portray the

changes in data characteristics over time. Second, C-TREND

is a meta-analysis tool for data mining results and, therefore,

is designed to provide the domain expert with substantial

© 2013 IJEA. ALL RIGHTS RESERVED 30

International Journal of Engineering Associates (2320 – 0804) / #30 / Volume 2 Issue 6

Page 5: Implementation Of C-Trend For Commercial Application

control over the data presentation. In particular, C-TREND

provides the user with the ability to adjust all key parameters

for creating output trend graphs, which allows a domain

expert to visualize the data in a manner that provides the

most value. Third, C-TREND presents a set of graph

statistics which, in our future work, will provide a means for

developing new trend metrics and a framework for

performing hypothesis testing on the existence and

characteristics of trends.

VI.CONCLUSION

By harnessing computational techniques of data mining, we

have developed a temporal clustering technique for

discovering, analyzing, and visualizing trends in multi-

attribute temporal data. The proposed technique is versatile

and gives significant data representation power to the user –

domain experts have the ability to adjust parameters and

clustering mechanisms to fine-tune trend graphs. It is also

scalable: the time required to adjust trend parameters is quite

low even for larger data sets, which provides for real-time

visualization capabilities. The proposed technique is

applicable in many data analysis contexts, and can provide

insights for analysts performing historical analyses and

generating forecasts.

VIII.REFERENCES [1]Simmi Bagga, Dr. G.N. Singh “Application of Data

Mining”,International Journal for Science and Emerging

Technologies with Latest Trends,1(1):19-23.

[2] B.Ratnamala, P.M.Kiran “ Temporal Cluster graphs for

visualizing Trends”, National Conference on Advances in Computer

Science and Applications with International Journal of Computer

Applications (NCACSA 2012), Proceedings published in

International Journal of Computer Applications® (IJCA).

[3] D.Radha Rani, A.Vini Bharati, P.Lakshmi Durga Madhuri,

M.Phaneendra Babu, A.Sravani,” Analysis of Dendrogram Tree for

Identifying and Visualizing Trends in Multi-attribute Transactional

Data”, International Journal of Engineering Trends and

Technology- Volume3Issue1- 2012,pp 14-18.

[4]Arna Prabha Jena, Annan Naidu “A Review of C-TREND Using

Complete-Link Clustering for Transactional Data” Arna Prabha

Jena et al./ International Journal of Computer Science &

Engineering Technology (IJCSET)- Vol. 4 No. 07 Jul 2013,pp 850-

854.

[5] Gediminas Adomavicius, Member, IEEE, and Jesse Bockstedt

“C-TREND: Temporal Cluster Graphs for Identifying and

Visualizing Trends in Multiattribute Transactional Data” IEEE

TRANSACTIONS ON KNOWLEDGE AND DATA

ENGINEERING, VOL. 20, NO. 6, JUNE 2008,pp 721-735.

[6] Sotiris Kotsiantis, Dimitris Kanellopoulos “Association Rules

Mining: A Recent Overview” GESTS International Transactions on

Computer Science and Engineering, Vol.32 (1), 2006, pp. 71-82

[7]Jeffrey Hsu ”DATA MINING TRENDS AND DEVELOPMENTS

:The Key Data Mining Technologies and Applications forthe 21st

Century”

[8] AbdulRahman R. Alazmi, AbdulAziz R. Alazmi “Data Mining

And Visualization of Large Databases” International Journal of

Computer Science and Security (IJCSS), Volume (6): Issue (5) :

2012,pp 295-314.

© 2013 IJEA. ALL RIGHTS RESERVED 31

International Journal of Engineering Associates (2320 – 0804) / #31 / Volume 2 Issue 6