a new clustering tool of data mining rapid miner

A new clustering tool of Data Mining

RAPID MINER

Introduction To ClusteringUnsupervised learning when old data with

class labels not available e.g. when introducing a new product.

Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.

Key requirement: Need a good measure of similarity between instances.

Identify micro-markets and develop policies for each

About The ProjectAim of this project is to devise a new algorithm of

clustering for Data MiningThe main functionalities which would be

implemented in the system would be preprocessing and clustering.

In the preprocessing of the data, input file, .xls file can be chosen. The null values, if any, present in the input file would be removed in order to avoid the occurrence of faulty results in the output data sets. The redundancy or duplicity in the data sets of the attributes is removed.

In the clustering, the data is distributed into groups, so that the degree of association to be strong between members of the same cluster and weak between members of different clusters.

Present Tool: WekaWeka (Waikato Environment for Knowledge Analysis) is a

popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.

The Explorer interface features several panels providing access to the main components of the workbench:

The Preprocess panel has facilities for importing data from a database, a CSV file, etc., and for preprocessing this data using a so-called filtering algorithm. These filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria.

The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-means algorithm. There is also an implementation of the expectation maximization algorithm for learning a mixture of normal distributions.

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/University_of_Waikato

http://en.wikipedia.org/wiki/New_Zealand

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Comma-separated_values

http://en.wikipedia.org/wiki/Clustering

http://en.wikipedia.org/wiki/K-means

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://en.wikipedia.org/wiki/Normal_distribution

Our tool: Initially in the data preprocessing phase, the MS-Excel File

is taken as input. There is no question of CSV of ARFF File(s). This is done since Excel file(s) are well known and comfortably handled by non-technical people as well. But, CSV and ARFF file(s) are needed to be well versed with also. This was done by importing a new library, the ‘jxl.jar’ library into the project.

File(s) for data mining is firstly cleaned, by removing the null data sets from the input file(s). Null data sets are the data sets that contained no information or some information less than a threshold (minimum number of values of required attributes) value. The number of null data sets is reported to the user of the system as well. The second thing that was done was to remove redundancy/ duplicity of data sets from the file(s). Redundant/ Duplicate data sets are the data sets which have all the attribute values same in value with some other data set. These data sets are eliminated for the further process of data mining. The number of these redundant/ duplicate data sets is also reported to the user.

KD TreesK Dimensional TreesSpace Partitioning Data StructureSplitting planes perpendicular to

Coordinate AxesReduces the Overall Time Complexity to

O(log n)

ClusteringOur Clustering Algorithm uses KD Tree

extensively for improving its Time Complexity Requirements.

Our algorithm differs from existing approach in how nearest centers are computed.

Efficiency is achieved because the data points do not vary throughout the computation and, hence, this data structure does not need to be recomputed at each stage.

K-means ClusteringComplexity is O( n * K * I * d )– n = number of points, K = number of

clusters,I = number of iterations, d = number of

attributes

K means K-Means methodology is a commonly used clustering

technique. In this analysis the user starts with a collection of samples and attempts to group them into ‘k’ Number of Clusters based on certain specific distance measurements. The prominent steps involved in the K-Means clustering algorithm are given below.

1. This algorithm is initiated by creating ‘k’ different clusters. The given sample set is first randomly distributed between these ‘k’ different clusters.

2. As a next step, the distance measurement between each of the sample, within a given cluster, to their respective cluster centroid is calculated.

3. Samples are then moved to a cluster (k ¢ ) that records the shortest distance from a sample to the cluster (k ¢ ) centroid.

As a first step to the cluster analysis, the user decides on the Number of Clusters‘k’. This parameter could take definite integer values with the lower bound of 1 (in practice, 2 is the smallest relevant number of clusters) and an upper bound that equals the total number of samples.

The K-Means algorithm is repeated a number of times to obtain an optimal clustering solution, every time starting with a random set of initial clusters.

COMPARISON OF OUR TOOL WITH WEKA A set of data with the following statistics was

run on WEKA and our tool both :

Relation = weatherNo. of attributes = 3No. of Instances ( including redundant/

duplicate and null instances) = 17

Limitations :-This tool does not provide protection from: Shared storage failures.

Network service failures.

Operational errors.

Site disasters (unless a geographically dispersed clustering solution has been implemented).

In the near future…Market analysis

Marketing strategiesAdvertisement

Risk analysis and managementFinance and finance investmentsManufacturing and production

Fraud detection and detection of unusual patterns (outliers)Telecommunication Finanancial transactionsAnti-terrorism (!!!)

CONCLUSION We device a new algorithm for clustering by considering the following

variations:-

MS-Excel File(s) is successfully read, handled and processed by the system with the help of ‘jxl.jar’ library. By using this library, new features and functionalities of using Excel document were known.

Null data sets were removed comfortably. Along with this, redundant and duplicate data sets were also removed.

This algorithm choose better starting clusters i.e. choosing the initial values (or “seeds”) for the clustering algorithm.

A filtering algorithm is included in this which uses KD-TREES to speed up each k-mean step.

The initial centers are chosen in this algorithm. K-MEANS does not specify

how they are to be selected. An inappropriate choice of number of clusters can yield poor results. That

is why, number of clusters are determined properly in the data set.

ReferencesAn Efficient k-Means Clustering Algorithm: Analysis and

Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y.

Wu. Introduction to Clustering Techniques – by Leo WannerA comprehensive overview of Basic Clustering Algorithms – Glenn Fung Introduction to Data Mining – Tan/Steinbach/Kumar

Questions/comments…?

a new clustering tool of data mining rapid miner

Documents