veri madenciligi yazilimlari

31
DATA MINING SOFTWARES ORANGE Orange C++ tabanlı olup; makine öğrenimi ve veri madenciliği ile ilgili çok fazla algoritma içermektedir. Buna ek olarak veri girdisi ve bu verinin işlenmesini sağlayan modülere sahiptir. Orange bileşen tabanlı bir yazılımdır. Bileşen tabanlı olması ile var olan bileşenlerin yanında kendi bileşenlerimizi de yaratıp kullanabiliriz. Orange’ın Özellikleri Veri giriş/çıkış : Orange C4.5 dosyalarını okuyup ve yazabilir. Bununla birlikte bazı değişik formatları da desteklemektedir. Önişleme: Altküme seçme özelliği, tahminsel görevler için yararlı tahminler üretme

Upload: onixsantos16878

Post on 16-Apr-2015

85 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Veri Madenciligi Yazilimlari

DATA MINING SOFTWARES

ORANGE

Orange C++ tabanlı olup; makine öğrenimi ve veri madenciliği ile ilgili çok fazla

algoritma içermektedir. Buna ek olarak veri girdisi ve bu verinin işlenmesini sağlayan

modülere sahiptir.

Orange bileşen tabanlı bir yazılımdır. Bileşen tabanlı olması ile var olan bileşenlerin

yanında kendi bileşenlerimizi de yaratıp kullanabiliriz.

Orange’ın Özellikleri

Veri giriş/çıkış : Orange C4.5 dosyalarını okuyup ve yazabilir. Bununla birlikte bazı

değişik formatları da desteklemektedir.

Önişleme: Altküme seçme özelliği, tahminsel görevler için yararlı tahminler üretme

Tahminsel Modelleme: Sınıflandırma ağaçları, saf(naive) bayesias sınıflandırıcı,

mantıksal regresyon(geri çekilme), kural tabanlı sınıflandırıcı(e.g CN2)

Veri Tanımlama Metodu: Çeşitli görselleştirmeler, hiyerarşik kümeleme, çok boyutlu

ölçekleme

Görsel Programlama- Orange Widgets

Orange’ın görsel programlama arayüzüne “Orange Widgets” denir. İki bileşen

arasında bağlantı oluşturarak bir bileşenden gelen veriyi diğer bileşene Orange Widgets’i

kullanarak rahatlıkla geçirebiliriz. Orange’ın bu uygulmasına “Orange Canvas” adı verilir.

Page 2: Veri Madenciligi Yazilimlari

Widget’ler Orange üzerinde yapabileceğimiz işlemlerin kullanıcıya rahatlık sağlaması

bakımından oluşturulan ikonlarıdır. Şu an yaklaşık 40 adet Widget vardır. Programın her yeni

sürümünde programın gelişmesiyle paralel olarak Widget sayısı artmaktadır.

Eğer program ile default olarak gelen Widget’ler tam olarak kullanıcının

gereksinimlerini karşılamıyor ise kullanıcı kendi gereksinimleri doğrultusunda kendi

Widget’ini yazar. Burada Orange’ın çok büyük kolaylığı vardır.

Orange Python ile tamamen uyumlu bir dildir. Python programlama dili büyük ve

esnek bir dildir. Orange’ı Python ile kontrol edebilmekteyiz, böylece kendi Widget’lerimizi

çok kolay bir şekilde Python ile yazabilmekte ve programa adapte edebiliriz.

Orange için Python ile yazılmış birkaç örnek:

Veri dosyası okunur.

Naive Bayesian sınıflandırıcı oluşturulur.

Çıktı belirlenir.

İki Orange modülü import edilir.

Veri okunur

crossValidation kullanarak iki sınıflandırıcı karşılaştırılır.

Page 3: Veri Madenciligi Yazilimlari

Orange Widget Örnekleri:

Bu örnekte Süsen Çiçeği’nin 2-boyutlu sınıflandırma ağacında gösterimini

görmekteyiz. Bilindiği üzere Süsen Çiçeği’nin 3 cinsi vardır (Iris setosa,Iris versicolor

ve Iris virginica) Bu örnekte bu 3 cins üzerinde belli koşullara göre sınıflandırma

ağacını görmekteyiz.

Şimdi de aynı örnek için ScatterPlot Widget’ine bakalım. Burada çiçeğin taç

yaprak genişliği ve uzunluğu ölçütlerine göre yine 3 cins üzerinden alınan veriler

mevcuttur.

Page 4: Veri Madenciligi Yazilimlari

Widget iletişimi

Tüm widget’ler verilerini widget iletişi sayesinde alırlar. İletişim ise widget’lerin sahip

olduğu kanallardan alınıp verilen veriler ile sağlanır.

Orange Canvas; iletişim için yapılan arayüzdür. Orange Canvas’ta sürükle-bırak

yöntemi ile Widget’ler arası iletişim kolaylıkla sağlanır.

Yukarıdaki resimde görüldüğü üzere Widget’ler Orange Canvas üzerine

sürükle-bırak ile yerleştiriliyor ve daha sonra Widget’ler arasındaki bağlantı

sağlanarak veri üzerinde istediğimiz işlemi yapabiliyoruz.

Tüm widget’ler iletişim kanallarına sahiptir. Her bir Widget birden fazla

kanal’a sahip olabilir. İki widget arası iletişimin olabilmesi için Widget’lerin kanalları

uyuşmalıdır.

Page 5: Veri Madenciligi Yazilimlari

Yukarıdaki örneğe göre Sınıflandırma Ağacından gelen veri 2-boyutlu

Sınıflandırma ağacı göstericisine geçecektir. Her iki Widget ta birden fazla iletişim

kanalına sahiptir. Burada seçilecek olan kanal, her ikisinde de aynı olan kanal olan

“Classification Tree” kanalıdır.

Orange Canvas’ta kanallar aktif yada pasif olabilmektedir. Burada kanalları pasif

yapmamızın nedeni 2 yola ayrılan şemada tek koldan giderek istediğimiz sonucu elde

ederken zaman kaybını engellemek ve yapılan değişikliklerim tek bir kolu

etkilemesini sağlamaktır

Bazı Orange Metotları

orngAssoc

İlişiki Kuralı(association rule) ile ilgili birkaç bilgi

orngBayes

Orange saf bayesian learner ile tahminsel durumları ayarlar.Ayrıca modelin

çıktısını almamızı sağlar..

Page 6: Veri Madenciligi Yazilimlari

orngC45

Oluşturulan modelin C4.5 formatında çıktısını almamızı sağlar.

orngCI

(Constructive induction) Yapıcı Tümevarım (HINT, Kramer's constructive

induction method).

orngCN2

Learning kuralları için sınıflar ve fonksiyonlar kümesi(CN2 tabanlı).

orngFSS

Feature subset selection.(özellik altküme seçici)

orngLookup

Functions for working with classifiers with stored tables of examples.

orngMDS

Multidimensional scaling.

orngMisc

Miscellaneous functions, including various counters and selections of optimal

objects in a sequence.

orngMySQL

Interface to MySQL.

orngOutlier

Simple outlier detection.

orngReinforcement

Reinforcement learning.

orngSOM

Self-organizing maps

orngCA

Class for calculating correspondence analysis.

Page 7: Veri Madenciligi Yazilimlari

WEKAWHAT İS WEKA?The Weka workbench is a collection of state-of-the-art machine learningalgorithms and data preprocessing tools. It is designed so that you can quickly try out existingmethods on new datasets in flexible ways. It provides extensive support for thewhole process of experimental data mining, including preparing the inputdata, evaluating learning schemes statistically, and visualizing the input dataand the result of learning. As well as a wide variety of learning algorithms, itincludes a wide range of preprocessing tools. This diverse and comprehensivetoolkit is accessed through a common interface so that its users can comparedifferent methods and identify those that are most appropriate for the problemat hand.

Weka was developed at the University of Waikato in New Zealand, and thename stands for Waikato Environment for Knowledge Analysis. Outside theuniversity the weka, pronounced to rhyme with Mecca, is a flightless bird withan inquisitive nature found only on the islands of New Zealand. The system iswritten in Java and distributed under the terms of the GNU General PublicLicense. It runs on almost any platform and has been tested under Linux,Windows, and Macintosh operating systems—and even on a personal digitalassistant. It provides a uniform interface to many different learning algorithms,along with methods for pre- and postprocessing and for evaluating the result oflearning schemes on any given dataset.

One way of using Weka is to apply a learning method to a dataset and analyzeits output to learn more about the data. Another is to use learned models togenerate predictions on new instances. A third is to apply several different learnersand compare their performance in order to choose one for prediction. Thelearning methods are called classifiers, and in the interactive Weka interface youselect the one you want from a menu.Many classifiers have tunable parameters,which you access through a property sheet or object editor. A common evaluationmodule is used to measure the performance of all classifiers.Implementations of actual learning schemes are the most valuable resourcethat Weka provides. But tools for preprocessing the data, called filters, come aclose second. Like classifiers, you select filters from a menu and tailor themto your requirements.

USER INTERFACES OF WEKA

EXPLORERThe easiest way to use Weka is through a graphical user interface called the

Page 8: Veri Madenciligi Yazilimlari

Explorer. This gives access to all of its facilities using menu selection and formfilling. For example, you can quickly read in a dataset from an ARFF file (orspreadsheet) and build a decision tree from it. But learning decision trees is justthe beginning: there are many other algorithms to explore. The Explorer interfacehelps you do just that. It guides you by presenting choices as menus, byforcing you to work in an appropriate order by graying out options until theyare applicable, and by presenting options as forms to be filled out. Helpful tooltips pop up as the mouse passes over items on the screen to explain what theydo. Sensible default values ensure that you can obtain results with a minimumof effort—but you will have to think about what you are doing to understandwhat the results mean.

The Explorer interface.

KNOWLEDGE FLOWThe Knowledge Flowinterface allows you to design configurations for streamed data processing. Afundamental disadvantage of the Explorer is that it holds everything in mainmemory—when you open a dataset, it immediately loads it all in. This meansthat it can only be applied to small to medium-sized problems. However,Wekacontains some incremental algorithms that can be used to process very large

Page 9: Veri Madenciligi Yazilimlari

datasets. The Knowledge Flow interface lets you drag boxes representing learningalgorithms and data sources around the screen and join them togetherinto the configuration you want. It enables you to specify a data stream by connectingcomponents representing data sources, preprocessing tools, learningalgorithms, evaluation methods, and visualization modules. If the filters andlearning algorithms are capable of incremental learning, data will be loaded andprocessed incrementally.

Knowledge Flow interface

EXPERİMENTERWeka’s third interface, the Experimenter, is designed to help you answer abasic practical question when applying classification and regression techniques:which methods and parameter values work best for the given problem? Thereis usually no way to answer this question a priori, and one reason we developedthe workbench was to provide an environment that enables Weka users tocompare a variety of learning techniques. This can be done interactively usingthe Explorer. However, the Experimenter allows you to automate the process bymaking it easy to run classifiers and filters with different parameter settings ona corpus of datasets, collect performance statistics, and perform significancetests. Advanced users can employ the Experimenter to distribute the computing

Page 10: Veri Madenciligi Yazilimlari

load across multiple machines using Java remote method invocation (RMI).In this way you can set up large-scale statistical experiments and leave them torun.

FILTERING ALGORİTHMSThere are two kinds of filter: unsupervised and supervised. This seemingly innocuous distinction masks a rather fundamental issue. Filters are often applied to a training dataset and then also applied to the test file. If the filter is supervised—for example, if it uses class values to derive good intervals for discretization—applying it to the test data will bias the results. It is the discretization intervals derived from the training data that must be applied to the test data.

Unsupervised Attribute Filters

Add inserts an attribute at a given position, whose value is declared to be missingfor all instances. Use the generic object editor to specify the attribute’s name,

Page 11: Veri Madenciligi Yazilimlari

where it will appear in the list of attributes, and its possible values (for nominalattributes). Copy copies existing attributes so that you can preserve them whenexperimenting with filters that overwrite attribute values. Several attributes canbe copied together using an expression such as 1–3 for the first three attributes, or first-3,5,9-last for attributes 1, 2, 3, 5, 9, 10, 11, 12, . . . . The selection can beinverted, affecting all attributes except those specified. These features are sharedby many filters.Remove has already been described. Similar filters are RemoveType, whichdeletes all attributes of a given type (nominal, numeric, string, or date), andRemoveUseless, which deletes constant attributes and nominal attributes whosevalues are different for almost all instances. You can decide how much variationis tolerated before an attribute is deleted by specifying the number of distinctvalues as a percentage of the total number of values. Some unsupervised attributefilters behave differently if the menu in the Preprocess panel has been usedto set a class attribute. For example, RemoveType and RemoveUseless both skipthe class attribute.AddCluster applies a clustering algorithm to the data before filtering it. Youuse the object editor to choose the clustering algorithm. Clusterers are configuredjust as filters are. The AddCluster object editor contains itsown Choose button for the clusterer, and you configure the clusterer by clickingits line and getting another object editor panel, which must be filled in beforereturning to the AddCluster object editor. This is probably easier to understandwhen you do it in practice than when you read about it in a book! At any rate,once you have chosen a clusterer, AddCluster uses it to assign a cluster numberto each instance, as a new attribute. The object editor also allows you to ignorecertain attributes when clustering, specified as described previously for Copy.ClusterMembership uses a clusterer, again specified in the filter’s object editor,to generate membership values. A new version of each instance is created whoseattributes are these values. The class attribute, if set, is left unaltered.AddExpression creates a new attribute by applying a mathematical functionto numeric attributes. The expression can contain attribute references and constants;the arithmetic operators +, -, *, /, and Ÿ; the functions log and exp, absand sqrt, floor, ceil and rint,5 and sin, cos, and tan; and parentheses. Attributesare specified by the prefix a, for example, a7 is the seventh attribute.An exampleexpression isThere is a debug option that replaces the new attribute’s value with a postfixparse of the supplied expression.Whereas AddExpression applies mathematical functions, NumericTransformperforms an arbitrary transformation by applying a given Java function toselected numeric attributes. The function can be anything that takes a double asits argument and returns another double, for example, sqrt() in java.lang.Math.

a1Ÿ2*a5 log(a7*4.0)

One parameter is the name of the Java class that implements the function(which must be a fully qualified name); another is the name of the transformationmethod itself.Normalize scales all numeric values in the dataset to lie between 0 and 1. Standardizetransforms them to have zero mean and unit variance. Both skip theclass attribute, if set.SwapValues swaps the positions of two values of a nominal attribute. The order

Page 12: Veri Madenciligi Yazilimlari

of values is entirely cosmetic—it does not affect learning at all—but if the classis selected, changing the order affects the layout of the confusion matrix.MergeTwoValues merges values of a nominal attribute into a single category. Thenew value’s name is a concatenation of the two original ones, and every occurrenceof either of the original values is replaced by the new one. The index ofthe new value is the smaller of the original indices. For example, if you mergethe first two values of the outlook attribute in the weather data—in which thereare five sunny, four overcast, and five rainy instances—the new outlook attributewill have values sunny_overcast and rainy; there will be nine sunny_overcastinstances and the original five rainy ones.One way of dealing with missing values is to replace them globally beforeapplying a learning scheme. ReplaceMissingValues replaces each missing valuewith the mean for numeric attributes and the mode for nominal ones. If a classis set, missing values of that attribute are not replaced.

Unsupervised Instance Filters

You can Randomize the order of instances in the dataset. Normalize treats allnumeric attributes (excluding the class) as a vector and normalizes it to a givenlength. You can specify the vector length and the norm to be used.There are various ways of generating subsets of the data. Use Resample toproduce a random sample by sampling with replacement or RemoveFolds to split it into a given number of cross-validation folds and reduce it to just one of them.If a random number seed is provided, the dataset will be shuffled before thesubset is extracted. RemovePercentage removes a given percentage of instances,and RemoveRange removes a certain range of instance numbers. To remove allinstances that have certain values for nominal attributes, or numeric valuesabove or below a certain threshold, use RemoveWithValues. By default allinstances are deleted that exhibit one of a given set of nominal attribute values(if the specified attribute is nominal) or a numeric value below a given threshold(if it is numeric). However, the matching criterion can be inverted.You can remove outliers by applying a classification method to the dataset(specifying it just as the clustering method was specified previously forAddCluster) and use RemoveMisclassified to delete the instances that itmisclassifies.Supervised Attribute Filters

Page 13: Veri Madenciligi Yazilimlari

Supervised Instance Filters

CLASSIFIERSBayesian classifiersNaiveBayes implements the probabilistic Naïve Bayes classifier. NaiveBayesSimple uses the normal distribution to model numeric attributes.NaiveBayes can use kernel density estimators, which improves performance if the normality assumption is grossly incorrect; it can also handle numeric attributes using supervised discretization. NaiveBayesUpdateable is an incrementalversion that processes one instance at a time; it can use a kernel estimatorbut not discretization. NaiveBayesMultinomial implements themultinomial Bayes classifier (Section 4.2, page 95). ComplementNaiveBayesbuilds a Complement Naïve Bayes classifier as described by Rennie et al. (2003)AODE (averaged, one-dependence estimators) is a Bayesian method that averagesover a space of alternative Bayesian models that have weaker independenceassumptions than Naïve Bayes (Webb et al., 2005). The algorithm may yieldmore accurate classification than Naïve Bayes on datasets with nonindependentattributes.BayesNet learns Bayesian networks under the assumptions made in Section6.7: nominal attributes (numeric ones are prediscretized) and no missing values(any such values are replaced globally). There are two different algorithms forestimating the conditional probability tables of the network. Search is doneusing K2 or the TAN algorithm or more sophisticated methodsbased on hill-climbing, simulated annealing, tabu search, and genetic algorithms.Optionally, search speed can be improved using AD trees.There is also an algorithm that uses conditional independence tests to learn thestructure of the network; alternatively, the network structure can be loaded froman XML (extensible markup language) file.More details on the implementationof Bayesian networks in Weka can be found in Bouckaert (2004).You can observe the network structure by right-clicking the history item andselecting Visualize graph.TREESDecisionStump, designed for use with the boosting methods described later, builds one-level binary decision trees for datasets with a categorical or numeric class, dealing with missing values by treating them as a separate value and extending a third branch from the stump. Trees built by

Page 14: Veri Madenciligi Yazilimlari

RandomTree chooses a test based on a given number of random features at eachnode, performing no pruning. RandomForest constructs random forests bybagging ensembles of random trees. REPTree builds a decision or regression tree using information gain/variancereduction and prunes it using reduced-error pruning. Optimized for speed, it only sorts values for numeric attributes once.It deals with missing values by splitting instances intopieces, as C4.5 does. You can set the minimum number of instances per leaf,maximum tree depth (useful when boosting trees), minimum proportion oftraining set variance for a split (numeric classes only), and number of folds forpruning.NBTree is a hybrid between decision trees and Naïve Bayes. It creates treeswhose leaves are Naïve Bayes classifiers for the instances that reach the leaf.Whenconstructing the tree, cross-validation is used to decide whether a node shouldbe split further or a Naïve Bayes model should be used instead (Kohavi 1996).M5P is the model tree learner.LMT builds logistic model trees. LMT can deal with binary and multiclass target variables, numeric and nominal attributes, and missing values. Whenfitting the logistic regression functions at a node, it uses cross-validation todetermine how many iterations to run just once and employs the same numberthroughout the tree instead of cross-validating at every node. This heuristic(which you can switch off) improves the run time considerably, with little effecton accuracy. Alternatively, you can set the number of boosting iterations to beused throughout the tree. Normally, it is the misclassification error that crossvalidationminimizes, but the root mean-squared error of the probabilities canbe chosen instead. The splitting criterion can be based on C4.5’s informationgain (the default) or on the LogitBoost residuals, striving to improve the purityof the residuals.ADTree builds an alternating decision tree using boostingand is optimized for two-class problems. The number of boosting iterations is a parameter that can be tuned to suit the dataset and the desiredcomplexity–accuracy tradeoff. Each iteration adds three nodes to the tree (onesplit node and two prediction nodes) unless nodes can be merged. The defaultsearch method is exhaustive search (Expand all paths); the others are heuristics andare much faster.You can determine whether to save instance data for visualization.RULESDecisionTable builds a decision table majority classifier.It evaluates featuresubsets using best-first search and can use cross-validation for evaluation(Kohavi 1995b). An option uses the nearest-neighbor method to determine theclass for each instance that is not covered by a decision table entry, instead ofthe table’s global majority, based on the same set of features. OneR is the 1Rclassifier with one parameter: the minimum bucket size for discretization.ConjunctiveRule learns a single rule that predicts either a numericor a nominal class value. Uncovered test instances are assigned the default class value (or distribution) of the uncovered training instances. The information gain(nominal class) or variance reduction (numeric class) of each antecedent is computed,and rules are pruned using reduced-error pruning. ZeroR is even simpler:

Page 15: Veri Madenciligi Yazilimlari

it predicts the test data’s majority class (if nominal) or average value (if numeric).Prism implements the elementary covering algorithm for rules.Part obtains rules from partial decision trees. It builds the tree using C4.5’s heuristics with the same user-defined parameters as J4.8. M5Rules obtains regression rules from model trees built using M5¢. Ridor learns rules with exceptions (Section 6.2, pages210–213) by generating the default rule, using incremental reduced-errorpruning to find exceptions with the smallest error rate, finding the best exceptionsfor each exception, and iterating.JRip implements RIPPER, including heuristicglobal optimization of the rule set (Cohen 1995). Nnge is a nearest-neighbormethod for generating rules using nonnested generalized exemplars.

OTHER CLASSIFIERS

Page 16: Veri Madenciligi Yazilimlari

TANAGRA

WHAT İS WEKA?

TANAGRA is a free DATA MINING software for academic and research purposes. It proposes several data mining methods from exploratory data analysis, statistical learning, machine learning and databases area.

This project is the successor of SIPINA which implements various supervised learning algorithms, especially an interactive and visual construction of decision trees. TANAGRA is more powerful, it contains some supervised learning but also other paradigms such as clustering, factorial analysis, parametric and nonparametric statistics, association rule, feature selection and construction algorithms...

TANAGRA is an "open source project" as every researcher can access to the source code, and add his own algorithms, as far as he agrees and conforms to the software distribution license.

The main purpose of Tanagra project is to give researchers and students an easy-to-use data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing to analyse either real or synthetic data.

The second purpose of TANAGRA is to propose to researchers an architecture allowing them to easily add their own data mining methods, to compare their performances. TANAGRA acts more as an experimental platform in order to let them go to the essential of their work, dispensing them to deal with the unpleasant part in the programmation of this kind of tools : the data management.

Page 17: Veri Madenciligi Yazilimlari

The third and last purpose, in direction of novice developers, consists in diffusing a possible methodology for building this kind of software. They should take advantage of free access to source code, to look how this sort of software is built, the problems to avoid, the main steps of the project, and which tools and code libraries to use for. In this way, Tanagra can be considered as a pedagogical tool for learning programming techniques.

TANAGRA does not include, presently, what makes all the strength of the commercial softwares in this domain : a wide set of data sources, direct access to datawarehouses and databases, data cleansing, interactive utilization.

GUI

Since its introduction by SPAD (DECISIA) at the beginning of the 90s, the modelisation of a chain of process as a "stream diagram" has now been adopted by many softwares . Consciously or not, most of the "big" entreprises in DATA MINING industry have re-used this notion of visual programming to describe the successive operations applied on data.

TANAGRA keeps going on this way, so the interface is classically composed of three parts : the description of the stream diagram, a "treeview" in our project ; the set of nodes (operators, components), in the bottom frame ; and finally, the results report, in HTML format.

Page 18: Veri Madenciligi Yazilimlari

Figure 1. A view from the graphical user interface.

Operators (Components)

The bottom frame contains the data mining operators (also designated as icons, nodes, components). All of them use data as input, perform analyses and produce results ; nethertheless, only a few of them make predictions. In this case, one or more variables are added to the data, which would be transmitted to the following operator.

Operators are arranged in categories. Some of these categories are commonly accepted (description / structuration / explication-prediction / association for example), others are more arguable. Actually, one underlying constraint was to have not too many categories of methods.

Figure 2. View of the operators interface.

Page 19: Veri Madenciligi Yazilimlari

Data mining diagram

As with the other softwares of this domain, the Data Miner can define analysis by starting with data and adding operators one after each other. To experiment various assumptions, and compare the results obtained with, the user can explore several branchs of the diagram.

Choosing a tree structure (Treeview) facilitates the managing of diagrams, as much at a programming level than at a end-user level. Complex analyses can so be easily represented and achieved. On the other hand, there is not possibility to accomplish some merging in the analysis diagram, as with other graphical softwares. For example, it is not possible to automatically group various data sources.

Figure 3. A view from the data mining diagram used in TANAGRA

Results

More often, Tanagra operators produce HTML format output. So it is simple to export the results towards an edition software, like EXCEL(C), for subsequent processing.

Generally, output is composed from two parts : the decription of the analysis parameters, and the results.

Choosing HTML format has also the advantage of easily export results, to look at it after closing the software. As far as to print it.

When necessary, results can be displayed in a window, in which the user can interact. So it is for the "X-Y graph" operator, the user can use the mouse to select the variables used for horizontal and vertical axis, in order to understand better how points are distributed.

Page 20: Veri Madenciligi Yazilimlari

Figure 4. Results section

The Concept Of Stream Diagram

Stream diagram

Introduced by SPAD for data analyses in the early 90s - at this time people didn't talk about Data Mining yet - the stream diagram represents the sequence of operations applied on data by a graph where (1) the nodes (an operator, a component, etc.) symbolize the analysis performed on the data ; (2) the links between nodes, the flow of processed data.

The main advantage of this representation is the clear aspect of it, it also resides in the ability to easily chain operations on the data generated by some methods : for example, applying a clustering using the factorial axis produced from a multiple correspondances analysis. Of course it is possible to do the same thing using the script functionalities of some softwares, but has the person realizing the study the time and the will to learn a new language ? Some people consider that using stream diagrams can be assimilated to visual programming, it's a little sententious as, apart from the succession of operations, none usual algorithmic structure is used (loops, conditions,...). Anyway, the representation using stream diagrams has become an unavoidable paradigm, adopted by most of the data mining software editors (cf. STATISTICA DATA MINER, INSIGHTFULL MINER, SAS EM, SPSS CLEMENTINE, etc.).

Page 21: Veri Madenciligi Yazilimlari

In TANAGRA, the graph has been replaced with a simplier form of it : a tree. Only one source can provide data in the same diagram, so the user must prepare them before importing. This choice has two main consequences : for users, an easier reading of the operations really done ; for the developers, simplier classes with fewer integrity controls on data.

The tree structure allows to lead in parallel several concomitant analyses on the same data. This can be useful if, for example, we want to compare the performances of some prediction algorithms.

Figure 5. Stream diagram representation in TANAGRA

Operator(s)

The operator (component) is a key element, as it represents an operation done on the data. The first operator is always a connection to a dataset, a set of records-attributes. The importation wizard automatically places the connection at the top of the diagram.

Four types of results can be expected when adding the operators, each of them having parameters : (a) analytical results, that describe the data or modelize them ; (b) a restriction or an elargement of the set of active examples used for analyses ; (c) a restriction or an elargement of the set of attributes used for analyses ;(d) the production of new attributes, added to the dataset.

Page 22: Veri Madenciligi Yazilimlari

Figure 6. Setting the parameters of an operator.

Operators compatibilities

Commonly, operators are linked to form a sequence in a diagram; there are two levels of controls : while adding the operator, or while executing the sequence.

Two categories of operators have some unusual comportment to take care of the specificities of their methods : supervised learning and meta supervised learning. So that their way of working is : first to add a "meta" component to the diagram, which is a sort of support for single methods or even arcing ones, then to include the "learning" operator in it (discriminant analysis, induction tree, etc.). This way of doing allows to multiply the combinations, for example to proceed a boosting on a multi-layer perceptron, that's not really recommended, but it's possible anyway.

Cihan KABRANİsa KURU

Uğur YAMAN