computational intelligence for data mining

Download Computational Intelligence  for Data Mining

Post on 11-Feb-2016




0 download

Embed Size (px)


Computational Intelligence for Data Mining. Wodzisaw Duch Department of Informatics Nicholas Copernicus University Torun, Poland W ith help from R . Adamczak , K . Grbczewski K . Grudziski , N . Jankowski , A . Naud - PowerPoint PPT Presentation


  • Computational Intelligence for Data MiningWodzisaw DuchDepartment of InformaticsNicholas Copernicus University Torun, Poland With help fromR. Adamczak, K. Grbczewski K. Grudziski, N. Jankowski, A. Naud 2002, Honolulu, HI

  • Group members

  • PlanWhat this tutorial is about ? How to discover knowledge in data; how to create comprehensible models of data; how to evaluate new data. AI, CI & Data MiningForms of useful knowledgeGhostMiner philosophyExploration & VisualizationRule-based data analysis Neurofuzzy modelsNeural modelsSimilarity-based modelsCommittees of models

  • AI, CI & DMArtificial Intelligence: symbolic models of knowledge. Higher-level cognition: reasoning, problem solving, planning, heuristic search for solutions.Machine learning, inductive, rule-based methods.Technology: expert systems.Computational Intelligence, Soft Computing:methods inspired by many sources: biology evolutionary, immune, neural computingstatistics, patter recognitionprobability Bayesian networkslogic fuzzy, rough Perception, object recognition.Data Mining, Knowledge Discovery in Databases.discovery of interesting patterns, rules, knowledge. building predictive data models.

  • Forms of useful knowledgeAI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever.But ... knowledge accessible to humans is in: symbols, similarity to prototypes, images, visual representations.

    What type of explanation is satisfactory?Interesting question for cognitive scientists.Different answers in different fields.

  • Forms of knowledgeHumans remember examples of each category and refer to such examples as similarity-based or nearest-neighbors methods do.Humans create prototypes out of many examples as Gaussian classifiers, RBF networks, neurofuzzy systems do. Logical rules are the highest form of summarization of knowledge. Types of explanation: exemplar-based: prototypes and similarity;logic-based: symbols and rules;visualization-based: maps, diagrams, relations ...

  • GhostMiner PhilosophyGhostMiner, data mining tools from our lab. Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer There is no free lunch provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees.Provide tools for visualization of data.Support the process of knowledge discovery/model building and evaluating, organizing it into projects.

  • Wine data exampleChemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, continuous features: alcohol content ash content magnesium content flavanoids content proanthocyanins phenols content OD280/D315 of diluted wines malic acid content alkalinity of ash total phenols content nonanthocyanins phenols content color intensity hueproline.

  • Exploration and visualizationGeneral info about the data

  • Exploration: dataInspect the data

  • Exploration: data statisticsDistribution of feature valuesProline has very large values, the data should be standardized before further processing.

  • Exploration: data standardizedStandardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] Other options: normalize to fit in [-1,+1], or normalize rejecting some extreme values.

  • Exploration: 1D histogramsDistribution of feature values in clasessSome features are more useful than the others.

  • Exploration: 1D/3D histogramsDistribution of feature values in clasess, 3D

  • Exploration: 2D projectionsProjections on selected 2DProjections on selected 2D

  • Visualize data Hard to imagine relations in more than 3D.SOM mappings: popular for visualization, but rather inaccurate, no measure of distortions.

    Measure of topographical distortions: map all Xi points from Rn to xi points in Rm, m < n, and ask: how well are Rij = D(Xi, Xj) distances reproduced by distances rij = d(xi,xj) ? Use m = 2 for visualization, use higher m for dimensionality reduction.

  • Visualize data: MDSMultidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) Minimize measure of topographical distortions moving the x coordinates.

  • Visualize data: Wine3 clusters are clearly distinguished, 2D is fine.The green outlier can be identified easily.

  • Decision treesSimplest things first: use decision tree to find logical rules.Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.

  • Decision bordersUnivariate trees: test the value of a single attribute x < a. Multivariate trees:test on combinations of attributes.Result: feature space is divided in hyperrectangular areas.

  • SSV decision treeSeparability Split Value tree: based on the separability criterion.Define left and right sides of the splits: SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class.

  • SSV complex treeTrees may always learn to achieve 100% accuracy.Very few vectors are left in the leaves!

  • SSV simplest treePruning finds the nodes that should be removed to increase generalization accuracy on unseen data. Trees with 7 nodes left: 15 errors/178 vectors.

  • SSV logical rulesTrees may be converted to logical rules.Simplest tree leads to 4 logical rules: if proline > 719 and flavanoids > 2.3 then class 1 if proline < 719 and OD280 > 2.115 then class 2 if proline > 719 and flavanoids < 2.3 then class 3 if proline < 719 and OD280 < 2.115 then class 3How accurate are such rules? Not 15/178 errors, or 91.5% accuracy! Run 10-fold CV and average the results. 8510%? Run 10X!

  • SSV optimal trees/rulesOptimal: estimate how well rules will generalize.Use stratified crossvalidation for training;use beam search for better results.if OD280/D315 > 2.505 and proline > 726.5 then class 1if OD280/D315 < 2.505 and hue > 0.875 and malic-acid < 2.82 then class 2if OD280/D315 > 2.505 and proline < 726.5 then class 2if OD280/D315 < 2.505 and hue > 0.875 and malic-acid > 2.82 then class 3if OD280/D315 < 2.505 and hue < 0.875 then class 3Note 6/178 errors, or 91.5% accuracy! Run 10-fold CV: results are 90.46.1%? Run 10X!

  • Logical rulesCrisp logic rules: for continuous x use linguistic variables (predicate functions).sk(x) True [Xk x X'k], for example: small(x) = True{x|x < 1}medium(x) = True{x|x [1,2]}large(x) = True{x|x > 2}

    Linguistic variables are used in crisp (prepositional, Boolean) logic rules:

    IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...

  • Crisp logic decisionsCrisp logic is based on rectangular membership functions:True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders.

    Sever limitation on the expressive power of crisp logical rules!

  • Logical rules - advantagesLogical rules, if simple enough, are preferable.Rules may expose limitations of black box solutions. Only relevant features are used in rules. Rules may sometimes be more accurate than NN and other CI methods. Overfitting is easy to control, rules usually have small number of parameters. Rules forever !? A logical rule about logical rules is:

  • Logical rules - limitationsLogical rules are preferred but ...Only one class is predicted p(Ci|X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications.Discontinuous cost function allow only non-gradient optimization. Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. Reliable crisp rules may reject some cases as unclassified.Interpretation of crisp rules may be misleading.

    Fuzzy rules are not so comprehensible.

  • How to use logical rules?Data has been measured with unknown error. Assume Gaussian distribution:x fuzzy number with Gaussian membership function.A set of logical rules R is used for fuzzy input vectors: Monte Carlo simulations for arbitrary system => p(Ci|X)Analytical evaluation p(C|X) is based on cumulant:Error function is identical to logistic f. < 0.02

  • Rules - choicesSimplicity vs. accuracy. Confidence vs. rejection rate.p++ is a hit; p-+ false alarm; p+- is a miss.

  • Rules error functionsThe overall accuracy is equal to a combination of sensitivity and specificity weighted by the a priori probabilities: A(M) = p+S+(M)+p-S-(M)Optimization of rules for the C+ class; large g means no errors but high rejection rate. E(M+;g)= gL(M+)-A(M+)= g (p+-+p-+) - (p+++p--) minM E(M;g) minM {(1+g)L(M)+R(M)} Optimization with different costs of errors

    minM E(M;a) = minM {p+-+ a p-+} = minM {p+(1-S+(M)) - p+r(M) + a [p-(1-S-(M)) - p-r(M)]}

    ROC (Receiver Operating Curve): p++ (p-+), hit(false alarm).

  • Fuzzification of rulesRule Ra(x) = {x>a} is fulfilled by Gx with probability:Error function is approximated by logistic function; assuming error distribution s(x)(1- s(x)), for s2=1.7 approximates Gauss < 3.5% Rule Rab(x) = {b> x >a} is fulfilled by Gx with probability:

  • Soft trapezoids and NNThe difference between two sigmoids mak


View more >