computational intelligence for data understanding
Post on 11-Feb-2016
Embed Size (px)
DESCRIPTIONComputational intelligence for data understanding. Wodzisaw Duch Department of Informatics, Nicolaus Copernicus University , Toru , Poland Google: W. Duch Best Summer Course 08. What is this tutorial about ? How to discover knowledge in data; - PowerPoint PPT Presentation
Computational intelligence for data understandingWodzisaw Duch
Department of Informatics, Nicolaus Copernicus University, Toru, Poland
Google: W. Duch
Best Summer Course08
PlanWhat is this tutorial about ? How to discover knowledge in data; how to create comprehensible models of data; how to evaluate new data;how to understand what CI methods do. AI, CI & Data Mining.Forms of useful knowledge.Integration of different methods in GhostMiner. Exploration & Visualization.Rule-based data analysis .Neurofuzzy models.Neural models, understanding what they do.Similarity-based models, prototype rules.Case studies.DM future: k-separability and meta-learning.From data to expert systems.
AI, CI & DMArtificial Intelligence: symbolic models of knowledge. Higher-level cognition: reasoning, problem solving, planning, heuristic search for solutions.Machine learning, inductive, rule-based methods.Technology: expert systems.Computational Intelligence, Soft Computing:methods inspired by many sources: biology evolutionary, immune, neural computingstatistics, patter recognitionprobability Bayesian networkslogic fuzzy, rough Perception, object recognition.Data Mining, Knowledge Discovery in Databases.discovery of interesting patterns, rules, knowledge. building predictive data models.
CI definitionComputational Intelligence. An International Journal (1984)+ 10 other journals with Computational Intelligence, D. Poole, A. Mackworth & R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning. CI should: be problem-oriented, not method oriented; cover all that CI community is doing now, and is likely to do in future; include AI they also think they are CI ... CI: science of solving (effectively) non-algorithmizable problems. Problem-oriented definition, firmly anchored in computer sci/engineering.AI: focused problems requiring higher-level cognition, the rest of CI is more focused on problems related to perception/action/control.
What can we learn?Good part of CI is about learning.What can we learn? Neural networks are universal approximators and evolutionary algorithms solve global optimization problems so everything can be learned? Not quite ... Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:
Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same. Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. There is no problem-independent or best set of features.
Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.
What is there to learn?Brains ... what is in EEG? What happens in the brain?
Industry: what happens?
Genetics, proteins ...
Forms of useful knowledgeAI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever.But ... knowledge accessible to humans is in: symbols, similarity to prototypes (intuition), images, visual representations.
What type of explanation is satisfactory?Interesting question for cognitive scientists but ...
in different fields answers are different!
Forms of knowledge3 types of explanation presented here: exemplar-based: prototypes and similarity;logic-based: symbols and rules;visualization-based: maps, diagrams, relations ... Humans remember examples of each category and refer to such examples as similarity-based or nearest-neighbors methods do.Humans create prototypes out of many examples same as Gaussian classifiers, RBF networks, neurofuzzy systems. Logical rules are the highest form of summarization of knowledge, require good linguistic variables.
GhostMiner PhilosophyThere is no free lunch provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, SVM, committees. Provide tools for visualization of data. Support the process of knowledge discovery/model building and evaluating, organizing it into projects.Many other interesting DM packages of this sort exists: Weka, Yale, Orange, Knime ... 168 packages on the-data-mine.com list!
GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqs.pl/ghostminer/ Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GhostMiner Developer & GhostMiner Analyzer
Wine data examplealcohol content ash content magnesium content flavanoids content proanthocyanins phenols content OD280/D315 of diluted wines Chemical analysis of wine from grapes grown in the same region in Italy, but derived from three different cultivars. Task: recognize the source of wine sample. 13 quantities measured, all features are continuous: malic acid content alkalinity of ash total phenols content nonanthocyanins phenols content color intensity hueproline.
Exploration and visualizationGeneral info about the data
Exploration: dataInspect the data
Exploration: data statisticsDistribution of feature valuesProline has very large values, most methods will benefit from data standardization before further processing.
Exploration: data standardizedStandardized data: unit standard deviation, about 2/3 of all data should fall within [mean-std,mean+std] Other options: normalize to [-1,+1], or normalize rejecting p% of extreme values.
Exploration: 1D histogramsDistribution of feature values in classesSome features are more useful than the others.
Exploration: 1D/3D histogramsDistribution of feature values in classes, 3D
Exploration: 2D projectionsProjections on selected 2DProjections on selected 2D
Visualize data Hard to imagine relations in more than 3D.Use parallel coordinates and other methods.Linear methods: PCA, FDA, PP ... use input combinations.
SOM mappings: popular for visualization, but rather inaccurate, there is no measure of distortions.
Measure of topographical distortions: map all Xi points from Rn to xi points in Rm, m < n, and ask: how well are Rij = D(Xi, Xj) distances reproduced by distances rij = d(xi,xj) ? Use m = 2 for visualization, use higher m for dimensionality reduction.
Sequences of the Globin family226 protein sequences of the Globin family; similarity matrix S(proteini,proteinj) shows high similarity values (dark spots) within subgroups, MDS shows cluster structure of the data (from Klock & Buhmann 1997); vector rep. of proteins is not easy.
Visualize data: MDSMultidimensional scaling: invented in psychometry by Torgerson (1952), re-invented by Sammon (1969) and myself (1994) Minimize measure of topographical distortions moving the x coordinates.Large distances
local structure as important aslarge scale
Visualize data: Wine3 clusters are clearly distinguished, 2D is fine.The green outlier can be identified easily.
Decision treesSimplest things should be done first: use decision tree to find logical rules.Test single attribute, find good point to split the data, separating vectors from different classes. DT advantages: fast, simple, easy to understand, easy to program, many good algorithms.
Tree for 3 kinds of iris flowers,petal and sepal leafsmeasured in cm.
Decision bordersUnivariate trees: test the value of a single attribute x < a. or for nomial features select a subset of values.Multivariate trees:test on combinations of attributes W.X < a.Result: feature space is divided into large hyperrectangular areas with decision borders perpendicular to axes.
Splitting criteriaMost popular: information gain, used in C4.5 and other trees.
CART trees use Gini index of node purity (Renyi quadratic entropy): Which attribute is better? Which should be at the top of the tree?
Look at entropy reduction, or information gain index.
Non-Bayesian selectionBayesian MAP selection: choose max a posteriori P(C|X) Problem: for binary features non-optimal decisions are taken! A=0 A=1P(C,A1) 0.0100 0.4900 P(C0)=0.5 0.0900 0.4100 P(C1)=0.5 P(C,A2) 0.0300 0.4700 0.1300 0.3700 P(C|X)=P(C,X)/P(X)
MAP is here equivalent to a majority classifier (MC): given A=x, choose maxC P(C,A=x) MC(A1)=0.58, S+=0.98, S-=0.18, AUC=0.58, MI= 0.058MC(A2)=0.60, S+=0.94, S-=0.26, AUC=0.60, MI= 0.057 MC(A1)
SSV decision treeSeparability Split Value tree: based on the separability criterion.SSV criterion: separate as many pairs of vectors from different classes as possible; minimize the number of separated from the same class. Define subsets of data D using a binary test f(X,s) to split the data into left and right subset D = LS RS.
SSV complex treeTrees may always learn to achieve 100% accuracy.Very few vectors are left in the leaves splits are not reliable and will overfit the data!
SSV simplest treePruning finds the nodes that should be removed to increase generalization accuracy on unseen data. Trees with 7 nodes left: 15 errors/178 vectors.
SSV logical rulesTrees may be converted to logical rules.Simplest tree leads to 4 logical rules: if proline > 719 and flavanoids > 2.3 then class 1 if proline < 719 and OD280 > 2.115 then class 2 if proline > 719 and flavanoids < 2.3 then class 3 if proline < 719 and OD280 < 2.115 then class 3How acc