towards science of dm włodzisław duch department of informatics, nicolaus copernicus university,...

Towards Science of DM Towards Science of DM Towards Science of DM Towards Science of DM

Włodzisław Duch

Department of Informatics, Nicolaus Copernicus University, Toruń, Poland

Google: W. DuchWCCI’08 Panel Discussion

What can we learn?What can we learn?What can we learn?What can we learn?Good part of CI is about learning.What can we learn?

Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not quite ...

Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:

• Uniformly averaged over all target functions the expected error for all learning algorithms [predictions by economists] is the same. • Averaged over all target functions no learning algorithm yields generalization error that is superior to any other. • There is no problem-independent or “best” set of features.

“Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.”

Data mining packagesData mining packagesData mining packagesData mining packages

• DM packages: Weka, Yale, RapidMiner, Orange, Knime ... >180 packages on the-data-mine.com list!

Hundreds of components ... thousands of combinations ... Our treasure box is full, although computer vision, BCI

and other problems are not solved.

• We can data mine forever … and publish forever!

• Neural networks are universal approximators and evolutionary algorithms solve global optimization problems – so everything can be learned? Not quite ...

GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/

http://www.fqspl.com.pl/ghostminer/

Are we really Are we really so good?so good?

Surprise!

Almost nothing can be learned using such tools!

What have we tried: SBMWhat have we tried: SBMWhat have we tried: SBMWhat have we tried: SBMSimilarity-Based Methods (SBMs) organized in a framework: p(Ci|X;M) posterior classification probability or y(X;M) approximators,models M are parameterized in increasingly sophisticated way.

Why? (Dis)similarity: • more general than feature-based description, • no need for vector spaces (structured objects), • more general than fuzzy approach (F-rules are reduced to P-rules), • includes kNN, MLPs, RBFs, separable function networks, SVMs,

kernel methods and many others!

Components => Models; systematic search selects optimal combination of parameters and procedures, opening different types of optimization channels, trying to discover appropriate bias for a given problem.

Start from kNN, k=1, all data & features, Euclidean distance, end with a model that is a novel combination of procedures and parameterizations.

Transformation-based frameworkTransformation-based frameworkTransformation-based frameworkTransformation-based framework

Extend SBM adding fine granulation of methods and relations between them to enable meta-learning by search in the model space.

For example, transformations (layers) frequently do:

•linear projection: unsupervised - PCA, ICA … or supervised – FDA, LDA, linear SVM generate useful linear components;

•non-linear preprocessing transformation, ex. MLP;

•feature selector, based on information filter;

•matching pursuit network for signal decomposition;

•logical rules to handle unusual situations;

•evaluate similarity (RBF).

DM requires more transformations!

Taxonomy Taxonomy - TF- TFTaxonomy Taxonomy - TF- TF

Heterogeneous everythingHeterogeneous everythingHeterogeneous everythingHeterogeneous everythingHomogenous systems: one type of “building blocks”, same type of decision borders, ex: neural networks, SVMs, decision trees, kNNs

Committees combine many models together, but lead to complex models that are difficult to understand.

Ockham razor: simpler systems are better. Discovering simplest class structures, inductive bias of the data, requires Heterogeneous Adaptive Systems (HAS).

HAS examples:NN with different types of neuron transfer functions.k-NN with different distance functions for each prototype.Decision Trees with different types of test criteria.

1. Start from large networks, use regularization to prune.2. Construct network adding nodes selected from a candidate pool.3. Use very flexible functions, force them to specialize.

RBF for XORRBF for XORRBF for XORRBF for XORIs RBF solution with 2 hidden Gaussians nodes possible?Typical architecture: 2 input – 2 Gaussians – 1 linear output, ML training

50% errors, but there is perfect separation - not a linear separation! Network knows the answer, but cannot say it ...

Single Gaussian output node may solve the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP.

More meta-learningMore meta-learningMore meta-learningMore meta-learning

Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments.Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search.

• Abstract view, uniform I/O, uniform results management.

• Directed acyclic graphs (DAG) of boxes representing scheme

• placeholders and particular models, interconnected through I/O.

• Configuration level for meta-schemes, expanded at runtime level.

An exercise in software engineering for data mining!

Intemi, Intelligent MinerIntemi, Intelligent MinerIntemi, Intelligent MinerIntemi, Intelligent MinerMeta-schemes: templates with placeholders

• May be nested; the role decided by the input/output types.

• Machine learning generators based on meta-schemes.

• Granulation level allows to create novel methods.

• Complexity control: Length + log(time)

• A unified meta-parameters description ...

• InteMi, intelligent miner, coming “soon”.

How much can we learn?How much can we learn?Linearly separable or almost separable problems are relatively simple – deform or add dimensions to make data separable.

How to define “slightly non-separable”? There is only separable and the vast realm of the rest.

Boolean functionsBoolean functionsBoolean functionsBoolean functionsn=2, 16 functions, 12 separable, 4 not separable.

n=3, 256 f, 104 separable (41%), 152 not separable.

n=4, 64K=65536, only 1880 separable (3%)

n=5, 4G, but << 1% separable ... bad news!

Existing methods may learn some non-separable functions, but most functions cannot be learned !

Example: n-bit parity problem; many papers in top journals.No off-the-shelf systems are able to solve such problems.

For parity problems SVM may go below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known.

But parity is still trivial ... solved by 1

cosn

ii

y b

kD casekD casekD casekD case3-bit functions: X=[b1b2b3], from [0,0,0] to [1,1,1]

f(b1,b2,b3) and f(b1,b2,b3) are symmetric (color change)

8 cube vertices, 28=256 Boolean functions.

0 to 8 red vertices: 1, 8, 28, 56, 70, 56, 28, 8, 1 functions.

For arbitrary direction W index projection W.X gives:

k=1 in 2 cases, all 8 vectors in 1 cluster (all black or all white)

k=2 in 14 cases, 8 vectors in 2 clusters (linearly separable)

k=3 in 42 cases, clusters B R B or W R W

k=4 in 70 cases, clusters R W R W or W R W R

Symmetrically, k=5-8 for 70, 42, 14, 2.

Most logical functions have 4 or 5-separable projections.

Learning = find best projection for each function. Number of k=1 to 4-separable functions is: 2, 102, 126 and 26126 of all functions may be learned using 3-separability.

3-bit parity in 2D and 3D3-bit parity in 2D and 3D3-bit parity in 2D and 3D3-bit parity in 2D and 3DOutput is mixed, errors are at base level (50%), but in the hidden space ...

Conclusion: separability in the hidden space is perhaps too much to desire ... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; train second RBF on the data (stacking), reducing number of clusters.

Spying on networksSpying on networksSpying on networksSpying on networksAfter initial transformation, what still needs to be done?

Conclusion: separability in the hidden space is perhaps too much to desire ... rules, similarity or linear separation, depending on the case.

Parity n=9Parity n=9Parity n=9Parity n=9

Simple gradient learning; quality index shown below.

Biological justificationBiological justificationBiological justificationBiological justification• Cortical columns may learn to respond to stimuli with complex logic

resonating in different way.

• The second column will learn without problems that such different reactions have the same meaning: inputs xi and training targets yj. are same => Hebbian learning Wij ~ xi yj => identical weights.

• Effect: same line y=W.X projection, but inhibition turns off one perceptron when the other is active.

• Simplest solution: oscillators based on combination of two neurons (W.X-b) – (W.X-b’) give localized projections!

• We have used them in MLP2LN architecture for extraction of logical rules from data.

• Note: k-sep. learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals!

• We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!

towards science of dm włodzisław duch department of informatics, nicolaus copernicus university,...

Documents

neural networks

data features

global optimization

data mining tools

linear svm

linear projection

evolutionary algorithms

universal approximators