machine learning neural networks, support vector machines … · 2019-10-09 · machine learning...

Machine LearningNeural Networks, Support Vector Machines

Georg Dorffner

Section for Artificial Intelligence and Decision Support

CeMSIIS – Medical University of Vienna

Neural networks: The simple mathematical model

• Propagation rule:

– Weighted sum

– Euclidian distance

• Transfer function f:

– Threshold function(McCulloch & Pitts)

– Linear fct.

– Sigmoid fct.

yj fxj

weight

unit (neuron)

activation, output

(net-) input

iiijjj

xwfyfx

Perceptron as neural network

• Inputs are random „feature“ detectors

• Binary codes

• Perceptron learns classification

• Learning rule = weight adaptation

• Model of perception / object recongition

• But can solve only linearly separable problems

Neuron.eng.wayne.edu

target""

sonst0

Multilayer perceptron (MLP)

• 2 (or more) layers (= connections)

Input units

Hidden units (typically sigmoid)

Output units(typically linear)

Learning rule (weight adaptation): Backpropagation

• Generalised delta rule

ijij xw

outout'outjjjj xtyf

outhid'hidk

kjkjj wyf

yhid, xhid

yout, xout

• Error is being propagated back

• „Pseudo-error“ for the hidden units

Backpropagation as gradient descent

• Define (quadratic) error (for pattern l):

• Minimize error

• Change weights in the direction of the gradient

• Chain rule leads to backpropagation

kkkl txE

(partial derivative by weight)

Limits of backpropagation

• Gradient descent can get stuck in local minimum(depends on initial values)

• it is not guranteed that backpropagation can find an existing solution

• Further problems: slow, can oscillate

• Solution: conjugent gradient, quasi-Newton

The power of NN: Arbitrary classifications

• Each hidden unit separates space into 2 halves (perceptron)

• Output units work like “AND”

• Sigmoids: smooth transitions

Maschinelles Lernen und Neural Computation

Allgemeiner Ansatz: Diskriminanzanalyse

• Lineare Diskriminanzfunktion:

entspricht dem Perceptron mit 1 Output Unit pro Klasse

• Quadratisch linear:

entspricht einer „Vorverarbeitung“ der Daten,Parameter (w,v) noch immer linear

iii wxwg

jjiijii wxxvxwg

Maschinelles Lernen und Neural Computation

Der Schritt zum neuronalen Netz

• Allgemein linear:

beliebige Vorverarbeitungsfunktionen, lineare Verknüpfung

• Neuronales Netz:

NN implementiert adaptive Vorverarbeitungnichtlinear in Parametern (w)

Gauss ...

Sigmoide ...

MLP to produce probabilities

• MLP can approximate the Bayes posterior

• Activation function: Softmax

• Prior probabilities:Distribution in training set

cPcpcP

SS 2008 Maschinelles Lernen und Neural Computation

Regression

• To model the data generator: estimate joint distribution

• Likelihood:

xxttx, ppp |

ii ppL xxt

Distribution with expected value f(xi)

Gaussian noise

• Likelihood:

• Maximize = -logL minimize(constant terms can be dropped incl. p(x))

• Corresponds to the quadratic error(see backpropagation)

ii tftpL

ii tfE1

SS 2008Maschinelles Lernen und Neural Computation

Training als Maximum Likelihood

• Minimierung des quadratischen Fehlers ist Maximum Likelihood mit den Annahmen:– Fehler ist in jedem Punkt normalverteilt, ~N(0,)

– Varianz dieser Verteilung ist konstant

• Varianz des Fehlers (des Rauschens):

• Aber: das muss nicht gelten!Erweiterungen möglich (Rauschmodell)

opt2 1

Wx (verbleibender normalisierter Fehler)

Klassifikation als Regression

• MLP soll Posterior annähern

• Verteilung der Targets ist keine Normalverteilung

• Bernoulli Verteilung:

• Neg. log-Likelihood:

• „Cross-Entropy Fehler“ (für 2 Klassen; verallgemeinerbar auf n Klassen)

titiii

outout 1

iiii xtxtE1

outout 1log1log

xout=P(c|xin)

Optimale Paarungen: Transferfunktion (am Output) +Fehlerfunktion

• Regression:

– Linear + summierter quadratischer Fehler

• Klassifikation (Diskriminationsfunktion):

– Linear + summierter quadratischer Fehler

• Klassifikation (Posterior nach Bayes):

– Softmax+cross-entropy Fehler

– 2 Klassen, 1 Ouput: Sigmoid+cross-entropy

Gradient der Fehlerfunktion

• Backpropagation (nach Bishop 1995):effiziente Berechnung des Gradienten (Beitrag des Netzes): O(W) statt O(W2), siehe p.146f

• ist unabhängig von der gewählten Fehlerfunktion

Beitrag der Fehlerfunktion Beitrag des Netzes

• Optimierung basiert auf Gradienteninformation:

Gute Minimierungsverfahren

• Gradientenabstieg Konjugierter(„Backprop“) Gradient:

MLP as universal function approximator

• E.g: 1 Input, 1 Output, 5 Hidden

• MLP can approximate arbitray functions (Hornik et al. 1990)

• Through superposition of sigmoids

• Complexity by combining simple elements

inhidoutinoutj

iiiijjkkk wwxwfwxgx

move(bias)

stretch, mirror

Overfitting

• If too few training data: NN tries to model the noise

• Overfitting: worse performance on new data (quadratic error becomes bigger)

50 samples, 15 H.U.

Avoiding overfitting

• As much data as possible(good coverage of distribution)

• Model (network) as small as possible

• More generally: regularisation (= limit the effective number of degrees of freedom):

– Several training runs, average

– Penalty for large networks, e.g.:

– „Pruning“ (remove connections)

– Early stopping

The important steps in practice

Owing to their power and characteristics, neural networkrequire a sound and careful strategy:

1. Data inspection (visualisation)

2. Data preprocessing (e.g. normalization to zero mean andunit variance)

3. Feature selection

4. Model selection (pick best network size)

5. Comparison with simpler methods

6. Testing on independent data

7. Interpretation of results

Model selection

• Strategy for the optimal choice of model complexity:– Start small (e.g. 1 or 2 hidden units)

– n-fold cross-validation

– Add hidden units one by one

– Accept as long as there is a significant improvement (test)

• No regularization necessaryoverfitting is captured by cross-validation (averaging)

• Too many hidden units too large variance no statistical significance

• The same method can also be used for feature selection (“wrapper”)

Support Vector Machines: Returning to the perceptron

• Advantage of (linear) perecptron:

– Global solution guaranteed (no local minima)

– Easy to solve / optimize

• Disadvantage:

– Restricted to linear separability

• Idea:

– Transformation of data to a highdimensional space, such that problem becomes linearly separable

Mathematical formulation of perceptron learning rule

• Perceptron (1 Output):

• ti = +1/-1:

• Data is described in terms of inner products („dual form“)

0wxf T wx

iitw x

Tii wtxf 0xx

Inner product(dot product)

Kernels

• The goal is a certain transformation xi→Φ(xi), such that problem becomes linearly separable (can be high-dimensional)

• Kernel: Function that is depictable as inner product of Φs:

• Φ does not have to be explicitly known

TK 2121, xΦxΦxx

ii wKtf 0, xxx

Example: polynomial kernel

• 2 dimensions:

• Kernel is indeed an inner product of vectors after transformation („preprocessing“)

2, TK xzzx

zΦxΦ

zzzzxxxx

zzxxzxzx

212122

2,,2,,

The effect of the „kernel trick“

• Use of the kernel, e.g:

• 16x16-dimensional vectors (e.g. pixel images), 5th degree polynomial: dimension = 1010

– Inner product of two 10000000000-dim. vectors

• Calculation is done in low-dimensional space:– Inner Product of two 256-dim. vectors

– To the power of 5

iii wtwKtf 0

0, xxxxx

Large Margin Classifier• Highdimensional space:

Overfitting easily possible

• Solution: Search for decision border (hyperplabe) with largest distance to closest points

0 bTwx

distance maximal

1 bTwx

• Optimization:

Minimize

(Maximize )

Boundary condition:

01 bt Tii wx

Optimization of large margin classifier

• Quadratic optimization problem, Lagrange multiplier approach, leads to:

• „Dual“ form

• Important: Data is again denoted in terms of inner products

• Kernel trick can be used again

TjijijiiD ttL xx

jijijiiD KttL xx

Support Vectors

• Support-Vectors: Points at the margin (closest to decision border

• Determine the solution, all other points could be omitted

Kernel function

Back projectionsupport vectors

Summary

• Neural networks are powerful machine learners for numerical features, initally inspired by neurophysiology

• Nonlinearity through interplay of simpler learners (perceptrons)

• Statistical/probabilistic framework most appropriate

• Learning = Maximum Likelihood, minimizing error function with efficient gradient-based method (e.g. conjugent gradient)

• Power comes with downsides (overfitting) -> careful validation necessary

• Support vector machines are interesting alternatives, simplify learning problem through „Kernel trick“

machine learning neural networks, support vector machines … · 2019-10-09 · machine learning...

Documents

neural network based vector hysteresis model and...

intrusion detection using neural networks and support vector...

june 2003neural computation for time series1 neural...

neural networks and accelerate - apple inc.€¦ · neural...

lecture 25 neural approaches to nlp - university of...

2011_speed estimation of vector controlled squirrel cage...

neural ordinary differential equations...neural ordinary...

chapter 4: artificial neural networks. artificial neural...

comp9444: neural networks support vector...

a novel approach for vector quantization using a neural...

minimal neural networks support vector machines and bayesian...

pit pattern classification with support vector machines...

artificial neural networks and support vector...

hydrological neural modeling aided by support vector...

can artificial intelligence meet the cognitive networking...

deep convolutional neural networks and support vector...

ss 2011maschinelles lernen und neural computation 1...

application of seven-level neural space vector pwm in dvc

digital image watermarking using learning vector...

computing gradient vector and jacobian matrix in...