machine learning

54
Machine Learning Central Problem of Pattern Recognition: Supervised and Unsupervised Learning Classification Bayesian Decision Theory Perceptrons and SVMs Clustering Visual Computing: Joachim M. Buhmann — Machine Learning 143/196

Upload: butest

Post on 29-Oct-2014

826 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Machine Learning

Machine Learning

Central Problem of Pattern Recognition:Supervised and Unsupervised Learning

ClassificationBayesian Decision Theory

Perceptrons and SVMsClustering

Visual Computing: Joachim M. Buhmann — Machine Learning 143/196

Page 2: Machine Learning

Machine Learning – What is the Challenge?Find optimal structure in data and validate it!

38 March 2006 Joachim M. Buhmann / Institute for Computational Science

Concept for Robust Data Analysis

FeedbackFeedback

Structureoptimizationmultiscale analysis,stochastic approximation

xQuantization ofsolution spaceInformation/Rate Distortion Theory

Regularizationof statistical &

computational complexity

Structure Validationstatisticallearning theory

Structuredefinition

(costs, risk, ...)

Datavectors, relations,

images,...

Visual Computing: Joachim M. Buhmann — Machine Learning 144/196

Page 3: Machine Learning

The Problem of Pattern Recognition

Machine Learning (as statistics) addresses a number of chal-lenging inference problems in pattern recognition which spanthe range from statistical modeling to efficient algorithmics.Approximative method which yield good performance on ave-rage are particularly important.

• Representation of objects . ⇒ Data representation

• What is a pattern? Definition/modeling of structure .

• Optimization : Search for prefered structures

• Validation : are the structures indeed in the data or are theyexplained by fluctuations?

Visual Computing: Joachim M. Buhmann — Machine Learning 145/196

Page 4: Machine Learning

Literatur

• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.Wiley & Sons (2001)

• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer Ver-lag (2001)

• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-tern Recognition. Springer Verlag (1996)

• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.Springer Verlag (1995)

• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,ISBN: 0-387-40272-1) Springer Verlag (2004)

Visual Computing: Joachim M. Buhmann — Machine Learning 146/196

Page 5: Machine Learning

The Classification Problem

Visual Computing: Joachim M. Buhmann — Machine Learning 147/196

Page 6: Machine Learning

Visual Computing: Joachim M. Buhmann — Machine Learning 148/196

Page 7: Machine Learning

Classification as a Pattern Recognition Problem

Problem: We look for a partition of the object space O (fishin the previous example) which corresponds to classificationexamples.Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !

Data: pairs of feature vectors and class labelsZ = (xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ 1, . . . , k

Definitions: feature space X with xi ∈ X ⊂ Rd

class labels yi ∈ 1, . . . , k

Classifier: mapping c : X → 1, . . . , k

k class problem: What is yn+1 ∈ 1, . . . , k for xn+1 ∈ Rd?

Visual Computing: Joachim M. Buhmann — Machine Learning 149/196

Page 8: Machine Learning

Example of Classification

Visual Computing: Joachim M. Buhmann — Machine Learning 150/196

Page 9: Machine Learning

Histograms of Length Values

salmon sea bass

length

count

l*

0

2

4

6

8

10

12

16

18

20

22

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 151/196

Page 10: Machine Learning

Histograms of Skin Brightness Values

2 4 6 8 100

2

4

6

8

10

12

14

lightness

count

x*

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 152/196

Page 11: Machine Learning

Linear Classification

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 153/196

Page 12: Machine Learning

Overfitting

?

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 154/196

Page 13: Machine Learning

Optimized Non-Linear Classification

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Occam’s razor argument: Entia non sunt multiplicanda praeter necessitatem!

Visual Computing: Joachim M. Buhmann — Machine Learning 155/196

Page 14: Machine Learning

Regression(see Introduction to Machine Learning)

Question: Given a feature(vector) xi and a corre-sponding noisy measure-ment of a function valueyi = f(xi) + noise, what isthe unknown function f(.)in a hypothesis class H?

Data: Z = (xi, yi) ∈ Rd × R : 1 ≤ i ≤ n

Modeling choice: What is an adequate hypothesis class anda good noise model? Fitting with linear/nonlinear functions?

Visual Computing: Joachim M. Buhmann — Machine Learning 156/196

Page 15: Machine Learning

The Regression Function

Questions: (i) What is the statistically optimal estimate of afunction f : Rd → R and (ii) which algorithm achieves thisgoal most efficiently?

Solution to (i): the regression function

y(x) = E y|X = x =∫

Ω

y p(y|X = x)dy

Nonlinear regression of asinc functionsinc(x) := sin(x)/x

(gray) with a regression fit(black) based on 50 noisydata.

Visual Computing: Joachim M. Buhmann — Machine Learning 157/196

Page 16: Machine Learning

Examples of linear and nonlinear regression

linear regression nonlinear regression

How should we measure the deviations?

vertical offsets perpendicular offsets

Visual Computing: Joachim M. Buhmann — Machine Learning 158/196

Page 17: Machine Learning

Core Questions of Pattern Recognition:Unsupervised Learning

No teacher signal is available for the learning algorithm; lear-ning is guided by a general cost/risk function.

Examples for unsupervised learning

1. data clustering, vector quantization :as in classification we search for a partitioning of objects ingroups; but explicit labelings are not available.

2. hierarchical data analysis; search for tree structures in data3. visualisation, dimension reduction

Semisupervised learning: some of the data are labeled, mostof them are unlabeled.

Visual Computing: Joachim M. Buhmann — Machine Learning 159/196

Page 18: Machine Learning

Modes of Learning

Reinforcement Learning: weakly supervised learningAction chains are evaluated at the end.Backgammon; the neural network TD-Gammon gained theworld championship! Quite popular in Robotics

Active Learning: Data are selected according to their expec-ted information gain.Information Filtering

Inductive Learning: the learning algorithm extracts logical ru-les from the data.Inductive Logic Programming is a popular sub area of Artifi-cial Intelligence

Visual Computing: Joachim M. Buhmann — Machine Learning 160/196

Page 19: Machine Learning

Vectorial Data

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

A

AA

A

AA

A

AAA

B

B B

BBB

BB

B

B

C

C

C

C

C

C

C

C

C

C

DD

DD

D

DD

DD

D

E

E

EE

E

E

E

E

EE

FF

FFFF

F

F

FF

G

GGG

G

GGG

G

G

H

H

H

H

HH

H

HH

H

I

IIIII

III

I

J

J

JJJJ J

J

J

J

K

K

KK

K

KK

KKK

L

L

L

L

L

L

L

LL

LMM

M

M

M

M

M

MM

M

N

N

N

N

N

N

N

NN

N

O

O

O

OOO

O

OO

O

PP

P

P

P

PP

PP

P

Q

QQQ

QQ

Q

Q

Q Q

R

R

R

R

R

RR

R

R

R

S

S

S

S

S

S

S

S

S

S

T

T

TTTT

TTTT

Data of 20 Gaussiansources in R20, pro-jected onto two di-mensions with Princi-pal Component Ana-lysis.

Visual Computing: Joachim M. Buhmann — Machine Learning 161/196

Page 20: Machine Learning

Relational Data

Pairwise dissimilarityof 145 globins whichhave been selectedfrom 4 classes ofα-globine, β-globine,myoglobins and glo-bins of insects andplants.

Visual Computing: Joachim M. Buhmann — Machine Learning 162/196

Page 21: Machine Learning

Scales for Data

Nominal or categorial scale : qualitative, but without quantita-tive measurements,e.g. binary scale F = 0, 1 (presence or absence of proper-ties like “kosher”) ortaste categories “sweet, sour, salty and bitter.

Ordinal scale : measurement values are meaningful only withrespect to other measurements, i.e., the rank order of mea-surements carries the information, not the numerical diffe-rences (e.g. information on the ranking of different marathonraces!?)

Visual Computing: Joachim M. Buhmann — Machine Learning 163/196

Page 22: Machine Learning

Quantitative scale:

• interval scale : the relation of numerical differences car-ries the information. Invariance w.r.t. translation and sca-ling (Fahrenheit scale of temperature).

• ratio scale : zero value of the scale carries information butnot the measurement unit. (Kelvin scale).

• Absolute scale : Absolute values are meaningful. (gradesof final exams)

Visual Computing: Joachim M. Buhmann — Machine Learning 164/196

Page 23: Machine Learning

Machine Learning: Topic Chart

• Core problems of pattern recognition

• Bayesian decision theory

• Perceptrons and Support vector machines

• Data clustering

Visual Computing: Joachim M. Buhmann — Machine Learning 165/196

Page 24: Machine Learning

Bayesian Decision TheoryThe Problem of Statistical Decisions

Task: textbf n objects have to be partitioned in 1, . . . , k classes,the doubt class D and the outlier class O.

D : doubt class (→ new measurements required)O : outlier class , definitively none of the classes 1, 2, . . . , k

Objects are characterized by feature vectors X ∈ X , X ∼P(X) with the probability P(X = x) of feature values x.

Statistical modeling: Objects represented by data X andclasses Y are considered to be random variables, i.e.,(X, Y ) ∼ P(X, Y ).Conceptually, it is not mandatory to consider class labels as random since they mightbe induced by legal considerations or conventions.

Visual Computing: Joachim M. Buhmann — Machine Learning 166/196

Page 25: Machine Learning

Structure of the feature space X

• X ⊂ Rd

• X = X1 ×X2 × · · · × Xd with Xi ⊆ R or Xi finite.

Remark: in most situations we can define the feature space as subsets of Rd or astuples of real, categorial (B = 0, 1) or ordinal (K ⊂ K) numbers. Sometimes wehave more complicated data spaces composed of lists, trees or graphs.

Class density / likelihood: py(x) := P(X = x|Y = y) is equalto the probability of a feature value x given a class y.

Parametric Statistics: estimate the parameters of the classdensities py(x)

Non-Parametric Statistics: minimize the empirical risk

Visual Computing: Joachim M. Buhmann — Machine Learning 167/196

Page 26: Machine Learning

Motivation of ClassificationGiven are labeled dataZ = (xi, yi) : i ≤ n

Questions:

1. What are the classboundaries?

2. What are the classspecific densitiespy(x)?

3. How many modesor parameters dowe need to modelpy(x)?

4. ...

Figure: quadratic SVM classifier for five classes.White areas are ambiguous regions.

Visual Computing: Joachim M. Buhmann — Machine Learning 168/196

Page 27: Machine Learning

Thomas Bayes and his Terminology

The State of Nature is modelled as a random variable!

prior: Pmodel

likelihood: Pdata|model

posterior: Pmodel|data

evidence: Pdata

Bayes Rule: Pmodel|data =Pdata|modelPmodel

Pdata

Visual Computing: Joachim M. Buhmann — Machine Learning 169/196

Page 28: Machine Learning

Ronald A. Fisher and Frequentism

Fisher, Ronald Aylmer (1890-1962): founder of frequentiststatistics together with Jerzey Neyman & Karl Pearson.

British mathematician and biologist who in-vented revolutionary techniques for apply-ing statistics to natural sciences.

Maximum likelihood method

Fisher information: a measure for the infor-mation content of densities.

Sampling theory

Hypothesis testing

Visual Computing: Joachim M. Buhmann — Machine Learning 170/196

Page 29: Machine Learning

Bayesianism vs. Frequentist Inference 1

Bayesianism is the philosophical tenet that the mathematical theory of pro-bability applies to the degree of plausibility of statements, or to the degreeof belief of rational agents in the truth of statements; together with Bayestheorem, it becomes Bayesian inference. The Bayesian interpretation ofprobability allows probabilities assigned to random events, but also al-lows the assignment of probabilities to any other kind of statement.

Bayesians assign probabilities to any statement, even when no randomprocess is involved, as a way to represent its plausibility. As such, thescope of Bayesian inquiries include the scope of frequentist inquiries.

The limiting relative frequency of an event over a long series of trials isthe conceptual foundation of the frequency interpretation of probability.

Frequentism rejects degree-of-belief interpretations of mathematical pro-bability as in Bayesianism, and assigns probabilities only to randomevents according to their relative frequencies of occurrence.1see http://encyclopedia.thefreedictionary.com/

Visual Computing: Joachim M. Buhmann — Machine Learning 171/196

Page 30: Machine Learning

Bayes Rule for Known Densities and Parameters

Assume that we know how the features are distributed for thedifferent classes, i.e., the class conditional densities and theirparameters are known.What is the best classification strat-egy in this situation?

Classifier:

c : X → 1, . . . , k,DThe assignment function c maps the feature space X to theset of classes 1, . . . , k,D. (Outliers are neglected)

Quality of a classifier: Whenever a classifier returns a labelwhich differs from the correct class Y = y then it has madea mistake.

Visual Computing: Joachim M. Buhmann — Machine Learning 172/196

Page 31: Machine Learning

Error count: The indicator function∑x∈X

Ic(x) 6=y

counts the classifier mistakes. Note that this error count is arandom variable!

Expected errors also called expected risk define the qualityof a classifier

R(c) =∑y≤k

P(y)EP(x)

[Ic(x) 6=y|Y = y

]+ terms from D

Remark: The rational behind this choice comes from gambling. If we bet ona particular outcome of our experiment and our gain is measured by howoften we assign the measurements to the correct class then classifier withminimal expected risk will win on average against any other classificationrule (“Dutch books”)!

Visual Computing: Joachim M. Buhmann — Machine Learning 173/196

Page 32: Machine Learning

The Loss Function

Weighted mistakes are introduced when classification errorsare not equally costly; e.g. in medical diagnosis, some di-sease classes might be harmless and others might be lethaldespite of similar symptoms.

⇒ We introduce a loss function L(y, z) which denotes the lossfor the decision z if class y is correct.

0-1 loss: all classes are treated the same!

L0−1(y, z) =

0 if z = y (correct decision)

1 if z 6= y and z 6= D (wrong decision)

d if z = D (no decision)

Visual Computing: Joachim M. Buhmann — Machine Learning 174/196

Page 33: Machine Learning

• weighted classification costs L(y, z) ∈ R+ are frequentlyused, e.g. in medicine;classification costs can also be asymmetric, that meansL(y, z) 6= L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).

Conditional Risk function of the classifier is the expectedloss of class y

R(c, y) = Ex [L(y, c(x))|Y = y]

=∑z≤k

(L(y, z)Pc(x) = z|Y = y

+L(y,D)Pc(x) = D|Y = y)

= Pc(x) 6= y ∧ c(x) 6= D|Y = y︸ ︷︷ ︸pmc(y) probability of misclassification

+ d ·Pc(x) = D|Y = y︸ ︷︷ ︸pd(y) probability of doubt

Visual Computing: Joachim M. Buhmann — Machine Learning 175/196

Page 34: Machine Learning

Total risk of the classifier: (πy := P(Y = y))

R(c) =∑z≤k

πz pmc(z) + d∑z≤k

πz pd(z) = EC

[R(c, C)

]

Asymptotic average loss

limn→∞

1n

∑j≤n

L(cj, c(xj)) = limn→∞

R(c) = R(c),

where (xj, cj)|1 ≤ j ≤ n is a random sample set of size n.This formula can be interpreted as the expected loss with empirical distribution as probability model.

Visual Computing: Joachim M. Buhmann — Machine Learning 176/196

Page 35: Machine Learning

Posterior class probability

Posterior: Let

p(y|x) ≡ PY = y|X = x =πypy(x)∑z πzpz(x)

be the posterior of the class y given X = x.

(The ‘Partition of One” πypy(x)/∑

z πzpz(x) results from the normalizati-

on∑

z p(z|x) = 1. )

Likelihood: The class conditional density py(x) is the probabi-lity of observing data X = x given class Y = y.

Prior: πy is the probability of class Y = y.

Visual Computing: Joachim M. Buhmann — Machine Learning 177/196

Page 36: Machine Learning

Bayes Optimal Classifier

Theorem 1 The classification rule which minimizes the totalrisk for 0− 1 loss is

c(x) =

y if p(y|x) = maxz≤k p(z|x) > 1− d,

D if p(y|x) ≤ 1− d ∀y.

Generalization to arbitrary loss functions

c(x) =

y if

∑z L(z, y)p(z|x) = minρ≤k

∑z L(z, ρ)p(z|x) ≤ d,

D else .

Bayes classifier: Select the class with highest πypy(x) value ifit exceeds the costs for not making a decision, i.e., πypy(x) >

(1− d)p(x).

Visual Computing: Joachim M. Buhmann — Machine Learning 178/196

Page 37: Machine Learning

Proof: Calculate the total expected loss R(c)

R(c) = EX

[EY

[L0−1(Y, c(x))|X = x

]]=

∫X

EY

[L0−1(Y, c(x))|X = x

]p(x)dx with p(x) =

∑z≤k

πzpz(x)

Minimize the conditional expectation value since it depends only on c.

c(x) = argminc∈1,...,k,DE[L0−1(Y, c)|X = x

]= argminc∈1,...,k,D

∑z≤k

L0−1(z, c)p(z|x)

=

argminc∈1,...,k (1− p(c|x)) if d > minc(1− p(c|x))D else

=

argmaxc∈1,...,kp(c|x) if 1− d < maxc p(c|x)D else

Visual Computing: Joachim M. Buhmann — Machine Learning 179/196

Page 38: Machine Learning

Outliers

• Modeling by an outlier class πO with pO(x)

• “Novelty Detection” : Classify a measurement as an outlierif

πOpO(x) ≥ max

(1− d)p(x),maxz

πzpz(x)

• The outlier concept causes conceptual problems and it does not fit to thestatistical decision theory since outliers indicate an erroneous or incom-plete specification of the statistical model!

• The outlier class is often modeled by a uniform distribution.Attention : Normalization of uniform distribution does not exist in manyfeature spaces!

=⇒ Limit the support of the measurement space or put a (Gaussian)measure on it!

Visual Computing: Joachim M. Buhmann — Machine Learning 180/196

Page 39: Machine Learning

Class Conditional Densities and Posteriors for 2Classes

Class-conditional probability den-sity function

Posterior probabilities for priorsP(y1) = 2

3,P(y2) = 13.

9 10 11 12 13 14 15

0.1

0.2

0.3

0.4

p(x|ωi)

x

ω1

ω2

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi. If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

0.2

0.4

0.6

0.8

1

P(ωi|x)

x

ω1

ω2

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 181/196

Page 40: Machine Learning

Likelihood Ratio for 2 Class Example

x

θa

p(x|ω1)

p(x|ω2)

θb

R1

R2

R1

R2

FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1

patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 182/196

Page 41: Machine Learning

Discriminant Functions gl

discriminant

functions

input

g1(x) g

2(x) g

c(x). . .

x1

x2

xd. . .x

3

costs

action

(e.g., classification)

FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

• Discriminant function: gz(x) = PY = y|X = x

• Class decision: gy(x) > gz(x) ∀z 6= y ⇒ class y.

• Different discriminant functions can yield the same decision:gy(x) = log Px|y+ log πy; minimize implementation problems!

Visual Computing: Joachim M. Buhmann — Machine Learning 183/196

Page 42: Machine Learning

Example for Discriminant Functions

0

0.1

0.2

0.3

decision

boundary

p(x|ω2)P(ω2)

R1

R2

p(x|ω1)P(ω1)

R2

0

5

0

5

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 184/196

Page 43: Machine Learning

Adaptation of Discriminant Functions gl

discriminant

functions

input

g1(x) g

2(x) g

c(x). . .

x1

x2

xd. . .x

3

action

(e.g., classification)

MAX

-

teacher

signal

The red connections (weights) are adapted in such a way that the teacher

signal is imitated by the discriminant function.

Visual Computing: Joachim M. Buhmann — Machine Learning 185/196

Page 44: Machine Learning

Example Discriminant Functions: NormalDistributions

The Likelihood of class y is Gaussian distributed.

py(x) =1√

(2π)d|Σy|exp

(−1

2(x− µy)TΣ−1

y (x− µy))

Special case: Σy = σ2I

gy(x) = log py(x) + log πy

= − 12σ2

‖x− µy‖2 + log πy + const.

Visual Computing: Joachim M. Buhmann — Machine Learning 186/196

Page 45: Machine Learning

⇒ Decision surface between class z and y:

− 12σ2

‖x− µz‖2 + log πz = − 12σ2

‖x− µy‖2 + log πy

−‖x‖2 + 2x · µz − ‖µz‖2 + 2σ2 log πz = −‖x‖2 + 2x · µy − ‖µy‖2 + 2σ2 log πy

⇒ 2x · (µz − µy)− ‖µz‖2 + ‖µy‖2 + 2σ2 logπz

πy= 0

Linear decision rule : wT (x− x0) = 0

with w = µz − µy x0 =12(µz + µy)−

σ2(µz − µy)‖µz − µy‖2

logπz

πy

Visual Computing: Joachim M. Buhmann — Machine Learning 187/196

Page 46: Machine Learning

Decision Surface for Gaussians in 1,2,3Dimensions

-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi)ω1 ω2

0

R1

R2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1

ω2

R1

R2

-2

0

2

4

-2

-1

0

1

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Visual Computing: Joachim M. Buhmann — Machine Learning 188/196

Page 47: Machine Learning

P(ω1)=.7 P(ω2)=.3

ω1 ω2

R1

R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R1

R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R1

R2

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1

ω2

R1

R2

-1

0

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2

R1

R2

0

2

4

-1

0

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R1

R2

FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.Visual Computing: Joachim M. Buhmann — Machine Learning 189/196

Page 48: Machine Learning

Multi Class Case

R3

R2

R1

R4

R4

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Decision regions for four Gaussian distributions. Even for such a small num-

ber of classes the discriminant functions show a complex form.

Visual Computing: Joachim M. Buhmann — Machine Learning 190/196

Page 49: Machine Learning

Example: Gene Expression Data

The expression of genes is measured for various patients. Theexpression profiles provide information of the metabolic state ofthe cells, meaning that they could be used as indicators for di-sease classes. Each patient is represented as a vector in a highdimensional (≈ 10000) space with Gaussian class distribution.

Genes

Sam

ples

ALL B

−Cell

AM

LA

LL T−C

ellT

rue

Pred

Visual Computing: Joachim M. Buhmann — Machine Learning 191/196

Page 50: Machine Learning

Parametric Models for Class Densities

If we would know the prior probabilities and the class conditio-nal probabilities then we could calculate the optimal classifier.But we don’t!

Task: Estimate p(y|x; θ) from samplesZ = (x1, y1), . . . , (xn, yn)for classification.

Data are sorted according to their classes:Xy = X1y, . . . , Xny,y where Xiy ∼ PX|Y = y; θy

Question: How can we use the information in samples to esti-mate θy?

Assumption: classes can be separated and treated indepen-dently! Xy is not informative w.r.t. θz, z 6= y

Visual Computing: Joachim M. Buhmann — Machine Learning 192/196

Page 51: Machine Learning

Maximum Likelihood Estimation Theory

Likelihood of the data set: PXy|θy =∏

i≤nyp(xiy|θy)

Estimation principle: Select the parameters θy which maximi-ze the likelihood, that means

θy = arg maxθy

PXy|θy

Procedure: Find the extreme value of the log-likelihood functi-on

∇θy log PX |θy = 0

∂θy

∑i≤n

log p(xi|θy) = 0

Visual Computing: Joachim M. Buhmann — Machine Learning 193/196

Page 52: Machine Learning

Remark

Bias of an estimator: bias(θn) = Eθn − θ.

Consistent estimator: A point estimator θn of a parameter θ

is consistent if θnP→ θ.

Asymptotic Normality of Maximum Likelihood estimates:

(θn − θ)/√

Vθn N (0, 1).

Alternative to ML class density estimation: discriminativelearning by maximizing the a posteriori distribution Pθy|Xy(details of the density do not have to be modelled since they might not influence the po-

sterior)

Visual Computing: Joachim M. Buhmann — Machine Learning 194/196

Page 53: Machine Learning

Example: Multivariate Normal Distribution

Expectation values of a normal distribution and its estimation:Class index has been omitted for legibility reasons (θy → θ).

log p(xi|θ) = −12(xi − µ)TΣ−1(xi − µ)− d

2log 2π − 1

2log |Σ|

∂µ

∑i≤n

log p(xi|θ) =12

∑i≤n

Σ−1(xi − µ) +12

∑i≤n

((xi − µ)Σ−1

)T= 0

Σ−1∑i≤n

(xi − µ) = 0 ⇒ µn =1n

∑i

xi estimator for µ

Average value formula results from the quadratic form.

Unbiasedness: E[µn] =1n

∑i≤n

Exi = E[x] = µ

Visual Computing: Joachim M. Buhmann — Machine Learning 195/196

Page 54: Machine Learning

ML estimation of the variance (1d case)

∂σ2

∑i≤n

log p(xi|θ) = − ∂

∂σ2

∑i≤n

1σ2‖xi − µ‖2 − n

2log(2πσ2)

=12

∑i≤n

σ−4‖xi − µ‖2 − n

2σ−2 = 0

⇒ σ2n =

1n

∑i≤n

‖xi − µ‖2

Multivariate case Σn =1n

∑i≤n

(xi − µ)(xi − µ)T

Σn is biased, e.g., EΣn 6= Σ, if µ is unknown.

Visual Computing: Joachim M. Buhmann — Machine Learning 196/196