machine learning
DESCRIPTION
TRANSCRIPT
Machine Learning
Central Problem of Pattern Recognition:Supervised and Unsupervised Learning
ClassificationBayesian Decision Theory
Perceptrons and SVMsClustering
Visual Computing: Joachim M. Buhmann — Machine Learning 143/196
Machine Learning – What is the Challenge?Find optimal structure in data and validate it!
38 March 2006 Joachim M. Buhmann / Institute for Computational Science
Concept for Robust Data Analysis
FeedbackFeedback
Structureoptimizationmultiscale analysis,stochastic approximation
xQuantization ofsolution spaceInformation/Rate Distortion Theory
Regularizationof statistical &
computational complexity
Structure Validationstatisticallearning theory
Structuredefinition
(costs, risk, ...)
Datavectors, relations,
images,...
Visual Computing: Joachim M. Buhmann — Machine Learning 144/196
The Problem of Pattern Recognition
Machine Learning (as statistics) addresses a number of chal-lenging inference problems in pattern recognition which spanthe range from statistical modeling to efficient algorithmics.Approximative method which yield good performance on ave-rage are particularly important.
• Representation of objects . ⇒ Data representation
• What is a pattern? Definition/modeling of structure .
• Optimization : Search for prefered structures
• Validation : are the structures indeed in the data or are theyexplained by fluctuations?
Visual Computing: Joachim M. Buhmann — Machine Learning 145/196
Literatur
• Richard O. Duda, Peter E. Hart & David G. Stork, Pattern Classification.Wiley & Sons (2001)
• Trevor Hastie, Robert Tibshirani & Jerome Friedman, The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer Ver-lag (2001)
• Luc Devroye, Laslo Gyorfi & Gabor Lugosi, A Probabilistic Theory of Pat-tern Recognition. Springer Verlag (1996)
• Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Da-ta. Springer Verlag (1983); The Nature of Statistical Learning Theory.Springer Verlag (1995)
• Larry Wasserman, All of Statistics. (1st ed. 2004. Corr. 2nd printing,ISBN: 0-387-40272-1) Springer Verlag (2004)
Visual Computing: Joachim M. Buhmann — Machine Learning 146/196
The Classification Problem
Visual Computing: Joachim M. Buhmann — Machine Learning 147/196
Visual Computing: Joachim M. Buhmann — Machine Learning 148/196
Classification as a Pattern Recognition Problem
Problem: We look for a partition of the object space O (fishin the previous example) which corresponds to classificationexamples.Distinguish conceptually between “objects” o ∈ O and “data” x ∈ X !
Data: pairs of feature vectors and class labelsZ = (xi, yi) : 1 ≤ i ≤ n, xi ∈ Rd, yi ∈ 1, . . . , k
Definitions: feature space X with xi ∈ X ⊂ Rd
class labels yi ∈ 1, . . . , k
Classifier: mapping c : X → 1, . . . , k
k class problem: What is yn+1 ∈ 1, . . . , k for xn+1 ∈ Rd?
Visual Computing: Joachim M. Buhmann — Machine Learning 149/196
Example of Classification
Visual Computing: Joachim M. Buhmann — Machine Learning 150/196
Histograms of Length Values
salmon sea bass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 151/196
Histograms of Skin Brightness Values
2 4 6 8 100
2
4
6
8
10
12
14
lightness
count
x*
salmon sea bass
FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗
marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 152/196
Linear Classification
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 153/196
Overfitting
?
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 154/196
Optimized Non-Linear Classification
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Occam’s razor argument: Entia non sunt multiplicanda praeter necessitatem!
Visual Computing: Joachim M. Buhmann — Machine Learning 155/196
Regression(see Introduction to Machine Learning)
Question: Given a feature(vector) xi and a corre-sponding noisy measure-ment of a function valueyi = f(xi) + noise, what isthe unknown function f(.)in a hypothesis class H?
Data: Z = (xi, yi) ∈ Rd × R : 1 ≤ i ≤ n
Modeling choice: What is an adequate hypothesis class anda good noise model? Fitting with linear/nonlinear functions?
Visual Computing: Joachim M. Buhmann — Machine Learning 156/196
The Regression Function
Questions: (i) What is the statistically optimal estimate of afunction f : Rd → R and (ii) which algorithm achieves thisgoal most efficiently?
Solution to (i): the regression function
y(x) = E y|X = x =∫
Ω
y p(y|X = x)dy
Nonlinear regression of asinc functionsinc(x) := sin(x)/x
(gray) with a regression fit(black) based on 50 noisydata.
Visual Computing: Joachim M. Buhmann — Machine Learning 157/196
Examples of linear and nonlinear regression
linear regression nonlinear regression
How should we measure the deviations?
vertical offsets perpendicular offsets
Visual Computing: Joachim M. Buhmann — Machine Learning 158/196
Core Questions of Pattern Recognition:Unsupervised Learning
No teacher signal is available for the learning algorithm; lear-ning is guided by a general cost/risk function.
Examples for unsupervised learning
1. data clustering, vector quantization :as in classification we search for a partitioning of objects ingroups; but explicit labelings are not available.
2. hierarchical data analysis; search for tree structures in data3. visualisation, dimension reduction
Semisupervised learning: some of the data are labeled, mostof them are unlabeled.
Visual Computing: Joachim M. Buhmann — Machine Learning 159/196
Modes of Learning
Reinforcement Learning: weakly supervised learningAction chains are evaluated at the end.Backgammon; the neural network TD-Gammon gained theworld championship! Quite popular in Robotics
Active Learning: Data are selected according to their expec-ted information gain.Information Filtering
Inductive Learning: the learning algorithm extracts logical ru-les from the data.Inductive Logic Programming is a popular sub area of Artifi-cial Intelligence
Visual Computing: Joachim M. Buhmann — Machine Learning 160/196
Vectorial Data
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
A
AA
A
AA
A
AAA
B
B B
BBB
BB
B
B
C
C
C
C
C
C
C
C
C
C
DD
DD
D
DD
DD
D
E
E
EE
E
E
E
E
EE
FF
FFFF
F
F
FF
G
GGG
G
GGG
G
G
H
H
H
H
HH
H
HH
H
I
IIIII
III
I
J
J
JJJJ J
J
J
J
K
K
KK
K
KK
KKK
L
L
L
L
L
L
L
LL
LMM
M
M
M
M
M
MM
M
N
N
N
N
N
N
N
NN
N
O
O
O
OOO
O
OO
O
PP
P
P
P
PP
PP
P
Q
QQQ
Q
Q
Q Q
R
R
R
R
R
RR
R
R
R
S
S
S
S
S
S
S
S
S
S
T
T
TTTT
TTTT
Data of 20 Gaussiansources in R20, pro-jected onto two di-mensions with Princi-pal Component Ana-lysis.
Visual Computing: Joachim M. Buhmann — Machine Learning 161/196
Relational Data
Pairwise dissimilarityof 145 globins whichhave been selectedfrom 4 classes ofα-globine, β-globine,myoglobins and glo-bins of insects andplants.
Visual Computing: Joachim M. Buhmann — Machine Learning 162/196
Scales for Data
Nominal or categorial scale : qualitative, but without quantita-tive measurements,e.g. binary scale F = 0, 1 (presence or absence of proper-ties like “kosher”) ortaste categories “sweet, sour, salty and bitter.
Ordinal scale : measurement values are meaningful only withrespect to other measurements, i.e., the rank order of mea-surements carries the information, not the numerical diffe-rences (e.g. information on the ranking of different marathonraces!?)
Visual Computing: Joachim M. Buhmann — Machine Learning 163/196
Quantitative scale:
• interval scale : the relation of numerical differences car-ries the information. Invariance w.r.t. translation and sca-ling (Fahrenheit scale of temperature).
• ratio scale : zero value of the scale carries information butnot the measurement unit. (Kelvin scale).
• Absolute scale : Absolute values are meaningful. (gradesof final exams)
Visual Computing: Joachim M. Buhmann — Machine Learning 164/196
Machine Learning: Topic Chart
• Core problems of pattern recognition
• Bayesian decision theory
• Perceptrons and Support vector machines
• Data clustering
Visual Computing: Joachim M. Buhmann — Machine Learning 165/196
Bayesian Decision TheoryThe Problem of Statistical Decisions
Task: textbf n objects have to be partitioned in 1, . . . , k classes,the doubt class D and the outlier class O.
D : doubt class (→ new measurements required)O : outlier class , definitively none of the classes 1, 2, . . . , k
Objects are characterized by feature vectors X ∈ X , X ∼P(X) with the probability P(X = x) of feature values x.
Statistical modeling: Objects represented by data X andclasses Y are considered to be random variables, i.e.,(X, Y ) ∼ P(X, Y ).Conceptually, it is not mandatory to consider class labels as random since they mightbe induced by legal considerations or conventions.
Visual Computing: Joachim M. Buhmann — Machine Learning 166/196
Structure of the feature space X
• X ⊂ Rd
• X = X1 ×X2 × · · · × Xd with Xi ⊆ R or Xi finite.
Remark: in most situations we can define the feature space as subsets of Rd or astuples of real, categorial (B = 0, 1) or ordinal (K ⊂ K) numbers. Sometimes wehave more complicated data spaces composed of lists, trees or graphs.
Class density / likelihood: py(x) := P(X = x|Y = y) is equalto the probability of a feature value x given a class y.
Parametric Statistics: estimate the parameters of the classdensities py(x)
Non-Parametric Statistics: minimize the empirical risk
Visual Computing: Joachim M. Buhmann — Machine Learning 167/196
Motivation of ClassificationGiven are labeled dataZ = (xi, yi) : i ≤ n
Questions:
1. What are the classboundaries?
2. What are the classspecific densitiespy(x)?
3. How many modesor parameters dowe need to modelpy(x)?
4. ...
Figure: quadratic SVM classifier for five classes.White areas are ambiguous regions.
Visual Computing: Joachim M. Buhmann — Machine Learning 168/196
Thomas Bayes and his Terminology
The State of Nature is modelled as a random variable!
prior: Pmodel
likelihood: Pdata|model
posterior: Pmodel|data
evidence: Pdata
Bayes Rule: Pmodel|data =Pdata|modelPmodel
Pdata
Visual Computing: Joachim M. Buhmann — Machine Learning 169/196
Ronald A. Fisher and Frequentism
Fisher, Ronald Aylmer (1890-1962): founder of frequentiststatistics together with Jerzey Neyman & Karl Pearson.
British mathematician and biologist who in-vented revolutionary techniques for apply-ing statistics to natural sciences.
Maximum likelihood method
Fisher information: a measure for the infor-mation content of densities.
Sampling theory
Hypothesis testing
Visual Computing: Joachim M. Buhmann — Machine Learning 170/196
Bayesianism vs. Frequentist Inference 1
Bayesianism is the philosophical tenet that the mathematical theory of pro-bability applies to the degree of plausibility of statements, or to the degreeof belief of rational agents in the truth of statements; together with Bayestheorem, it becomes Bayesian inference. The Bayesian interpretation ofprobability allows probabilities assigned to random events, but also al-lows the assignment of probabilities to any other kind of statement.
Bayesians assign probabilities to any statement, even when no randomprocess is involved, as a way to represent its plausibility. As such, thescope of Bayesian inquiries include the scope of frequentist inquiries.
The limiting relative frequency of an event over a long series of trials isthe conceptual foundation of the frequency interpretation of probability.
Frequentism rejects degree-of-belief interpretations of mathematical pro-bability as in Bayesianism, and assigns probabilities only to randomevents according to their relative frequencies of occurrence.1see http://encyclopedia.thefreedictionary.com/
Visual Computing: Joachim M. Buhmann — Machine Learning 171/196
Bayes Rule for Known Densities and Parameters
Assume that we know how the features are distributed for thedifferent classes, i.e., the class conditional densities and theirparameters are known.What is the best classification strat-egy in this situation?
Classifier:
c : X → 1, . . . , k,DThe assignment function c maps the feature space X to theset of classes 1, . . . , k,D. (Outliers are neglected)
Quality of a classifier: Whenever a classifier returns a labelwhich differs from the correct class Y = y then it has madea mistake.
Visual Computing: Joachim M. Buhmann — Machine Learning 172/196
Error count: The indicator function∑x∈X
Ic(x) 6=y
counts the classifier mistakes. Note that this error count is arandom variable!
Expected errors also called expected risk define the qualityof a classifier
R(c) =∑y≤k
P(y)EP(x)
[Ic(x) 6=y|Y = y
]+ terms from D
Remark: The rational behind this choice comes from gambling. If we bet ona particular outcome of our experiment and our gain is measured by howoften we assign the measurements to the correct class then classifier withminimal expected risk will win on average against any other classificationrule (“Dutch books”)!
Visual Computing: Joachim M. Buhmann — Machine Learning 173/196
The Loss Function
Weighted mistakes are introduced when classification errorsare not equally costly; e.g. in medical diagnosis, some di-sease classes might be harmless and others might be lethaldespite of similar symptoms.
⇒ We introduce a loss function L(y, z) which denotes the lossfor the decision z if class y is correct.
0-1 loss: all classes are treated the same!
L0−1(y, z) =
0 if z = y (correct decision)
1 if z 6= y and z 6= D (wrong decision)
d if z = D (no decision)
Visual Computing: Joachim M. Buhmann — Machine Learning 174/196
• weighted classification costs L(y, z) ∈ R+ are frequentlyused, e.g. in medicine;classification costs can also be asymmetric, that meansL(y, z) 6= L(z, y) ((z, y) ∼ (pancreas cancer, gastritis).
Conditional Risk function of the classifier is the expectedloss of class y
R(c, y) = Ex [L(y, c(x))|Y = y]
=∑z≤k
(L(y, z)Pc(x) = z|Y = y
+L(y,D)Pc(x) = D|Y = y)
= Pc(x) 6= y ∧ c(x) 6= D|Y = y︸ ︷︷ ︸pmc(y) probability of misclassification
+ d ·Pc(x) = D|Y = y︸ ︷︷ ︸pd(y) probability of doubt
Visual Computing: Joachim M. Buhmann — Machine Learning 175/196
Total risk of the classifier: (πy := P(Y = y))
R(c) =∑z≤k
πz pmc(z) + d∑z≤k
πz pd(z) = EC
[R(c, C)
]
Asymptotic average loss
limn→∞
1n
∑j≤n
L(cj, c(xj)) = limn→∞
R(c) = R(c),
where (xj, cj)|1 ≤ j ≤ n is a random sample set of size n.This formula can be interpreted as the expected loss with empirical distribution as probability model.
Visual Computing: Joachim M. Buhmann — Machine Learning 176/196
Posterior class probability
Posterior: Let
p(y|x) ≡ PY = y|X = x =πypy(x)∑z πzpz(x)
be the posterior of the class y given X = x.
(The ‘Partition of One” πypy(x)/∑
z πzpz(x) results from the normalizati-
on∑
z p(z|x) = 1. )
Likelihood: The class conditional density py(x) is the probabi-lity of observing data X = x given class Y = y.
Prior: πy is the probability of class Y = y.
Visual Computing: Joachim M. Buhmann — Machine Learning 177/196
Bayes Optimal Classifier
Theorem 1 The classification rule which minimizes the totalrisk for 0− 1 loss is
c(x) =
y if p(y|x) = maxz≤k p(z|x) > 1− d,
D if p(y|x) ≤ 1− d ∀y.
Generalization to arbitrary loss functions
c(x) =
y if
∑z L(z, y)p(z|x) = minρ≤k
∑z L(z, ρ)p(z|x) ≤ d,
D else .
Bayes classifier: Select the class with highest πypy(x) value ifit exceeds the costs for not making a decision, i.e., πypy(x) >
(1− d)p(x).
Visual Computing: Joachim M. Buhmann — Machine Learning 178/196
Proof: Calculate the total expected loss R(c)
R(c) = EX
[EY
[L0−1(Y, c(x))|X = x
]]=
∫X
EY
[L0−1(Y, c(x))|X = x
]p(x)dx with p(x) =
∑z≤k
πzpz(x)
Minimize the conditional expectation value since it depends only on c.
c(x) = argminc∈1,...,k,DE[L0−1(Y, c)|X = x
]= argminc∈1,...,k,D
∑z≤k
L0−1(z, c)p(z|x)
=
argminc∈1,...,k (1− p(c|x)) if d > minc(1− p(c|x))D else
=
argmaxc∈1,...,kp(c|x) if 1− d < maxc p(c|x)D else
Visual Computing: Joachim M. Buhmann — Machine Learning 179/196
Outliers
• Modeling by an outlier class πO with pO(x)
• “Novelty Detection” : Classify a measurement as an outlierif
πOpO(x) ≥ max
(1− d)p(x),maxz
πzpz(x)
• The outlier concept causes conceptual problems and it does not fit to thestatistical decision theory since outliers indicate an erroneous or incom-plete specification of the statistical model!
• The outlier class is often modeled by a uniform distribution.Attention : Normalization of uniform distribution does not exist in manyfeature spaces!
=⇒ Limit the support of the measurement space or put a (Gaussian)measure on it!
Visual Computing: Joachim M. Buhmann — Machine Learning 180/196
Class Conditional Densities and Posteriors for 2Classes
Class-conditional probability den-sity function
Posterior probabilities for priorsP(y1) = 2
3,P(y2) = 13.
9 10 11 12 13 14 15
0.1
0.2
0.3
0.4
p(x|ωi)
x
ω1
ω2
FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi. If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
0.2
0.4
0.6
0.8
1
P(ωi|x)
x
ω1
ω2
9 10 11 12 13 14 15
FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 181/196
Likelihood Ratio for 2 Class Example
x
θa
p(x|ω1)
p(x|ω2)
θb
R1
R2
R1
R2
FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1
patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 182/196
Discriminant Functions gl
discriminant
functions
input
g1(x) g
2(x) g
c(x). . .
x1
x2
xd. . .x
3
costs
action
(e.g., classification)
FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
• Discriminant function: gz(x) = PY = y|X = x
• Class decision: gy(x) > gz(x) ∀z 6= y ⇒ class y.
• Different discriminant functions can yield the same decision:gy(x) = log Px|y+ log πy; minimize implementation problems!
Visual Computing: Joachim M. Buhmann — Machine Learning 183/196
Example for Discriminant Functions
0
0.1
0.2
0.3
decision
boundary
p(x|ω2)P(ω2)
R1
R2
p(x|ω1)P(ω1)
R2
0
5
0
5
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 184/196
Adaptation of Discriminant Functions gl
discriminant
functions
input
g1(x) g
2(x) g
c(x). . .
x1
x2
xd. . .x
3
action
(e.g., classification)
MAX
-
teacher
signal
The red connections (weights) are adapted in such a way that the teacher
signal is imitated by the discriminant function.
Visual Computing: Joachim M. Buhmann — Machine Learning 185/196
Example Discriminant Functions: NormalDistributions
The Likelihood of class y is Gaussian distributed.
py(x) =1√
(2π)d|Σy|exp
(−1
2(x− µy)TΣ−1
y (x− µy))
Special case: Σy = σ2I
gy(x) = log py(x) + log πy
= − 12σ2
‖x− µy‖2 + log πy + const.
Visual Computing: Joachim M. Buhmann — Machine Learning 186/196
⇒ Decision surface between class z and y:
− 12σ2
‖x− µz‖2 + log πz = − 12σ2
‖x− µy‖2 + log πy
−‖x‖2 + 2x · µz − ‖µz‖2 + 2σ2 log πz = −‖x‖2 + 2x · µy − ‖µy‖2 + 2σ2 log πy
⇒ 2x · (µz − µy)− ‖µz‖2 + ‖µy‖2 + 2σ2 logπz
πy= 0
Linear decision rule : wT (x− x0) = 0
with w = µz − µy x0 =12(µz + µy)−
σ2(µz − µy)‖µz − µy‖2
logπz
πy
Visual Computing: Joachim M. Buhmann — Machine Learning 187/196
Decision Surface for Gaussians in 1,2,3Dimensions
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi)ω1 ω2
0
R1
R2
02
4
0
0.05
0.1
0.15
-2
P(ω1)=.5P(ω2)=.5
ω1
ω2
R1
R2
-2
0
2
4
-2
-1
0
1
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R1
R2
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
Visual Computing: Joachim M. Buhmann — Machine Learning 188/196
P(ω1)=.7 P(ω2)=.3
ω1 ω2
R1
R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
P(ω1)=.9 P(ω2)=.1
ω1 ω2
R1
R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.8
P(ω2)=.2
ω1 ω2
R1
R2
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.99
P(ω2)=.01
ω1
ω2
R1
R2
-1
0
1
2
0
1
2
3
-2
-1
0
1
2
-2
P(ω1)=.8
P(ω2)=.2
ω1
ω2
R1
R2
0
2
4
-1
0
1
2
-2
-1
0
1
2
-2
P(ω1)=.99
P(ω2)=.01ω1
ω2
R1
R2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.Visual Computing: Joachim M. Buhmann — Machine Learning 189/196
Multi Class Case
R3
R2
R1
R4
R4
FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
Decision regions for four Gaussian distributions. Even for such a small num-
ber of classes the discriminant functions show a complex form.
Visual Computing: Joachim M. Buhmann — Machine Learning 190/196
Example: Gene Expression Data
The expression of genes is measured for various patients. Theexpression profiles provide information of the metabolic state ofthe cells, meaning that they could be used as indicators for di-sease classes. Each patient is represented as a vector in a highdimensional (≈ 10000) space with Gaussian class distribution.
Genes
Sam
ples
ALL B
−Cell
AM
LA
LL T−C
ellT
rue
Pred
Visual Computing: Joachim M. Buhmann — Machine Learning 191/196
Parametric Models for Class Densities
If we would know the prior probabilities and the class conditio-nal probabilities then we could calculate the optimal classifier.But we don’t!
Task: Estimate p(y|x; θ) from samplesZ = (x1, y1), . . . , (xn, yn)for classification.
Data are sorted according to their classes:Xy = X1y, . . . , Xny,y where Xiy ∼ PX|Y = y; θy
Question: How can we use the information in samples to esti-mate θy?
Assumption: classes can be separated and treated indepen-dently! Xy is not informative w.r.t. θz, z 6= y
Visual Computing: Joachim M. Buhmann — Machine Learning 192/196
Maximum Likelihood Estimation Theory
Likelihood of the data set: PXy|θy =∏
i≤nyp(xiy|θy)
Estimation principle: Select the parameters θy which maximi-ze the likelihood, that means
θy = arg maxθy
PXy|θy
Procedure: Find the extreme value of the log-likelihood functi-on
∇θy log PX |θy = 0
∂
∂θy
∑i≤n
log p(xi|θy) = 0
Visual Computing: Joachim M. Buhmann — Machine Learning 193/196
Remark
Bias of an estimator: bias(θn) = Eθn − θ.
Consistent estimator: A point estimator θn of a parameter θ
is consistent if θnP→ θ.
Asymptotic Normality of Maximum Likelihood estimates:
(θn − θ)/√
Vθn N (0, 1).
Alternative to ML class density estimation: discriminativelearning by maximizing the a posteriori distribution Pθy|Xy(details of the density do not have to be modelled since they might not influence the po-
sterior)
Visual Computing: Joachim M. Buhmann — Machine Learning 194/196
Example: Multivariate Normal Distribution
Expectation values of a normal distribution and its estimation:Class index has been omitted for legibility reasons (θy → θ).
log p(xi|θ) = −12(xi − µ)TΣ−1(xi − µ)− d
2log 2π − 1
2log |Σ|
∂
∂µ
∑i≤n
log p(xi|θ) =12
∑i≤n
Σ−1(xi − µ) +12
∑i≤n
((xi − µ)Σ−1
)T= 0
Σ−1∑i≤n
(xi − µ) = 0 ⇒ µn =1n
∑i
xi estimator for µ
Average value formula results from the quadratic form.
Unbiasedness: E[µn] =1n
∑i≤n
Exi = E[x] = µ
Visual Computing: Joachim M. Buhmann — Machine Learning 195/196
ML estimation of the variance (1d case)
∂
∂σ2
∑i≤n
log p(xi|θ) = − ∂
∂σ2
∑i≤n
1σ2‖xi − µ‖2 − n
2log(2πσ2)
=12
∑i≤n
σ−4‖xi − µ‖2 − n
2σ−2 = 0
⇒ σ2n =
1n
∑i≤n
‖xi − µ‖2
Multivariate case Σn =1n
∑i≤n
(xi − µ)(xi − µ)T
Σn is biased, e.g., EΣn 6= Σ, if µ is unknown.
Visual Computing: Joachim M. Buhmann — Machine Learning 196/196