2. bayes decision theory prof. a.l. yuille stat 231. fall 2004

21
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Upload: lucinda-lester

Post on 12-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

2. Bayes Decision Theory

Prof. A.L. Yuille

Stat 231. Fall 2004.

Page 2: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Decisions with Uncertainty

• Bayes Decision Theory is a theory for how to make decisions in the presence of uncertainty.

• Input data x.

• Salmon y= +1, Sea Bass y=-1.

• Learn decision rule: f(x) taking values

Page 3: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Decision Rule for Fish.

• Classify fish as Salmon or Sea Bass by decision rule f(x).

Page 4: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Basic Ingredients.

• Assume there are probability distributions for generating the data.

• P(x|y=1) and P(x|y=-1).

• Loss function L(f(x),y) specifies the loss of making decision f(x) when true state is y.

• Distribution P(y). Prior probability on y.

• Joint Distribution P(x,y) = P(x|y) P(y).

Page 5: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Minimize the Risk

• The risk of a decision rule f(x) is:

• Bayes Decision Rule f*(x):

• The Bayes Risk:

Page 6: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Minimize the Risk.

• Write P(x,y) = P(y|x) P(x).• Then we can write the Risk as:

• The best decision for input x is f*(x):

Page 7: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Bayes Rule.

• Posterior distribution P(y|x):

• Likelihood function P(x|y)• Prior P(y).

• Bayes Rule has been controversial (historically) because of the Prior P(y) (subjective?).

• But in Bayes Decision Theory, everything starts from the joint distribution P(x,y).

Page 8: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Risk.

• The Risk is based on averaging over all possible x & y. Average Loss.

• Alternatively, can try to minimize the worst risk over x & y. Minimax Criterion.

• This course uses the Risk, or average loss.

Page 9: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Generative & Discriminative.

• Generative methods aim to determine probability models P(x|y) & P(y).

• Discriminative methods aim directly at estimating the decision rule f(x).

• Vapnik argues for Discriminative Methods: Don’t solve a harder problem than you need to. Only care about the probabilities near the decision boundaries.

Page 10: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Discriminant Functions.

• For two category case the Bayes decision rule depends on the discriminant function:

• The Bayes decision rule is of form:

• Where T is a threshold, which is determined by the loss function.

Page 11: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Two-State Case

• Detect “target” or “non-target”.

• Let loss function pay a penalty of 1 for misclassification, 0 otherwise.

• Risk becomes Error. Bayes Risk becomes Bayes Error.

• Error is the sum of false positives F+ (non- targets classified as targets) and false negatives F- (targets classified as non-targets).

Page 12: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Gaussian Example: 1• Is a bright light flashing?

• n is no. photons emitted by dim or bright light.

Page 13: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

8. Gaussian Example: 2• are Gaussians with

means and s.d. .• Bayes decision rule selects “dim” if ;

• Errors:

Page 14: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Example: Multidimensional Gaussian Distributions.

• Suppose the two classes have Gaussian distributions for P(x|y).

• Different means but same covariance• The discriminant function is a plane:

• Alternatively, seek a planar decision rule without attempting to model the distributions.

• Only care about the data near the decision boundary.

Page 15: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Generative vrs. Discriminant.

• The Generative approach will attempt to estimate the Gaussian distributions from data – and then derive the decision rule.

• The Discriminant approach will seek to estimate the decision rule directly by learning the discriminant plane.

• In practice, we will not know the form of the distributions of the form of the discriminant.

Page 16: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Gaussian.

• Gaussian Case with unequal covariance.

Page 17: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Discriminative Models & Features.

• In practice, the Discriminative methods are usually defined based on features extracted from the data. (E.g. length and brightness of fish).

• Calculate features z=h(x).

• Bayes Decision Theory says that this throws away information.

• Restrict to a sub-class of possible decision rules – those that can be expressed in terms of features z=h(x).

Page 18: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Bayes Decision Rule and Learning.

• Bayes Decision Theory assumes that we know, or can learn, the distributions P(x|y).

• This is often not practical, or extremely difficult. • In real problems, you have a set of classified

data • You can attempt to learn P(x|y=+1) & P(x|y=-1)

from these (next few lectures).• Parametric & Non-parametric approaches.• Question: when do you have enough data to

learn these probabilities accurately?• Depends on the complexity of the model.

Page 19: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Machine Learning.

• Replace Risk by Empirical Risk

• How does minimizing the empirical risk relate to minimizing the true risk?

• Key Issue: When can we generalize? Be confident that the decision rule we have learnt on the training data will yield good results on unseen data?

Page 20: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Machine Learning

• Vapnik’s theory gives a mathematically elegant way of answering these issues.

• It assumes that the data is sampled from an unknown distribution.

• Vapnik’s theory gives bounds for when we can generalize.

• Unfortunately these bounds are very conservative.

• In practice, train on part of dataset and test on other part(s).

Page 21: 2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004

Extensions to Multiple Classes

The decision partitionsf the feature space into k subspaces

ji,jiik1i

1 4

35

2

Conceptually straightforward – see Duda, Hart & Stork.