chapter 20 classification and estimation. 20.2 classification –20.2.1 feature selection good...

Chapter 20 Classification and Estimation

Chapter 20 Classification and Estimation• 20.2 Classification

– 20.2.1 Feature selection• Good feature have four characteristics:

– Discrimination. Features should take on significantly different values for objects belonging to different classes.

– Reliability. Features should take on similar values for all objects of the same class.

– Independence. The various features used should be uncorrelated with each other.

– Small numbers. The number of features should be small because the complexity of a pattern recognition system increases rapidly with the dimensionality of the system.

20.2 Classification

• Classifier design– Classifier design consists of establishing the log

ical structure of the classifier and the mathematical basis of the classification rule.

• Classifier Training– A group of known objects are used to train a cla

ssifier to determine its threshold values.

20.2 Classification

– The training set is a collection of objects from each class that have been previously identified by some accurate method.

– Training rule:minimizing an error function or a cost function.

– Unrepresentative training set; Biased training set.

20.2 Classification

– 20.2.4 Measurement of performance• A classifier accuracy can be directly estimated by cl

assifying a known test set of objects.

• An alternative is to use a test set of known objects to estimate the PDFs of the features for objects belonging to each group.

• Using a different test set from the training set is a better approach to evaluate a classifier.

20.3 Feature selection

• Feature selection is the process of eliminating some features and combining others that are related, until the feature set becomes manageable and performance is still adequate.

• The brute force approach of feature selection.


• A training set containing objects from M different classes, let be the number of objects from class j, and , are two features obtained when the ith object in class j, the mean value of each feature is

jN

ijx ijy

jN

iij

jxj x

N 1

1̂

jN

iij

jyj y

N 1

1̂


• 20.3.1 Feature Variance– All objects within the same class should take on

similar values. The variance of the features with class j is

jN

ixjij

jxj x

N 1

22 )ˆ(1

ˆ

jN

iyjij

jyj y

N 1

22 )ˆ(1

ˆ


• 20.3.2 Feature correlation– The correlation of the features x and y in class j

is

– A value of zero indicates that the two features are uncorrelated, while a value near 1 implies a high degree of correlation.

yjxj

N

iyjijxjij

jxyj

j

yxN

ˆˆ

)ˆ)(ˆ(1

ˆ 1

20.3 Feature Selection

• 20.3.3 Class separation distance– The variance-normalized distance between two

class is

where the two classes are j and k. – The greater the distance is, the better the feature

is.

22 ˆˆ

|ˆˆ|ˆ

yjxj

xkxjxjkD


• 20.3.4 Dimension reduction– Many features can combine to form few numbe

r of features. – Linear combination. Two features x and y can p

roduce a new feature z by

this can be reduced to

byaxz

)sin()cos( yxz


– This is a projection of (x,y) plane to line z.

Class 1

Class 2

x

y

z

20.4 Statistical Classification

• 20.4.1 Statistical decision theory– An approach that makes classification by statist

ical method. The PDFs of features are assumed to be known

– The PDFs of a feature may be estimated by measuring a large number of objects, and plotting a histogram of the feature.

20.4 Statistical Classification

– 20.4.1.1 A Priori Probabilities• The a priori probabilities represent our knowledge a

bout an object before it has been measured.

• The conditional probability is the probability of the event , when a given event occurs.

)|( 21 EEP

1E 2E

20.4 Statistical classification

• 20.4.1.2 Bayes’ Theorem– The a posteriori probability is the conditional pr

obability , which means the probability of the object belongs to the class , when the feature

occurs. – The Bayes’ Theorem (two classes)

)|( xCP i

iC

x

2

1

)()|(

)()|(

)(

)()|()|(

iii

iiiii

CpCxp

CpCxp

xp

CpCxpxCP


• Bayes’ theorem may be used to pattern classification. For example, when there are only 2 classes, a object is assigned to class 1 if

• This is equivalent to

• The classifier defined by this decision rule is called a maximum-likelihood classifier.

)|()|( 21 xCPxCP

)()|()()|( 2211 CPCxpCPCxp


– If there are more than one features and the feature vector is , and suppose there are m classes, then Bayes’ theorem is

• Bayes’ Risk. The conditional risk is

where is the cost (loss) of assigning an object to class i when it really belongs in class j.

Tnxxx ],,,[ 21

m

iiin

iinni

CpCxxxp

CpCxxxpxxxCp

121

2121

)()|,,,(

)()|,,,(),,,|(

m

jnjijni xxxCplxxxCR

12121 ),,,|(),,,|(

ijl


– Bayes’ decision rule. Each object should be assigned to the class that produces the minimum conditional risk. The Bayes’ risk is

– Parametric and Nonparametric classifier• If the functional form of the conditional PDFs is known, but so

me parameters are unknown, the classifier is called parametric.

• If the functional form of some or all of the conditional PDFs is unknown, the classifier is called nonparametric.

nnnm dxdxdxxxxpxxxRR 212121 ),,,(),,,(

20.4.3 Parameter estimation and classifier training

• The process of estimating the conditional PDFs o their parameters is refered to as training the classifier.

• Supervised and unsupervised training– Supervised training. The classes to which the o

bjects in the training set is known.– Unsupervised training. The conditional PDFs ar

e estimated using samples whose class is unknown.

20.4.3 Parameter estimation and classifier training

• Maximum-likelihood estimation– The maximum-likelihood estimation approach assumes

that the parameters to be estimated are fixed but unknown.

– The maximum-likelihood estimate of a parameter is the value that makes the occurrence of the observed training set most likely.

– The Maximum-likelihood estimates of the mean and standard deviation of a normal distribution are the sample mean and sample standard deviation, respectively.

20.4.3.3 Bayesian Estimation

– The Bayesian estimation treats the unknown parameter as a random variable, and it has a known a priori PDF before any samples are taken.

– After the training set has been measured, Bayes’ theorem is used to update the a priori PDF, and this results in an a posterior PDF of the unknown parameter value.

– The a posteriori PDF with a single narrow peak, centered on the true value of the parameter is desired.

20.4.3.3 Bayesian estimation

• An example of Bayesian estimation– Estimate the mean of a normal distribution with

known variance. The a priori PDF is .– The functional form of the PDF of the unknown

mean is assumed to be , this means that given a value for , we known .

– Suppose represents the set of sample values obtained by measuring the training set.

)(p

)|( xp

)(xp

X


– Bayes’ theorem gives the a posteriori PDF

– What we really want is

– For example, if has a single sharp peak at , it can be approximated as an impulse

dpXp

pXpXp

)()|(

)()|()|(

dXpxpdXxpXxp )|()|()|,()|(

)|( Xp

)()|( 0 Xp

0


– Then

This means that is the best estimate of the unknown mean.

– If has a relatively broad peak, then

becomes a weighted average of many PDFs.

– Both maximum-likelihood and Bayesian estimate the unknown mean at the mean of a large training set.

)|()()|()|( 00 xpdxpXxp

0

)|( Xp )|( Xxp


• Steps of Bayesian estimation– 1.Assume an a priori PDF for the unknown parameters;

– 2.collect samples values from the population by measuring the training set.

– 3.Use Bayes’ theorem to refine the a priori PDF into the a posteriori PDF

– 4.Form the joint density of x and the unknown parameter and integrate out the latter to leave the desired estimate of the PDF.


– If we have strong ideas about the probable values of the unknown parameter, we may assume a narrow a priori PDF, otherwise, we should assume a relatively broad PDF.

20.5 Neural Networks

• Layered feedforward neural networks

where the activation function is usually a Sigmoidal function.

j

jkjk xwS

1x

2x

nx

1w

2w

nwy

][][1

SgwxggON

iii

T

WX

][g

)( kk Sgx

1x

2x

Nx

1kw

2kw

knw

chapter 20 classification and estimation. 20.2 classification –20.2.1 feature selection good...

Documents

feature set

feature selectiongood

feature selection20

feature varianceall

feature selectionthis

number of features

features x

pdfs of features