lecture 1

Pattern Recognition

[email protected]

Dept. Electrical EngineeringIndian Institute of Science, Bangalore

PR video course p.1/90

Reference Books

R.O.Duda, P.E.Hart and D.G.Stork, PatternClassification, Johy Wiley, 2002

C.M.Bishop, Pattern Recognition and MachineLearning, Springer, 2006.

C. M. Bishop, Neural Networks for PatternRecognition, Oxford University Press, Indian Edition,2003.

R.O.Duda and P.E. Hart, Pattern Classification andScene Analysis, Wiley, 1973


Pattern Recognition

A basic attribute of people categorisation of sensoryinput

Pattern PR System Class label

Examples of Pattern Recognition tasks Reading facial expressions Recognising Speech Reading a Document Identifying a person by fingerprints Diagnosis from medical images Wine tasting


Machine Recognition of Patterns

pattern feature extractor X classifier classlabel

Feature extractor makes some measurements on theinput pattern.

X is called Feature Vector. Often, X

Some Examples of PR Tasks

Character Recognition Pattern Image. Class identity of character Features: Binary image, projections (e.g., row and

column sums), Moments etc.


Some Examples of PR Tasks

Speech Recognition Pattern 1-D signal (or its sampled version) Class identity of speech units Features LPC model of chunks of speech,

spectral info, cepstrum etc. Pattern can become a sequence of feature vectors.


Examples contd...

Fingerprint based identity verification Pattern image plus a identity claim Class Yes / No Features: Position of Minutiae, Orientation field of

ridge lines etc.


Examples contd...

Video-based Surveillance Pattern video sequence Class e.g., level of alertness Features Motion trajectories, Parameters of a

prefixed model etc.


Examples contd...

Credit Screening Pattern Details of an applicant (for, e.g., credit

card) Class Yes / No Features: income, job history, level of credit, credit

history etc.


Examples contd...

Imposter detection (of, e.g., credit card) Pattern A sequence of transactions Class Yes / No Features: Amount of money, locations of

transactions, times between transactions etc.


Examples contd...

Document Classification Pattern A document and a query Class Relevant or not (in general, rank) Features word occurrence counts, word context

etc. Spam filtering, diagnostics of machinery etc.


Design of Pattern Recognition Systems

Features depend on the problem. Measure relevantquantities.




Some techniques available to extract more relevantquantities from the initial measurements. (e.g., PCA)





After feature extraction each pattern is a vector Classifier is a function to map such vectors into class

labels.





After feature extraction each pattern is a vector Classifier is a function to map such vectors into class

labels. Many general techniques of classifier design are

available. Need to test and validate the final system.


Some notation

Feature Space, X Set of all possible feature vectors.


Some notation

Feature Space, X Set of all possible feature vectors. Classifier: a decision rule or a function,h : X {1, , M}.

Often, X =

We first consider the 2-class problem.


We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

2-class problem.



2-class problem. Simplest alternative: design M 2-class classifiers.

One Vs Rest




One Vs Rest There are other possibilities: e.g., Tree structured

classifiers.





classifiers. The 2-class problem is the basic problem.





classifiers. The 2-class problem is the basic problem. We will also look at M-class classifiesrs.


A simple PR problem

: Problem: Spot the Right Candidate : Features:

x1: Marks based on academic record x2: Marks in the interview


A simple PR problem



A Classifier: ax1 + bx2 > c Good


A simple PR problem



A Classifier: ax1 + bx2 > c GoodAnother Classifier: x1x2 > c Good(or (x1 + a)(x2 + b) > c).


A simple PR problem



A Classifier: ax1 + bx2 > c GoodAnother Classifier: x1x2 > c Good(or (x1 + a)(x2 + b) > c).

Design of classifier:We have to choose a specific form for the classifier.What values to use for parameters such as a, b, c?


Designing Classifiers

Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)




In most applications, not possible to design classifierfrom physics of the problem.




In most applications, not possible to design classifierfrom physics of the problem.

The difficulties are Lot of variability in patterns of a single class Variability in feature vector values Feature vectors of patterns from different classes

can be arbitrarily close. Noise in measurements


Designing Classifiers contd...

Often the only information available for the design is A training set of example patterns.

Training set: {(Xi, yi), i = 1, . . . , `}.Here Xi is an example feature vector of class yi.


Designing Classifiers contd...

Often the only information available for the design is A training set of example patterns.

Training set: {(Xi, yi), i = 1, . . . , `}.Here Xi is an example feature vector of class yi.

Generation of training set Take representativepatterns of known category (data collection) andobtain the feature vectors. (choice of featuremeasurements).

Now learn an appropriate function h as classifier.(Model choice)

Test and validate the classifier on more data.PR video course p.33/90

A simple PR problem



A Classifier: ax1 + bx2 > c GoodWe have chosen a specific form for the classifier.

Design of classifier: What values to use for a, b, c? Information available: experience history of past

candidates


Training Set


Another example problem

Problem: recognize persons of medium build Features: Height and Weight

The classifier is nonlinear herePR video course p.36/90

Learning from Examples

Designing a classifier is a typical problem of learningfrom examples. (Also called learning with a teacher).

Nature of feedback from teacher determines difficultyof the learning problem.


In the context of Pattern Recognition

feature

V ector Classifier Class feedback

Teacher

Supervised learning Teacher gives the true classlabel for each feature vector

Reinforcement Learning noisy assessment ofperformance, e.g., correct/incorrect

Unsupervised learning no teacher input (Clusteringproblems)


When the class labels of training patterns as given byteacher are noisy, we consider it as supervisedlearning with label noise or classification noise.

Many classifier design algorithms do supervisedlearning.


Function Learning

Closely related problem. Output is continuous-valuedrather than discrete as in classifiers.

Here training set examples could be{(Xi, yi), i = 1, . . . , `}, Xi X , yi

Examples of Function Learning

Time series prediction: Given a series x1, x2, , finda function to predict xn.




Based on past values: Find a best functionxn = h(xn1, xn2, , xnp) Predict stock prices, exchange rates etc. Linear prediction model used in speech analysis




Based on past values: Find a best functionxn = h(xn1, xn2, , xnp) Predict stock prices, exchange rates etc. Linear prediction model used in speech analysis

More general predictors can use other variables also. Predict rainfall based on measurements and

(possibly) previous years data. In general, System Identification. (An application:

smart sensors)PR video course p.44/90

Examples contd... : Equaliser

Tx x(k) channel Z(k) filter y(k) Rx

We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

y(k) =T

i=1

aiZ(k i)

Find best ai a function learning problem.


Examples contd... : Equaliser

Tx x(k) channel Z(k) filter y(k) Rx

We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

y(k) =T

i=1

aiZ(k i)

Find best ai a function learning problem.Training set: {(x(k), Z(k)), k = 1, 2, , N}How do we know x(k) at the receiver end?Prior agreements (Protocols) PR video course p.46/90

Learning from examples

Learning from examples is basic to both classificationand regression.

Training set: {(X1, y1), , (X`, y`)}. We essentially want to fit a best function: y = f(X).


Learning from examples

Learning from examples is basic to both classificationand regression.

Training set: {(X1, y1), , (X`, y`)}. We essentially want to fit a best function: y = f(X). Suppose X

Learning from examples Generalization

To obtain a classifier (or a regression function) we usethe training set.

We know the class label of patterns (or the values forprediction variable) in the training set.





Errors on the training set do not necessarily tell howgood is the classifier.





Errors on the training set do not necessarily tell howgood is the classifier.

Any classifier that amounts to only storing the trainingset is useless.

Interested in the generalization abilities how doesour classifier perform on unseen or new patterns.


Design of Classifiers

The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.




Statistical Pattern Recognition An approach wherethe variabilities are captured through probabilisticmodels.





There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.





There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.

In this course we consider classification andregression (function learning) problems in thestatistical framework.


Statistical Pattern Recognition

X is the feature space. (We take X =

Class conditional densities model the variability in thefeature values.

For example, the two classes can be uniformlydistributed in the two regions as shown. (The twoclasses are separable here).

Class 1Class 2


When class regions are separable, an importantspecial case is linear separability.

Class 1Class 2 Class 1

Class 2

The classes in the left panel above are linearlyseparable (can be separated by a line) while those inthe right panel are not linearly separable (thoughseparable). PR video course p.61/90

In general, the two class conditional densities canoverlap. (The same value of feature vector can befrom different classes with different probabilities)

Class 2Class 1


The statistical viewpoint gives us one way of lookingfor optimal classifier.

We can say we want a classifier that has leastprobability of misclassifying a random pattern (drawnfrom the underlying distributions).


Letqi(X) = Prob[class = i|X], i = 0, 1.

qi is called posterior probability (function) for class-i.



qi is called posterior probability (function) for class-i. Consider the classifier

h(X) = 0 if q0(X) > q1(X)= 1 otherwise





q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.





q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.

Hence, intuitively, such a classifier should minimizeprobability of error in classification. PR video course p.67/90

Statistical Pattern Recognition

X (=

For example, we can rate different classifiers by

F (h) = Prob[h(X) 6= y(X)]

F (h) is the probability that h misclassifies a randomX .



F (h) = Prob[h(X) 6= y(X)]


Optimal classifier would be one with lowest value ofF .



F (h) = Prob[h(X) 6= y(X)]



Given a h we can calculate F (h) only if we know theprobability distributions of classes.



F (h) = Prob[h(X) 6= y(X)]




Minimizing F is not a straight-forward optimizationproblem.



F (h) = Prob[h(X) 6= y(X)]




Minimizing F is not a straight-forward optimizationproblem.

Here we are treating all errors as same. We willgeneralize the framework later.


Statistical PR contd.

Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).




Let pi = Prob[y(X) = i]. Called prior probabilities.




Let pi = Prob[y(X) = i]. Called prior probabilities. Recall, posterior probabilities,qi(X) = Prob[y(X) = i | X].




Let pi = Prob[y(X) = i]. Called prior probabilities. Recall, posterior probabilities,qi(X) = Prob[y(X) = i | X]. Now, by Bayes theorem

qi(X) = fi(X)pi / Z

where Z = f0(X)p0 + f1(X)p1 is the normalisingconstant


Bayes Classifier

Consider the classifier given by

h(X) = 0 if q0(X)q1(X)

> 1

= 1 otherwise


Bayes Classifier


h(X) = 0 if q0(X)q1(X)

> 1

= 1 otherwise This is called the Bayes classifier. Given our statistical

framework, this is the optimal classifier.


Bayes Classifier


h(X) = 0 if q0(X)q1(X)

> 1


framework, this is the optimal classifier. q0(X) > q1(X) is same as p0f0(X) > p1f1(X).


Bayes Classifier


h(X) = 0 if q0(X)q1(X)

> 1


framework, this is the optimal classifier. q0(X) > q1(X) is same as p0f0(X) > p1f1(X). We will prove optimality of Bayes classifier later.


story so far

We consider PR as a two step process Featuremeasurement/extraction and Classification

A classifier is to map feature vectors to Class labels. The main difficulty in designing classifiers is due to

large variability in feature vector values. The main information we have for the design is a

training set of examples. Function learning is a closely related problem. In both cases we need to learn from (training)

examples.


story so far

The statistical view is: model the variation in featurevectors from a given class through probability models.

One objective can be: Find a classifier that has leastprobability of error.

For example, Bayes Classifier is the optimal one if weknow class conditional densities.


Organization of the course

Overview of classifier learning strategies Nearest Neighbour classification rule Bayes Classifier for minimizing risk. Some variations on this theme (e.g.

Neymann-Pearson Classifier) Techniques for estimating class conditional densities

to implement Bayes classifier.



Linear discriminant functions Learning linear discriminant functions (Perceptron,

LMS, logistic regression) Linear least-squares regression Simple overview of statistical learning theory Empirical risk minimization and VC theory



Learning nonlinear classifiers Neural Networks (Backpropagation, RBF networks) Support Vector Machines Kernel-based methods Feature extraction and dimensionality reduction (PCA) Boosting and ensemble classifiers (AdaBoost)


Thank You!


Reference BooksPattern RecognitionMachine Recognition of PatternsSome Examples of PR TasksSome Examples of PR TasksExamples contd...Examples contd...Examples contd...Examples contd...Examples contd...Design of Pattern Recognition SystemsDesign of Pattern Recognition SystemsDesign of Pattern Recognition SystemsDesign of Pattern Recognition SystemsSome notationSome notationSome notationA simple PR problemA simple PR problemA simple PR problemA simple PR problemDesigning ClassifiersDesigning ClassifiersDesigning ClassifiersDesigning Classifiers contd...Designing Classifiers contd...A simple PR problemTraining SetAnother example problemLearning from ExamplesFunction LearningFunction LearningExamples of Function LearningExamples of Function LearningExamples of Function LearningExamples contd... : EqualiserExamples contd... : EqualiserLearning from examplesLearning from examplesLearning from examplesLearning from examples -- GeneralizationLearning from examples -- GeneralizationLearning from examples -- GeneralizationDesign of ClassifiersDesign of ClassifiersDesign of ClassifiersDesign of ClassifiersStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical PR contd.Statistical PR contd.Statistical PR contd.Statistical PR contd.Bayes ClassifierBayes ClassifierBayes ClassifierBayes Classifierstory so farstory so farOrganization of the courseOrganization of the courseOrganization of the course

lecture 1

Documents

theinput pattern

bangalorepr video course

pcapr video course

creditcard class

identity claim class

sequence of transactions

sequence of feature

measure relevantquantities