lecture 1

90
Pattern Recognition P.S.Sastry [email protected] Dept. Electrical Engineering Indian Institute of Science, Bangalore PR video course – p.1/90

Upload: vasudevanitk

Post on 08-Nov-2015

213 views

Category:

Documents


0 download

DESCRIPTION

lecture

TRANSCRIPT

  • Pattern Recognition

    [email protected]

    Dept. Electrical EngineeringIndian Institute of Science, Bangalore

    PR video course p.1/90

  • Reference Books

    R.O.Duda, P.E.Hart and D.G.Stork, PatternClassification, Johy Wiley, 2002

    C.M.Bishop, Pattern Recognition and MachineLearning, Springer, 2006.

    C. M. Bishop, Neural Networks for PatternRecognition, Oxford University Press, Indian Edition,2003.

    R.O.Duda and P.E. Hart, Pattern Classification andScene Analysis, Wiley, 1973

    PR video course p.2/90

  • Pattern Recognition

    A basic attribute of people categorisation of sensoryinput

    Pattern PR System Class label

    Examples of Pattern Recognition tasks Reading facial expressions Recognising Speech Reading a Document Identifying a person by fingerprints Diagnosis from medical images Wine tasting

    PR video course p.3/90

  • Machine Recognition of Patterns

    pattern feature extractor X classifier classlabel

    Feature extractor makes some measurements on theinput pattern.

    X is called Feature Vector. Often, X

  • Some Examples of PR Tasks

    Character Recognition Pattern Image. Class identity of character Features: Binary image, projections (e.g., row and

    column sums), Moments etc.

    PR video course p.5/90

  • Some Examples of PR Tasks

    Speech Recognition Pattern 1-D signal (or its sampled version) Class identity of speech units Features LPC model of chunks of speech,

    spectral info, cepstrum etc. Pattern can become a sequence of feature vectors.

    PR video course p.6/90

  • Examples contd...

    Fingerprint based identity verification Pattern image plus a identity claim Class Yes / No Features: Position of Minutiae, Orientation field of

    ridge lines etc.

    PR video course p.7/90

  • Examples contd...

    Video-based Surveillance Pattern video sequence Class e.g., level of alertness Features Motion trajectories, Parameters of a

    prefixed model etc.

    PR video course p.8/90

  • Examples contd...

    Credit Screening Pattern Details of an applicant (for, e.g., credit

    card) Class Yes / No Features: income, job history, level of credit, credit

    history etc.

    PR video course p.9/90

  • Examples contd...

    Imposter detection (of, e.g., credit card) Pattern A sequence of transactions Class Yes / No Features: Amount of money, locations of

    transactions, times between transactions etc.

    PR video course p.10/90

  • Examples contd...

    Document Classification Pattern A document and a query Class Relevant or not (in general, rank) Features word occurrence counts, word context

    etc. Spam filtering, diagnostics of machinery etc.

    PR video course p.11/90

  • Design of Pattern Recognition Systems

    Features depend on the problem. Measure relevantquantities.

    PR video course p.12/90

  • Design of Pattern Recognition Systems

    Features depend on the problem. Measure relevantquantities.

    Some techniques available to extract more relevantquantities from the initial measurements. (e.g., PCA)

    PR video course p.13/90

  • Design of Pattern Recognition Systems

    Features depend on the problem. Measure relevantquantities.

    Some techniques available to extract more relevantquantities from the initial measurements. (e.g., PCA)

    After feature extraction each pattern is a vector Classifier is a function to map such vectors into class

    labels.

    PR video course p.14/90

  • Design of Pattern Recognition Systems

    Features depend on the problem. Measure relevantquantities.

    Some techniques available to extract more relevantquantities from the initial measurements. (e.g., PCA)

    After feature extraction each pattern is a vector Classifier is a function to map such vectors into class

    labels. Many general techniques of classifier design are

    available. Need to test and validate the final system.

    PR video course p.15/90

  • Some notation

    Feature Space, X Set of all possible feature vectors.

    PR video course p.16/90

  • Some notation

    Feature Space, X Set of all possible feature vectors. Classifier: a decision rule or a function,h : X {1, , M}.

    Often, X =

  • Some notation

    Feature Space, X Set of all possible feature vectors. Classifier: a decision rule or a function,h : X {1, , M}.

    Often, X =

  • We first consider the 2-class problem.

    PR video course p.19/90

  • We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

    2-class problem.

    PR video course p.20/90

  • We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

    2-class problem. Simplest alternative: design M 2-class classifiers.

    One Vs Rest

    PR video course p.21/90

  • We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

    2-class problem. Simplest alternative: design M 2-class classifiers.

    One Vs Rest There are other possibilities: e.g., Tree structured

    classifiers.

    PR video course p.22/90

  • We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

    2-class problem. Simplest alternative: design M 2-class classifiers.

    One Vs Rest There are other possibilities: e.g., Tree structured

    classifiers. The 2-class problem is the basic problem.

    PR video course p.23/90

  • We first consider the 2-class problem. Can handle the M > 2 case if we know how to handle

    2-class problem. Simplest alternative: design M 2-class classifiers.

    One Vs Rest There are other possibilities: e.g., Tree structured

    classifiers. The 2-class problem is the basic problem. We will also look at M-class classifiesrs.

    PR video course p.24/90

  • A simple PR problem

    : Problem: Spot the Right Candidate : Features:

    x1: Marks based on academic record x2: Marks in the interview

    PR video course p.25/90

  • A simple PR problem

    : Problem: Spot the Right Candidate : Features:

    x1: Marks based on academic record x2: Marks in the interview

    A Classifier: ax1 + bx2 > c Good

    PR video course p.26/90

  • A simple PR problem

    : Problem: Spot the Right Candidate : Features:

    x1: Marks based on academic record x2: Marks in the interview

    A Classifier: ax1 + bx2 > c GoodAnother Classifier: x1x2 > c Good(or (x1 + a)(x2 + b) > c).

    PR video course p.27/90

  • A simple PR problem

    : Problem: Spot the Right Candidate : Features:

    x1: Marks based on academic record x2: Marks in the interview

    A Classifier: ax1 + bx2 > c GoodAnother Classifier: x1x2 > c Good(or (x1 + a)(x2 + b) > c).

    Design of classifier:We have to choose a specific form for the classifier.What values to use for parameters such as a, b, c?

    PR video course p.28/90

  • Designing Classifiers

    Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

    PR video course p.29/90

  • Designing Classifiers

    Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

    In most applications, not possible to design classifierfrom physics of the problem.

    PR video course p.30/90

  • Designing Classifiers

    Need to decide how feature vector values determinethe class.(How different marks reflect goodness of candidate)

    In most applications, not possible to design classifierfrom physics of the problem.

    The difficulties are Lot of variability in patterns of a single class Variability in feature vector values Feature vectors of patterns from different classes

    can be arbitrarily close. Noise in measurements

    PR video course p.31/90

  • Designing Classifiers contd...

    Often the only information available for the design is A training set of example patterns.

    Training set: {(Xi, yi), i = 1, . . . , `}.Here Xi is an example feature vector of class yi.

    PR video course p.32/90

  • Designing Classifiers contd...

    Often the only information available for the design is A training set of example patterns.

    Training set: {(Xi, yi), i = 1, . . . , `}.Here Xi is an example feature vector of class yi.

    Generation of training set Take representativepatterns of known category (data collection) andobtain the feature vectors. (choice of featuremeasurements).

    Now learn an appropriate function h as classifier.(Model choice)

    Test and validate the classifier on more data.PR video course p.33/90

  • A simple PR problem

    : Problem: Spot the Right Candidate : Features:

    x1: Marks based on academic record x2: Marks in the interview

    A Classifier: ax1 + bx2 > c GoodWe have chosen a specific form for the classifier.

    Design of classifier: What values to use for a, b, c? Information available: experience history of past

    candidates

    PR video course p.34/90

  • Training Set

    PR video course p.35/90

  • Another example problem

    Problem: recognize persons of medium build Features: Height and Weight

    The classifier is nonlinear herePR video course p.36/90

  • Learning from Examples

    Designing a classifier is a typical problem of learningfrom examples. (Also called learning with a teacher).

    Nature of feedback from teacher determines difficultyof the learning problem.

    PR video course p.37/90

  • In the context of Pattern Recognition

    feature

    V ector Classifier Class feedback

    Teacher

    Supervised learning Teacher gives the true classlabel for each feature vector

    Reinforcement Learning noisy assessment ofperformance, e.g., correct/incorrect

    Unsupervised learning no teacher input (Clusteringproblems)

    PR video course p.38/90

  • When the class labels of training patterns as given byteacher are noisy, we consider it as supervisedlearning with label noise or classification noise.

    Many classifier design algorithms do supervisedlearning.

    PR video course p.39/90

  • Function Learning

    Closely related problem. Output is continuous-valuedrather than discrete as in classifiers.

    Here training set examples could be{(Xi, yi), i = 1, . . . , `}, Xi X , yi

  • Function Learning

    Closely related problem. Output is continuous-valuedrather than discrete as in classifiers.

    Here training set examples could be{(Xi, yi), i = 1, . . . , `}, Xi X , yi

  • Examples of Function Learning

    Time series prediction: Given a series x1, x2, , finda function to predict xn.

    PR video course p.42/90

  • Examples of Function Learning

    Time series prediction: Given a series x1, x2, , finda function to predict xn.

    Based on past values: Find a best functionxn = h(xn1, xn2, , xnp) Predict stock prices, exchange rates etc. Linear prediction model used in speech analysis

    PR video course p.43/90

  • Examples of Function Learning

    Time series prediction: Given a series x1, x2, , finda function to predict xn.

    Based on past values: Find a best functionxn = h(xn1, xn2, , xnp) Predict stock prices, exchange rates etc. Linear prediction model used in speech analysis

    More general predictors can use other variables also. Predict rainfall based on measurements and

    (possibly) previous years data. In general, System Identification. (An application:

    smart sensors)PR video course p.44/90

  • Examples contd... : Equaliser

    Tx x(k) channel Z(k) filter y(k) Rx

    We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

    y(k) =T

    i=1

    aiZ(k i)

    Find best ai a function learning problem.

    PR video course p.45/90

  • Examples contd... : Equaliser

    Tx x(k) channel Z(k) filter y(k) Rx

    We want y(k) = x(k). Design (or adapt) the filter toachieve thisWe can choose a filter as

    y(k) =T

    i=1

    aiZ(k i)

    Find best ai a function learning problem.Training set: {(x(k), Z(k)), k = 1, 2, , N}How do we know x(k) at the receiver end?Prior agreements (Protocols) PR video course p.46/90

  • Learning from examples

    Learning from examples is basic to both classificationand regression.

    Training set: {(X1, y1), , (X`, y`)}. We essentially want to fit a best function: y = f(X).

    PR video course p.47/90

  • Learning from examples

    Learning from examples is basic to both classificationand regression.

    Training set: {(X1, y1), , (X`, y`)}. We essentially want to fit a best function: y = f(X). Suppose X

  • Learning from examples

    Learning from examples is basic to both classificationand regression.

    Training set: {(X1, y1), , (X`, y`)}. We essentially want to fit a best function: y = f(X). Suppose X

  • Learning from examples Generalization

    To obtain a classifier (or a regression function) we usethe training set.

    We know the class label of patterns (or the values forprediction variable) in the training set.

    PR video course p.50/90

  • Learning from examples Generalization

    To obtain a classifier (or a regression function) we usethe training set.

    We know the class label of patterns (or the values forprediction variable) in the training set.

    Errors on the training set do not necessarily tell howgood is the classifier.

    PR video course p.51/90

  • Learning from examples Generalization

    To obtain a classifier (or a regression function) we usethe training set.

    We know the class label of patterns (or the values forprediction variable) in the training set.

    Errors on the training set do not necessarily tell howgood is the classifier.

    Any classifier that amounts to only storing the trainingset is useless.

    Interested in the generalization abilities how doesour classifier perform on unseen or new patterns.

    PR video course p.52/90

  • Design of Classifiers

    The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

    PR video course p.53/90

  • Design of Classifiers

    The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

    Statistical Pattern Recognition An approach wherethe variabilities are captured through probabilisticmodels.

    PR video course p.54/90

  • Design of Classifiers

    The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

    Statistical Pattern Recognition An approach wherethe variabilities are captured through probabilisticmodels.

    There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.

    PR video course p.55/90

  • Design of Classifiers

    The classifier should perform well inspite of inherentvariability of patterns and noise in feature extractionand/or in class labels as given in training set.

    Statistical Pattern Recognition An approach wherethe variabilities are captured through probabilisticmodels.

    There are other approaches, e.g., syntactic patternrecognition, fuzzy-set based methods etc.

    In this course we consider classification andregression (function learning) problems in thestatistical framework.

    PR video course p.56/90

  • Statistical Pattern Recognition

    X is the feature space. (We take X =

  • Statistical Pattern Recognition

    X is the feature space. (We take X =

  • Statistical Pattern Recognition

    X is the feature space. (We take X =

  • Class conditional densities model the variability in thefeature values.

    For example, the two classes can be uniformlydistributed in the two regions as shown. (The twoclasses are separable here).

    Class 1Class 2

    PR video course p.60/90

  • When class regions are separable, an importantspecial case is linear separability.

    Class 1Class 2 Class 1

    Class 2

    The classes in the left panel above are linearlyseparable (can be separated by a line) while those inthe right panel are not linearly separable (thoughseparable). PR video course p.61/90

  • In general, the two class conditional densities canoverlap. (The same value of feature vector can befrom different classes with different probabilities)

    Class 2Class 1

    PR video course p.62/90

  • The statistical viewpoint gives us one way of lookingfor optimal classifier.

    We can say we want a classifier that has leastprobability of misclassifying a random pattern (drawnfrom the underlying distributions).

    PR video course p.63/90

  • Letqi(X) = Prob[class = i|X], i = 0, 1.

    qi is called posterior probability (function) for class-i.

    PR video course p.64/90

  • Letqi(X) = Prob[class = i|X], i = 0, 1.

    qi is called posterior probability (function) for class-i. Consider the classifier

    h(X) = 0 if q0(X) > q1(X)= 1 otherwise

    PR video course p.65/90

  • Letqi(X) = Prob[class = i|X], i = 0, 1.

    qi is called posterior probability (function) for class-i. Consider the classifier

    h(X) = 0 if q0(X) > q1(X)= 1 otherwise

    q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.

    PR video course p.66/90

  • Letqi(X) = Prob[class = i|X], i = 0, 1.

    qi is called posterior probability (function) for class-i. Consider the classifier

    h(X) = 0 if q0(X) > q1(X)= 1 otherwise

    q0(X) > q1(X) would imply that the feature vector Xis more likely to come from class-0 rather thanclass-1.

    Hence, intuitively, such a classifier should minimizeprobability of error in classification. PR video course p.67/90

  • Statistical Pattern Recognition

    X (=

  • Statistical Pattern Recognition

    X (=

  • Statistical Pattern Recognition

    X (=

  • Statistical Pattern Recognition

    X (=

  • For example, we can rate different classifiers by

    F (h) = Prob[h(X) 6= y(X)]

    F (h) is the probability that h misclassifies a randomX .

    PR video course p.72/90

  • For example, we can rate different classifiers by

    F (h) = Prob[h(X) 6= y(X)]

    F (h) is the probability that h misclassifies a randomX .

    Optimal classifier would be one with lowest value ofF .

    PR video course p.73/90

  • For example, we can rate different classifiers by

    F (h) = Prob[h(X) 6= y(X)]

    F (h) is the probability that h misclassifies a randomX .

    Optimal classifier would be one with lowest value ofF .

    Given a h we can calculate F (h) only if we know theprobability distributions of classes.

    PR video course p.74/90

  • For example, we can rate different classifiers by

    F (h) = Prob[h(X) 6= y(X)]

    F (h) is the probability that h misclassifies a randomX .

    Optimal classifier would be one with lowest value ofF .

    Given a h we can calculate F (h) only if we know theprobability distributions of classes.

    Minimizing F is not a straight-forward optimizationproblem.

    PR video course p.75/90

  • For example, we can rate different classifiers by

    F (h) = Prob[h(X) 6= y(X)]

    F (h) is the probability that h misclassifies a randomX .

    Optimal classifier would be one with lowest value ofF .

    Given a h we can calculate F (h) only if we know theprobability distributions of classes.

    Minimizing F is not a straight-forward optimizationproblem.

    Here we are treating all errors as same. We willgeneralize the framework later.

    PR video course p.76/90

  • Statistical PR contd.

    Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

    PR video course p.77/90

  • Statistical PR contd.

    Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

    Let pi = Prob[y(X) = i]. Called prior probabilities.

    PR video course p.78/90

  • Statistical PR contd.

    Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

    Let pi = Prob[y(X) = i]. Called prior probabilities. Recall, posterior probabilities,qi(X) = Prob[y(X) = i | X].

    PR video course p.79/90

  • Statistical PR contd.

    Recall that fi(X) denotes the probability densityfunction of feature vectors of class-i. ( classconditional densities).

    Let pi = Prob[y(X) = i]. Called prior probabilities. Recall, posterior probabilities,qi(X) = Prob[y(X) = i | X]. Now, by Bayes theorem

    qi(X) = fi(X)pi / Z

    where Z = f0(X)p0 + f1(X)p1 is the normalisingconstant

    PR video course p.80/90

  • Bayes Classifier

    Consider the classifier given by

    h(X) = 0 if q0(X)q1(X)

    > 1

    = 1 otherwise

    PR video course p.81/90

  • Bayes Classifier

    Consider the classifier given by

    h(X) = 0 if q0(X)q1(X)

    > 1

    = 1 otherwise This is called the Bayes classifier. Given our statistical

    framework, this is the optimal classifier.

    PR video course p.82/90

  • Bayes Classifier

    Consider the classifier given by

    h(X) = 0 if q0(X)q1(X)

    > 1

    = 1 otherwise This is called the Bayes classifier. Given our statistical

    framework, this is the optimal classifier. q0(X) > q1(X) is same as p0f0(X) > p1f1(X).

    PR video course p.83/90

  • Bayes Classifier

    Consider the classifier given by

    h(X) = 0 if q0(X)q1(X)

    > 1

    = 1 otherwise This is called the Bayes classifier. Given our statistical

    framework, this is the optimal classifier. q0(X) > q1(X) is same as p0f0(X) > p1f1(X). We will prove optimality of Bayes classifier later.

    PR video course p.84/90

  • story so far

    We consider PR as a two step process Featuremeasurement/extraction and Classification

    A classifier is to map feature vectors to Class labels. The main difficulty in designing classifiers is due to

    large variability in feature vector values. The main information we have for the design is a

    training set of examples. Function learning is a closely related problem. In both cases we need to learn from (training)

    examples.

    PR video course p.85/90

  • story so far

    The statistical view is: model the variation in featurevectors from a given class through probability models.

    One objective can be: Find a classifier that has leastprobability of error.

    For example, Bayes Classifier is the optimal one if weknow class conditional densities.

    PR video course p.86/90

  • Organization of the course

    Overview of classifier learning strategies Nearest Neighbour classification rule Bayes Classifier for minimizing risk. Some variations on this theme (e.g.

    Neymann-Pearson Classifier) Techniques for estimating class conditional densities

    to implement Bayes classifier.

    PR video course p.87/90

  • Organization of the course

    Linear discriminant functions Learning linear discriminant functions (Perceptron,

    LMS, logistic regression) Linear least-squares regression Simple overview of statistical learning theory Empirical risk minimization and VC theory

    PR video course p.88/90

  • Organization of the course

    Learning nonlinear classifiers Neural Networks (Backpropagation, RBF networks) Support Vector Machines Kernel-based methods Feature extraction and dimensionality reduction (PCA) Boosting and ensemble classifiers (AdaBoost)

    PR video course p.89/90

  • Thank You!

    PR video course p.90/90

    Reference BooksPattern RecognitionMachine Recognition of PatternsSome Examples of PR TasksSome Examples of PR TasksExamples contd...Examples contd...Examples contd...Examples contd...Examples contd...Design of Pattern Recognition SystemsDesign of Pattern Recognition SystemsDesign of Pattern Recognition SystemsDesign of Pattern Recognition SystemsSome notationSome notationSome notationA simple PR problemA simple PR problemA simple PR problemA simple PR problemDesigning ClassifiersDesigning ClassifiersDesigning ClassifiersDesigning Classifiers contd...Designing Classifiers contd...A simple PR problemTraining SetAnother example problemLearning from ExamplesFunction LearningFunction LearningExamples of Function LearningExamples of Function LearningExamples of Function LearningExamples contd... : EqualiserExamples contd... : EqualiserLearning from examplesLearning from examplesLearning from examplesLearning from examples -- GeneralizationLearning from examples -- GeneralizationLearning from examples -- GeneralizationDesign of ClassifiersDesign of ClassifiersDesign of ClassifiersDesign of ClassifiersStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical Pattern RecognitionStatistical PR contd.Statistical PR contd.Statistical PR contd.Statistical PR contd.Bayes ClassifierBayes ClassifierBayes ClassifierBayes Classifierstory so farstory so farOrganization of the courseOrganization of the courseOrganization of the course