eel 851: biometrics · 2009. 7. 5. · eel 851 4 ¾feature relevant characteristics that make...
TRANSCRIPT
EEL 851 1
EEL 851: Biometrics
An Overview ofStatistical Pattern Recognition
EEL 851 2
OutlineIntroduction
PatternFeatureNoise
ExampleProblem Analysis
SegmentationFeature ExtractionClassification
Design CycleBayesian Decision Theory
Discriminant FunctionsDecision RegionsMinimum Distance Classification
Nearest Neighbor ClassificationParameter EstimationDecision Boundaries
EEL 851 3
What is a Pattern?Object process of event consisting of both deterministic/stochastic componentsRecord of dynamic occurrences influenced by both deterministic and stochastic factorsExamples – voice, image, characters
Kind of PatternsVisual patternsTemporal patternsLogical patterns
What is Pattern Class?Set of patterns sharing set of common attributes (or features)Usually originating from the same source
Introduction
EEL 851 4
FeatureRelevant characteristics that make patterns apart from each otherData extractable through measurements or processing
ExamplesPatterns
Speech waveforms, crystals, textures, weather patternsFeatures
Age, color, height, width
ClassificationsAssigning patterns into classes based on features
NoiseDistortions associated with pattern processing and/or training samples that effect the classification performance of system
Introduction
EEL 851 5
ExampleAutomation in fish packing plant
Sort incoming fish on a conveyor according to species using optical sensing
Sorting speciesSea bassSalmon
EEL 851 6
Setup cameraTake some sample imagesFeatures?
Suggested featuresLengthLightnessWidthNumber and shape of finsPosition of mouth, etc..
Explore above suggested features!
Problem Analysis
EEL 851 7
SegmentationIsolate fishes from backgroundIsolate fishes from one another
Feature extractionReduce the data by measuring certain features
ClassifierEvaluates the evidence presented → Makes final decision
Preprocessing
EEL 851 8
Preprocessing
EEL 851 9
Selecting length of fish → Possible feature for classification
Classification
EEL 851 10
Length is a poor feature!Select brightness as a possible feature
Classification
EEL 851 11
CostConsequences of our decisionEqual?
Task of decision theoryMove the decision boundary towards smaller values of lightness, Cost ↓Reduce number of sea brass classified as salmon
Threshold Decision Boundary
EEL 851 12
Adopt lightness and the width of fish
Fish x = [x1, x2]
Classification
Lightness Width
EEL 851 13
Classification
EEL 851 14
Classification
Our satisfaction → PrematureCentral aim → Correctly classify a novel inputIssue of Generalization!
EEL 851 15
Classification
Syntactic Pattern RecognitionRecursive description using method of formal languages
Structural Pattern RecognitionDerivation of descriptions using formal rules
EEL 851 16
Pattern Recognition Systems
Invariant Features• Translation• Rotation• Scale
EEL 851 17
Data Collection
Feature Choice
Model Choice
Training
Evaluation
Computational Complexity
Design Cycle
EEL 851 18
Design Cycle
EEL 851 19
Supervised LearningTeacher → Category label or cost for each label in training setDesired input → Desired output
Goals → Produce a correct output given a new input
Goals of Supervised LearningClassification
Desired output → Discrete class labels
RegressionDesired output → Continuous valued
Unsupervised LearningThe system forms clusters or natural groupings of input patternGoals → Build a model or find useful representations of dataUsage → Reasoning, finding clusters, dimensionality reduction, decision making, prediction, etc.
Data compression, Outlier detection, Help classification
Types of Learning
EEL 851 20
AssumptionsProblem → Probabilistic termsKnowledge of relevant probabilities
Sea Bass/Salmon exampleState of nature
Random Variableω → ω1 (sea bass)ω → ω1 (salmon)
Priori probabilityP(ω1) → sea bass; P(ω2) → salmonP(ω1) + P( ω2) = 1 (exclusivity and exhaustivity)
Bayesian Decision Theory
EEL 851 21
Decision RuleOnly prior information, Cannot see!Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2More info in most cases
Class Conditional InfoClass-conditional pdf → p(x|ω)p(x|ω1) and p(x|ω2)
Difference in lightness between populations of sea and salmon
Bayesian Decision Theory
EEL 851 22
Bayesian Decision Theory
Class-conditional pdfx → Lightness of fishNormalized
EEL 851 23
Bayes FormulaLightness of fish → x, Class?
P(ωj | x) = p(x | ωj) . P (ωj) / p(x)
where in case of two categories
Posterior = (Likelihood × Prior) / Evidencep(ωj | x) → Likelihoodp(x) → Evidence (scale factor)
∑=
=
=2
1
)()|()(j
jjj Pxpxp ωω
EEL 851 24
Posterior Probabilities
Prior probabilitiesP(ω1) = 2/3P(ω2) = 1/3
EEL 851 25
Decision given posterior probabilitiesObservation → xif P(ω1 | x) > P(ω2 | x) → True state of nature = ω1
if P(ω1 | x) < P(ω2 | x) → True state of nature = ω2
Probability of errorP(error | x) = P(ω1 | x) if we decide ω2
P(error | x) = P(ω2 | x) if we decide ω1
Bayes Decision Rule
EEL 851 26
Minimize probability of errorDecide ω1 if P(ω1 | x) > P(ω2 | x); otherwise decide ω2
Bayes decisionP(error | x) = min [P(ω1 | x), P(ω2 | x)]
Bayes Decision Rule
EEL 851 27
Generalization of Preceding IdeasMore than one featureMore than two state of natureActions not only deciding state of natureLoss function → More general than probability of error
Bayes Decision RuleInput pattern C is classified into class ωk , for a given feature vector x, maximizes the posterior probability;
P(ωk| x) ≥ P(ωj| x) for ∀ j ≠ kLikelihood conditions p(x |ωk) → Training data measurementsPrior probabilities P(ωk) → Supposed to known within given population of sample patternsp(x) is same for all class alternatives → Ignored, normalization factor
Bayesian Classifier
EEL 851 28
Certain classification errors may be costly than othersFalse Negative may be much more costly than False PositiveExamples → Fire alarm, Medical diagnosisΩ = ω1,ω1, … ωn → Possible set of nature/classes∆ = α1,α2, … αn → Possible decisions/actions
Loss Functionλ(αi|ωj)Loss incurred by taking action αi when true state of nature is ωj
Conditional RiskR(αi|x)Loss expected by taking action αi when observed evidence is x
Loss Function
EEL 851 29
Minimum Error Rate ClassificationAction αi is associated with class ωjAll errors are equally likelyZero-One Loss
Classification decision αi is correct only if state of nature is ωjSymmetrical function
Conditional RiskR(αi|x) = ∑ λ(αi|ωj) P(ωj|x) = 1 – P(ωi|x)
Minimize RiskSelect the class maximizing the posterior probabilitySuggested by Bayes Decision Rule, Minimum error rate classification
⎩⎨⎧
≠=
=jiji
ji for 1for 0
)|( ωαλ
EEL 851 30
Minimize overall risk subject to constraints∫ R(αi|x) dx < constantMisclassification limited to a frequencyFish example → Do not misclassify more than 1% of salmon as bassMinimize the chance of classifying sea bass as salmon
Neyman-Pearson Criterion
EEL 851 31
Classifier RepresentationFunction employed for discriminating among classesgi(x), i = 1, …,kClassifier → Assign x to class ωi if
gi(x) ≥ gj(x) for ∀ j ≠ i
Possible Choicesgi(x) = P(ωj|x)gi(x) = p(x|ωj)P(ωj)gi(x) = ln p(x|ωj) + ln P(ωj)
Discriminant Functions
EEL 851 32
Decision RegionEffect of Decision Rule
Feature space → Decision regionsDecision Boundaries
Surface in feature space →Ties occur among largest discriminant functions
EEL 851 33
Normal DensityUnivariate Density
Analytically tractable, Maximum entropyContinuous-valued densityA lot of processes are asymptotically GaussianCentral Limit Theorem
Aggregate effect of independent random disturbances → GaussianMany patterns → Prototype corrupted by large number of RP
µ = mean (or expected value) of xσ2 = expected squared deviation or variance
,x21exp
21)x(P
2
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎠⎞
⎜⎝⎛
σµ−
−σπ
=
EEL 851 34
Multivariate density
Multivariate normal density in d dimensions is:
where:x = (x1, x2, …, xd)t feature vectorµ = (µ1, µ2, …, µd)t mean vectorΣ = d*d covariance matrix|Σ| and Σ-1 are determinant and inverse respectively
Σ → Shape of Gaussian curve(x - µ)t Σ-1(x - µ) → Mahalanobis distance
⎥⎦⎤
⎢⎣⎡ µ−Σµ−−
Σπ= − )x()x(
21exp
)2(1)x(P 1t
2/12/d
Normal Density
EEL 851 35
⎥⎦⎤
⎢⎣⎡ µ−Σµ−−
Σπ= − )x()x(
21exp
)2(1)x(P 1t
2/12/d
Normal Density
EEL 851 36
Discriminant FunctionsMinimum Error Rate Classification
gi(x) = ln p(x | ωi) + ln P(ωi)
Case → Σi = σ2.IFeatures are statistically independent, same variance σ2
Diagonal covariance matrix
Linear machinesDecision surface → Pieces of hyperplanesgi(x) = gj(x)
)(Plnln212ln
2d)x()x(
21)x(g ii
1
ii
tii ω+Σ−π∑ −µ−µ−−=
−
0itii wg += xwx)(
ii µ21σ
=w )(ln iitii Pw ω
σ+
−= µµ20
21
EEL 851 37
Discriminant Functions
EEL 851 38
Discriminant Functions
EEL 851 39
Case → Σi = ΣCovariance matrices of all classes → Identical but arbitrary
Hyperplane separating Ri and Rj
Hyperplane separating Ri and Rj → Not orthogonal to the line between the means
Discriminant Functions
[ ]).(
)()(
)(/)(ln)( ji
jit
ji
jiji
PPx µµ
µµµµ
ωωµµ −
−Σ−−+=
−10 21
)(ln)( ii Pg ωµµ +Σ−= − )-(x)-(xx it
i1
21
EEL 851 40
Discriminant Functions
EEL 851 41
Discriminant Functions
EEL 851 42
Case → Σi = arbitrary
The covariance matrices are different for each category
HyperquadricsHyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids
)(lnln21
21 w
w21 W
:)(
10
1i
1i
0
iiiitii
ii
i
itii
ti
P
wherewxwxWxxg
ωµµ
µ
+Σ−Σ−=
Σ=
Σ−=
=+=
−
−
−
Discriminant Functions
EEL 851 43
Discriminant Functions
EEL 851 44
Decision Boundary - ExampleCompute Bayes decision boundary
Gaussian Distributions
Means and Covariance's → Discrete versions
⎥⎦
⎤⎢⎣
⎡=
63
1µ ⎟⎟⎠
⎞⎜⎜⎝
⎛=∑
2 00 21
1/
⎥⎦
⎤⎢⎣
⎡−
=23
2µ ⎟⎟⎠
⎞⎜⎜⎝
⎛=∑
2 00 2
2
211 1875012515143 xxx ... +−=2
EEL 851 45
Supervised LearningParametric Approaches
Bayesian parameter estimationMaximum likelihood estimation
Estimation ProblemEstimate P(ωj) Estimate p(ωj | x) → Tough
High dimensional feature spaces, small number of training samples
Simplifying AssumptionsFeature IndependenceIndependently drawn samples → I. I. D. modelAssume that p(ωj | x) is Gaussian
Estimation problem → Parameters of normality
EEL 851 46
Estimation MethodsBayesian Estimation (MAP)
Distribution parameters are random values that follow a known (i.e. Gaussian) distributionBehavior of training data helps in revising parameter valuesLarge training samples → Better chances of refining posterior probabilities (parameter peaking)
Maximum-Likelihood (ML) Estimation Parameters of probabilistic distributions are fixed but unknown values Parameters → Unknown constants, Identify using training dataBest estimates of parameter values
Class-conditional probabilities are maximized over the available samples
EEL 851 47
Parametric ApproachesCurse of dimensionality
Estimate the parameters known distributionSmaller number of samplesSome priori information
EEL 851 48
Non-parametric ApproachesParzen window pdf estimation (KDE)
Estimate p(ωj | x) directly from sample patterns
Kn nearest-neighborDirectly construct the decision boundary based on training data
EEL 851 49
Statistical Approaches
EEL 851 50
Nearest-Neighbor MethodsStatistical, nearest-neighbor or memory-based methodsk-nearest-neighbor
New pattern category → Plurality of its k closest neighborLarge k → Decreases the chance of undue influence by noisy training patternSmall k → Reduces acuity of methodDistance metric usually Euclidean
In practice features are scaled → aj
Advantage → Does not require training
EEL 851 51
Nearest-Neighbor DecisionExample
Class of training patterns → 4
EEL 851 52
Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification, 2nd edition,John Wiley & Sons, 2001 (most contents)Nils J. Nilsson, Introduction to Machine Learning, http://ai.stanford.edu/people/nilsson/mlbook.htmlhttp://www.rii.ricoh.com/%7Estork/DHS.htmlA. K. Jain, R. P. W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans PAMI, pp. 4-37, Jan. 2000
Other Useful References
Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, NY, 1972. Therrien, Decision Estimation and Classification, John-Wiley & sons, New York, NY, 1989. Hertz, Krogh, and Palmer, Introduction to the Theory of Neural Computation, Addison-Wesley, Reading, MA, 1991. Bose and Liang, Neural Network Fundamentals with Graphs, Algorithms, and Applications, McGrawll-Hill, New York, NY, 1996
References