perception vision, sections 24.1 - 24.3 speech, section 24.7
TRANSCRIPT
Perception
Vision, Sections 24.1 - 24.3
Speech, Section 24.7
Computer Vision
“the process by which descriptions of physical scenes are inferred from images of them.” -- S. Zucker
“produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information”
Typical Applications
Medical Image AnalysisAerial Photo InterpretationMaterial HandlingInspectionNavigation
Multimedia Applications
Image compressionVideo teleconferencingVirtual classrooms
Image pixelation
Pixel values
How to recognize faces?
Problem Background
M training imagesEach image is N x N pixelsEach image is
normalized for face position, orientation, scale, and brightness
There are several pictures of each face different “moods”
Your Task
Determine if the test image contains a faceIf it contains a face, is it a face of a person
in our database?If it is a person in our database, which one?Also, what is the probability that it is Jim?
Image Space
An N x N image can be thought of as a point in an N2 dimensional image space
Each pixel is a feature with a gray scale value.
Example: 512 x 512 image each pixel can be 0 (black) to 255 (white)
Nearest Neighbor
The most likely match is the nearest neighbor
But that would take too much processingSince all images are faces, they will have
very high similarity
Face Space
Lower dimensionality to both simplify the storage and generalize the answer
Use eigenvectors to distill the 20 most distinctive metrics
Make a 20-item array for each face that contains the values of 20 features that most distinguish faces.
Now each face can be stored in 20 words
The average face
Training images are I1, I2, . . . Im
Average image is A
M
iiIM
A1
1
Weight of an image in each feature
For k=1, . . ., 20 features, compute the similarity between the Input image, I, and the kth eigenvector, Ek
)(* AIEW Tkk
Image in Face Space
“Only” 20 dimensional spaceW = [w1, w2, . . ., w20], a column vector of
weights that indicate the contribution of each of the 20 eigenfaces in I
Each image is projected from a point in high dimensional space into face space
20 features * 32 bits = 320 bits per image
Reconstructing image I
If M’ < M, we can only approximate I
Good enough for recognizing faces
i
M
ii EwAApproxI *
'
1
Picking the 20 Eigenfaces
Principal Component Analysis (also called Karhunen-Loeve transform)
Create 20 images that maximize the information content in eigenspace
Normalize by subtracting the average faceCompute the covariance matrix, CFind the eigenvectors of C that have the 20
largest eigenvalues
Build a database of faces
Given a training set of face images, compute the 20 largest eigenvectors,E1, E2, . . . , E20
Offline because it is slow
For each face in the training set, compute the point in eigenspace,W = [w1,w2, . . . ,w20] Offline, because it is big
Categorizing a test face
Given a test image, Itest, project it into the 20-space by computing Wtest
Find the closest face in the database to the test face:
where Wk is the point in facespace associated with the kth person || * || denotes the euclidean distance in facespace
ktestk
wwd min
Distance from facespace
Find the distance of the test image from eigenspace
)*(20
1, i
iitestf EwY
AIY test
fYYdffs
Is this a face?
If dffs < threshold1 then if d < threshold2
• the test image is a face that is very close to the nearest neighbor, classify it as that person
else• the image is a face, but not one we recognize
else• the image probably does not contain a face
Face Recognition Accuracy
Using 20-dimensional facespace resulted in about 95% correct classification on a database of 7500 images of 3000 people
If there are several images per person, the average W for that person helps improve accuracy
Edge Detection
Finding simple descriptions of objects in complex images find edges interrelate edges
Causes of edges
Depth discontinuity One surface occludes another
Surface orientation discontinuity the edge of a block
reflectance discontinuity texture or color changes
illumination discontinuity shadows
Examples of edges
Finding EdgesImage Intensity along a line
First derivative of intensity
Smoothed via convolving with gaussian
Pixels on edges
Edges found
Human-Computer Interfaces
Handwriting recognitionOptical Character RecognitionGesture recognitionGaze trackingFace recognition
Vision Conclusion
Machine Vision is so much fun, we have a full semester course in it
Current research in vision modeling is very active More breakthroughs are needed
Speech Recognition
Section 24.7
Speech recognition goal
Find a sequence of words that maximizesP(words | signal)
Signal Processing
“Toll quality” was the Bell labs definition of digitized speech good enough for long distance calls (“toll” calls) Sampling rate: 8000 samples per second Quantization factor: 8 bits per sample
Too much data to analyze to find utterances directly
Computational Linguistics
Human speech is limited to a repertoire of about 40 to 50 sounds, called phones
Our problem: What speech sounds did the speaker utter? What words did the speaker intend? What meaning did the speaker intend?
Finding features
Vector Quantization
The 255 most common clusters of feature values are labeled C1, …, C255
Send only the 8 bit labelOne byte per frame (a 100-fold
improvement over the 500 KB/minute)
How to Wreck a Nice Beach
where P(signal) is a constant (it is the signal we received)
So we want
)(
)(*)|()|(
signalP
wordsPwordssignalPsignalwordsP
)()|(maxarg wordsPwordssignalPwords
Unigram Frequency
Word frequencyEven though his handwriting was sloppy,
Woody Allen’s bank hold-up note probably should not have been interpreted as “I have a gub” The word “gun” is common The word “gub” is unlikely
Language model
Use the language model to compare P(“wreck a nice beach”) P(“recognize speech”)
Use naïve Bayes to asses the likelihood for each word that it will appear in this context
Bigram model
want P(wi | w1, w2, …, wn) approximate it by P(wi | wI-1)
Easy to train Simply count the number of times each word
pair occurs “I has” is unlikely, “I have” is likely “an gun” is unlikely, “a gun” is likely
Trigram
Some trigrams are very common only track the most common trigrams
Use a weighted sum of unigram bigram trigram
Near the end of the semester
Time flies like an arrow Fruit flies like a banana
It is currently hard to incorporate parts of speech and sentence grammar into the probability calculation lots of ambiguity but humans seem to do it
Conclusion
Speech recognition technology is changing very quickly
Highly parallelAmenable to hardware implementations