perception vision, sections 24.1 - 24.3 speech, section 24.7

Perception

Vision, Sections 24.1 - 24.3

Speech, Section 24.7

Computer Vision

“the process by which descriptions of physical scenes are inferred from images of them.” -- S. Zucker

“produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information”

Typical Applications

Medical Image AnalysisAerial Photo InterpretationMaterial HandlingInspectionNavigation

Multimedia Applications

Image compressionVideo teleconferencingVirtual classrooms

Image pixelation

Pixel values

How to recognize faces?

Problem Background

M training imagesEach image is N x N pixelsEach image is

normalized for face position, orientation, scale, and brightness

There are several pictures of each face different “moods”

Your Task

Determine if the test image contains a faceIf it contains a face, is it a face of a person

in our database?If it is a person in our database, which one?Also, what is the probability that it is Jim?

Image Space

An N x N image can be thought of as a point in an N2 dimensional image space

Each pixel is a feature with a gray scale value.

Example: 512 x 512 image each pixel can be 0 (black) to 255 (white)

Nearest Neighbor

The most likely match is the nearest neighbor

But that would take too much processingSince all images are faces, they will have

very high similarity

Face Space

Lower dimensionality to both simplify the storage and generalize the answer

Use eigenvectors to distill the 20 most distinctive metrics

Make a 20-item array for each face that contains the values of 20 features that most distinguish faces.

Now each face can be stored in 20 words

The average face

Training images are I1, I2, . . . Im

Average image is A

M

iiIM

A1

1

Weight of an image in each feature

For k=1, . . ., 20 features, compute the similarity between the Input image, I, and the kth eigenvector, Ek

)(* AIEW Tkk

Image in Face Space

“Only” 20 dimensional spaceW = [w1, w2, . . ., w20], a column vector of

weights that indicate the contribution of each of the 20 eigenfaces in I

Each image is projected from a point in high dimensional space into face space

20 features * 32 bits = 320 bits per image

Reconstructing image I

If M’ < M, we can only approximate I

Good enough for recognizing faces

i

M

ii EwAApproxI *

'

1

Picking the 20 Eigenfaces

Principal Component Analysis (also called Karhunen-Loeve transform)

Create 20 images that maximize the information content in eigenspace

Normalize by subtracting the average faceCompute the covariance matrix, CFind the eigenvectors of C that have the 20

largest eigenvalues

Build a database of faces

Given a training set of face images, compute the 20 largest eigenvectors,E1, E2, . . . , E20

Offline because it is slow

For each face in the training set, compute the point in eigenspace,W = [w1,w2, . . . ,w20] Offline, because it is big

Categorizing a test face

Given a test image, Itest, project it into the 20-space by computing Wtest

Find the closest face in the database to the test face:

where Wk is the point in facespace associated with the kth person || * || denotes the euclidean distance in facespace

ktestk

wwd min

Distance from facespace

Find the distance of the test image from eigenspace

)*(20

1, i

iitestf EwY

AIY test

fYYdffs

Is this a face?

If dffs < threshold1 then if d < threshold2

• the test image is a face that is very close to the nearest neighbor, classify it as that person

else• the image is a face, but not one we recognize

else• the image probably does not contain a face

Face Recognition Accuracy

Using 20-dimensional facespace resulted in about 95% correct classification on a database of 7500 images of 3000 people

If there are several images per person, the average W for that person helps improve accuracy

Edge Detection

Finding simple descriptions of objects in complex images find edges interrelate edges

Causes of edges

Depth discontinuity One surface occludes another

Surface orientation discontinuity the edge of a block

reflectance discontinuity texture or color changes

illumination discontinuity shadows

Examples of edges

Finding EdgesImage Intensity along a line

First derivative of intensity

Smoothed via convolving with gaussian

Pixels on edges

Edges found

Human-Computer Interfaces

Handwriting recognitionOptical Character RecognitionGesture recognitionGaze trackingFace recognition

Vision Conclusion

Machine Vision is so much fun, we have a full semester course in it

Current research in vision modeling is very active More breakthroughs are needed

Speech Recognition

Section 24.7

Speech recognition goal

Find a sequence of words that maximizesP(words | signal)

Signal Processing

“Toll quality” was the Bell labs definition of digitized speech good enough for long distance calls (“toll” calls) Sampling rate: 8000 samples per second Quantization factor: 8 bits per sample

Too much data to analyze to find utterances directly

Computational Linguistics

Human speech is limited to a repertoire of about 40 to 50 sounds, called phones

Our problem: What speech sounds did the speaker utter? What words did the speaker intend? What meaning did the speaker intend?

Finding features

Vector Quantization

The 255 most common clusters of feature values are labeled C1, …, C255

Send only the 8 bit labelOne byte per frame (a 100-fold

improvement over the 500 KB/minute)

How to Wreck a Nice Beach

where P(signal) is a constant (it is the signal we received)

So we want

)(

)(*)|()|(

signalP

wordsPwordssignalPsignalwordsP

)()|(maxarg wordsPwordssignalPwords

Unigram Frequency

Word frequencyEven though his handwriting was sloppy,

Woody Allen’s bank hold-up note probably should not have been interpreted as “I have a gub” The word “gun” is common The word “gub” is unlikely

Language model

Use the language model to compare P(“wreck a nice beach”) P(“recognize speech”)

Use naïve Bayes to asses the likelihood for each word that it will appear in this context

Bigram model

want P(wi | w1, w2, …, wn) approximate it by P(wi | wI-1)

Easy to train Simply count the number of times each word

pair occurs “I has” is unlikely, “I have” is likely “an gun” is unlikely, “a gun” is likely

Trigram

Some trigrams are very common only track the most common trigrams

Use a weighted sum of unigram bigram trigram

Near the end of the semester

Time flies like an arrow Fruit flies like a banana

It is currently hard to incorporate parts of speech and sentence grammar into the probability calculation lots of ambiguity but humans seem to do it

Conclusion

Speech recognition technology is changing very quickly

Highly parallelAmenable to hardware implementations

perception vision, sections 24.1 - 24.3 speech, section 24.7

Documents

ieach image

imaverage image

input image

face position

face spaceonly

closest face

image spacean n x n

training set of face