perception vision, sections 24.1 - 24.3 speech, section 24.7

43
Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Upload: everett-ryan

Post on 30-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Perception

Vision, Sections 24.1 - 24.3

Speech, Section 24.7

Page 2: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Computer Vision

“the process by which descriptions of physical scenes are inferred from images of them.” -- S. Zucker

“produces from images of the external 3D world a description that is useful to the viewer and not cluttered by irrelevant information”

Page 3: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Typical Applications

Medical Image AnalysisAerial Photo InterpretationMaterial HandlingInspectionNavigation

Page 4: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Multimedia Applications

Image compressionVideo teleconferencingVirtual classrooms

Page 5: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Image pixelation

Page 6: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Pixel values

Page 7: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

How to recognize faces?

Page 8: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Problem Background

M training imagesEach image is N x N pixelsEach image is

normalized for face position, orientation, scale, and brightness

There are several pictures of each face different “moods”

Page 9: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Your Task

Determine if the test image contains a faceIf it contains a face, is it a face of a person

in our database?If it is a person in our database, which one?Also, what is the probability that it is Jim?

Page 10: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Image Space

An N x N image can be thought of as a point in an N2 dimensional image space

Each pixel is a feature with a gray scale value.

Example: 512 x 512 image each pixel can be 0 (black) to 255 (white)

Page 11: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Nearest Neighbor

The most likely match is the nearest neighbor

But that would take too much processingSince all images are faces, they will have

very high similarity

Page 12: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Face Space

Lower dimensionality to both simplify the storage and generalize the answer

Use eigenvectors to distill the 20 most distinctive metrics

Make a 20-item array for each face that contains the values of 20 features that most distinguish faces.

Now each face can be stored in 20 words

Page 13: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

The average face

Training images are I1, I2, . . . Im

Average image is A

M

iiIM

A1

1

Page 14: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Weight of an image in each feature

For k=1, . . ., 20 features, compute the similarity between the Input image, I, and the kth eigenvector, Ek

)(* AIEW Tkk

Page 15: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Image in Face Space

“Only” 20 dimensional spaceW = [w1, w2, . . ., w20], a column vector of

weights that indicate the contribution of each of the 20 eigenfaces in I

Each image is projected from a point in high dimensional space into face space

20 features * 32 bits = 320 bits per image

Page 16: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Reconstructing image I

If M’ < M, we can only approximate I

Good enough for recognizing faces

i

M

ii EwAApproxI *

'

1

Page 17: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Picking the 20 Eigenfaces

Principal Component Analysis (also called Karhunen-Loeve transform)

Create 20 images that maximize the information content in eigenspace

Normalize by subtracting the average faceCompute the covariance matrix, CFind the eigenvectors of C that have the 20

largest eigenvalues

Page 18: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Build a database of faces

Given a training set of face images, compute the 20 largest eigenvectors,E1, E2, . . . , E20

Offline because it is slow

For each face in the training set, compute the point in eigenspace,W = [w1,w2, . . . ,w20] Offline, because it is big

Page 19: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Categorizing a test face

Given a test image, Itest, project it into the 20-space by computing Wtest

Find the closest face in the database to the test face:

where Wk is the point in facespace associated with the kth person || * || denotes the euclidean distance in facespace

ktestk

wwd min

Page 20: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Distance from facespace

Find the distance of the test image from eigenspace

)*(20

1, i

iitestf EwY

AIY test

fYYdffs

Page 21: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Is this a face?

If dffs < threshold1 then if d < threshold2

• the test image is a face that is very close to the nearest neighbor, classify it as that person

else• the image is a face, but not one we recognize

else• the image probably does not contain a face

Page 22: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Face Recognition Accuracy

Using 20-dimensional facespace resulted in about 95% correct classification on a database of 7500 images of 3000 people

If there are several images per person, the average W for that person helps improve accuracy

Page 23: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Edge Detection

Finding simple descriptions of objects in complex images find edges interrelate edges

Page 24: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Causes of edges

Depth discontinuity One surface occludes another

Surface orientation discontinuity the edge of a block

reflectance discontinuity texture or color changes

illumination discontinuity shadows

Page 25: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Examples of edges

Page 26: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Finding EdgesImage Intensity along a line

First derivative of intensity

Smoothed via convolving with gaussian

Page 27: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Pixels on edges

Page 28: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Edges found

Page 29: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Human-Computer Interfaces

Handwriting recognitionOptical Character RecognitionGesture recognitionGaze trackingFace recognition

Page 30: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Vision Conclusion

Machine Vision is so much fun, we have a full semester course in it

Current research in vision modeling is very active More breakthroughs are needed

Page 31: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Speech Recognition

Section 24.7

Page 32: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Speech recognition goal

Find a sequence of words that maximizesP(words | signal)

Page 33: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Signal Processing

“Toll quality” was the Bell labs definition of digitized speech good enough for long distance calls (“toll” calls) Sampling rate: 8000 samples per second Quantization factor: 8 bits per sample

Too much data to analyze to find utterances directly

Page 34: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Computational Linguistics

Human speech is limited to a repertoire of about 40 to 50 sounds, called phones

Our problem: What speech sounds did the speaker utter? What words did the speaker intend? What meaning did the speaker intend?

Page 35: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Finding features

Page 36: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Vector Quantization

The 255 most common clusters of feature values are labeled C1, …, C255

Send only the 8 bit labelOne byte per frame (a 100-fold

improvement over the 500 KB/minute)

Page 37: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

How to Wreck a Nice Beach

where P(signal) is a constant (it is the signal we received)

So we want

)(

)(*)|()|(

signalP

wordsPwordssignalPsignalwordsP

)()|(maxarg wordsPwordssignalPwords

Page 38: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Unigram Frequency

Word frequencyEven though his handwriting was sloppy,

Woody Allen’s bank hold-up note probably should not have been interpreted as “I have a gub” The word “gun” is common The word “gub” is unlikely

Page 39: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Language model

Use the language model to compare P(“wreck a nice beach”) P(“recognize speech”)

Use naïve Bayes to asses the likelihood for each word that it will appear in this context

Page 40: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Bigram model

want P(wi | w1, w2, …, wn) approximate it by P(wi | wI-1)

Easy to train Simply count the number of times each word

pair occurs “I has” is unlikely, “I have” is likely “an gun” is unlikely, “a gun” is likely

Page 41: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Trigram

Some trigrams are very common only track the most common trigrams

Use a weighted sum of unigram bigram trigram

Page 42: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Near the end of the semester

Time flies like an arrow Fruit flies like a banana

It is currently hard to incorporate parts of speech and sentence grammar into the probability calculation lots of ambiguity but humans seem to do it

Page 43: Perception Vision, Sections 24.1 - 24.3 Speech, Section 24.7

Conclusion

Speech recognition technology is changing very quickly

Highly parallelAmenable to hardware implementations