an automatic lip-reading method based on polynomial fitting

An Automatic Lip-reading Method Based on An Automatic Lip-reading Method Based on Polynomial FittingPolynomial FittingAn Automatic Lip-reading Method Based on An Automatic Lip-reading Method Based on Polynomial FittingPolynomial Fitting

Meng LISupervisor: Dr. Yiu-ming CHEUNG

Department of Computer ScienceHong Kong Baptist University

Content

Conclusion and future work

Introduction

Lip segmentation

Visual speech recognition

Experiment

Introduction

Audio Audio ChannelChannel

Video Video ChannelChannel

PerceptionPerception

The speech perception is multimodal involves information from at least two sensory modalities.

Introduction

Visual Only

Audio Only

Visual-Audio

0% 20% 40% 60% 80% 100%

73 %

91%

97%

Visual Only

Audio Only

Visual-Audio

0% 20% 40% 60% 80% 100%

73 %

47%

87%

Silent Environment

Noisy Environment

Introduction

The hottest research direction in lip-reading is visual-speech recognition (with audio information, or visual only)

1%1%5%5%

31%31%

63%63%

Others

Identification

Speech recognition in noisy environment

Visual-only speech recognition

Introduction

Preprocessing

Acoustic Acoustic ProcessingProcessing

Audio

Video

Feature Extraction

Audio Feature Audio Feature ExtractionExtraction

Lip CapturingLip Capturing Visual Feature Visual Feature ExtractionExtraction

AV Fusion

Fusion and Fusion and RecognitionRecognition

The basic structure of an typical AVSR (Automatic Visual-Speech Recognition) system

Introduction

Using all pixels in lip region as feature.

Pixel Based Motion Based

Model BasedShape Based

Capture the moving feature in all or parts of lip during pronunciation

Extract the boundary of lip as the feature.

Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.

Introduction


Pixel BasedPixel Based Motion Based





Sensitive to the illumination condition. Sensitive to the rotate, scale transform. Human dependence. High dimension of feature data.


All information are utilized. Highest recognition in ideal

illumination condition.

All information are utilized. Highest recognition in ideal

illumination condition.

PositivePositive DisadvantageDisadvantageAdvantageAdvantage

Introduction

Introduction


Pixel Based Motion BasedMotion Based







Represent the motion of lip directly and completely.

Represent the motion of lip directly and completely.

NegativeNegativePositivePositive DisadvantageDisadvantageAdvantageAdvantage

Introduction

Introduction


Pixel Based Moving Based

Model BasedModel BasedShape BasedShape Based




High computation complexity. High computation complexity. Low dimension of feature data. Robust to rotate and scale

transformation. If the model appropriate, human

independence ca be implemented. Convenient to employ some classical

method (e.g. HMM) to match.

Low dimension of feature data. Robust to rotate and scale

transformation. If the model appropriate, human

independence ca be implemented. Convenient to employ some classical

method (e.g. HMM) to match.

NegativeNegativePositivePositive DisadvantageDisadvantageAdvantageAdvantage

Introduction

Tip

So far, the Model-based Feature

Extraction is the most common

method.

Introduction

Introduction

Lip segmentation under gray-levelLip segmentation under gray-level

Based on gray-level image.

Locate the minimum enclosing rectangular of mouth.

High processing speed. Low computation complexity.

Based on gray-level image.

Locate the minimum enclosing rectangular of mouth.

High processing speed. Low computation complexity.

The rest of this presentation.

Introduction

Lip segmentation in colour spaceLip segmentation in colour space

Based on rgb, hsv and La*b* colour space.

Can extract the outer boundary of lip.

High accuracy.

High computation complexity.

Based on rgb, hsv and La*b* colour space.

Can extract the outer boundary of lip.

High accuracy.

High computation complexity.


Introduction

Visual only speech recognitionVisual only speech recognition

Based on polynomial fitting.

High processing speed. Suitable for real-time system. Perform good in limited training set.

Based on polynomial fitting.

High processing speed. Suitable for real-time system. Perform good in limited training set.


Lip segmentation (1)


Firstly, we transform the source image from RGB color space into La*b* space.

In a* channel, negative values indicate green while positive values indicate magenta.

So, it is helpful to highlight the lip region from skin.


1282/)77142221814503(377 24* BGRa

*min

*max

*min

**

aa

aaanorm

2552 * GaI normmask

GRGRrg R

GI 125610/


In source image, we get the pixels located in the non-black area, and transform them into HSV color space.

Then, we can get a vector as follow:

))2sin(),2cos(( iiiii shshI

We assume the data follow a normal distribution, and estimate the mean and variance via ML:

n

In

ii

1̂

n

i

Tii II

n 1

)ˆ)(ˆ(1

1ˆ


We can transform the source image into HSV color space, and get the vector as follow:

2

)ˆ()ˆ( 1

|ˆ|2

1255

Tglobali

globali II

seg eI

))2sin(),2cos(( iiiiglobali shshI

Then, we can get a new image:

The lighter pixel means it is similar to lip region in color space.


row

i

col

jseg

row

i

col

jseg

x

jiI

jiIj

g

1 1

1 1

),(

),(

row

i

col

jseg

row

i

col

jseg

y

jiI

jiIi

g

1 1

1 1

),(

),(

We select the block in which include the “gravity center” as the lip region.

maskI


For each utterance, we can get two curves correspond into the changing of width and height of lip, respectively.

We can employ LSE to construct two polynomial to fit the two curves.

n

k

kk xaP

1

n

i

n

ki

kk yxaI

0 1

2)(

0

ia

I


In this work, we get n=3.The maximum, minimum and the most right point is recorded as the feature vectors.

Twwwwww boundyyxyxF ],,,,[

maxmaxminmin

Thhhhhh boundyyxyxF ],,,,[

maxmaxminmin

Each utterance is assigned a label “j”, and we use the following equations to train:

21,

wiwi

jw

FTT

21,

hihi

jh

FTT

We use the following equations to test (F is the input feature vector, and T is the trained template feature vector):

||)||||(|| ,,minarg jhhjww

j

TFTFJ

Experiment

The illumination source is an 18w fluorescent lamp, the resolution of camera is 320*240, FPS = 30, and the entire environment is shown as below.

Our task is to recognize 10 isolate digits (0 to 9) in Chinese mandarin.

There are 5 speakers (4 males and 1 female) take part into the experiment. For each digit, speakers were asked to repeat 10 times to train the system, and fifty times to test.

Experiment

The experiment result is shown as below:

Digit Accuracy Digit Accuracy

0 0.972 5 0.912

1 0.952 6 0.964

2 0.976 7 0.744

3 0.964 8 0.952

4 0.788 9 0.932

Experiment

Compare with some existed approaches which also utilize the width and height of lip as visual feature:

Method Accuracy

1 0.8127

2 0.7741

3 0.9149

4 0.7720

Our approach 0.9156

1,2 and 3: S.L.Wang, W.H.Lau, A.W.C.Liew, and S.H.Leung. Automatic lipreading with limited training data. In Proc. ICPR 2006, pp: 881-884, 2006.

4: A.R.Baig, R.Seguier, and G. Vaucher. Image sequence analysis using a spatio-temporal coding for automatic lipreading. In Porc. ICIAP 1999, pp: 544-549, 1999.

Experiment

Conclusion & Future work

In this paper, we have proposed a new approach to automatic lip reading recognition based upon polynomial fitting. The feature vector of our approach have low dimensions and the approach need small testing data set. Experiments have shown the promising result of the proposed approach in comparison with the existing methods.

However, in the more difficult experiment task, e.g. to recognize some words or sentences, some appropriate model is required. This is the emphasis of the next stage research.

Thank you!

31-08-2009

an automatic lip-reading method based on polynomial fitting

Documents