an automatic lip-reading method based on polynomial fitting
DESCRIPTION
An Automatic Lip-reading Method Based on Polynomial Fitting. Meng LI Supervisor: Dr. Yiu-ming CHEUNG Department of Computer Science Hong Kong Baptist University. Content. Introduction. Lip segmentation. Visual speech recognition. Experiment. Conclusion and future work. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
An Automatic Lip-reading Method Based on An Automatic Lip-reading Method Based on Polynomial FittingPolynomial FittingAn Automatic Lip-reading Method Based on An Automatic Lip-reading Method Based on Polynomial FittingPolynomial Fitting
Meng LISupervisor: Dr. Yiu-ming CHEUNG
Department of Computer ScienceHong Kong Baptist University
Content
Conclusion and future work
Introduction
Lip segmentation
Visual speech recognition
Experiment
Introduction
Audio Audio ChannelChannel
Video Video ChannelChannel
PerceptionPerception
The speech perception is multimodal involves information from at least two sensory modalities.
Introduction
Visual Only
Audio Only
Visual-Audio
0% 20% 40% 60% 80% 100%
73 %
91%
97%
Visual Only
Audio Only
Visual-Audio
0% 20% 40% 60% 80% 100%
73 %
47%
87%
Silent Environment
Noisy Environment
Introduction
The hottest research direction in lip-reading is visual-speech recognition (with audio information, or visual only)
1%1%5%5%
31%31%
63%63%
Others
Identification
Speech recognition in noisy environment
Visual-only speech recognition
Introduction
Preprocessing
Acoustic Acoustic ProcessingProcessing
Audio
Video
Feature Extraction
Audio Feature Audio Feature ExtractionExtraction
Lip CapturingLip Capturing Visual Feature Visual Feature ExtractionExtraction
AV Fusion
Fusion and Fusion and RecognitionRecognition
The basic structure of an typical AVSR (Automatic Visual-Speech Recognition) system
Introduction
Using all pixels in lip region as feature.
Pixel Based Motion Based
Model BasedShape Based
Capture the moving feature in all or parts of lip during pronunciation
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.
Introduction
Using all pixels in lip region as feature.
Pixel BasedPixel Based Motion Based
Model BasedShape Based
Capture the moving feature in all or parts of lip during pronunciation
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.
Sensitive to the illumination condition. Sensitive to the rotate, scale transform. Human dependence. High dimension of feature data.
Sensitive to the illumination condition. Sensitive to the rotate, scale transform. Human dependence. High dimension of feature data.
All information are utilized. Highest recognition in ideal
illumination condition.
All information are utilized. Highest recognition in ideal
illumination condition.
PositivePositive DisadvantageDisadvantageAdvantageAdvantage
Introduction
Introduction
Using all pixels in lip region as feature.
Pixel Based Motion BasedMotion Based
Model BasedShape Based
Capture the moving feature in all or parts of lip during pronunciation
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.
Sensitive to the illumination condition. Sensitive to the rotate, scale transform. Human dependence. High dimension of feature data.
Sensitive to the illumination condition. Sensitive to the rotate, scale transform. Human dependence. High dimension of feature data.
Represent the motion of lip directly and completely.
Represent the motion of lip directly and completely.
NegativeNegativePositivePositive DisadvantageDisadvantageAdvantageAdvantage
Introduction
Introduction
Using all pixels in lip region as feature.
Pixel Based Moving Based
Model BasedModel BasedShape BasedShape Based
Capture the moving feature in all or parts of lip during pronunciation
Extract the boundary of lip as the feature.
Assume a lip modal, matching the lip shape and the modal, using some parameters to represent the shape of lip.
High computation complexity. High computation complexity. Low dimension of feature data. Robust to rotate and scale
transformation. If the model appropriate, human
independence ca be implemented. Convenient to employ some classical
method (e.g. HMM) to match.
Low dimension of feature data. Robust to rotate and scale
transformation. If the model appropriate, human
independence ca be implemented. Convenient to employ some classical
method (e.g. HMM) to match.
NegativeNegativePositivePositive DisadvantageDisadvantageAdvantageAdvantage
Introduction
Tip
So far, the Model-based Feature
Extraction is the most common
method.
Introduction
Introduction
Lip segmentation under gray-levelLip segmentation under gray-level
Based on gray-level image.
Locate the minimum enclosing rectangular of mouth.
High processing speed. Low computation complexity.
Based on gray-level image.
Locate the minimum enclosing rectangular of mouth.
High processing speed. Low computation complexity.
The rest of this presentation.
Introduction
Lip segmentation in colour spaceLip segmentation in colour space
Based on rgb, hsv and La*b* colour space.
Can extract the outer boundary of lip.
High accuracy.
High computation complexity.
Based on rgb, hsv and La*b* colour space.
Can extract the outer boundary of lip.
High accuracy.
High computation complexity.
The rest of this presentation.
Introduction
Visual only speech recognitionVisual only speech recognition
Based on polynomial fitting.
High processing speed. Suitable for real-time system. Perform good in limited training set.
Based on polynomial fitting.
High processing speed. Suitable for real-time system. Perform good in limited training set.
The rest of this presentation.
Lip segmentation (1)
Lip segmentation (1)
Lip segmentation (1)
Lip segmentation (2)
Firstly, we transform the source image from RGB color space into La*b* space.
In a* channel, negative values indicate green while positive values indicate magenta.
So, it is helpful to highlight the lip region from skin.
Lip segmentation (2)
1282/)77142221814503(377 24* BGRa
*min
*max
*min
**
aa
aaanorm
2552 * GaI normmask
GRGRrg R
GI 125610/
Lip segmentation (2)
In source image, we get the pixels located in the non-black area, and transform them into HSV color space.
Then, we can get a vector as follow:
))2sin(),2cos(( iiiii shshI
We assume the data follow a normal distribution, and estimate the mean and variance via ML:
n
In
ii
1̂
n
i
Tii II
n 1
)ˆ)(ˆ(1
1ˆ
Lip segmentation (2)
We can transform the source image into HSV color space, and get the vector as follow:
2
)ˆ()ˆ( 1
|ˆ|2
1255
Tglobali
globali II
seg eI
))2sin(),2cos(( iiiiglobali shshI
Then, we can get a new image:
The lighter pixel means it is similar to lip region in color space.
Lip segmentation (2)
row
i
col
jseg
row
i
col
jseg
x
jiI
jiIj
g
1 1
1 1
),(
),(
row
i
col
jseg
row
i
col
jseg
y
jiI
jiIi
g
1 1
1 1
),(
),(
We select the block in which include the “gravity center” as the lip region.
maskI
Visual speech recognition
Visual speech recognition
For each utterance, we can get two curves correspond into the changing of width and height of lip, respectively.
We can employ LSE to construct two polynomial to fit the two curves.
n
k
kk xaP
1
n
i
n
ki
kk yxaI
0 1
2)(
0
ia
I
Visual speech recognition
In this work, we get n=3.The maximum, minimum and the most right point is recorded as the feature vectors.
Twwwwww boundyyxyxF ],,,,[
maxmaxminmin
Thhhhhh boundyyxyxF ],,,,[
maxmaxminmin
Each utterance is assigned a label “j”, and we use the following equations to train:
21,
wiwi
jw
FTT
21,
hihi
jh
FTT
We use the following equations to test (F is the input feature vector, and T is the trained template feature vector):
||)||||(|| ,,minarg jhhjww
j
TFTFJ
Experiment
The illumination source is an 18w fluorescent lamp, the resolution of camera is 320*240, FPS = 30, and the entire environment is shown as below.
Our task is to recognize 10 isolate digits (0 to 9) in Chinese mandarin.
There are 5 speakers (4 males and 1 female) take part into the experiment. For each digit, speakers were asked to repeat 10 times to train the system, and fifty times to test.
Experiment
The experiment result is shown as below:
Digit Accuracy Digit Accuracy
0 0.972 5 0.912
1 0.952 6 0.964
2 0.976 7 0.744
3 0.964 8 0.952
4 0.788 9 0.932
Experiment
Compare with some existed approaches which also utilize the width and height of lip as visual feature:
Method Accuracy
1 0.8127
2 0.7741
3 0.9149
4 0.7720
Our approach 0.9156
1,2 and 3: S.L.Wang, W.H.Lau, A.W.C.Liew, and S.H.Leung. Automatic lipreading with limited training data. In Proc. ICPR 2006, pp: 881-884, 2006.
4: A.R.Baig, R.Seguier, and G. Vaucher. Image sequence analysis using a spatio-temporal coding for automatic lipreading. In Porc. ICIAP 1999, pp: 544-549, 1999.
Experiment
Conclusion & Future work
In this paper, we have proposed a new approach to automatic lip reading recognition based upon polynomial fitting. The feature vector of our approach have low dimensions and the approach need small testing data set. Experiments have shown the promising result of the proposed approach in comparison with the existing methods.
However, in the more difficult experiment task, e.g. to recognize some words or sentences, some appropriate model is required. This is the emphasis of the next stage research.
Thank you!
31-08-2009