a study on the video scene retrieving system

27
A Study on the Video Scene Retrieving System with a Speech Recognizer 2013. 5. 14 Yoshika OSAWA Kohno Lab.

Upload: yoshika-osawa

Post on 06-Jul-2015

122 views

Category:

Technology


0 download

DESCRIPTION

Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int ernet. To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed. So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model). The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words. Experiment result shows average detection accuracy is 68%.

TRANSCRIPT

Page 1: A Study on the Video Scene Retrieving System

A Study on the Video Scene Retrieving System

with a Speech Recognizer

2013. 5. 14

Yoshika OSAWA

Kohno Lab.

Page 2: A Study on the Video Scene Retrieving System

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

Page 3: A Study on the Video Scene Retrieving System

1. Introduction

• A variety of video data are being generated, stored, and accessed with advances in the Internet.

• To make search a video scene quickly from the data, an efficient technique is needed.

Page 4: A Study on the Video Scene Retrieving System

1. Introduction• Multimedia Annotations

oNagao(2001)

Page 5: A Study on the Video Scene Retrieving System

1. Introduction• A Subtitling System for Broadcast

Programs with a Speech Recognizer

oAndo et al.(2001)

Page 6: A Study on the Video Scene Retrieving System

1. Introduction• Extracting voices from the video.

• The advantage of voice :

Easy to Make texts.

Simple association.

Apply the speech recognition to the scene retrieving.

Page 7: A Study on the Video Scene Retrieving System

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

Page 8: A Study on the Video Scene Retrieving System

2. Aim of Study

Implement a scene retrieving system, then verify the accuracy and

check the operations.

Make annotations with the speech recognition automatically.

Page 9: A Study on the Video Scene Retrieving System

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

Page 10: A Study on the Video Scene Retrieving System

3. Composition of System

Start

End

Select a Video

Speech Recognize Section

Input a Keyword

Scene Retrieve Section

Output the resultVoice Divide Section

Page 11: A Study on the Video Scene Retrieving System

i. Voice Divide Section• Focus on the Amplitude

oUse signals while exceeding the threshold value of the amplitude.

o Reject because it is not possible to recognize if it is too short.

oDerive threshold based on experiment.

axis threshold

Amplitude 10[%]

Time 1000[ms]

Page 12: A Study on the Video Scene Retrieving System

ii. Speech Recognize Section

Page 13: A Study on the Video Scene Retrieving System

(1) Pre-Processing Unit• Digitization

o Sampling frequency: 16kHz

oQuantization bit : 16bit

• Noise Reductiono Additive: Subtract the difference between the silence

o Multiplicative: Subtract in the log axis

Microphone characteristics of SM57

Page 14: A Study on the Video Scene Retrieving System

(2) Feature Extraction Unit

Resonant frequency is effective as a feature value

Page 15: A Study on the Video Scene Retrieving System

• Resolution of human hearing

oHigher sensitivity in lower frequency

• Filter that matches the human hearing

Mel-frequency

(2) Feature Extraction Unit

Page 16: A Study on the Video Scene Retrieving System

• Inverse Fourier transform in the Mel-frequency axiso New axis: Cepstrum

o Separate the voice pitch and resonance frequency

• MFCC(Mel Frequency Cepstrum Coefficient)o Information of vowel

• ΔMFCCo Infromation of consonant

• Feature vectoro (Average power, MFCC, ΔMFCC)

(2) Feature Extraction Unit

Page 17: A Study on the Video Scene Retrieving System

(3) Identification Unit

From Bayes' theorem

Page 18: A Study on the Video Scene Retrieving System

(3) Identification UnitSpeech waveform : Observable

Character information: Unobservable directly

Estimate the character information from the waveform by using HMM (Hidden Markov Models)

Maximum likelihood calculation : Viterbi algorithmMachine learning : Baum-Welch algorithm

Page 19: A Study on the Video Scene Retrieving System

iii. Scene Retrieve Section• Matching keyword and text

1. Input a keyword

2. Matching the keyword by String searching

3. Extract scene that the keyword was spoken.

4. Output a thumbnail

Page 20: A Study on the Video Scene Retrieving System

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

Page 21: A Study on the Video Scene Retrieving System

4. Evaluation Experiment1. Compare the result with the word I heard

2. Calculate the recognition rate

3. Evaluate it by each number of charactersSample data

Video NHK news

Time 3 minutes

Number 30 videos

Words 457 words

Engine Julius

Page 22: A Study on the Video Scene Retrieving System

4. Evaluation Experiment

Total average rate is 68%.

67%73%

69%

46% 45%40%

0%

20%

40%

60%

80%

Recognition Rate

1 2 3 4 5 6 words

Page 23: A Study on the Video Scene Retrieving System

4. Evaluation Experiment• Verify the correspondence between

keyword and the seek destination

o Select thumbnail and play from the scene

oCheck whether the keyword was spoken.

Page 24: A Study on the Video Scene Retrieving System

4. Evaluation Experiment• Recognition rate decrease when number

of characters increase.

• The retrieved scene is corresponding to the keyword.

• Recognition error in weak consonant part

oNeed improvement in Voice Devide Section

oMust also improve the recognition accuracy

Page 25: A Study on the Video Scene Retrieving System

Outline1. Introduction

2. Aim of Study

3. Composition of Systemi. Voice Divide Section

ii. Speech Recognize Section

iii. Scene Retrieve Section

4. Evaluation Experiment

5. Conclusion

Page 26: A Study on the Video Scene Retrieving System

5. Conclusion• System for efficient watching video

oUse Speech Recognition

oMake Annotations automatically

• Future work

oAdopt the Zero-Crossing Number in Voice Devide Section

o Take in latest Speech Recognition technology.

o Incorporate Image Recognition.

Page 27: A Study on the Video Scene Retrieving System

Thank you for your attention!