1 robust endpoint detection and energy normalization for real-time speech and speaker recognition qi...
TRANSCRIPT
![Page 1: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/1.jpg)
1
Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition
Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE
Presented by Chen Hung_Bin
![Page 2: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/2.jpg)
2
outline
• Introduction endpoint detection• Endpoint detection include• Endpoint detection (Filter)• State Transition• Experiment
![Page 3: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/3.jpg)
3
Introduction
• The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection.
• In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition.– sequential: automatic speech recognition (ASR)– batch-mode: utterances are usually as short as a few seconds an
d the delay in response is usually small.
![Page 4: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/4.jpg)
4
Introduction
• Endpoint detection include– energy threshold– pitch detection– spectrum analysis– cepstral analysis – zero-crossing rate– periodicity measure– chi-square test– entropy– hybrid detection
![Page 5: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/5.jpg)
5
Introduction
• energy
N
i
N
i
N
i
ixdb
ixenergy
ixmagnitude
1
210
1
2
1
][log10
dbin measured is e(t) of value thefrequentlyBut,
][
][
![Page 6: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/6.jpg)
6
Introduction
• A Mandarin digit “eight.”
• spectrum
![Page 7: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/7.jpg)
7
Introduction
• zero-crossing rate
![Page 8: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/8.jpg)
8
Introduction
• The chi-square test given by
• The hypothesis test can thus be written as
N
i i
ii
e
eo
1
22
1
02
H
Hthresthold
![Page 9: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/9.jpg)
9
Introduction
• entropy
kk
k xPxPxH /1log
![Page 10: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/10.jpg)
10
Introduction
• endpoint detection crucial : accuracy and speed for several reasons.
– It is hard to model noise and silence accurately in changing environments.
– if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech.
– The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.
![Page 11: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/11.jpg)
11
Introduction
• point out in this study :– The more accurately we can detect endpoints, the better we can
do on real-time energy normalization.
• requirements: – Accurate location of detected endpoints; – Robust detection at various noise levels; – Low computational complexity; – Fast response time;– And simple implementation.
![Page 12: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/12.jpg)
12
Endpoint Detection (Filter)
• First, we need a detector (filter) that meets the following general requirements:– 1) invariant outputs at various background energy levels;– 2) capability of detecting both beginning and ending points;– 3) short time delay or look-ahead;– 4) limited response level;– 5) maximum output signal-to-noise ratio (SNR) at endpoints;– 6) accurate location of detected endpoints;– 7) maximum suppression of false detection.
![Page 13: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/13.jpg)
13
Endpoint Detection (Filter)
sx
Ax
Ax
In
nj
w
wi
eKK
AxKAxKe
AxKAxKexf
sxAwifiwfih
o(j)g(t)
w
i
t
itgihtF
t
t
65
43
21
12
10
)cos()sin(
)cos()sin()(
papameters are ,,{ )1(),0()(
(db) featureenergy {the log10
filter theof width half theis
integeran is
number framecurrent theis
)()()(
filter average moving a as operated becan then filter The
![Page 14: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/14.jpg)
14
Filter for Both Beginning- and Ending-Edge Detection
• choose the filter size– W =13
– s = 0.5385
– A = 0.2208
–
• Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros.
24
2
)2()()(i
itgiHtF
872,-0.56]-0.036,-0.68,-0.078,[1.583,1.4]k[k 61
Count 30Less then 25 points
![Page 15: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/15.jpg)
15
Filter for Both Beginning- and Ending-Edge Detection
• In this paper choose the filter size
Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1
Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.
![Page 16: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/16.jpg)
16
Batch-mode Endpoint Detection
Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points.
Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)
![Page 17: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/17.jpg)
17
Batch-mode Endpoint Detection
![Page 18: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/18.jpg)
18
State Transition Diagram
• Using a three-state transition diagram to make final decisions.– silence, in-speech, and leaving-speech.
8 KHz sampling rate
State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.
![Page 19: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/19.jpg)
19
Real-Time Energy Normalization
• Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.
});1(ˆ),2(max{)(ˆ
as (t)g update
Fig.in shown as N toM from is windowahead-look the
}2);(max{)(ˆ
g estimate tohow
)((t)~
maxmax
max
max
max
max
MttgWtgtg
WMtMtgtg
gtgg
![Page 20: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/20.jpg)
20
Real-Time Energy Normalization
mgWMtMtgEtg }2);({)(
click. single a fromnot is g new
that ensure to thresholdselected-pre a tobeg a need we,But
max
m
![Page 21: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/21.jpg)
21
Real-Time Energy Normalization
• example
(a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR).
(b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.
![Page 22: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/22.jpg)
22
Database Evaluation
• The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases.
• Baseline Endpoint Detection:– six-state transition diagram is used
• initializing, silence, rising, energy, fell-rising, and fell states.
– In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.
![Page 23: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/23.jpg)
23
Database Evaluation
• Noisy Database Evaluation:– In this experiment, a database was first recorded from a desktop
computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate.
– Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB.
– The original database has 39 utterances and 1738 digits in total.– LPC feature and the short-term energy were used and the
hidden Markov model (HMM) to recognize.
![Page 24: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/24.jpg)
24
Database Evaluation
Comparisons on real-time connected digit recognition
(a) utterance in DB5: “1 Z 4 O 5 8 2.”(b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.”(d) filter output
![Page 25: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/25.jpg)
25
Database Evaluation
• Telephone Database Evaluation:– The proposed algorithm was further evaluated in 11 databases
collected from the telephone networks with 8 kHz sampling rates in various acoustic environments.
– DB1 to DB5 contain digits, alphabet and word strings.– DB6 to DB11 contain pure digit strings.– In the proposed system, we set the parameters as
30)( and 0.3,6.3,60,800 countCapTTgg LUm
![Page 26: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/26.jpg)
26
Database Evaluation
digits, alphabet andword strings
pure digit strings
![Page 27: 1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine](https://reader035.vdocuments.mx/reader035/viewer/2022070410/56649f155503460f94c29c4b/html5/thumbnails/27.jpg)
27
CONCLUSIONS
• Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.