dtw for speech recognition
DESCRIPTION
DTW for Speech Recognition. J.-S. Roger Jang ( 張智星 ) [email protected] http://www.cs.nthu.edu.tw/~jang MIR Lab ( 多媒體資訊檢索實驗室 ) CS, Tsing Hua Univ. ( 清華大學 資工系 ). Dynamic Time Warping (DTW). Characteristics: Pattern-matching-based approach Require less memory/computation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/1.jpg)
DTW for Speech Recognition
J.-S. Roger Jang ( 張智星 )
http://www.cs.nthu.edu.tw/~jang
MIR Lab ( 多媒體資訊檢索實驗室 )
CS, Tsing Hua Univ. ( 清華大學 資工系 )
![Page 2: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/2.jpg)
-2-
Dynamic Time Warping (DTW)
Characteristics: Pattern-matching-based approach Require less memory/computation Suitable for speaker-dependent recognition Suitable for small to medium vocabulary Suitable for microprocessor/chip implementation
Applications Speaker identification & verification for surveillance
Voice commands for mobile phones, toys
![Page 3: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/3.jpg)
-3-
Dynamic Time Warping: Type 1
i
j
t(i-1)
r(j)
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
),( jiD
t: input MFCC matrix (Each column is a frame’s feature.)r: reference MFCC matrixLocal paths: 27-45-63 degrees
DTW recurrence:r(j-1)
t(i)
![Page 4: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/4.jpg)
-4-
Dynamic Time Warping: Type 2
i
j
t(i-1)
r(j)
),1(
)1,1(
)1,(
min
)(),(),(
jiD
jiD
jiD
jritjiD
),( jiD
r(j-1)
t(i)
t: input MFCC matrix (Each row is a frame’s feature.)r: reference MFCC matrixLocal paths: 0-45-90 degrees
DTW recurrence:
![Page 5: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/5.jpg)
-5-
Local Path Constraints
Type 1 27-45-63 local paths
Type 2 0-45-90 local paths
jiD ,
jiD ,
),1(
)1,1(
)1,(
min
)()(),(
jiD
jiD
jiD
jritjiD
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
2,1 jiD
1, jiD 1,1 jiD
jiD ,1
1,1 jiD 1,2 jiD
![Page 6: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/6.jpg)
-6-
Path Penalty for Type-1 DTW
Path penalty No penalty for 45-degree path Some penalty for paths deviated from 45-degree
)1,2(
)1,1(
)2,1(
min)()(),(
jiD
jiD
jiD
jritjiD
),( jiD
)2,1( jiD
)1,2( jiD
)1,1( jiD
0
![Page 7: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/7.jpg)
-7-
DTW Paths of “Match Corners”
We assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended sentence.
Both corners are fixed. (End point detection is critical.)
Suitable for voice command applications
i
j
![Page 8: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/8.jpg)
-8-
DTW Paths of “Match Anywhere”
No fixed anchored positions
Suitable for retrieval of personal spoken documents
i
j
![Page 9: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/9.jpg)
-9-
Other Variants
Local constraints
Start/ending area
![Page 10: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/10.jpg)
-10-
Implementation Issues
To save memory Use 2-column table for type-1 DTW Use 1-column table for type-2 DTW
To avoid too many if-then statements Pad type-1 DTW with two-layer padding Pad type-2 DTW with one-layer padding
To find a suitable path Minimizing total distance Minimizing average distance
![Page 11: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/11.jpg)
-11-
DTW Path of “Match Corners”
![Page 12: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/12.jpg)
-12-
DTW Path of “Match Anywhere”
![Page 13: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/13.jpg)
-13-
DTW Path of “Match Anywhere”
20 40
20
40
60
80
100
120
140
160
DTW total distance = 304.957
清 華 大 學
我今
天很
高興
來到
清華
大學
進行
演講
20 40
20
40
60
80
100
120
140
160
清 華 大 學
我今
天很
高興
來到
清華
大學
進行
演講
20 40
50
100
150
200400600800
![Page 14: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/14.jpg)
-14-
DTW for Spoken Document Retrieval
Applications Voice-based audio/video retrieval
Issues in SDR using DTW Speaker normalization
Vocal track length normalization (VTLN)
Frequency warping
Efficiency
![Page 15: DTW for Speech Recognition](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815849550346895dc59f00/html5/thumbnails/15.jpg)
-15-
DTW for Speaker-independent Voice Command Recognition
Applications Digit recognition
Technical highlights Extensive recordings Clustering within each command Some indexing methods for DTW Suitable for small-vocabulary applications