selected topics from 40 years of research on speech andk...
TRANSCRIPT
Selected topics fromp40 years of research on speech
d k itiand speaker recognition
Sadaoki FuruiTokyo Institute of Technologyy gy
Department of Computer [email protected]
Generations of ASR technology1950 1960 1970 1980 1990 2000 2010
1952 19681GHeuristic approaches
(analog filter bank + logic circuits)
2G1968 1980Pattern matching(LPC, FFT, DTW)
3G1980 1990Statistical framework
(HMM, n-gram, neural net)
3.5G1990Discriminative approaches, robust training, normalization, adaptation, spontaneous speech, rich transcription
4G?
Prehistory
4GExtended knowledgeprocessing
?
Our researchNTT Labs (+Bell Labs), Tokyo TechCollaboration with other labs
1970s
• Speaker recognition by statistical p g yfeatures
• Speaker recognition by cepstral p g y pfeatures
Speaker recognition by long-term averaged spectrum
100 100
spectrum
90 90
%)80 80
%)
Acc
ura
cy (
70Method Ⅰ
70Method ⅠA
ccura
cy (
60
Method Ⅱ
Method ⅢMethod Ⅳ
60
Method Ⅱ
Method Ⅲ
Method Ⅳ
502-3days 3 weeks 6 weeks 3 months
502-3days 3 weeks 6 weeks 3 months
(a) Speaker identification (b) Speaker verification
2 3days 3 weeks 6 weeks 3 months 2 3days 3 weeks 6 weeks 3 months
Speaker recognition by using LPC features
100 100
90 90
80
(%)
80
(%)
70
4 words
Acc
ura
cy (
70
4 words
Acc
ura
cy (
50
604 words
1 word
50
604 words
1 word
50
2-3
days
3 wee
ks
6 wee
ks
3 m
onth
s
10 m
onth
s
50
2-3
days
3 wee
ks
6 wee
ks
3 m
onth
s
10 m
onth
s
(a) Speaker identification (b) Speaker verification
The amount of spectral variation as a function of time interval
6
Log-area-ratioparameters nc
e
3
45
6
p
Dis
tan
123
0
Interval0 3 6 9 12 15 18 21 24 27 30 33 months
0
6
Long-term averagedspectrum nc
e
3
45
Dis
tan
123
0
Interval0 3 6 9 12 15 months
0
Variation of the long-time averaged spectrum from four sessions over eight months, and corresponding spectral
envelopes derived from cepstrum coefficients weighted by the t f i isquare root of inverse variances
Sub. W
Sub. T
(b) Envelopes by weighted cepstrum(a) Long-time averaged spectra
Speaker recognition by using LPC features(Eff ti f i filt i )
20 20
(Effectiveness of inverse filtering)
15
1st order filter
2nd order CD filter15
1st order filter
2nd order CD filter
No inverse filter No inverse filter
%)
%)
10 10
Err
or ra
te (%
Err
or ra
te (%
5 5
0
3 da
ys
3 wee
ks
6 wee
ks
mon
ths
mon
ths
0
3 da
ys
wee
ks
wee
ks
mon
ths
mon
ths
2-3 3 w
6 w
3 m
10 m 2-3 3 w
6 w
3 m
o
10 m
o
(a) Two words (b) One word
Speaker verification using cepstrum features
St
Sampleutterance
End-point
Identityclaim
Store
Cepstrum coefficientsthrough LPC analysis
pdetection
Reference dataretrieval
Expansion bypolynomial function
Long-timeaverage
N li ti bNormalization byaverage cepstrum
Feature selection
WeightDynamic time warpingReference template
Overall distancecomputation
Comparison with threshold
Threshold
Accept orreject
threshold
On-line speaker verification experiments usingOn line speaker verification experiments using 120 Bell Labs employees
User: “We were away a year ago.”System: “Stand by for analysis ”System: Stand by for analysis.System: “Your identity has been verified. Thank you.”
1980s
• Spectral dynamics in speech p y pperception and recognition
• Speaker recognition by HMM/GMMSpeaker recognition by HMM/GMM
Analysis of relationships between spectral dynamics and syllable perception
rgy
spectral dynamics and syllable perception
eech
ene
rS
pe
Initial truncation Final truncationInitial truncation Final truncation
Identification test
Spectral dynamics
Relationship between truncated CV syllable identification scores and truncation position relative to the perceptual critical point
100 100
(a) Initial truncation (b) Final truncation
100
80e (%
)
100
80
60
tion s
core
60
l Po
int)
al P
oin
t)
40
den
tifica
t
40
(Critica
(Critica
20
Id
20
0 0-10 0 10 20
Truncation position (ms)
-10 0 10 20
Syllables Vowels Consonants
Truncation position (ms)
Relationship between spectral transition and syllable identification scores as a function of the truncation position for the syllable /njo/
100/ n / / j / / o /
106
on80
Energy
10
tran
sitio
rgy
60
Spec
tral
Ener
40105
S20Spectral transition
0
llable
ific
atio
nco
re
0 50 150 250ms200100
Initial Final truncation50
100%
Syl
Iden
ti sc
Initial Final truncation
0
50
Distribution of the difference between the perceptual critical point and the maximum spectral transition position for all 100 syllables
30Initial truncation
25(μ=4.0ms, σ=14.2ms)
Final truncation(μ=-2.6ms, σ=17.9ms)
20
cy (
%)
(μ , )
15
10Freq
uen
c
10
5
0
5
60 6040 4020 200Positive values mean that the
i t iti i-60 60-40 40-20 200
Relative position (ms)maximum transition is located after the critical point
Relationship between truncation position and identification scores for the truncated CV syllables
Initial truncation ( ) Final truncation ( )Ti TmTf
100100
80(%)
60
tion
scor
e
40
Iden
tific
at
50ms period
Vowels Consonants Syllables
20
Vowels Consonants Syllables0
-30 -20 -10 0 10 20Truncation position (ms)
Relative to the critical point for final truncation
Ti, Tf : Perceptual critical point for initial & final truncationTm : Maximum spectral transition position
Experimental resultsExperimental results• “Perceptual critical points” (Ti, Tf) are related to p p ( i, f)
maximum spectral transition positions (Tm).• 10ms period including the Tm bears the most10ms period including the Tm bears the most
important information for consonant and syllable perception.
• Crucial information for both consonant and vowel identification is contained across the same transitional part of each syllable.
• The spectral transition is more crucial than punvoiced and buzz bar periods for consonant (syllable) perception.
Role of spectral transition for speech perception
Pow
erz]
Pue
ncy
[kH
zFr
equ
k aPeriods that canbe truncated
k
Maximum spectral change period: essential for syllable perception
Cepstrum and delta-cepstrum coefficients
Parameter (vector) trajectory
Instantaneous vectorTransitional (velocity) vector
( )Instantaneous vector(Cepstrum)
(Delta-cepstrum)
Instantaneous and dynamic cepstrum features
Speech
FFTFFT
Spectrum
Cepstrum
DCTDCTLogLog
Δ Acousticvector
Δ2
A five-state ergodic HMMfor text-independent speaker recognition
a22
a23a12
SS22
a32a21a13
a11 a33a24
a42a25
a31a52SS33SS11
a51 a15 a43 a34
4225a14 a53
a41 a35
a54
a41 a35
SSSSa44a55 a45
SS44SS55
Speaker identification rates as a function of the number of states and mixtures in ergodic HMMs
100%
) 8s2m1s32m4s8m 2s16m
1s64m4s16m8s8m
98
100tio
n ra
te (%
8s2m1s16m2s8m
4s8m,2s16m8s4m 3s16m
3s8m94
96
er id
entif
icat 8s1m
1s8m
90
92
1-state
2-states
Spe
ake
86
882 states
3-states
4-states
8-states
840 10 20 30 40 50 60 70
8 states
Total number of mixtures(number of states x number of mixtures)
1990s
• Japanese LVCSR using a newspaper p g p pcorpus and broadcast news
• Robust ASRRobust ASR• Text-prompted speaker recognition
Japanese LVCSR using a newspaper corpus and b d tbroadcast news
Compa ison of le ica and LM t aining co po a fo Comparison of lexica and LM training corpora for different languages
WSJ L M dFrankfurter
Nikkei(Japanese)
WSJ(English)
Le Monde(French)
Frankfurter Rundschau(German)
Sole 24(Italian)
Training test size [words] 180M 37 2M 37 7M 36M 25 7MTraining test size [words]
Distinct words
5k coverage
180M 37.2M 37.7M 36M 25.7M
623k 165k 280k 650k 200k
88 0% 90 6% 85 2% 82 9% 88 3%5k coverage
20k coverage
40k coverage
88.0% 90.6% 85.2% 82.9% 88.3%
96.2% 97.5% 94.7% 90.0% 96.3%
98.2% 99.2% 97.6% - 98.9%g
65k coverage
20k OOV rate
99.0% 99.6% 98.3% 95.1% 99.0%
3.8% 2.5% 5.3% 10.0% 3.7%
LM units for Japanese: morphemes
Entry items corresponding to the number of homophone Entry items corresponding to the number of homophone classes with k graphemic forms in the class
Rate inLexicon
1 2 3 >4
Corpus Homophone class size (k)
1 2 3 >4
Nikkei (30k) 20% 24.1k 2438 706 565
BREF (10k)
BREF (40k)
35% 6686 1329 215 73
45% 23.7k 5361 1264 1039
WSJ (9k)
WSJ (65k)
6% 8453 237 22 1
15% 60.4k 3689 890 291
FR (64k)
So24 (10k)
10% 58.1k 2769 221 57
1.7% 9872 85 3 0
Relationship between perplexity (bigram) and word error rate for a read newspaper taskword error rate for a read newspaper task
30
25%
]
20
r ra
te [
%
15
ord
err
or
10
5
Wo
Mean word error rate by trigram: 9 5%5
45 50 55 60 65 70
9.5%
45 50 55 60 65 70
Perplexity
Relationship between perplexity (bigram) and word error rate for a broadcast-news taskword error rate for a broadcast news task
60
50]
40
rat
e [%
]
30
ord
err
or
20
Wo
Anchor
Mean word error rate by trigram: 19 7% (A h )
100 100 200 300 400 500
Others19.7% (Anchor)
Perplexity
Robust ASR(Supervised/unsupervised acoustic(Supervised/unsupervised acoustic
model adaptation)
• Hierarchical spectral clustering-based• Hierarchical spectral clustering-based unsupervised adaptationMAP+MCE (minim m classification• MAP+MCE (minimum classification error) training-based supervised adaptationadaptation
• N-best-based unsupervised adaptation
Robust ASR(Supervised/unsupervised acoustic(Supervised/unsupervised acoustic
model adaptation)
• Hierarchical spectral clustering-based• Hierarchical spectral clustering-based unsupervised adaptationMAP+MCE (minim m classification• MAP+MCE (minimum classification error) training-based supervised adaptationadaptation
• N-best-based unsupervised adaptation
Hierarchical codebook adaptation algorithm maintaining continuity between adjacent clusters
Speaker-independentcodebook
cici
uuv1v1p1p1
uu
v1v1p1p1 u2u2
cici
Trainingutterances
u1u1 u1u1p1p1
v2v2p2p2
(1) (2)
(4)(3)
Cepstral distortion between input speech and reference templates resulted from hierarchical codebook adaptation
1.0progressive adaptation
0io
ndirect adaptation
ral d
isto
rti
0.9
ive
spec
trR
elat
0.8NA 1 2 4 8 16 32 64 128
NA: No adaptation
NA 1 2 4 8 16 32 64 128Number of clusters
Text-prompted speaker recognition method
(Training)
Training dataTraining dataTraining dataTraining dataSpeakerSpeaker--specific phonemespecific phoneme
model creationmodel creation
Training dataTraining data• • SpeechSpeech• • TextText
SpeakerSpeaker--independentindependent
Training dataTraining data• • SpeechSpeech• • TextText
SpeakerSpeaker--independentindependentpp ppphoneme modelsphoneme models
pp ppphoneme modelsphoneme models
(Recognition)SpeakerSpeaker--specific phonemespecific phoneme
model concatenationmodel concatenationText
Likelihood calculationLikelihood calculationInput speech
Text confirmation andText confirmation andspeaker verificationspeaker verification
2000s (1)
• Spontaneous speech recognition j t d CSJproject and CSJ corpus
• Spectral reduction in spontaneous speechspeech
• Automatic speech summarization and evaluationevaluation
CSJ corpus construction
n
Academic presentationson Prosody labeling Segmental
form
atio
n
Extemporaneous presentations
Oth (i t ia co
llect
io y g(manual)
Manual morphological
glabeling (manual)
Core(500k words)
vario
us inOthers (interviews,
dialogues, etc.)Dat
a morphological analysis
Corpus of Spontaneous Japanese
( CSJ ) Giv
ing
v
Speech file with noise description Automatic
morphological l i
(7M words)
scrip
tion analysis
Sentence segmentation by
pauses
Transcribing text(text and reading)
Tran
s
D t fl t kXML Data flow at workGiving information for research
Contents of the CSJType of Speech # Speakers # Files Monologue/
DialogueSpontaneous/
Read Hours
Academic SAcademic presentations (AP) 838 1006 Monolog Spont. 299.5
Extemporaneous presentations (EP) 580 1715 Monolog Spont. 327.5
Interview on AP * (10) 10 Dialog Spont. 2.1
Interview on EP * (16) 16 Dialog Spont 3 4Interview on EP (16) 16 Dialog Spont. 3.4Task oriented
dialogue * (16) 16 Dialog Spont. 3.1
F di l * (16) 16 Di l S t 3 6Free dialogue * (16) 16 Dialog Spont. 3.6
Reading text *(244) 491 Dialog Read 14.1
Reading transcriptions * (16) 16 Monolog Read 5.5
*Counted as the speakers of AP or EP Total hours 658.8
Out-of-vocabulary (OOV) rate, word error rate (WER) and adjusted test-set perplexity (APP) as a function of the size of language modeltest-set perplexity (APP) as a function of the size of language model
training data (8/8 = 6.84M words)
Linear regression models of the word accuracy (%)g y ( )with the six presentation attributes
Speaker-independent recognition
Acc=0.12AL−0.88SR−0.020PP−2.2OR+0.32FR−3.0RR+95
Speaker-adaptive recognition
Acc=0.024AL−1.3SR−0.014PP−2.1OR+0.32FR−3.2RR+99
Acc: word accuracy SR: speaking rateAcc: word accuracy, SR: speaking rate, PP: word perplexity, OR: out of vocabulary rate, FR: filled pause rate, RR: repair ratep , p
The reduction ratio of the vector norm between each phoneme and the phoneme center in the spontaneous speech to that inand the phoneme center in the spontaneous speech to that in
the read speech
(Academic presentations) (Extemporaneous presentations)
Equation for estimating word recognition accuracyaccu acy
(Acoustic variation only) (Acoustic and linguistic variations)
b cbyaxAccuracy ++−~phonemesbetweendistancesMahalanobiMean:x
Constant:,,perplexitysetTest:
cbay −
Speech summarization by sentence extraction and compaction
Spontaneous speech
Acoustic model
Speechcorpus
Word posteriorprobability
Speech recognition(Recognition results)
SummarizationSummarization
Language modelcorpus
Sentence segmentation
(Word frequency)Large-scaletext corpus
Sentenceextraction
Summarizationlanguage modelSummary
corpus
p
Sentencecompaction
• RecordsMinutesWord dependencyWord dependencyManually parsed
corpus(Summary)
• Minutes• Captions• Indexes
p yprobabilitycorpus
Sentence clustering using SVDg g
N sentencesInformation of Information of sentence sentence iiInformation of Information of word word jj
VUA Σ
Tσ1σ2
=M content
d
j
UA ΣσΝ
=words
i
Target MatrixTarget Matrix Right singular Right singular vector matrixvector matrix
Left singular Left singular vector matrixvector matrix
Singular Singular value matrixvalue matrix
SVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesDeriving a latent semantic structure from a presentation speech represented by the Deriving a latent semantic structure from a presentation speech represented by the matrix matrix AA
)/(log mAmnmn FFfa ⋅=Element Element aamnmn of the matrix of the matrix AA
:: Number of occurrences of a content word (m) in the sentence (n)f : : Number of occurrences of a content word (m) in the sentence (n)mnf
Fm : Number of occurrences of a content word (m) in a large corpus
LSA-based sentence extraction
Dimension reduction by SVDDimension reduction by SVD
⎥⎥⎤
⎢⎢⎡
i
i
aa
2
1
⎥⎥⎤
⎢⎢⎡ i
vv
σσ 11
⎥⎥⎤
⎢⎢⎡
=ivσ
ψ M11
⎥⎥⎥⎥⎥
⎦⎢⎢⎢⎢⎢
⎣
= ii aAM3
⎥⎥⎥
⎦⎢⎢⎢
⎣
=
iNN
ii
v
vA
σ
σM
22ˆ⎥⎥⎦⎢
⎢⎣
=
iKK
i
vσψ M
SVDSVD DimensionDimensionreductionreduction
Each sentence is represented by a weighted singularEach sentence is represented by a weighted singular--value vectorvalue vectorIn order to evaluate each sentence the score of each sentence is calculated byIn order to evaluate each sentence the score of each sentence is calculated by
⎥⎦⎢⎣ Mia ⎦⎣
K
In order to evaluate each sentence, the score of each sentence is calculated by In order to evaluate each sentence, the score of each sentence is calculated by the norm in the the norm in the KK dimensional spacedimensional space
∑=
σ=K
kikki v
1
2)(ψ Score for sentence extractionScore for sentence extraction
A fixed number of sentences having relatively large sentence scores in the A fixed number of sentences having relatively large sentence scores in the reduced dimensional space are selected.reduced dimensional space are selected.
Word extraction scoreWord extraction score
Summarized sentence with M words V = v1 ,v2 ,…, vM1 2 M
ScoreLinguistic scoreLinguistic scoreLinguistic correctness (Bigram/Trigram)
M
S(VM) = Σm=1
L(vm |… vm-1)
Significance (topic) scoreImportant information extraction(Amount of information)
+ λI I (vm ) (Amount of information)
Confidence scoreRecognition error exclusion+ λC C (vm ) Recognition error exclusion(Acoustic & linguistic reliability)
Word concatenation score
C ( m )
+ λ T (v ) Semantic correctness(Word dependency probability)
+ λT Tr(vm )
Word concatenation scoreA penalty for word concatenation with no dependency in the original sentenceno dependency in the original sentence
Inter-phraseInter-phrase
Intra-phrase Intra-phrasep
in Japanthe beautiful cherry blossoms
Phrase 1Phrase 1 Phrase 2Phrase 2
in Japanthe beautiful cherry blossoms
“the beautiful Japan” Grammatically correct “the beautiful Japan” ybut incorrect as a summary
Correlation between subjective and objective l ti ( d t ti )evaluation scores (averaged over presentations)
In the subjective evaluation, the summaries were evaluated in terms of ease of understanding and appropriateness as summaries on five levels.
2000s (2)
• Development of WFST-based decoder and applicationdecoder and application
• Unsupervised cross-validation and aggregated adaptation methods
WFST (Weighted Finite State Transducer)-based “T3 decoder”T3 decoder
Word sequenceSpeechH: HMMC: TriphoneL L i
H C L G
Context Context Word
CompositionL: LexiconG: N-gram
H C L Gspeech
Contextdependent
phones
Contextindependent
phones Word
WordsequenceN-gram
Problems:Problems:– Large memory requirement– Small flexibility
On-the-fly compositionParallel decoding
• Difficult to change partial models
T3 decoder performance
JNASCSJ
9494.5
9595.5
96
racy
79
79.5
80
80.5
racy
9292.5
9393.5
94
Acc
u
77
77.5
78
78.5
Acc
ur
920 0.2 0.4 0.6 0.8 1 1.2
RTF
• Spontaneous speech • Read speech
770 0.2 0.4 0.6 0.8 1 1.2
RTF
• Spontaneous speech• “Corpus of Spontaneous
Japanese (CSJ)”• Test set of 10 lectures
• Read speech• “Japanese Newspaper Article
Sentences (JNAS)”• Test set of 200 utterances• Test set of 10 lectures
• 128 Gaussians per mixture• 65K word vocabulary
• Test set of 200 utterances• 16 Gaussians per mixture• 465k word vocabulary
Icelandic speech recognition using an E li hEnglish corpus
• The Jupiter corpus (a weather information corpus d l d b MIT ) d thdeveloped by MIT ) was used as the English rich corpus
RTRT(60K) Weather information
domain
ST
Number of sentences
(1.5K)
RT: Rich textST: Sparse textST: Sparse text
Icelandic speech recognition using E li h LM (E li h t t)English LM (English output)
T diti l f tTraditional format
Icelandic h
English
WFST format
speech Text
• P(O|W): Icelandic acoustic model• P(W|T): English to Icelandic translation model
P(T) E li h l d l• P(T): English language model
Icelandic speech recognition using E li h LM (I l di t t)
Traditional format
English LM (Icelandic output)Traditional format
Icelandic speech
Icelandic Text
WFST format
speech Text
• P(O|W): Icelandic acoustic model• P(W|T): English to Icelandic translation model
P(T) E li h l d l• P(T): English language model
Optimized English Recognition resultsoutput 91.0%
Optimized
Recognition results
Icelandic output 89.8%
Baseline Icelandic
Keyword Accuracy Rate (%)
87.6%
Rate (%)
λST Output Vocabulary
Unsupervised cross-validation (CV) adaptation
Initial modelM
adaptation
M(2)
Copy
M(1) M(K)
M
M(2)
D(1) Evaluation speech data
Iteration
M(1) M(K)
D(2) D(K)( ) p
Recognition h th i
Recognition lt
Speech recognition
( ) ( )
T(1) T(2) T(K)hypothesis results
Model update
T(1) T(2) T(K)
M(2)M(1) M(K)
Reducing the influence of recognition errors by separating the data used for the decoding step and the model update step
Future• Increasing flexibility and robustness against various sources of variationsagainst various sources of variations
• Spoken language comprehension
A communication - theoretic view ofh ti & itispeech generation & recognition
P ( X | W )M
P ( W | M )
Linguistic W X Speech
P ( M )
Message Acoustic gchannel
W precognizer
gsource channel
Language Vocabulary
Speaker Reverberation
Sources ofVocabulary Grammar Semantics
Reverberation Noise Transmission
h i i
variations
Context Habits
characteristics Microphone
Knowledgesources
Knowledge sources for speech recognitiong p g
Human speech recognition is a matching process whereby an di i l i t h d t i ti k l d ( h i )audio signal is matched to existing knowledge (comprehension).
• Knowledge (Meta-data)g ( )– Domain and topics– Context– Semantics
RecognitionRecognition
– Speakers– Environment, etc.
Speech(data)
Transcription(information)
GeneralizationGeneralization• Systematization of variousrelated knowledge is crucial
Knowledge
GeneralizationMeta-data
GeneralizationMeta-data
• How to incorporate knowledge sources into the t ti ti l ASR f k
Knowledge
AbstractionAbstraction
statistical ASR framework
Generations of ASR technologyGenerations of ASR technology
ASR prehistory 1G 2G 3G 3.5G 4G
~1920 ~1952 ~1968 ~1980 ~1990 ~2009
Our research
Extended knowledge processing“Speech and Intelligence”
Future works• Grand challenge-1: flexibility and robustness
against various acoustic as well as linguisticagainst various acoustic as well as linguisticvariationsG d h ll 2 k l• Grand challenge-2: spoken language comprehension
• A much greater understanding of the human speech process is required before automatic speech recognition systems can approach human performance.
• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.