selected topics from 40 years of research on speech andk...

68
Selected topics from 40 years of research on speech d k iti and speaker recognition Sadaoki Furui Tokyo Institute of Technology Department of Computer Science [email protected]

Upload: ngokhanh

Post on 03-Mar-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Selected topics fromp40 years of research on speech

d k itiand speaker recognition

Sadaoki FuruiTokyo Institute of Technologyy gy

Department of Computer [email protected]

Generations of ASR technology1950 1960 1970 1980 1990 2000 2010

1952 19681GHeuristic approaches

(analog filter bank + logic circuits)

2G1968 1980Pattern matching(LPC, FFT, DTW)

3G1980 1990Statistical framework

(HMM, n-gram, neural net)

3.5G1990Discriminative approaches, robust training, normalization, adaptation, spontaneous speech, rich transcription

4G?

Prehistory

4GExtended knowledgeprocessing

?

Our researchNTT Labs (+Bell Labs), Tokyo TechCollaboration with other labs

J t diti l i i “K i ki i”Japanese traditional cuisine “Kaiseki-ryori”

1970s

• Speaker recognition by statistical p g yfeatures

• Speaker recognition by cepstral p g y pfeatures

Speaker recognition by long-term averaged spectrum

100 100

spectrum

90 90

%)80 80

%)

Acc

ura

cy (

70Method Ⅰ

70Method ⅠA

ccura

cy (

60

Method Ⅱ

Method ⅢMethod Ⅳ

60

Method Ⅱ

Method Ⅲ

Method Ⅳ

502-3days 3 weeks 6 weeks 3 months

502-3days 3 weeks 6 weeks 3 months

(a) Speaker identification (b) Speaker verification

2 3days 3 weeks 6 weeks 3 months 2 3days 3 weeks 6 weeks 3 months

Speaker recognition by using LPC features

100 100

90 90

80

(%)

80

(%)

70

4 words

Acc

ura

cy (

70

4 words

Acc

ura

cy (

50

604 words

1 word

50

604 words

1 word

50

2-3

days

3 wee

ks

6 wee

ks

3 m

onth

s

10 m

onth

s

50

2-3

days

3 wee

ks

6 wee

ks

3 m

onth

s

10 m

onth

s

(a) Speaker identification (b) Speaker verification

The amount of spectral variation as a function of time interval

6

Log-area-ratioparameters nc

e

3

45

6

p

Dis

tan

123

0

Interval0 3 6 9 12 15 18 21 24 27 30 33 months

0

6

Long-term averagedspectrum nc

e

3

45

Dis

tan

123

0

Interval0 3 6 9 12 15 months

0

Variation of the long-time averaged spectrum from four sessions over eight months, and corresponding spectral

envelopes derived from cepstrum coefficients weighted by the t f i isquare root of inverse variances

Sub. W

Sub. T

(b) Envelopes by weighted cepstrum(a) Long-time averaged spectra

Speaker recognition by using LPC features(Eff ti f i filt i )

20 20

(Effectiveness of inverse filtering)

15

1st order filter

2nd order CD filter15

1st order filter

2nd order CD filter

No inverse filter No inverse filter

%)

%)

10 10

Err

or ra

te (%

Err

or ra

te (%

5 5

0

3 da

ys

3 wee

ks

6 wee

ks

mon

ths

mon

ths

0

3 da

ys

wee

ks

wee

ks

mon

ths

mon

ths

2-3 3 w

6 w

3 m

10 m 2-3 3 w

6 w

3 m

o

10 m

o

(a) Two words (b) One word

Research at Bell Laboratories, Murray Hill, f 1978 t 1979from 1978 to 1979

Speaker verification using cepstrum features

St

Sampleutterance

End-point

Identityclaim

Store

Cepstrum coefficientsthrough LPC analysis

pdetection

Reference dataretrieval

Expansion bypolynomial function

Long-timeaverage

N li ti bNormalization byaverage cepstrum

Feature selection

WeightDynamic time warpingReference template

Overall distancecomputation

Comparison with threshold

Threshold

Accept orreject

threshold

On-line speaker verification experiments usingOn line speaker verification experiments using 120 Bell Labs employees

User: “We were away a year ago.”System: “Stand by for analysis ”System: Stand by for analysis.System: “Your identity has been verified. Thank you.”

1980s

• Spectral dynamics in speech p y pperception and recognition

• Speaker recognition by HMM/GMMSpeaker recognition by HMM/GMM

Analysis of relationships between spectral dynamics and syllable perception

rgy

spectral dynamics and syllable perception

eech

ene

rS

pe

Initial truncation Final truncationInitial truncation Final truncation

Identification test

Spectral dynamics

Relationship between truncated CV syllable identification scores and truncation position relative to the perceptual critical point

100 100

(a) Initial truncation (b) Final truncation

100

80e (%

)

100

80

60

tion s

core

60

l Po

int)

al P

oin

t)

40

den

tifica

t

40

(Critica

(Critica

20

Id

20

0 0-10 0 10 20

Truncation position (ms)

-10 0 10 20

Syllables Vowels Consonants

Truncation position (ms)

Relationship between spectral transition and syllable identification scores as a function of the truncation position for the syllable /njo/

100/ n / / j / / o /

106

on80

Energy

10

tran

sitio

rgy

60

Spec

tral

Ener

40105

S20Spectral transition

0

llable

ific

atio

nco

re

0 50 150 250ms200100

Initial Final truncation50

100%

Syl

Iden

ti sc

Initial Final truncation

0

50

Distribution of the difference between the perceptual critical point and the maximum spectral transition position for all 100 syllables

30Initial truncation

25(μ=4.0ms, σ=14.2ms)

Final truncation(μ=-2.6ms, σ=17.9ms)

20

cy (

%)

(μ , )

15

10Freq

uen

c

10

5

0

5

60 6040 4020 200Positive values mean that the

i t iti i-60 60-40 40-20 200

Relative position (ms)maximum transition is located after the critical point

Relationship between truncation position and identification scores for the truncated CV syllables

Initial truncation ( ) Final truncation ( )Ti TmTf

100100

80(%)

60

tion

scor

e

40

Iden

tific

at

50ms period

Vowels Consonants Syllables

20

Vowels Consonants Syllables0

-30 -20 -10 0 10 20Truncation position (ms)

Relative to the critical point for final truncation

Ti, Tf : Perceptual critical point for initial & final truncationTm : Maximum spectral transition position

Experimental resultsExperimental results• “Perceptual critical points” (Ti, Tf) are related to p p ( i, f)

maximum spectral transition positions (Tm).• 10ms period including the Tm bears the most10ms period including the Tm bears the most

important information for consonant and syllable perception.

• Crucial information for both consonant and vowel identification is contained across the same transitional part of each syllable.

• The spectral transition is more crucial than punvoiced and buzz bar periods for consonant (syllable) perception.

Role of spectral transition for speech perception

Pow

erz]

Pue

ncy

[kH

zFr

equ

k aPeriods that canbe truncated

k

Maximum spectral change period: essential for syllable perception

Cepstrum and delta-cepstrum coefficients

Parameter (vector) trajectory

Instantaneous vectorTransitional (velocity) vector

( )Instantaneous vector(Cepstrum)

(Delta-cepstrum)

Instantaneous and dynamic cepstrum features

Speech

FFTFFT

Spectrum

Cepstrum

DCTDCTLogLog

Δ Acousticvector

Δ2

A five-state ergodic HMMfor text-independent speaker recognition

a22

a23a12

SS22

a32a21a13

a11 a33a24

a42a25

a31a52SS33SS11

a51 a15 a43 a34

4225a14 a53

a41 a35

a54

a41 a35

SSSSa44a55 a45

SS44SS55

Speaker identification rates as a function of the number of states and mixtures in ergodic HMMs

100%

) 8s2m1s32m4s8m 2s16m

1s64m4s16m8s8m

98

100tio

n ra

te (%

8s2m1s16m2s8m

4s8m,2s16m8s4m 3s16m

3s8m94

96

er id

entif

icat 8s1m

1s8m

90

92

1-state

2-states

Spe

ake

86

882 states

3-states

4-states

8-states

840 10 20 30 40 50 60 70

8 states

Total number of mixtures(number of states x number of mixtures)

1990s

• Japanese LVCSR using a newspaper p g p pcorpus and broadcast news

• Robust ASRRobust ASR• Text-prompted speaker recognition

Japanese LVCSR using a newspaper corpus and b d tbroadcast news

Compa ison of le ica and LM t aining co po a fo Comparison of lexica and LM training corpora for different languages

WSJ L M dFrankfurter

Nikkei(Japanese)

WSJ(English)

Le Monde(French)

Frankfurter Rundschau(German)

Sole 24(Italian)

Training test size [words] 180M 37 2M 37 7M 36M 25 7MTraining test size [words]

Distinct words

5k coverage

180M 37.2M 37.7M 36M 25.7M

623k 165k 280k 650k 200k

88 0% 90 6% 85 2% 82 9% 88 3%5k coverage

20k coverage

40k coverage

88.0% 90.6% 85.2% 82.9% 88.3%

96.2% 97.5% 94.7% 90.0% 96.3%

98.2% 99.2% 97.6% - 98.9%g

65k coverage

20k OOV rate

99.0% 99.6% 98.3% 95.1% 99.0%

3.8% 2.5% 5.3% 10.0% 3.7%

LM units for Japanese: morphemes

Entry items corresponding to the number of homophone Entry items corresponding to the number of homophone classes with k graphemic forms in the class

Rate inLexicon

1 2 3 >4

Corpus Homophone class size (k)

1 2 3 >4

Nikkei (30k) 20% 24.1k 2438 706 565

BREF (10k)

BREF (40k)

35% 6686 1329 215 73

45% 23.7k 5361 1264 1039

WSJ (9k)

WSJ (65k)

6% 8453 237 22 1

15% 60.4k 3689 890 291

FR (64k)

So24 (10k)

10% 58.1k 2769 221 57

1.7% 9872 85 3 0

Relationship between perplexity (bigram) and word error rate for a read newspaper taskword error rate for a read newspaper task

30

25%

]

20

r ra

te [

%

15

ord

err

or

10

5

Wo

Mean word error rate by trigram: 9 5%5

45 50 55 60 65 70

9.5%

45 50 55 60 65 70

Perplexity

Relationship between perplexity (bigram) and word error rate for a broadcast-news taskword error rate for a broadcast news task

60

50]

40

rat

e [%

]

30

ord

err

or

20

Wo

Anchor

Mean word error rate by trigram: 19 7% (A h )

100 100 200 300 400 500

Others19.7% (Anchor)

Perplexity

Robust ASR(Supervised/unsupervised acoustic(Supervised/unsupervised acoustic

model adaptation)

• Hierarchical spectral clustering-based• Hierarchical spectral clustering-based unsupervised adaptationMAP+MCE (minim m classification• MAP+MCE (minimum classification error) training-based supervised adaptationadaptation

• N-best-based unsupervised adaptation

Robust ASR(Supervised/unsupervised acoustic(Supervised/unsupervised acoustic

model adaptation)

• Hierarchical spectral clustering-based• Hierarchical spectral clustering-based unsupervised adaptationMAP+MCE (minim m classification• MAP+MCE (minimum classification error) training-based supervised adaptationadaptation

• N-best-based unsupervised adaptation

Hierarchical codebook adaptation algorithm maintaining continuity between adjacent clusters

Speaker-independentcodebook

cici

uuv1v1p1p1

uu

v1v1p1p1 u2u2

cici

Trainingutterances

u1u1 u1u1p1p1

v2v2p2p2

(1) (2)

(4)(3)

Cepstral distortion between input speech and reference templates resulted from hierarchical codebook adaptation

1.0progressive adaptation

0io

ndirect adaptation

ral d

isto

rti

0.9

ive

spec

trR

elat

0.8NA 1 2 4 8 16 32 64 128

NA: No adaptation

NA 1 2 4 8 16 32 64 128Number of clusters

Text-prompted speaker recognition method

(Training)

Training dataTraining dataTraining dataTraining dataSpeakerSpeaker--specific phonemespecific phoneme

model creationmodel creation

Training dataTraining data• • SpeechSpeech• • TextText

SpeakerSpeaker--independentindependent

Training dataTraining data• • SpeechSpeech• • TextText

SpeakerSpeaker--independentindependentpp ppphoneme modelsphoneme models

pp ppphoneme modelsphoneme models

(Recognition)SpeakerSpeaker--specific phonemespecific phoneme

model concatenationmodel concatenationText

Likelihood calculationLikelihood calculationInput speech

Text confirmation andText confirmation andspeaker verificationspeaker verification

2000s (1)

• Spontaneous speech recognition j t d CSJproject and CSJ corpus

• Spectral reduction in spontaneous speechspeech

• Automatic speech summarization and evaluationevaluation

CSJ corpus construction

n

Academic presentationson Prosody labeling Segmental

form

atio

n

Extemporaneous presentations

Oth (i t ia co

llect

io y g(manual)

Manual morphological

glabeling (manual)

Core(500k words)

vario

us inOthers (interviews,

dialogues, etc.)Dat

a morphological analysis

Corpus of Spontaneous Japanese

( CSJ ) Giv

ing

v

Speech file with noise description Automatic

morphological l i

(7M words)

scrip

tion analysis

Sentence segmentation by

pauses

Transcribing text(text and reading)

Tran

s

D t fl t kXML Data flow at workGiving information for research

Contents of the CSJType of Speech # Speakers # Files Monologue/

DialogueSpontaneous/

Read Hours

Academic SAcademic presentations (AP) 838 1006 Monolog Spont. 299.5

Extemporaneous presentations (EP) 580 1715 Monolog Spont. 327.5

Interview on AP * (10) 10 Dialog Spont. 2.1

Interview on EP * (16) 16 Dialog Spont 3 4Interview on EP (16) 16 Dialog Spont. 3.4Task oriented

dialogue * (16) 16 Dialog Spont. 3.1

F di l * (16) 16 Di l S t 3 6Free dialogue * (16) 16 Dialog Spont. 3.6

Reading text *(244) 491 Dialog Read 14.1

Reading transcriptions * (16) 16 Monolog Read 5.5

*Counted as the speakers of AP or EP Total hours 658.8

Out-of-vocabulary (OOV) rate, word error rate (WER) and adjusted test-set perplexity (APP) as a function of the size of language modeltest-set perplexity (APP) as a function of the size of language model

training data (8/8 = 6.84M words)

Word error rate (WER) as a function of the size of acoustic model training data (8/8 = 510 hours)

Linear regression models of the word accuracy (%)g y ( )with the six presentation attributes

Speaker-independent recognition

Acc=0.12AL−0.88SR−0.020PP−2.2OR+0.32FR−3.0RR+95

Speaker-adaptive recognition

Acc=0.024AL−1.3SR−0.014PP−2.1OR+0.32FR−3.2RR+99

Acc: word accuracy SR: speaking rateAcc: word accuracy, SR: speaking rate, PP: word perplexity, OR: out of vocabulary rate, FR: filled pause rate, RR: repair ratep , p

The reduction ratio of the vector norm between each phoneme and the phoneme center in the spontaneous speech to that inand the phoneme center in the spontaneous speech to that in

the read speech

(Academic presentations) (Extemporaneous presentations)

Mean reduction ratios of vowels and consonants for each speaking style

Distribution of distances between phonemes(R: read speech, D: dialogue)

Relationship between phoneme distances and p pphoneme recognition accuracy

Relationship between test-set perplexity and word recognition accuracy (%) (NC: news commentary)

Equation for estimating word recognition accuracyaccu acy

(Acoustic variation only) (Acoustic and linguistic variations)

b cbyaxAccuracy ++−~phonemesbetweendistancesMahalanobiMean:x

Constant:,,perplexitysetTest:

cbay −

Speech summarization by sentence extraction and compaction

Spontaneous speech

Acoustic model

Speechcorpus

Word posteriorprobability

Speech recognition(Recognition results)

SummarizationSummarization

Language modelcorpus

Sentence segmentation

(Word frequency)Large-scaletext corpus

Sentenceextraction

Summarizationlanguage modelSummary

corpus

p

Sentencecompaction

• RecordsMinutesWord dependencyWord dependencyManually parsed

corpus(Summary)

• Minutes• Captions• Indexes

p yprobabilitycorpus

Sentence clustering using SVDg g

N sentencesInformation of Information of sentence sentence iiInformation of Information of word word jj

VUA Σ

Tσ1σ2

=M content

d

j

UA ΣσΝ

=words

i

Target MatrixTarget Matrix Right singular Right singular vector matrixvector matrix

Left singular Left singular vector matrixvector matrix

Singular Singular value matrixvalue matrix

SVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesSVD semantically clusters content words and sentencesDeriving a latent semantic structure from a presentation speech represented by the Deriving a latent semantic structure from a presentation speech represented by the matrix matrix AA

)/(log mAmnmn FFfa ⋅=Element Element aamnmn of the matrix of the matrix AA

:: Number of occurrences of a content word (m) in the sentence (n)f : : Number of occurrences of a content word (m) in the sentence (n)mnf

Fm : Number of occurrences of a content word (m) in a large corpus

LSA-based sentence extraction

Dimension reduction by SVDDimension reduction by SVD

⎥⎥⎤

⎢⎢⎡

i

i

aa

2

1

⎥⎥⎤

⎢⎢⎡ i

vv

σσ 11

⎥⎥⎤

⎢⎢⎡

=ivσ

ψ M11

⎥⎥⎥⎥⎥

⎦⎢⎢⎢⎢⎢

= ii aAM3

⎥⎥⎥

⎦⎢⎢⎢

=

iNN

ii

v

vA

σ

σM

22ˆ⎥⎥⎦⎢

⎢⎣

=

iKK

i

vσψ M

SVDSVD DimensionDimensionreductionreduction

Each sentence is represented by a weighted singularEach sentence is represented by a weighted singular--value vectorvalue vectorIn order to evaluate each sentence the score of each sentence is calculated byIn order to evaluate each sentence the score of each sentence is calculated by

⎥⎦⎢⎣ Mia ⎦⎣

K

In order to evaluate each sentence, the score of each sentence is calculated by In order to evaluate each sentence, the score of each sentence is calculated by the norm in the the norm in the KK dimensional spacedimensional space

∑=

σ=K

kikki v

1

2)(ψ Score for sentence extractionScore for sentence extraction

A fixed number of sentences having relatively large sentence scores in the A fixed number of sentences having relatively large sentence scores in the reduced dimensional space are selected.reduced dimensional space are selected.

Word extraction scoreWord extraction score

Summarized sentence with M words V = v1 ,v2 ,…, vM1 2 M

ScoreLinguistic scoreLinguistic scoreLinguistic correctness (Bigram/Trigram)

M

S(VM) = Σm=1

L(vm |… vm-1)

Significance (topic) scoreImportant information extraction(Amount of information)

+ λI I (vm ) (Amount of information)

Confidence scoreRecognition error exclusion+ λC C (vm ) Recognition error exclusion(Acoustic & linguistic reliability)

Word concatenation score

C ( m )

+ λ T (v ) Semantic correctness(Word dependency probability)

+ λT Tr(vm )

Word concatenation scoreA penalty for word concatenation with no dependency in the original sentenceno dependency in the original sentence

Inter-phraseInter-phrase

Intra-phrase Intra-phrasep

in Japanthe beautiful cherry blossoms

Phrase 1Phrase 1 Phrase 2Phrase 2

in Japanthe beautiful cherry blossoms

“the beautiful Japan” Grammatically correct “the beautiful Japan” ybut incorrect as a summary

Correlation between subjective and objective l ti ( d t ti )evaluation scores (averaged over presentations)

In the subjective evaluation, the summaries were evaluated in terms of ease of understanding and appropriateness as summaries on five levels.

Correlation between subjective and objective j jevaluation scores (each presentation)

2000s (2)

• Development of WFST-based decoder and applicationdecoder and application

• Unsupervised cross-validation and aggregated adaptation methods

WFST (Weighted Finite State Transducer)-based “T3 decoder”T3 decoder

Word sequenceSpeechH: HMMC: TriphoneL L i

H C L G

Context Context Word

CompositionL: LexiconG: N-gram

H C L Gspeech

Contextdependent

phones

Contextindependent

phones Word

WordsequenceN-gram

Problems:Problems:– Large memory requirement– Small flexibility

On-the-fly compositionParallel decoding

• Difficult to change partial models

Structure of the T3 decoder

T3 decoder performance

JNASCSJ

9494.5

9595.5

96

racy

79

79.5

80

80.5

racy

9292.5

9393.5

94

Acc

u

77

77.5

78

78.5

Acc

ur

920 0.2 0.4 0.6 0.8 1 1.2

RTF

• Spontaneous speech • Read speech

770 0.2 0.4 0.6 0.8 1 1.2

RTF

• Spontaneous speech• “Corpus of Spontaneous

Japanese (CSJ)”• Test set of 10 lectures

• Read speech• “Japanese Newspaper Article

Sentences (JNAS)”• Test set of 200 utterances• Test set of 10 lectures

• 128 Gaussians per mixture• 65K word vocabulary

• Test set of 200 utterances• 16 Gaussians per mixture• 465k word vocabulary

Icelandic speech recognition using an E li hEnglish corpus

• The Jupiter corpus (a weather information corpus d l d b MIT ) d thdeveloped by MIT ) was used as the English rich corpus

RTRT(60K) Weather information

domain

ST

Number of sentences

(1.5K)

RT: Rich textST: Sparse textST: Sparse text

Icelandic speech recognition using E li h LM (E li h t t)English LM (English output)

T diti l f tTraditional format

Icelandic h

English

WFST format

speech Text

• P(O|W): Icelandic acoustic model• P(W|T): English to Icelandic translation model

P(T) E li h l d l• P(T): English language model

Icelandic speech recognition using E li h LM (I l di t t)

Traditional format

English LM (Icelandic output)Traditional format

Icelandic speech

Icelandic Text

WFST format

speech Text

• P(O|W): Icelandic acoustic model• P(W|T): English to Icelandic translation model

P(T) E li h l d l• P(T): English language model

Optimized English Recognition resultsoutput 91.0%

Optimized

Recognition results

Icelandic output 89.8%

Baseline Icelandic

Keyword Accuracy Rate (%)

87.6%

Rate (%)

λST Output Vocabulary

Unsupervised cross-validation (CV) adaptation

Initial modelM

adaptation

M(2)

Copy

M(1) M(K)

M

M(2)

D(1) Evaluation speech data

Iteration

M(1) M(K)

D(2) D(K)( ) p

Recognition h th i

Recognition lt

Speech recognition

( ) ( )

T(1) T(2) T(K)hypothesis results

Model update

T(1) T(2) T(K)

M(2)M(1) M(K)

Reducing the influence of recognition errors by separating the data used for the decoding step and the model update step

Future• Increasing flexibility and robustness against various sources of variationsagainst various sources of variations

• Spoken language comprehension

A communication - theoretic view ofh ti & itispeech generation & recognition

P ( X | W )M

P ( W | M )

Linguistic W X Speech

P ( M )

Message Acoustic gchannel

W precognizer

gsource channel

Language Vocabulary

Speaker Reverberation

Sources ofVocabulary Grammar Semantics

Reverberation Noise Transmission

h i i

variations

Context Habits

characteristics Microphone

Knowledgesources

Knowledge sources for speech recognitiong p g

Human speech recognition is a matching process whereby an di i l i t h d t i ti k l d ( h i )audio signal is matched to existing knowledge (comprehension).

• Knowledge (Meta-data)g ( )– Domain and topics– Context– Semantics

RecognitionRecognition

– Speakers– Environment, etc.

Speech(data)

Transcription(information)

GeneralizationGeneralization• Systematization of variousrelated knowledge is crucial

Knowledge

GeneralizationMeta-data

GeneralizationMeta-data

• How to incorporate knowledge sources into the t ti ti l ASR f k

Knowledge

AbstractionAbstraction

statistical ASR framework

Generations of ASR technologyGenerations of ASR technology

ASR prehistory 1G 2G 3G 3.5G 4G

~1920 ~1952 ~1968 ~1980 ~1990 ~2009

Our research

Extended knowledge processing“Speech and Intelligence”

Future works• Grand challenge-1: flexibility and robustness

against various acoustic as well as linguisticagainst various acoustic as well as linguisticvariationsG d h ll 2 k l• Grand challenge-2: spoken language comprehension

• A much greater understanding of the human speech process is required before automatic speech recognition systems can approach human performance.

• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.

Thanks to all our present and past colleagues and studentsat NTT Labs and Tokyo Tech!