hidden markov model-based speech...

70
Hidden Markov Model-based speech synthesis Junichi Yamagishi, Korin Richmond, Simon King and many others Centre for Speech Technology Research University of Edinburgh, UK www.cstr.ed.ac.uk

Upload: others

Post on 04-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Hidden Markov Model-based speech synthesis

Junichi Yamagishi, Korin Richmond, Simon King and many othersCentre for Speech Technology ResearchUniversity of Edinburgh, UK

www.cstr.ed.ac.uk

Page 2: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Note

• I did not invent HMM-based speech synthesis!

• Core idea: Tokuda (Nagoya Institute of Technology, Japan)

• Developments: many other people

• Speaker adaptation: Junichi Yamagishi (Edinburgh) and colleagues

Page 3: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Background

Page 4: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Speech synthesis mini-tutorial

• Text to speech

• input: text

• output: a waveform that can be listened to

• Two main components

• front end: analyses text and converts to linguistic specification

• waveform generation: converts linguistic specification to speech

Page 5: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Speech synthesis mini-tutorial

• Text to speech

• input: text

• output: a waveform that can be listened to

• Two main components

• front end: analyses text and converts to linguistic specification

• waveform generation: converts linguistic specification to speech

Page 6: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

"the cat sat"

Page 7: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

DET NN VB

"the cat sat"

Page 8: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

((the cat) sat)

DET NN VB

"the cat sat"

Page 9: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

sil dh ax k ae t s ae t sil

((the cat) sat)

DET NN VB

"the cat sat"

Page 10: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

sil dh ax k ae t s ae t sil

((the cat) sat)

DET NN VB

phrase finalphrase initialpitch accent

"the cat sat"

Page 11: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From words to linguistic specification

sil^dh-ax+k=ae, "phrase initial", "unstressed syllable", ...

sil dh ax k ae t s ae t sil

((the cat) sat)

DET NN VB

phrase finalphrase initialpitch accent

"the cat sat"

Page 12: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Full context models used in synthesis

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

Page 13: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

phonetic

Full context models used in synthesis

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

Page 14: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

phonetic

Full context models used in synthesis

aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....

prosodic

Page 15: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Example linguistic specification

pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pau^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....

“Author of the ...”

Page 16: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From linguistic specification to speech

• Two possible methods

• Concatenate small pieces of pre-recorded speech

• Generate speech from a model

Page 17: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

From linguistic specification to speech

• Two possible methods

• Concatenate small pieces of pre-recorded speech

• Generate speech from a model

Page 18: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMM mini-tutorial

• HMMs are models of sequences

• speech signals

• gene sequences

• etc

Page 19: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMMs

• a HMM consists of

• sequence model: a weighted finite state network of states and transitions

• observation model: multivariate Gaussian distribution in each state

• can generate from the model

• can also use for pattern recognition (e.g., automatic speech recognition)

Page 20: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMMs are generative models

Page 21: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMMs are generative models

Page 22: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMMs are generative models

Page 23: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMM-based speech synthesis mini-tutorial

• HMMs are used to generate sequences of speech (in a parameterised form)

• From the parameterised form, we can generate a waveform

• The parameterised form contains sufficient information to generate speech:

• spectral envelope

• fundamental frequency (F0) - sometimes called ‘pitch’

• aperiodic (noise-like) components (e.g. for sounds like ‘sh’ and ‘f’)

Page 24: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Trajectory HMMs

• Using an HMM to generate speech parameters

• because of the Markov assumption, the most likely output is the sequence of the means of the Gaussians in the states visited

• this is piecewise constant, and ignores important dynamic properties of speech

• Trajectory HMM algorithm (Tokuda and colleagues)

• solves this problem, by correctly using statistics of the dynamic properties during the generation process

Page 25: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Generation

• Generate the most likely observation sequence from the HMM

• but take the statistics of not only the static coefficients, but also the delta and delta-delta too

• Maximum Likelihood Parameter Generation Algorithm

Page 26: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Trajectory HMMs

Page 27: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Trajectory HMMs

Page 28: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Trajectory HMMs

Page 29: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 30: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 31: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 32: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 33: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 34: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 35: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

time

sp

eech p

ara

mete

r

Trajectory HMMs

Page 36: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Constructing the HMM

• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information

• There is one 5-state HMM for each phoneme, in every required context

• To synthesise a given sentence,

• use front end to predict the linguistic specification

• concatenate the corresponding HMMs

• generate from the HMM

Page 37: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Constructing the HMM

• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information

• There is one 5-state HMM for each phoneme, in every required context

• To synthesise a given sentence,

• use front end to predict the linguistic specification

• concatenate the corresponding HMMs

• generate from the HMM

Sparsi

ty pr

oblem

!

Page 38: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Example linguistic specification

pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pau^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....

“Author of the ...”

Page 39: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

HMM-based speech synthesis

• Differences from automatic speech recognition include

• Synthesis uses a much richer model set, with a lot more context

• For speech recognition: triphone models

• For speech synthesis: “full context” models

• “Full context” = both phonetic and prosodic factors

• Observation vector for HMMs contains the necessary parameters to generate speech, such as spectral envelope + F0 + multi-band noise amplitudes

Page 40: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Sparsity

• In practically all speech or language applications, sparsity is a problem

• Distribution of classes is usually long-tailed (Zipf-like)

• We also ‘create’ even more sparsity by using context-dependent models

• thus, most models have no training data at all

• Common solution is to merge classes or contexts

• i.e., use the same model for several classes or contexts

• for HMMs, we call this ‘parameter tying’

Page 41: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Decision-tree-based clustering

Clustering

Context Dependent HMMs

Yes No

NoYes

Voiced?

Vowel?Description length for

State occupancy probability for node

Covariance matrix for node

Dimension

20

Page 42: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Model parameter estimation from ‘labelled’ data

• Actually, we only have word labels for the training data

• Convert these to full linguistic specification using the front end of our text-to-speech system (text processing, pronunciation, prosody)

• these labels will not exactly match the speech signal (we do a few tricks to try to make the match closer, but it’s never perfect)

• We still only know the model sequence, but no information about the state alignment

• So, we use EM (we could call this ‘semi-supervised’ learning)

Page 43: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Model adaptation

• Training the models needs 1000+ sentences of data from one speaker

• What if we have insufficient data for this target speaker?

• Adaptation:

• Train the model on lots of data from other speakers

• Adapt the trained model’s parameters using a small amount of target speaker data

• estimate linear transforms to maximise the likelihood (MLLR)

• also in combination with MAP

Page 44: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Training, adaptation, synthesis

Page 45: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

speech labels

awb

clb

rms

...Train

awb

clb

rms

...

Training, adaptation, synthesis

Page 46: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

speech labels

awb

clb

rms

...Train

awb

clb

rms

...

Training, adaptation, synthesis

Page 47: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

speech labels

Average

voice model

awb

clb

rms

...Train

awb

clb

rms

...

Training, adaptation, synthesis

Page 48: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

speech labels

Average

voice model

Training, adaptation, synthesis

Page 49: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Adaptbdl bdl

speech labels

Average

voice model

Training, adaptation, synthesis

Page 50: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Adaptbdl bdl

speech labels

Average

voice model

Training, adaptation, synthesis

Page 51: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Adaptbdl bdl

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 52: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Adaptbdl bdl

speech labels

Average

voice model

Recognise

Transforms

Training, adaptation, synthesis

Page 53: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Adaptbdl bdl

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 54: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 55: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Synthesise

Test

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 56: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Synthesise

Test

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 57: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Synthesise

Test

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 58: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Synthesise

Test

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 59: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Synthesise

Test

sentence

labels

Synthetic

speech

speech labels

Average

voice model

Transforms

Training, adaptation, synthesis

Page 60: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Evaluation

• Objective measures that compare synthetic speech with a natural example (e.g., spectral distortion) have their uses, but don’t necessarily correlate with human perception

• main problem: there is more than one ‘correct answer’ in speech synthesis

• a single natural example does not capture this

• So, we mainly rely on playing examples to listeners

• opinion scores for quality & naturalness, typically on 5 point scales

• objective measures of intelligibility (type-in tests)

Page 61: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

A J S B O V M C L E G Q T F H D R I N0

2040

6080

100

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

Word error rate for voice B (All listeners)

System

WER

(%)

Intelligibility (WER), English

A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)

A J S K B P O V M C L E G Q T F H D R I N

020

4060

8010

0

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

Word error rate for voice A (All listeners)

System

WER

(%)

Page 62: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

A J S B O V M C L E G Q T F H D R I N0

2040

6080

100

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

Word error rate for voice B (All listeners)

System

WER

(%)

Intelligibility (WER), English

A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)

A J S K B P O V M C L E G Q T F H D R I N

020

4060

8010

0

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

Word error rate for voice A (All listeners)

System

WER

(%)

No significant difference between

A, V and T

Page 63: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

A J S B O V M C L E G Q T F H D R I N0

2040

6080

100

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

Word error rate for voice B (All listeners)

System

WER

(%)

Intelligibility (WER), English

A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)

A J S K B P O V M C L E G Q T F H D R I N

020

4060

8010

0

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

Word error rate for voice A (All listeners)

System

WER

(%)

No significant difference between

A, V and T

No significant difference between

A, C, V and T

Page 64: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

A J S B O V M C L E G Q T F H D R I N0

2040

6080

100

198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n

Word error rate for voice B (All listeners)

System

WER

(%)

Intelligibility (WER), English

A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)

A J S K B P O V M C L E G Q T F H D R I N

020

4060

8010

0

245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n

Word error rate for voice A (All listeners)

System

WER

(%)

No significant difference between

A, V and T

No significant difference between

A, C, V and T

HTS is as intelligible as human speech

Page 65: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Recent extensions

Page 66: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Articulatory-controllable HMM-based speech synthesis

• can manipulate articulator positions explicitly

• ability to synthesise new phonemes, not seen in training data

• requires parallel articulatory+acoustic corpus, which we have in CSTR

Page 67: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Articulatory-controllable HMM-based speech synthesis

+1.5

+1.0

+0.5

default

-0.5

-1.0

-1.5

Tong

ue h

eigh

t (cm

)

Page 68: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Articulatory-controllable HMM-based speech synthesis

+1.5

+1.0

+0.5

default

-0.5

-1.0

-1.5

set

Tong

ue h

eigh

t (cm

)

Page 69: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Dirichlet process HMMs

a e i o u

010

020

030

040

050

0

Japanese vowel

Dura

tion [

ms]

aa ae ah ao aw ax ay eh el em en er ey ih iy ow oy uh uw

010

020

030

040

050

0

English vowel

Dura

tion [

ms]

a ai an ang ao e ei en eng er i ia ian iang iao ic ich ie in ing iong iu o ong ou u ua uai uan uang ui un uo v van ve vn

010

020

030

040

050

0

Mandarin final

Dura

tion [

ms]

• Fixed number of states may not be optimal

• Cross-validation, information criteria (AIC, BIC, or MDL) or variational Bayes can be used for determining the number of states

• Or use Dirichlet process (HDP-HMM or infinite HMM)

Page 70: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,

Summary

• HMM-based speech synthesis has many opportunities for using machine learning:

• learning the model from data

• parameters (alternatives to maximum likelihood such as minimum generation error)

• model complexity (context clustering, number of mixture components, number of states, ...)

• semi-supervised and unsupervised learning (labels for data are unreliable or missing)

• adapting the model, given limited new data

• generation algorithms