discrete/continuous modelling of speaking style in...
TRANSCRIPT
Discrete/Continuous Modelling of Speaking Stylein HMM-based Speech Synthesis:
Design and Evaluation
Nicolas Obin 1,2, Pierre Lanchantin 1
Anne Lacheret 2, Xavier Rodet 1
1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France
[email protected], [email protected], [email protected], [email protected]
AbstractThis paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to modelthe symbolic and acoustic speech characteristics of a speak-ing style. The proposed model is used to model the averagecharacteristics of a speaking style that is shared among variousspeakers, depending on specific situations of speech communi-cation. The evaluation consists of an identification experimentof 4 speaking styles based on delexicalized speech, and com-pared to a similar experiment on natural speech. The compari-son is discussed and reveals that discrete/continuous HMM con-sistently models the speech characteristics of a speaking style.Index Terms: speaking style, speech synthesis, speechprosody, average modelling.
1. IntroductionEach speaker has his own speaking style which constitutes hisvocal signature, and a part of his identity. Nevertheless, aspeaker continuously adapt his speaking style according to spe-cific communication situations, and to his emotional state. Inparticular, each situational context determines a specific modeof production associated with it - a genre - which is defined bya set of conventions of form and content that are shared amongall of its productions [1]. In particular, a specific discoursegenre (DG) relates to a specific speaking style. Consequently, aspeaker adapts his speaking style to each specific situation de-pending on the formal conventions that are associated with thesituation, his a-priori knowledge about these conventions, andhis competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a stylethat depends on the speaker identity, and a conventional speak-ing style that is conditioned by a specific situation.In speech synthesis, methods have been proposed to model andadapt the symbolic [2, 3] and acoustic speech characteristics ofa speaking style, with application to emotional speech synthesis[4]. However, no study exists on the joint modelling of the sym-bolic and acoustic characteristics of speaking style, and speak-ing style acoustic modelling generally limits to the modellingof emotion, with rare extensions to other sources of speakingstyles variations [5].
1This study was partially funded by “La Fondation Des Treilles”,and supported by ANR Rhapsodie 07 Corp-030-01; reference prosodycorpus of spoken French; French National Agency of research; 2008-2012.
This paper presents an average discrete/continuous HMMwhich is applied to the speaking style modelling of various dis-course genres in speech synthesis, and assesses whether themodel adequately captures the speech prosody characteristicsof a speaking style. Incidentally, the robustness of the HMM-based speech synthesis is evaluated in the conditions of real-world applications. The paper is organized as follows: thespeaking style corpus design is described in section 2; the aver-age discrete/continuous HMM model is presented in section 3;the evaluation is presented and discussed in sections 4 and 5.
2. Speech & Text Material2.1. Corpus Design
For the purpose of speaking style speech synthesis, a 4-hourmulti-speakers speech database was designed. The speechdatabase consists of four different DG’s: catholic mass cere-mony, political, journalistic, and sport commentary. In orderto reduce the DG intra-variability, the different DGs were re-stricted to specific situational contexts (see list below) and tomale speakers only.
−2.2 −2 −1.8 −1.6 −1.4 −1.2
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
M1
log syllable duration
log
f 0
M2
M3
M4
M5M6M7
P1P2
P3
P4P5J1
J2
J3J4J5
S1
S2S3 S4
S5S6
DISC
RETE
/CO
NTIN
UOUS
MO
DELL
ING
OF
SPEA
KIN
GST
YLE
INH
MM
-BAS
EDSP
EECH
SYNT
HES
IS:
DESI
GN
AND
EVAL
UATI
ON
Nico
lasO
bin
1,2,P
ierr
eLan
chan
tin1
Anne
Lach
eret
2,X
avie
rRod
et1
1A
naly
sis-S
ynth
esis
Team
,IRC
AM
,Par
is,Fr
ance
2M
odyc
oLa
b.,U
nive
rsity
ofPa
risO
uest
-LaD
efen
se,N
ante
rre,F
ranc
[email protected],[email protected],
anne
.lac
heret@u-pa
ris1
0.fr,rode
t@ir
cam.fr
ABST
RACT
This
pape
ras
sess
esth
eab
ility
ofa
HM
M-b
ased
spee
chsy
nthe
-sis
syste
ms
tom
odel
the
spee
chch
arac
teris
tics
ofva
rious
spea
k-in
gsty
les1 .
Adi
scre
te/c
ontin
uous
HM
Mis
pres
ente
dto
mod
elth
esy
mbo
lican
dac
ousti
csp
eech
char
acte
ristic
sof
asp
eaki
ngsty
le.
The
prop
osed
mod
elis
used
tom
odel
the
aver
age
char
acte
ristic
sof
asp
eaki
ngsty
leth
atis
shar
edam
ong
vario
ussp
eake
rs,d
epen
d-in
gon
spec
ific
situa
tions
ofsp
eech
com
mun
icat
ion.
The
eval
uatio
nco
nsist
sof
anid
entifi
catio
nex
perim
ento
f4sp
eaki
ngsty
les
base
don
delex
ical
ized
spee
ch,a
ndco
mpa
red
toa
simila
rexp
erim
ento
nna
tura
lspe
ech.
The
com
paris
onis
disc
usse
dan
dre
veal
sth
atdi
s-cr
ete/
cont
inuo
usH
MM
cons
isten
tlym
odel
sthe
spee
chch
arac
teris
-tic
sofa
spea
king
style
.In
dexT
erm
s:sp
eaki
ngsty
le,s
peec
hsy
nthe
sis,s
peec
hpr
osod
y,av
-er
agem
odel
ling.
1.IN
TRO
DUCT
ION
Each
spea
kerh
ashi
sow
nsp
eaki
ngsty
lew
hich
cons
titut
eshi
svoc
alsig
natu
re,a
ndap
arto
fhis
iden
tity.
Nev
erth
eles
s,as
peak
erco
ntin
u-ou
slyad
apth
issp
eaki
ngsty
leac
cord
ing
tosp
ecifi
cco
mm
unic
atio
nsit
uatio
ns,a
ndto
hise
mot
iona
lsta
te.I
npa
rticu
lar,
each
situa
tiona
lco
ntex
tdet
erm
ines
asp
ecifi
cm
ode
ofpr
oduc
tion
asso
ciat
edw
ithit
-age
nre-
whi
chis
defin
edby
aset
ofco
nven
tions
offo
rman
dco
n-te
ntth
atar
esh
ared
amon
gal
lofi
tspr
oduc
tions
[1].
Inpa
rticu
lar,
asp
ecifi
cdi
scou
rse
genr
e(D
G)r
elat
esto
asp
ecifi
csp
eaki
ngsty
le.
Cons
eque
ntly
,asp
eake
rada
pts
his
spea
king
style
toea
chsp
ecifi
csit
uatio
nde
pend
ing
onth
efo
rmal
conv
entio
nsth
atar
eas
soci
ated
with
the
situa
tion,
his
a-pr
iori
know
ledg
eab
outt
hese
conv
entio
ns,
and
his
com
pete
nce
toad
apth
issp
eaki
ngsty
le.
Thus
,eac
hco
m-
mun
icat
ion
acti
nsta
ntia
tesa
style
whi
chis
com
pose
dof
asty
leth
atde
pend
son
the
spea
keri
dent
ity,a
nda
conv
entio
nals
peak
ing
style
that
isco
nditi
oned
byas
peci
ficsit
uatio
n.In
spee
chsy
nthe
sis,m
etho
dsha
vebe
enpr
opos
edto
mod
elan
dada
ptth
esym
bolic
[3,4
]and
acou
stics
peec
hch
arac
teris
ticso
fasp
eaki
ngsty
le,w
ithap
plic
atio
nto
emot
iona
lspe
ech
synt
hesis
[2].
How
ever
,no
study
exist
son
the
join
tmod
ellin
gof
the
sym
bolic
and
acou
stic
char
acte
ristic
sof
spea
king
style
,and
spea
king
style
acou
stic
mod
-el
ling
gene
rally
limits
toth
emod
ellin
gof
emot
ion,
with
rare
exte
n-sio
nsto
othe
rsou
rces
ofsp
eaki
ngsty
lesv
aria
tions
[5].
This
pape
rpre
sent
san
aver
age
disc
rete
/con
tinuo
usH
MM
whi
chis
appl
ied
toth
esp
eaki
ngsty
lem
odel
ling
ofva
rious
disc
ours
ege
nres
1 This
study
was
supp
orte
dby
AN
RRh
apso
die
07Co
rp-0
30-0
1;re
fer-
ence
pros
ody
corp
usof
spok
enFr
ench
;Fre
nch
Nat
iona
lAge
ncy
ofre
sear
ch;
2008
-201
2.
insp
eech
synt
hesis
,and
asse
sses
whe
ther
them
odel
adeq
uate
lyca
p-tu
rest
hesp
eech
pros
ody
char
acte
ristic
sofa
spea
king
style
.Inc
iden
-ta
lly,t
hero
bustn
esso
fthe
HM
M-b
ased
spee
chsy
nthe
sisis
eval
uate
din
the
cond
ition
sofr
eal-w
orld
appl
icat
ions
.The
pape
riso
rgan
ized
asfo
llow
s:th
esp
eaki
ngsty
leco
rpus
desig
nis
desc
ribed
inse
ctio
n2;
the
aver
age
disc
rete
/con
tinuo
usH
MM
mod
elis
pres
ente
din
sec-
tion
3;th
eev
alua
tion
ispr
esen
ted
and
disc
usse
din
sect
ions
4an
d5.
2.SP
EECH
&TE
XTM
ATER
IAL
2.1.
Corp
usDe
sign
Fort
hepu
rpos
eof
spea
king
style
spee
chsy
nthe
sis,a
4-ho
urm
ulti-
spea
kers
spee
chda
taba
sewa
sdes
igne
d.Th
esp
eech
data
base
con-
sists
offo
urdi
ffere
ntD
G’s:
cath
olic
mas
scer
emon
y,po
litic
al,j
our-
nalis
tic,a
ndsp
ortc
omm
enta
ry.
Inor
der
tore
duce
the
DG
intra
-va
riabi
lity,
the
diffe
rent
DG
sw
ere
restr
icte
dto
spec
ific
situa
tiona
lco
ntex
ts(s
eelis
tbel
ow)a
ndto
mal
espe
aker
sonl
y.
−2.2
−2−1
.8−1
.6−1
.4−1
.2
4.54.64.74.84.955.15.25.35.45.5
M1
log sy
llable
dura
tion
log f0
M2
M3
M4
M5M6
M7 P1P2P3
P4P5
J1
J2J3J4
J5S1
S2S3
S4S5
S6
Fig.
1.Pr
osod
icde
scrip
tion
ofth
esp
eaki
ngsty
lesd
e-pe
ndin
gon
the
spea
ker.
Mea
nan
dva
rianc
eof
f 0an
dsp
eech
rate
.
log
f 0
log(
1/sp
eech
rate
)Th
efo
llow
ing
isa
desc
riptio
nof
the
four
sele
cted
DG
’s:
DISCRETE/CONTINUOUS MODELLING OF SPEAKING STYLEIN HMM-BASED SPEECH SYNTHESIS:
DESIGN AND EVALUATION
Nicolas Obin 1,2, Pierre Lanchantin 1
Anne Lacheret 2, Xavier Rodet 1
1 Analysis-Synthesis Team, IRCAM, Paris, France2 Modyco Lab., University of Paris Ouest - La Defense, Nanterre, France
[email protected], [email protected], [email protected], [email protected]
ABSTRACT
This paper assesses the ability of a HMM-based speech synthe-sis systems to model the speech characteristics of various speak-ing styles1. A discrete/continuous HMM is presented to model thesymbolic and acoustic speech characteristics of a speaking style.The proposed model is used to model the average characteristicsof a speaking style that is shared among various speakers, depend-ing on specific situations of speech communication. The evaluationconsists of an identification experiment of 4 speaking styles basedon delexicalized speech, and compared to a similar experiment onnatural speech. The comparison is discussed and reveals that dis-crete/continuous HMM consistently models the speech characteris-tics of a speaking style.Index Terms: speaking style, speech synthesis, speech prosody, av-erage modelling.
1. INTRODUCTION
Each speaker has his own speaking style which constitutes his vocalsignature, and a part of his identity. Nevertheless, a speaker continu-ously adapt his speaking style according to specific communicationsituations, and to his emotional state. In particular, each situationalcontext determines a specific mode of production associated with it- a genre - which is defined by a set of conventions of form and con-tent that are shared among all of its productions [1]. In particular,a specific discourse genre (DG) relates to a specific speaking style.Consequently, a speaker adapts his speaking style to each specificsituation depending on the formal conventions that are associatedwith the situation, his a-priori knowledge about these conventions,and his competence to adapt his speaking style. Thus, each com-munication act instantiates a style which is composed of a style thatdepends on the speaker identity, and a conventional speaking stylethat is conditioned by a specific situation.In speech synthesis, methods have been proposed to model and adaptthe symbolic [3, 4] and acoustic speech characteristics of a speakingstyle, with application to emotional speech synthesis [2]. However,no study exists on the joint modelling of the symbolic and acousticcharacteristics of speaking style, and speaking style acoustic mod-elling generally limits to the modelling of emotion, with rare exten-sions to other sources of speaking styles variations [5].This paper presents an average discrete/continuous HMM which isapplied to the speaking style modelling of various discourse genres
1This study was supported by ANR Rhapsodie 07 Corp-030-01; refer-ence prosody corpus of spoken French; French National Agency of research;2008-2012.
in speech synthesis, and assesses whether the model adequately cap-tures the speech prosody characteristics of a speaking style. Inciden-tally, the robustness of the HMM-based speech synthesis is evaluatedin the conditions of real-world applications. The paper is organizedas follows: the speaking style corpus design is described in section2; the average discrete/continuous HMM model is presented in sec-tion 3; the evaluation is presented and discussed in sections 4 and5.
2. SPEECH & TEXT MATERIAL
2.1. Corpus Design
For the purpose of speaking style speech synthesis, a 4-hour multi-speakers speech database was designed. The speech database con-sists of four different DG’s: catholic mass ceremony, political, jour-nalistic, and sport commentary. In order to reduce the DG intra-variability, the different DGs were restricted to specific situationalcontexts (see list below) and to male speakers only.
−2.2 −2 −1.8 −1.6 −1.4 −1.2
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
M1
log syllable duration
log
f 0
M2
M3
M4
M5M6M7
P1P2
P3
P4P5J1
J2
J3J4J5
S1
S2S3 S4
S5S6
Fig. 1. Prosodic description of the speaking styles de-pending on the speaker. Mean and variance of f0 andspeech rate.
log f0
log(1/speech rate)The following is a description of the fourselected DG’s:Figure 1: Prosodic description of the speaking
styles depending on the speaker. Mean and vari-ance of f0 and speech rate (syllable per second).
The following is a description of the four selected DG’s:
mass: Christian church sermon (pilgrimage and Sunday high-mass sermons); single speaker monologue, no interaction.
political: New Year’s French president speech; single speakermonologue; no interaction.
journal: radio review (press review; political, economical,
technological chronicles); almost single speaker monologuewith a few interactions with a lead journalist.
sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interaction.The speech database consists of natural speech multi-media au-dio contents with strongly variable audio quality (backgroundnoise: crowd, audience, recording noise, and reverberation).The speech prosody characteristics of the speech databased areillustrated in figure 1.
3. Speaking Style ModelA speaking style model λ(style) is composed of dis-crete/continuous context-dependent HMMs that model the sym-bolic/acoustic speech characteristics of a speaking style.
λ(style) =“λ
(style)symbolic,λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from thesymbolic representation to the acoustic variations. Addition-ally, a rich linguistic description of the text characteristics isautomatically extracted using a linguistic processing chain [6]and used to refine the context-dependent HMM modelling (see[7] and [8] for a detailed description of the enriched linguisticcontexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associ-
ated with the speaking style.
The prosodic grammar consists of a hierarchical prosodicrepresentation that was experimented as an alternative to ToBI[9] for French prosody labelling [10]. The prosodic grammaris composed of major prosodic boundaries (FM, a boundarywhich is right bounded by a pause), minor prosodic boundaries(Fm, an intermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let l = (l(1), . . . , l(R))
the total set of prosodic symbolic observations, andl(r) = [l(r)(1), . . . , l(r)(Nr)] the prosodic symbolic se-quence associated with speaker r, where l(r)(n) is theprosodic label associated with the n-th syllable. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]> is the (Lx1) linguistic
context vector which describes the linguistic characteristicsassociated with the n-th syllable.
An average context-dependent discrete HMM λ(style)symbolic is es-
timated from the pooled speakers observations. Firstly, an av-erage context-dependent tree T(style)
symbolic is derived so as to min-imize the information entropy of the prosodic symbolic labelsl conditionally to the linguistic contexts q . Then, a context-dependent HMM model λ
(style)symbolic is estimated for each termi-
nal node of the context-dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic
that includes source/filter variations, f0 variations, and state-durations, is estimated from the pooled speakers associatedwith the speaking style based on the conventional HTS system[11].
Let R be the number of speakers from which an averagemodel is to be estimated. Let o = (o(1), . . . ,o(R)) thetotal set of observations, and o(r) = [o(r)(1), . . . ,o(r)(Tr)]the observation sequences associated with speaker r, whereo(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]> is the (Dx1) observation
vector which describes the acoustical property at time t. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] the lin-guistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]> is the (Lx1) linguistic context
vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length ofthe context-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter varia-tions, f0 variations, and the temporal structure associated witha speaking style. Speakers f0 were normalized with respectto the speaking style prior to modelling. Source, filter, andnormalized f0 observation vectors and their dynamic vectorsare used to estimate context-dependent HMM models λ
(style)acoustic.
Context-dependent HMMs are clustered into acoustically sim-ilar models using decision-tree-based context-clustering (ML-MDL [11]). Multi-Space probability Distributions (MSD) [12]are used to model continuous/discrete parameter f0 sequenceto manage voiced/unvoiced regions properly. Each context-dependent HMM is modelled with a state duration probabilitydensity functions (PDFs) to account for the temporal structureof speech [13]. Finally, speech dynamic is modelled accordingto the trajectory model and the global variance (GV) that modellocal and global speech variations over time [14].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a con-catenated sequence of context-dependent HMM modelsλ
(style)symbolic associated with the linguistic context sequence
q = [q1, . . . ,qN ], where qn = [q1, . . . , qL]> denotesthe (Lx1) linguistic context vector associated with the n-thphoneme.
Firstly, the prosodic symbolic sequence bl is determined so asto maximize the likelihood of the prosodic symbolic sequence lconditionally to the linguistic context sequence q and the modelλ
(style)symbolic.
bl = argmaxl
p(l|q,λ(style)symbolic) (2)
Then, the linguistic context sequence q augmented with theprosodic symbolic sequence bl is converted into a concatenated
sequence of context-dependent models λ(style)acoustic.
The acoustic sequence bo is inferred so as to maximize the like-lihood of the acoustic sequence o conditionally to the modelλ
(style)acoustic.
bo = argmaxo
maxq
p(o|q,λ(style)acoustic)p(q|λ(style)
acoustic) (3)
First, the state sequence bq is determined so as to maximizethe likelihood of the state sequence conditionally to the modelλ
(style)acoustic. Then, the observation sequence bc is determined so
as to maximize the likelihood of the observation sequence con-ditionally to the state sequence bq, the model λ
(style)acoustic under
dynamic constraint o = Wc.
Rbqbc = rbq (4)
where:
Rbq = W>Σ−1bq W. (5)
rbq = W>Σ−1bq µbq. (6)
and Σbq and µbq are respectively the covariance matrix and themean vector for the state sequence bq.
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]
ampl
itud
e
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f 0[H
z]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
time [s]
ampl
itude
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f 0[H
z]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
time [s]
ampl
itud
e
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f 0[H
z]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]
ampl
itud
e
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f 0[H
z]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
time [s]
ampl
itude
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f 0[H
z]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
Then, the linguistic context sequence q augmented with the inferredprosodic label sequence qproso is converted into a concatenatedsequence of context-dependent models λ
(style)acoustic.
The acoustic sequence o is inferred so as to maximize the log-likelihood of the acoustic sequence o conditionally to the modelλ
(style)acoustic and the sequence length T .
o = argmaxo
maxq
P(o|q, λ(style)acoustic, T )P(q, λ
(style)acoustic, T )(3)
First, the state sequence q is determined so as to maximize the log-likelihood of the state sequence conditionally to the model λ
(style)acoustic
and the sequence length T . Then, the observation sequence c isdetermined so as to maximize the log-likelihood of the observationsequence conditionnally to the state sequence q, the model λ
(style)acoustic
under dynamic constraint o = Wc.
Rqc = rq (4)
where:
Rq = W�Σ−1q W. (5)
rq = W�Σ−1q µq. (6)
and Σq and µq are respectively the covariance matrix and the meanvector for the sate sequence q.
SPEECH SYNTHESIS
4. EVALUATION
The proposed model has been evaluatedon a speaking style identification percep-tual experiment basis, and compared toa speaking style identification experimentwith natural speech [18]. For the purposeof such a comparison, it was necessaryto provide a single evaluation scheme forboth experiments. In particular, it was notpossible to control the linguistic content ofnatural speech utterances which providesevident cues for DG’s identification (a sin-gle keyword would be sufficient to identifya DG). Thus, such a comparison requiredto remove lexical access and to focus onthe prosodic dimension only.
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]
am
plitude
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f0
[Hz]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
time [s]
am
plitu
de
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f0
[Hz]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
time [s]
am
plitude
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f0
[Hz]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-basedtime [s]
am
plitude
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f0
[Hz]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
time [s]
am
plitu
de
!0.4
!0.2
0
0.2
0.4
40
60
80
100
120
f0
[Hz]
l
o
ta Z
@m@
sH ik
uS
ed2b O n 9 R
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
52CHAPTER 4. SPEECH PROSODY MODELLING & SYNTHESIS:
STATE-OF-THE-ART
A phonetizer is used to convert the input text into a sequence of phonemes. A syllabifieris used to merge the sequence phonemes into a sequence of syllables. At the prosodic level,pauses are identified to
4.6 Abstract Model: Text To Prosodic Structure
The prosodic structure model is to infer a sequence of prosodic labels that are associatedwith relevant prosodic events.
sentence Longtemps , je me suis couche de bonne heure .
⇓prosodicstructureFM * *Fm * * *P * * * *
syllable Long- temps ## je me suis cou- che de bonne heure ##
Table 4.2: Illustration of the text-to-prosodic-structure conversion.
Two main approaches can be distinguished in prosodic structure modelling: on the onehand, expert approaches attempt at elaborating formal models that account for the observedprosodic variations with respect to linguistic, para-linguistic, and extra-linguistic con-straints. On the other hand, statistical methods attempt at elaborating a statistical modelwhich accounts for the prosodic variations from the observation of statistical regularities onlarge speech corpora.
4.6.1 Expert Models
Expert approaches mostly concern hierarchical prosodic structure mod-elling, and in particular prosodic frontiers ([Cooper and Paccia-Cooper, 1980,Gee and Grosjean, 1983, Ferreira, 1988, Abney, 1992, Watson and Gibson, 2004]for English; [Dell, 1984, Bailly, 1989, Monnin and Grosjean, 1993, Ladd, 1996,Delais-Roussarie, 2000, Mertens, 2004b] for French, [Barbosa, 2006] for some otherlanguages). Expert models assume that a prosodic structure results from the integrationof various and potentially conflictual constraints, in particular syntactic and rhythmic[?, Dell, 1984] constraints.The linguistic module mostly concerns the extraction of prominent syn-tactic boundaries from deep syntactic parsing, based on syntactic con-stituency (Constituent-Depth [Cooper and Paccia-Cooper, 1980], φ-phrases[Gee and Grosjean, 1983, Delais-Roussarie, 2000], Left-hand-side / Right-hand-sideBoundary [Watson and Gibson, 2004]), syntactic dependency (Dependency-Grammar-based
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
• journal: radio review (press review; political, economical, tech-nological chronicles); almost single speaker monologue with afew interactions with a lead journalist.
• sport commentary: soccer; two speakers engaged in mono-logues with speech overlapping during intense soccer sequencesand speech turn changes; almost no interactions.
The speech database consists of natural speech multi-media audiocontents with strongly variable audio quality (background noise:crowd, audience, recording noise, and reverberation). The speechprosody characteristics of the speech databased are illustrated in fig-ure 1.
3. SPEAKING STYLE MODEL
A speaking style model λ(style) is composed of discrete/continuouscontext-dependent HMMs that model the symbolic/acoustic speechcharacteristics of a speaking style.
λ(style) =“λ
(style)symbolic, λ
(style)acoustic
”(1)
During the training, the discrete/continuous context-dependentHMMs are estimated separately. During the synthesis, the sym-bolic/acoustic parameters are generated in cascade, from the sym-bolic representation to the acoustic variations. Additionally, a richlinguistic description of the text characteristics is automatically ex-tracted using a linguistic processing chain [9] and used to refine thecontext-dependent HMM modelling (see [10] and [11] for a detaileddescription of the enriched linguistic contexts).
3.1. Training of the Discrete/Continuous Models
3.1.1. Discrete HMM
For each speaking style, an average context-dependent discreteHMM λ
(style)symbolic is estimated from the pooled speakers associated
with the speaking style.
The prosodic grammar consists of a hierarchical prosodic repre-sentation that was experimented as an alternative to TOBI [12]for French prosody labelling [13]. The prosodic grammar iscomposed of major prosodic boundaries (FM, a boundary whichis right bounded by a pause), minor prosodic boundaries (Fm, anintermediate boundary), and prosodic prominences (P).
Let R be the number of speakers from which an average modelλ
(style)symbolic is to be estimated. Let qproso = (q
(1)proso, . . . ,q
(R)proso)
the total set of prosodic symbolic observations, andq
(r)proso = [q
(r)proso(1), . . . , q
(r)proso(Nr)] is the prosodic sym-
bolic sequence associated with speaker r, where q(r)proso(n)
is the prosodic label associated with syllable n. Letq = (q(1), . . . ,q(R)) the total set of linguistic contextsobservations, and q(r) = [q(r)(1), . . . ,q(r)(Nr)] is the lin-guistic context sequence associated with speaker r, whereq(r)(n) = [q
(r)1 (n), . . . , q
(r)L (n)]� is the (Lx1) linguistic context
vector which describes the linguistic characteristics associated withsyllable n.
An average context-dependent discrete HMM λ(style)symbolic is estimated
from the pooled speakers observations. Firstly, an average context-dependent tree T(style)
symbolic is derived so as to minimize the infor-
mation entropy of the prosodic symbolic labels qproso condition-ally to the linguistic contexts q . Then, a context-dependent HMMmodel λ
(style)symbolic is estimated for each terminal node of the context-
dependent tree T(style)symbolic.
3.1.2. Continuous HMM
For each speaking style, an average acoustic model λ(style)acoustic that
includes source/filter variations, f0 variations, and state-durations,is estimated from the pooled speakers associated with the speakingstyle based on the conventional HTS system ([14]).
Let R be the number of speakers from which an average model is tobe estimated. Let o = (o(1), . . . ,o(R)) the total set of observations,and o(r) = [o(r)(1), . . . ,o(r)(Tr)] is the observation sequencesassociated with speaker r, where o(r)(t) = [o
(r)t (1), . . . , o
(r)t (D)]�
is the (Dx1) observation vector which describes the acoustical prop-erty at time t. Let q = (q(1), . . . ,q(R)) the total set of linguisticcontexts observations, and q(r) = [q(r)(1), . . . ,q(r)(Tr)] isthe linguistic context sequence associated with speaker r, whereq(r)(t) = [q
(r)1 (t), . . . , q
(r)L (t)]� is the (L’x1) augmented linguistic
context vector which describes the linguistic properties at time t.
An average context-dependent HMM acoustic model λ(style)symbolic
is estimated from the pooled speakers observations. Firstly, acontext-dependent HMM model is estimated for each of thelinguistic contexts. Then, an average context-dependent treeT(style)
acoustic is derived so as to minimize the description length of thecontext-dependent HMM model λ
(style)acoustic.
The acoustic module models simultaneously source/filter variations,f0 variations, and the temporal symbolic associated with a speak-ing style. Speakers f0 were normalized with respect to the speakingstyle prior to modelling. Source, filter, and normalized f0 observa-tion vectors and their dynamic vectors are used to estimate context-dependent HMM models λ
(style)acoustic. Context-dependent HMMs are
clustered into acoustically similar models using decision-tree-basedcontext-clustering (ML-MDL [15]). Multi-Space probability Distri-butions (MSD) [16] are used to model continuous/discrete parame-ter f0 sequence to manage voiced/unvoiced regions properly. Eachcontext-dependent HMM is modelled with a state duration probabil-ity density functions (PDFs) to account for the temporal structure ofspeech [17]. Finally, speech dynamic is modelled according to thetrajectory model and the global variance (GV) that model local andglobal speech variations over time [?].
3.2. Generation of the Speech Parameters
During the synthesis, the text is first converted into a concatenatedsequence of context-dependent HMM models λ
(style)symbolic associated
with the linguistic context sequence q = [q1, . . . ,qN ], whereqn = [q1, . . . , qL]� denotes the (Lx1) linguistic context vectorassociated with linguistic unit n.
Firstly, the prosodic symbolic qproso is inferred so as to maximizethe log-likelihood of the prosodic symbolic sequence qproso condi-tionally to the linguistic context sequence q and the model λ(style)
symbolic.
qproso = argmaxqproso
P(qproso|q, λ(style)symbolic) (2)
Fig. 2. Generation of speech parameters.
4.1. Experimental Setup
40 speech utterances (10 per DG) were se-lected in the speaking style corpus and re-moved from the training set. Lexical ac-cess was removed using a band-pass filterthat insured that the lowest frequency ofthe fundamental frequency and the high-est frequency of its first harmonic was in-cluded. .
4.2. Subjective Evaluation
The evaluation consists of a multiplechoice identification task from speechprosody perception. The evaluation was
44 CHAPTER 4. PROSODY EXTRACTION
4.5 Prosodic Parameters Estimation
4.5.1 Fundamental Frequency (f0)
The fundamental frequency f0 and periodicity are estimated using the STRAIGHTalgorithm [Kawahara et al., 1999], a frequency-based fundamental frequency estimationmethod based on instantaneous frequency estimation and fixed-point analysis.
The analysis was performed using a 50-ms. blackmann window and a 5 ms. frame rate. F0
boundaries set for the analysis were manually adapted depending on the characteristics ofthe speaker. The voiced/unvoiced regions were decided using the aperiodicity measure.
Estimation of f0 variations (in blue, superimposed to the spectrogram) for the utterance: :”Longtemps, je me suis couche de bonne heure.” (”For a long time I used to go to bed
early”).
4.5.2 Syllable duration
˜ ˜
44 CHAPTER 4. PROSODY EXTRACTION
4.5 Prosodic Parameters Estimation
4.5.1 Fundamental Frequency (f0)
The fundamental frequency f0 and periodicity are estimated using the STRAIGHTalgorithm [Kawahara et al., 1999], a frequency-based fundamental frequency estimationmethod based on instantaneous frequency estimation and fixed-point analysis.
The analysis was performed using a 50-ms. blackmann window and a 5 ms. frame rate. F0
boundaries set for the analysis were manually adapted depending on the characteristics ofthe speaker. The voiced/unvoiced regions were decided using the aperiodicity measure.
Estimation of f0 variations (in blue, superimposed to the spectrogram) for the utterance: :”Longtemps, je me suis couche de bonne heure.” (”For a long time I used to go to bed
early”).
4.5.2 Syllable duration
˜ ˜
Figure 2: Generation of discrete/continuous speech pa-rameters for the sentence: “Longtemps, je me suis couchede bonne heure” (“For a long time I used to go to bedearly”).
4. EvaluationThe proposed model has been evaluated based on a speakingstyle identification perceptual experiment, and compared to aspeaking style identification experiment with natural speech[15]. For the purpose of such a comparison, it was necessaryto provide a single evaluation scheme for both experiments. Inparticular, it was not possible to control the linguistic content ofnatural speech utterances which provides evident cues for DG’sidentification (a single keyword would be sufficient to identify aDG). Thus, such a comparison required to remove lexical accessand to focus on the prosodic dimension only.
4.1. Experimental Setup
40 speech utterances (10 per DG) were selected in the speak-ing style corpus and removed from the training set. Lexical ac-cess was removed using a band-pass filter that insured that thelowest frequency of the fundamental frequency and the highestfrequency of its first harmonic was included.
4.2. Subjective Evaluation
The evaluation consists of a multiple choice identification taskfrom speech prosody perception. The evaluation was conductedaccording to crowd-sourcing technique using social networks.50 subjects (including 25 native French speakers, 15 non-nativeFrench speakers, 10 non-French speakers; 34 expert and 16naıve listeners) participated in this experiment. Participantswere given a brief description of the different speaking styles.Then, they were asked to associate a speaking style to each ofthe speech utterances. For this purpose, participants were giventhree options:
total confidence: select only one speaking style when certainof the choice;confusion: select two different speaking styles when two speak-ing styles are possible;total indecision: select ”indecision” when completely unsure.Subjects were asked to use this possibility only as a very lastresort.Additional informations were gleaned from the participants:speech expertise (expert, naıve), language (native Frenchspeaker, non-native French speaker, non-French speaker), age,and listening condition (headphones or not). Expert participantswere actually coming from various domains (speech and audiotechnologies, linguistics, musicians). Participants were encour-aged to use headphones.
5. Results & DiscussionIdentification performance was estimated using a measurebased on Cohen’s Kappa statistic [16]. Cohen’s Kappa statis-tic measures the proportion of agreement between two raterswith correction for random agreement. Our measure monitorsthe agreement between the ratings of the participants and theground truth. The measure varies from -1 to 1: -1 is perfectdisagreement; 0 is chance; 1 is perfect agreement. Confusionratings were considered as equally possible ratings. Total inde-cision ratings were relatively rare (3% of the total ratings) andremoved. Figure 3 presents the identification confusion matrix.Overall score reveals fair identification performance (κ =0.38 ± 0.04) which is comparable to that observed for iden-tification from natural speech (κnatural = 0.45 ± 0.03). Theidentification performance significantly depends on the speak-ing style (figure 4): sport commentary is substantially identified(κ = 0.68± 0.05), journal fairly identified (κ = 0.50± 0.06),political discourse moderately identified (κ = 0.28±0.07), andmass only slightly identified (κ = 0.12 ± 0.06). In compari-son with identification from natural speech, the identification iscomparable in the case of the sport commentary and the jour-nal speaking styles (κnatural = 0.70 ± 0.03 and κnatural =0.54 ± 0.05, respectively). However, there is a drop in identi-fication for the political and the mass speaking styles which isespecially significant for the mass style (κnatural = 0.34±0.05and κnatural = 0.38 ± 0.04, respectively). This indicates thatthe model somehow failed to capture the relevant cues of thecorresponding speaking style. Nevertheless, a large confusion
MASS
POLITIC
AL
JOURNAL
SPORT
MASS
POLITIC
AL
JOURNAL
SPORT
390
237
28
43
166
357
116
38
83
64
460
73
7
2
47
470
(a) natural speech
MASS
POLITICAL
JOURNAL
SPORT
MASS
POLITICAL
JOURNAL
SPORT
165
209
18
53
154
245
87
32
131
54
348
41
53
3
28
365
MASS
POLITICAL
JOURNAL
SPORT
MASS
POLITICAL
JOURNAL
SPORT
390
237
28
43
166
357
116
38
83
64
460
73
7
2
47
470
(b) synthetic speech
Figure 3: Identification confusion matrices. Rows representsynthesized speaking style. Columns represent identified speak-ing style.
exists between the political and the mass speech that is inherentto a similarity in the speaking style and the formal situation inwhich the speech occurs. Additionally, the conventional HMM-based speech synthesis system failed into modelling adequatelythe breathiness and the creakiness that is specific to the politicalspeaking style, especially within unvoiced segments.ANOVA analysis was conducted to assess whether the iden-tification performance depends on the language of the partic-ipants. Analysis reveals a significant effect of the language(F(2, 59) = 15, p < 0.001) (F(48,2)=5.9, p-value=0.005), andconfirms results obtained for natural speech. This confirms evi-dence that there exists variations of a speaking style dependingon the language and/or cultural background.Finally, an informal evaluation of the quality of the synthesizedspeech suggests that the speaking style modelling is robust tothe large variety of audio quality.
6. ConclusionIn this study, the ability and the robustness of a HMM-basedspeech synthesis system to model the speech characteristics ofvarious speaking styles were assessed. A discrete/continuousHMM was presented to model the symbolic and acoustic speechcharacteristics of a speaking style, and used to model the aver-age characteristics of a speaking style that is shared among var-ious speakers, depending on specific situations of speech com-munication. The evaluation consisted of an identification ex-periment of 4 speaking styles based on delexicalized speech,and compared with a similar experiment on natural speech. Theevaluation showed that the discrete/continuous HMM consis-
MASS POLITICAL JOURNAL SPORT0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.12
0.28
0.51
0.68
0.38
0.34
0.54
0.70
Coh
en’s
Kap
pa
NATURAL SPEECHSYNTHESIZED SPEECH
Figure 4: Mean identification scores and 95% confidence inter-val obtained for natural and synthesized speech.
tently models the speech characteristics of a speaking style, andis robust to the differences in audio quality. This proves evi-dence that the discrete/continuous HMM speech synthesis sys-tem successfully models the speech characteristics of a speak-ing style in the conditions of real-world applications.
7. References[1] A.-C. Simon, A. Auchlin, M. Avanzi, and J.-P. Goldman, Les voix des
Francais. Peter Lang, 2009, ch. Les phonostyles: une descriptionprosodique des styles de parole en francais.
[2] H. Schmid and M. Atterer, “New statistical methods for phrase break predic-tion,” in International Conference On Computational Linguistics, Geneva,Switzerland, 2004, pp. 659–665.
[3] P. Bell, T. Burrows, and P. Taylor, “Adaptation of prosodic phrasing mod-els,” in Speech Prosody, Dresden, Germany, 2006.
[4] J. Yamagishi, T. Masuko, and T. Kobayashi, “HMM-based expressivespeech synthesis - Towards TTS with arbitrary speaking styles and emo-tions,” in Special Workshop in Maui, Maui, Hawaı, 2004.
[5] S. Krstulovic, A. Hunecke, and M. Schroder, “An HMM-based speech syn-thesis system applied to german and its adaptation to a limited set of expres-sive football announcements,” in Interspeech, 2007.
[6] E. Villemonte de La Clergerie, “From metagrammars to factorizedTAG/TIG parsers,” in International Workshop On Parsing Technology, Van-couver, Canada, Oct. 2005, pp. 190–191.
[7] N. Obin, P. Lanchantin, A. Lacheret, and X. Rodet, “Towards improvedHMM-based speech synthesis using high-level syntactical features,” inSpeech Prosody, Chicago, U.S.A., 2010.
[8] A. Lacheret, N. Obin, and M. Avanzi, “Design and Evaluation of SharedProsodic Annotation for Spontaneous French Speech: From Expert Knowl-edge to Non-Expert Annotation,” in Linguistic Annotation Workshop, Upp-sala, Sweden, 2010.
[9] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price,J. Pierrehumbert, and J. Hirschberg, “ToBI: a standard for labeling en-glish prosody,” in International Conference of Spoken Language Process-ing, Banff, Canada, 1992, pp. 867–870.
[10] N. Obin, V. Dellwo, A. Lacheret, and X. Rodet, “Expectations for DiscourseGenre Identification: a Prosodic Study,” in Interspeech, Makuhari, Japan,2010, pp. 3070–3073.
[11] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Simultaneous modeling of spectrum, pitch and duration in HMM-basedspeech synthesis,” in European Conference on Speech Communication andTechnology, Budapest, Hungary, 1999, pp. 2347–2350.
[12] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden Markovmodels based on multi-space probability distribution for pitch pattern mod-eling,” in International Conference on Audio, Speech, and Signal Process-ing, Phoenix, Arizona, 1999, pp. 229–232.
[13] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hiddensemi-Markov model based speech synthesis,” in International Conferenceon Speech and Language Processing, Jeju Island, Korea, 2004, pp. 1397–1400.
[14] T. Toda and K. Tokuda, “A speech parameter generation algorithm consider-ing global variance for HMM-based speech synthesis,” IEICE Transactionson Information and Systems, vol. 90, no. 5, pp. 816–824, 2007.
[15] N. Obin, A. Lacheret, and X. Rodet, “HMM-based prosodic structure modelusing rich linguistic context,” in Interspeech, Makuhari, Japan, 2010, pp.1133–1136.
[16] J. Cohen, “A coefficient of agreement for nominal scales,” Educational andPsychological Measurement, vol. 20, no. 1, pp. 37–46, 1960.