towards optimal tts corpora cadic didier boidin cedric d'alessandro christophe
TRANSCRIPT
Towards optimal TTS corpora
CADIC DidierBOIDIN CedricD'ALESSANDRO Christophe
2 Towards optimal TTS corpora France Telecom Group restricted
Unit-selection TTS
This is an example.
Linguistic modules
Unit selection
Unit concatenati
on
Speaker database
3 Towards optimal TTS corpora France Telecom Group restricted
Unit-selection TTS
This is an example.
Linguistic modules
Unit selection
Unit concatenati
on
How to prepare the recording
script
?
4 Towards optimal TTS corpora France Telecom Group restricted
Preparation of the recording script
Criterion = diphones and triphones coverage
Algorithm = greedy, corpus condensation
Classic optimization approach
5 Towards optimal TTS corpora France Telecom Group restricted
Preparation of the recording script
Criterion = diphones and triphones coverage
Algorithm = greedy, corpus condensation
Classic optimization approach
The link between di- or triphones coverage and the final TTS quality is not clear
The process is constrained by the limited combinations encountered in the finite reference corpus
6 Towards optimal TTS corpora France Telecom Group restricted
Preparation of the recording script
Criterion = diphones and triphones coverage
Algorithm = greedy, corpus condensation
Classic optimization approach
Criterion = vocalic sandwiches coverage
Algorithm = greedy, sentence construction
Our optimization approach
7 Towards optimal TTS corpora France Telecom Group restricted
Vocalic sandwiches (Cadic et al, Interspeech 2009)
8 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
Finite State Transducers compute "optimal" sequences of sandwiches, so that:
- the coverage increment is maximized (greedy approach)
- only sandwich transitions observed in a reference corpus are allowed
Neither syntactic nor semantic consideration generated sequences are likely to be nonsense
Towards optimality
Towards readability
Development of a semi-automatic tool, allowing an operator to iteratively correct generated sequences, in order to build an acceptable and almost optimal sentence.
9 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't the week of the six.)
10 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't…)
11 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
12 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't take it out…)
13 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
14 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
15 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
16 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't take it out the weeks…)
17 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't take it out the weeks like you.)
18 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't take it out the black weeks,)
19 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
(I don't take it out the black weeks,)
20 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
The procedure is time-consuming (around 3 min – 50 steps – to build a plausible sentence)
Most built sentences lack semantic coherence (redundancy is minimized at the price of semantics)
Built scripts are much denser than with corpus condensation
21 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
Density increase of 30 to 40%
compared to condensation
San
dw
ich
co
vera
ge
rat
e (%
)
22 Towards optimal TTS corpora France Telecom Group restricted
Conclusion
For the creation of unit-selection TTS recording scripts:• We suggested using the Vocalic Sandwiches Coverage Rate as optimization criterion (since it is a convenient symbolic approximation of the selection cost)•We presented a novel corpus building technique, based on sentence construction rather than sentence selection. The procedure is time-consuming and built sentences tend to lack semantic coherence, but a density increase of 30 to 40% can be otained.
Recent work (SSW7 submission)
•Extensive evaluation of the vocalic sandwiches as optimization criterion•Construction of full recordings scripts. Density estimations seem to be confirmed. However semantic limitations had significant repercussions on the reading stage.
23 Towards optimal TTS corpora France Telecom Group restricted
24 Towards optimal TTS corpora France Telecom Group restricted
Database constitution: two ways
Rushes from DVD, websites…
Unique way to inaccessible voices
Expensive process, poor TTS quality
Control of the content best TTS quality
OR
Dedicated recordings (script read by a speaker)
25 Towards optimal TTS corpora France Telecom Group restricted
Database constitution: two ways
Rushes from DVD, websites…
Unique way to inaccessible voices
Expensive process, poor TTS quality
Control of the content best TTS quality
OR
Dedicated recordings (script read by a speaker)
26 Towards optimal TTS corpora France Telecom Group restricted
Vocalic sandwiches (Cadic et al, Interspeech 2009)
Given an input sentence, the selection module searches the database for units presenting:
Maximum adequation to the target sequence(target cost)
Minimum distorsion between consecutive units(concatenation cost)
Illustration
27 Towards optimal TTS corpora France Telecom Group restricted
Vocalic sandwiches (Cadic et al, Interspeech 2009)
Given an input sentence, the selection module searches the database for units presenting:
Maximum adequation to the target sequence(target cost)
Minimum distorsion between consecutive units(concatenation cost)
Illustration
28 Towards optimal TTS corpora France Telecom Group restricted
Vocalic sandwiches (Cadic et al, Interspeech 2009)
Given an input sentence, the selection module searches the database for units presenting:
Maximum adequation to the target sequence(target cost)
Minimum distorsion between consecutive units(concatenation cost)
Illustration
29 Towards optimal TTS corpora France Telecom Group restricted
Vocalic sandwiches (Cadic et al, Interspeech 2009)
Correlations of coverage rates with the selection cost:
Vocalic sandwiches -0.78
Diphones -0.44
Triphones -0.64
Illustration
30 Towards optimal TTS corpora France Telecom Group restricted
Sentence construction
Finite State Transducers compute "optimal" sequences of sandwiches, so that:
- the coverage increment is maximized (greedy approach)
- only sandwich transitions observed in a reference corpus are allowed
Optimal sequence of length 1
Coverage increment is averaged over the sequence length15 FST give 15 optimal sandwich sequences for each length ≦ 15
Optimal sequence of length 2Optimal sequence of length 3Optimal sequence of length 4 …Optimal sequence of length 15
#_b_i_z_# #__ _t_e_# #_i_l_p_a_ _t_e_#
#_i_l_p_a_ _a_s_j__# #_i_l_p_a_ _f__ _p_u_ _d_ _m_ _d_ _p_u__v_w_a_ _d_ _l_a_p_a_s_s__t_k_o_m_ _#