takeshi saitou 1, masataka goto 1, masashi unoki 2 and masato akagi 2 1 national institute of...
TRANSCRIPT
![Page 1: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/1.jpg)
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2
1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan Advanced Institute of Science and Technology (JAIST)
![Page 2: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/2.jpg)
Our research approach focuses on …
not text-to-singing (lyric-to-singing) synthesis
singing♪
♪♪
♪
but speech-to-singing synthesis (vocal conversion).
⇒ Clarifying acoustic differences between singing and speaking.
⇒ Developing novel applications for computer music production.
speech singing♪
♪♪
♪
![Page 3: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/3.jpg)
Vocal conversion system is - based on speech manipulation system STRAIGHT (Kawahara et al,1998) and
- comprises three types of model; F0 control model Duration control model Spectral control model
![Page 4: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/4.jpg)
Speaking voice: reading the lyrics of a song.
Musical score
Synchronization inform
ation
c v v c c c cv v
![Page 5: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/5.jpg)
Musical scoreMusical notes
F0 control model:Adding four types of F0fluctuation into musical note.
F0 contour of singing voice
Melody contourVibrato : Quasi-periodic frequency modulation with 4 - 7 Hz.
Preparation : Deflection in the opposite direction of note change observed just before note change.
Fine fluctuation : irregularlyfluctuations higher than 10 Hzin full contour.
Overshoot : Deflection exceeding the target note after note change.
![Page 6: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/6.jpg)
Speaking voice
STRAIGHT (analysis part)
![Page 7: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/7.jpg)
Spectral sequence AP sequence
Duration control model: is lengthened according to the fix
rate. is not lengthened. is lengthened so that the duration
of the whole combination
corresponds to the note duration.
![Page 8: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/8.jpg)
Lengthened Spectral and AP sequence
Spectral envelope and AP of vowel part.
Modified spectral envelope and AP
Spectral control model1:Adding singing formant by emphasizing peak of spectral envelope and dip of AP.
![Page 9: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/9.jpg)
Modified spectral and AP Generated F0 contour
Synthesized singing voice
STRAIGHT (synthesis)
Adding an amplitude modulation (AM) of formants synchronized with vibrato by adding AMs into amplitude envelope of the synthesized singing voice during vibrato.
Spectral control model 2:
Synthesized singing voice (final version)
![Page 10: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/10.jpg)
♪ Speaking voice (input): (male → female)
♪ Synthesized singing voice: (male → female → chorus)
![Page 11: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/11.jpg)
Thank you!!
![Page 12: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/12.jpg)
12
0),sin(
1),exp(
1),1sin()exp(1
1)),exp()(exp(12
)(2
2
212
tK
tKt
ttK
ttK
th
22 2)(
ss
KsH
![Page 13: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/13.jpg)
lips teeth ・ alveolar arch palate glottis
voiced unvoiced voiced unvoiced voiced unvoiced
fricative /z/1.37 /s/1.18 /h/1.28
plosive /d/1.00 /t/1.09 /g/1.14 /k/0.97
semivowel /w/2.61 /r/2.12
nasal /m/1.35 /n/1.50
♪Calculating the ratios of the duration of each consonant in singing-voices to read speech
We can control phoneme duration by controlling articulation manner rather thanarticulation positions:fricative 1.28, plosive 1.00, semivowel 2.37, nasal 1.43, /y/ 1.22
![Page 14: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/14.jpg)
♪ Singers’ formant: Remarkable peak of spectral at around 3 kHz. (Sundberg, 1974)
♪ Amplitude modulation of formants synchronized with vibrato. (Hirano, 1985)
Both features are remarkably contained to a professional singing-voice.
![Page 15: Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan](https://reader036.vdocuments.mx/reader036/viewer/2022070404/56649f345503460f94c50bab/html5/thumbnails/15.jpg)
2000 Hz
Spectral control 1: Singing formant that is a remarkable peak of spectrum at around 3 kHz .
Spectral control 2: Amplitude modulation of formants synchronized with vibrato in F0.