a method of speech waveform synthesis based on wavenet considering speech generation process
TRANSCRIPT
SP
WaveNet
1
2
:::: :
Feed-Forward [Zen et al., 13] , LSTM-RNN [Zen et al., 15] WaveNet [van den Oord et al., 16]
WaveNet
WaveNet
WaveNet
WaveNet
Fant, 60
overlap/shift
Vocoder
overlap/shift
Frame-by-Frame
WaveNet
WaveNet [van den Oord et.al, 16]
Causal dilated convolution , residual, skip-connection softmax
Causal Dilated ConvolutionCausal : dilation
1dilation=12dilation=23dilation=4dilation=8
sigmDilated Conv.11 Conv.
tanh
tanh
sigmDilated Conv.11 Conv.
tanh
sigmDilated Conv.11 Conv.Causal Conv.
ReLUSkip-connectionsResidualResidual11 Conv.
11 Conv.ReLUInputOutput
Softmax
Residual
Residual
Residual
Dilated Conv.11 Conv.
tanh
sigmDilated Conv.11 Conv.
tanh
sigmDilated Conv.11 Conv.Causal Conv.
ReLUSkip-connectionsResidualResidual11 Conv.
11 Conv.ReLUInputOutput
Softmax
Residual
ResidualGated activationCausal Dilated Conv.ResidualSkip-connection
Softmax 16bit65,536 16bit8bit
Softmax
WaveNet
WaveNetSoftmax
WaveNet
WaveNet
WaveNet
20
1
2
3
4
Residual BlockResidualBlock #3ResidualBlock #4ResidualBlock #2ResidualBlock #1
ResidualBlock #1
1dilation=12dilation=23dilation=4dilation=8
Sample-by-Sample Frame-by-Frame
WaveNet
CMU-ARCTIC SLT1082 50 16 kHz5 ms25 ms0 24
Adam; Dilation1, 2, .... , 512 3 30Dilated Conv.11 Conv.
tanh
sigmDilated Conv.11 Conv.
tanh
sigmDilated Conv.11 Conv.Causal Conv.
ReLUSkip-connectionsResidualResidual11 Conv.
11 Conv.ReLUInputOutput
Softmax
Residual
ResidualGated activation256ch256ch
30 = Causal dilated convolution 30 2048ch2048ch256ch2048ch
SNRSNR
SDR
:
:
:
:
: : :
NothingMcepMcep + F0
WaveNet
FFT
dB; 5%
SNRSDRMcepNothing
Mcep+F0
Raw
McepTest
Plain-MLSAFFTMLSA STRAIGHT-MLSASTRAIGHT1MLSA 2Plain-WaveNetFFTWaveNetSTRAIGHT-WaveNetSTRAIGHTWaveNet
1 STRAIGHT 2 MLSA
SNR
SNR
STRAIGHT-WaveNet
SNR
STRAIGHT-WaveNet
SNR
STRAIGHT-WaveNetRaw
SDR
STRAIGHT-MLSA
STRAIGHT-WaveNet
WaveNetSNRSDRSTRAIGHT