singing style transfer using cycle-consistent boundary … · 2018-07-09 · singing style transfer...

3
Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks Cheng-Wei Wu 12 Jen-Yu Liu 12 Yi-Hsuan Yang 2 Jyh-Shing R. Jang 1 Abstract Can we make a famous rap singer like Eminem sing whatever our favorite song? Singing style transfer attempts to make this possible, by replac- ing the vocal of a song from the source singer to the target singer. This paper presents a method that learns from unpaired data for singing style transfer using generative adversarial networks. 1. Introduction Figure 1 shows a two-stage framework combining singing voice separation with singing style transfer. The audio from a source singer is separated into accompaniment and vocal first, and then the singing style of the separated vocal is changed to that of a target singer. Finally, the separated accompaniment is integrated with the style-transferred vocal. Our focus here is on the singing style transfer model. We assume that the input to this model is clean, meaning that it has been perfectly separated from the source audio. We evaluate the proposed model through a listening test under the inside test scenario, meaning that the training and test sets have overlaps. We show audio result for both inside and outside tests on our project website: http:// mirlab.org/users/haley.wu/cybegan/. 2. Methodology The method proposed by Gatys et al. (2016) leads to the earliest successful examples of image style transfer. The method has been extended to perform audio style transfer, either to transfer sound texture (Ulyanov & Lebedev, 2016) or to stylize an acapella cover to match the original vocal (Bohan, 2017). However, the method requires the availabil- ity of paired data for training (i.e., the source and target singers need to have sung the same songs), which is not readily available for arbitrary target singers. 1 National Taiwan University; 2 Academia Sinica, Taiwan;. Cor- respondence to: Cheng-Wei Wu <[email protected]>. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Figure 1. System framework for singing style transfer. In this paper, we propose to use a CycleGAN-CNN (Zhu et al., 2017) based model due to its success in image style transfer without using paired data. We use log-magnitude spectrogram as the model input, with 1,024-sample, 1/4 overlapping windows for the STFT for songs sampled at 44.1kHz. To capture information in the spectra, we view a T × F 2D spectrogram as a T × 1 image with F channels, and use 1D convolution (Liu & Yang, 2018) for all the layers in our model. Since a song may have variable length, we adopt a fully-convolutional design (Long et al., 2015; Liu & Yang, 2016) and use no pooling layers. To get the time- domain audio signal from the modified spectrogram, we use the classic Griffin-Lim algorithm (1984). We experiment with a few extensions of this basic model. First, we use a U-Net architecture (Ronneberger et al., 2015) and add symmetric skip connections between the encod- ing and decoding layers to preserve the details of the given source audio and to increase the sharpness of the vocal. Second, we use BEGAN (Berthelot et al., 2017) instead of GAN in an attempt to stabilize the training process. More- over, in this CycleBEGAN design, we replace the original PatchGAN-based discriminator (Li & Wand, 2016; Ledig et al., 2017; Isola et al., 2017) with an auto-encoder hav- ing identical architecture with the generator. Finally, being inspired by (Liu & Yang, 2018), we add a recurrent layer (using GRU (Cho et al., 2014)) before the output convolu- tion layer to account for the temporal dependency in music. 3. Experiments We train our model on the iKala dataset (Chan et al., 2015), which comes with 252 30-second excerpts of clean vocal recordings. We cut the excerpts into 5-second clips, with 4- second overlaps. We randomly select 2,800 segments from each gender as the training set, and 100 segments from each gender as the test set. As part of each test clip may have been observed in the training set, this is an inside test. arXiv:1807.02254v1 [cs.SD] 6 Jul 2018

Upload: others

Post on 19-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Singing Style Transfer Using Cycle-Consistent Boundary … · 2018-07-09 · Singing style transfer attempts to make this possible, by replac-ing the vocal of a song from the source

Singing Style Transfer Using Cycle-Consistent Boundary EquilibriumGenerative Adversarial Networks

Cheng-Wei Wu 1 2 Jen-Yu Liu 1 2 Yi-Hsuan Yang 2 Jyh-Shing R. Jang 1

AbstractCan we make a famous rap singer like Eminemsing whatever our favorite song? Singing styletransfer attempts to make this possible, by replac-ing the vocal of a song from the source singer tothe target singer. This paper presents a methodthat learns from unpaired data for singing styletransfer using generative adversarial networks.

1. IntroductionFigure 1 shows a two-stage framework combining singingvoice separation with singing style transfer. The audio froma source singer is separated into accompaniment and vocalfirst, and then the singing style of the separated vocal ischanged to that of a target singer. Finally, the separatedaccompaniment is integrated with the style-transferred vocal.Our focus here is on the singing style transfer model. Weassume that the input to this model is clean, meaning that ithas been perfectly separated from the source audio.

We evaluate the proposed model through a listening testunder the inside test scenario, meaning that the trainingand test sets have overlaps. We show audio result for bothinside and outside tests on our project website: http://mirlab.org/users/haley.wu/cybegan/.

2. MethodologyThe method proposed by Gatys et al. (2016) leads to theearliest successful examples of image style transfer. Themethod has been extended to perform audio style transfer,either to transfer sound texture (Ulyanov & Lebedev, 2016)or to stylize an acapella cover to match the original vocal(Bohan, 2017). However, the method requires the availabil-ity of paired data for training (i.e., the source and targetsingers need to have sung the same songs), which is notreadily available for arbitrary target singers.

1National Taiwan University; 2Academia Sinica, Taiwan;. Cor-respondence to: Cheng-Wei Wu <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

Figure 1. System framework for singing style transfer.

In this paper, we propose to use a CycleGAN-CNN (Zhuet al., 2017) based model due to its success in image styletransfer without using paired data. We use log-magnitudespectrogram as the model input, with 1,024-sample, 1/4overlapping windows for the STFT for songs sampled at44.1kHz. To capture information in the spectra, we view aT × F 2D spectrogram as a T × 1 image with F channels,and use 1D convolution (Liu & Yang, 2018) for all the layersin our model. Since a song may have variable length, weadopt a fully-convolutional design (Long et al., 2015; Liu& Yang, 2016) and use no pooling layers. To get the time-domain audio signal from the modified spectrogram, we usethe classic Griffin-Lim algorithm (1984).

We experiment with a few extensions of this basic model.First, we use a U-Net architecture (Ronneberger et al., 2015)and add symmetric skip connections between the encod-ing and decoding layers to preserve the details of the givensource audio and to increase the sharpness of the vocal.Second, we use BEGAN (Berthelot et al., 2017) instead ofGAN in an attempt to stabilize the training process. More-over, in this CycleBEGAN design, we replace the originalPatchGAN-based discriminator (Li & Wand, 2016; Lediget al., 2017; Isola et al., 2017) with an auto-encoder hav-ing identical architecture with the generator. Finally, beinginspired by (Liu & Yang, 2018), we add a recurrent layer(using GRU (Cho et al., 2014)) before the output convolu-tion layer to account for the temporal dependency in music.

3. ExperimentsWe train our model on the iKala dataset (Chan et al., 2015),which comes with 252 30-second excerpts of clean vocalrecordings. We cut the excerpts into 5-second clips, with 4-second overlaps. We randomly select 2,800 segments fromeach gender as the training set, and 100 segments from eachgender as the test set. As part of each test clip may havebeen observed in the training set, this is an inside test.

arX

iv:1

807.

0225

4v1

[cs

.SD

] 6

Jul

201

8

Page 2: Singing Style Transfer Using Cycle-Consistent Boundary … · 2018-07-09 · Singing style transfer attempts to make this possible, by replac-ing the vocal of a song from the source

Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial NetworksMeanOpinion

Score(M

OS)

0

1

2

3

4

5

Sharpness Lyrics Pitch Naturalness Gender Overall

4.054.46

3.864.07

4.39

3.783.34

4.21

3.113.14

4.21

3.082.84

3.88

2.713.15

3.59

2.49

1.71

1.23

3.46

3.934.27

3.13

2.13

3.06

1.94

2.51

3.60

1.99

CycleGAN-CNN(m1) CycleGAN-CNN+skip(m2) CycleBEGAN-CNN(m3) CycleBEGAN-CNN+skip(m4) CycleBEGAN-CNN+skip+recurrent(m5)

Figure 2. Results of subjective listening test: mean opinion scores (MOS) in six performance metrics.

To simplify the problem, we train our model to perform“gender transfer”, i.e., either male-to-female or female-to-male. We implement and compare in total five differ-ent models (denoted as m1–m5 in Figure 2), includingthe basic ‘CycleGAN-CNN’ and the most sophisticated‘CycleBEGAN-CNN+skip+recurrent.’

We conduct a subjective listening test to evaluate the perfor-mance of style transfer. The subjects are asked to listen tothe source and transferred audio for six 15-second clips, in-cluding 3 clips for male-to-female and 3 clips for female-to-male. A subject has to compare the result of two randomlychosen models for each set, and evaluate the results in termsof the following six metrics, using a five-point Likert scale:

• The sharpness of the transferred singing voice.

• Whether the lyrics (main content of the song) isintelligible and is consistent with that in the source.

• The perceived pitch accuracy (relative pitch) of thetransferred singing voice.

• Naturalness: whether the transferred singingvoice is generated by human not by machine.

• Whether the gender of the singer has been changed.

• Overall: whether the transferred voice achieve thegoal of style transfer and its overall perceptual quality.

Figure 2 shows the mean opinion scores (MOS) in thesesix metrics from 69 adults (19 females). The followingobservations are made. First, in terms of sharpness andlyrics, all the models with symmetric skip connections(m2, m4, m5) outperform those without them (m1, m3).This shows that the audio details propagated via the skipconnections from the encoder to the decoder help gener-ate high-resolution audio. Second, in terms of pitch, theCycleBEGANs (m3, m4, m5) outperform the simple Cy-cleGAN model (m1), but not the CycleGAN with skip con-nections (m2). By listening to the result, we found that m2works like a normal auto-encoder which only reconstructsthe input without transferring the style. Therefore, it pre-serves the pitch well but performs very poorly in gender.

Figure 3. The spectrograms of the (a) source audio sung by a malesinger and the (b) style transferred audio (male-to-female) of anexample song, using CycleBEGAN-CNN+skip+recurrent (m5).We can see that the pitches in (b) are higher than those in (a).

Next, in terms of naturalness, the outputs of the mod-els with skip connections (m2, m4, m5) score higher thanthe others (m1, m3) as expected. Similar to pitch, theoutputs of m2 sound more natural than the outputs from theother models, except for m5. Finally, the scores in genderis consistent with the result of the overall performance.This is expected, because the goal is to transfer the singingstyle male→female or female→male. CycleBEGAN mod-els (m3, m4, m5) again outperform the CycleGAN ones (m1,m2), which verifies that the the integration of CycleGANwith BEGAN effectively improves the result of style trans-fer. The most sophisticated model m5 outperforms the othermodels in all the metrics, reaching scores that are close toor over 4 on the five-point Likert scale. Figure 3 shows anexample of the transfered spectrograms by m5.

4. Conclusion and Future WorkWe have proposed a new approach for singing style transferwithout paired data by combining CycleGAN with the train-ing strategy of BEGAN. We have shown that symmetric skipconnections increase the sharpness of the result, and thatBEGAN plays an important role in transferring the singerproperties. Future work will be done to transfer singer iden-tities instead of only gender, and to include a singing voiceseparation model (Jansson et al., 2017; Liu & Yang, 2018)to deal with real songs instead of clean vocals.

Page 3: Singing Style Transfer Using Cycle-Consistent Boundary … · 2018-07-09 · Singing style transfer attempts to make this possible, by replac-ing the vocal of a song from the source

Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks

ReferencesBerthelot, David, Schumm, Tom, and Metz, Luke. BEGAN:

Boundary equilibrium generative adversarial networks.arXiv preprint arXiv:1703.10717, 2017.

Bohan, O. B. Singing style transfer, 2017. URLhttp://madebyoll.in/posts/singing_style_transfer/.

Chan, Tak-Shing, Yeh, Tzu-Chun, Fan, Zhe-Cheng, Chen,Hung-Wei, Su, Li, Yang, Yi-Hsuan, and Jang, Roger.Vocal activity informed singing voice separation with theikala dataset. In Proc. IEEE. Int. Conf. Acoustics, Speechand Signal Processing, pp. 718–722, 2015.

Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Calar,Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger,and Bengio, Yoshua. Learning phrase representationsusing rnn encoder–decoder for statistical machine trans-lation. In Proc. Conf. Empirical Methods in NaturalLanguage Processing, pp. 1724–1734, 2014.

Gatys, L. A., Ecker, A. S., and Bethge, M. Image styletransfer using convolutional neural networks. In Proc.IEEE Conf. Computer Vision and Pattern Recognition,pp. 2414–2423, 2016.

Griffin, Daniel and Lim, Jae. Signal estimation from modi-fied short-time fourier transform. IEEE Transactions onAcoustics, Speech, and Signal Processing, 32(2):236–243,1984.

Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros,Alexei A. Image-to-image translation with conditionaladversarial networks. In Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2017.

Jansson, Andreas, Humphrey, Eric J., Montecchio, Nicola,Bittner, Rachel, Kumar, Aparna, and Weyde, Tillman.Singing voice separation with deep U-net convolutionalnetworks. In Proc. Int. Soc. Music Information RetrievalConf., pp. 745–751, 2017.

Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero,Jose, Cunningham, Andrew, Acosta, Alejandro, Aitken,Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan,et al. Photo-realistic single image super-resolution us-ing a generative adversarial network. In Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition, pp. 4681–4690, 2017.

Li, Chuan and Wand, Michael. Precomputed real-timetexture synthesis with markovian generative adversarialnetworks. In Proc. European Conf. Computer Vision, pp.702–716. Springer, 2016.

Liu, Jen-Yu and Yang, Yi-Hsuan. Event localization inmusic auto-tagging. In Proc. ACM Multimedia, pp. 1048–1057, 2016.

Liu, Jen-Yu and Yang, Yi-Hsuan. Denoising auto-encoderwith recurrent skip connections and residual regres-sion for music source separation. arXiv preprintarXiv:1807.01898, 2018.

Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor. Fullyconvolutional networks for semantic segmentation. InProc. IEEE Int. Conf. Computer Vision and Pattern Recog-nition, pp. 3431–3440, 2015.

Ronneberger, Olaf, Fischer, Philipp, and Brox, Thomas.U-net: Convolutional networks for biomedical imagesegmentation. In Proc. Int. Conf. Medical Image Com-puting and Computer-assisted Intervention, pp. 234–241.Springer, 2015.

Ulyanov, D. and Lebedev, V. Singing style transfer, 2016.URL https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/.

Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros,Alexei A. Unpaired image-to-image translation usingcycle-consistent adversarial networkss. In Proc. IEEE Int.Conf. Computer Vision, 2017.