icomposer: an automatic songwriting system for chinese

Proceedings of NAACL-HLT 2019: Demonstrations, pages 84–88Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

84

iComposer: An Automatic Songwriting System for Chinese Popular Music

Hsin-Pei LeeInstitute of

Information Science,Academia Sinicahhpslily

@iis.sinica.edu.tw

Jhih-Sheng FanInstitute of

Information Science,Academia Sinicafann1993814

@iis.sinica.edu.tw

Wei-Yun MaInstitute of

Information Science,Academia Sinica

[email protected]

Abstract

In this paper, we introduce iComposer, an in-teractive web-based songwriting system de-signed to assist human creators by greatly sim-plifying music production. iComposer auto-matically creates melodies to accompany anygiven text. It also enables users to generate aset of lyrics given arbitrary melodies. iCom-poser is based on three sequence-to-sequencemodels, which are used to predict melody,rhythm, and lyrics, respectively. Songs gener-ated by iComposer are compared with human-composed and randomly-generated ones in asubjective test, the experimental results ofwhich demonstrate the capability of the pro-posed system to write pleasing melodies andmeaningful lyrics at a level similar to that ofhumans.

1 Introduction

Music is a universal language. There is no humansociety that does not, at some point, have a musicalheritage and tradition. Over the past few years,many computer science researchers have workedon music information retrieval, focusing on genrerecognition, symbolic melodic similarity, melodyextraction, mood classification, etc.

With respect to computational creativity, taskssuch as automatic music composition or lyricsgeneration have been discussed for decades. How-ever, a relatively new topic—the relationshipbetween lyrics and melody—remains somewhatmysterious. It is difficult to explain how music andspoken words fit together and why the pairing ofthese two creates an emotionally compelling ex-perience; we believe this merits a deeper investi-gation.

Inspired by work on text-to-image synthesis andimage caption generation, we propose iComposer,a simple and effective bi-directional songwritingsystem that automatically generates melody fromtext and vice versa. We believe that iComposer

has value in capturing relationships between lyricsand melody. Moreover, iComposer makes it pos-sible for more people, both professional and ama-teur musicians, to enjoy and benefit from this cre-ative activity.

An LSTM (long short-term memory) basedmodel is especially suitable for this task as it isstructured to use historical information to predictthe next value in the sequence. Our architectureconsists of three subnetworks, each of which is asequence-to-sequence model. These three subnet-works generate the pitch, duration, and lyrics foreach note, and jointly learn the structure of Chi-nese popular music.

Choi, Fazekas, and Sandler (2016) use text-based LSTM for automatic music composition,Potash, Romanov, and Rumshisky (2015) demon-strate the effectiveness of LSTM on rap lyrics gen-eration, and Watanabe et al. (2018) propose anRNN-based lyrics language model conditioned onmelodies of Japanese songs. Bao et al. (2018)propose a sequence-to-sequence neural networkmodel that composes melody from lyrics. How-ever, to the best of our knowledge, iComposeris the first songwriting system that utilizes asequence-to-sequence model in generating bothmelody and Chinese lyrics that match each otherperfectly.

In addition to self-evaluation, we designed anexperiment to evaluate the quality of the generatedsongs. Thirty subjects were asked to subjectivelyrate the aesthetic value of 10 selections, a mixtureof original songs and computer-generated songs.The experimental results indicate that iComposercomposes compelling songs that are quite similarto those composed by humans, and are much betterthan those generated by the baseline method.

2 Methodology

In this section, we briefly introduce our data gath-ering and pre-processing techniques, and then de-

85

scribe the proposed network architecture and pro-vide a system overview.

2.1 Dataset

For the purpose of this study, music sheets withvocal lines and their accompanying lyrics are nec-essary. Moreover, files with a single instrumentcorresponding to the melody are favored for bet-ter performance in the LSTM model. However, asfar as we know there is no such dataset available,and most state-of-the-art melody extraction toolsare still unsatisfactory.

We collected 1000 Chinese popular musicpieces in MIDI format and extracted feature in-formation using pretty midi, a Python modulefor creating, manipulating, and analyzing MIDIfiles (Raffel and Ellis, 2014). Musical notes arerepresented by MIDI note numbers from 0 to 127,where 60 is defined as middle C. Note durationsare represented by numbers in the range from 0.1to 3.0 seconds.

We then recruited 20 people with at least fiveyears of experience playing instruments to iso-late the main melodies from polyphonic music andalign the lyrics to their corresponding notes, asshown in Figure 1. The dataset consists of approx-imately 300,000 character-note pairs, of which wepartitioned 80% as the training set and used therest for testing. Note that the correspondence be-tween notes and characters is not always one-to-one. A single sung character can be composed ofmultiple notes (i.e. one-to-many alignment).

Figure 1: Results of character-note alignment . Ev-ery character is bound to its corresponding note numberand note duration.

2.2 Network ArchitectureLSTM networks, introduced by Hochreiter andSchmidhuber (1997), have been shown effectivefor a wide variety of sequence-based tasks, in-cluding machine translation and speech recogni-tion. They consist of a chain of memory cells thatstore state through multiple time steps to mitigatethe vanishing gradient problem of recurrent neuralnetworks.

The neural network proposed in this paper is asequence-to-sequence model with a single layerand 100 nodes in both the encoder and decoderLSTM. Since music and lyrics both can be rep-resented as a sequence of events, the generatingprocess can be thought of as estimating the condi-tional probability

p(y1, ..., yn|x1, ..., xm) =n∏

t=1

p(yt|v, y1, ..., yt−1)

where x1, . . . , xm is the input sequence, y1, . . . ,yn is the output sequence, and v is the fixed-dimensional representation of the input sequence.

All the code for this system was written inPython using Pytorch as a backend, and wastrained using stochastic gradient descent (batchsize = 4), Adam optimization, and a learning rateof 0.001. The initial weights were randomly ini-tialized with a range between -0.1 and 0.1.

2.2.1 Melody GenerationThe melody is generated using two sequence-to-sequence models. The first predicts the note num-ber, and the second predicts the note duration. Asboth models take the same text as input, it makesno difference which is run first. Once the pitch andduration of the notes are generated, we synthesizethe data as audio using pretty midi.

2.2.2 Lyrics GenerationIn contrast to the melody generation process men-tioned above, generating lyrics requires only onesequence-to-sequence model. We first extract notenumbers from the given MIDI file, and then feed asequence of pitches into the LSTM encoder. Thenlyrics are automatically generated by the LSTMdecoder.

3 Evaluation

In this section, we illustrate the behavior of theproposed system with an analysis of some gener-ated lyrics and melodies. After this, we provideand discuss details about the subjective test.

86

Figure 2: Melody generation model. Take the lyrics –”I thought that although love...” in Chinese as an exam-ple.

Figure 3: Lyrics generation model. Take the lyrics – ”Ithought that although love...” in Chinese as an example.

3.1 AnalysisThe quality of the generated lyrics can be evalu-ated based on the variety of words. Here, we fo-cus on this trait to show our model is learning andimproving in a general sense.

According to our statistics on all the generatedlyrics, only 1154 words are being used in epoch10. In epoch 50, the variety of words increases to1386. Finally, in epoch 100, there are 1646 dif-ferent words in total. As the number of epochincreases, we could consistently see a noticeableimprovement in lyrics.

Figure 4 presents some examples of generatedlyrics from a certain melody. We can see that manywords are repeated in the sequence in epoch 10.However, after epochs of training, the variety ofwords become greater. In epoch 100, there is noadjacent repeated words occur.

Figure 4: Lyrics generated from the same melody atdifferent epochs. (Sentences in the parentheses are theEnglish translation generated by Google Translation.)

3.2 Subjective TestSince it is challenging to objectively evaluate thequality of songs, we evaluate the success of iCom-poser with experiments conducted using 30 hu-man participants. All the subjects were collegestudents aged 18 to 25, as the primary consumersof popular music are in this age group. In addition,we required our subjects to have as least five yearsof instrument playing or songwriting experience toensure the reliability of our survey feedback.

The questionnaire contained 10 questiongroups, each with 3 melody clips or 3 sets oflyrics generated by either humans, iComposer,or the baseline model. These three types ofsongs are arranged in random order, and eachis approximately 15 seconds in length. None of

87

the thirty chosen melodies and lyrics appearedin the training set, and were thus unseen to themodel. In each case, an attempt was made tofind less commonly known songs, so that thegenerated melodies and lyrics could be morefairly compared to the original ones.

For our baseline method, random notes were se-lected from MIDI note numbers 60–80, as pitchappears more frequently in this range accordingto the note distribution chart of our dataset shownin Figure 5 . The randomly-generated lyrics werechosen from 1000 most frequently used words inthe dataset.

Figure 5: Frequency distribution of notes in the dataset

Survey participants were asked to listen to amixture of melodies and lyrics generated by hu-mans, iComposer, and our baseline model. It wasnot until the completion of the experiment thatsubjects were informed that some selections werecomputer-generated. During the experiment, sub-jects were asked to subjectively rate the melodiesand lyrics from 1 to 10 in terms of the followingstandards (larger scores indicate better quality) :

• MelodyHow smooth are the melodies?

• LyricsHow meaningful are the lyrics?

• OverallHow well do the melodies fit with the lyrics?

Table 1: Subjective Test ResultsModel Melody Lyrics Overallbaseline 3.12 2.43 3.24iComposer 6.23 6.38 5.72human 7.55 6.90 7.69

Table 1 shows the average responses to eachquestion on the questionnaire mentioned above.According to the results, iComposer outperformsthe baseline model in all three metrics, indicat-ing its ability to generate songs in a more natu-ral way. Moreover, melodies and lyrics generatedby iComposer are rated slightly lower than humancreations, which further suggests its effectivenessin songwriting. We can also observe from Table1 that subjects give significantly lower scores tobaseline model on Lyrics metrics. This may bedue to the obvious grammatical errors the modelmade that make the rest two look a lot better. Incontrast, an unexpected melody sometimes stilldemonstrate an acceptable level of melodic pleas-antness and singability. Therefore, the gap be-tween random-generated melody and the one cre-ated by iComposer is not that huge.

We also provide the average score of everyquestion groups rated by 30 subjects. As shownin Figure 6 our model is capable of creating songsthat are close to human creations in most cases.The model even produces better lyrics in lyrics setnumber 1 and 5.

(a) Average score of each melody

(b) Average score of each lyrics segment

Figure 6: Average aesthetic score of 15 melody clipsand 15 sets of lyrics as rated by 30 human participants

88

4 Discussion

Our goal in this preliminary study is to build an ar-tificial artist. The survey feedback illustrates thatiComposer is promising.

Learning the structure of Chinese popular mu-sic is merely our first attempt: there are still amultitude of related issues that merit further ex-ploration. For example, in Chinese the same syl-lable can be pronounced in four different tones:high, rising, rising then falling, and falling. There-fore, we could rate the fluidity of a song tak-ing into account its conformity between flowingpitches and tones. Another novel task that inter-ests us is the distinction between verse and cho-rus, namely, what particular features distinguishbetween verse and chorus. Taking these issues intoaccount would dramatically improve our currentwork.

5 User Interface

iComposer is accessible via a web interface thatallows users either to enter text data or to uploadMIDI files. Once the text or melody has beengiven, users click the submit button to automati-cally compose songs.

Figure 7: A web interface that allows users either toenter text data or to upload MIDI files.

After submission, users are directed to the mu-sic player interface. The lyrics shown on the pagescroll down automatically in sync to the music.The system would play the music and also high-light the lyrics being sung. If users want to savethe song, iComposer provides the songs in MIDIformat for downloading.

Figure 8: Music player interface. It would play themusic and highlight the lyrics being sung.

Our iComposer interface is accessible at:http://ckip.iis.sinica.edu.tw/service/iComposer

The demonstration video is posted on youtubeat: https://www.youtube.com/watch?v=Gstzqls2f4A

The source code is available at: https://github.com/hhpslily/iComposer

ReferencesHangbo Bao, Shaohan Huang, Furu Wei, Lei Cui,

Yu Wu, Chuanqi Tan, Songhao Piao, and MingZhou. 2018. Neural melody composition fromlyrics. In arXiv preprint arXiv:1809.04318.

Keunwoo Choi, George Fazekas, and Mark Sandler.2016. Text-based lstm networks for automatic musiccomposition. In Proceedings of the 1st Conferenceon Computer Simulation of Musical Creativity.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Peter Potash, Alexey Romanov, and Anna Rumshisky.2015. Ghost-writer: Using an lstm for automatic raplyric generation. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1919–1924.

Colin Raffel and Daniel P. W. Ellis. 2014. Intuitiveanalysis, creation and manipulation of midi datawith pretty midi. In Proceedings of the 15th In-ternational Society for Music Information RetrievalConference Late Breaking and Demo Papers, pages84–93.

Kento Watanabe, Yuichiroh Matsubayashi, SatoruFukayama, Masataka Goto, Kentaro Inui, and To-moyasu Nakano. 2018. A melody-conditionedlyrics language model. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies.

http://ckip.iis.sinica.edu.tw/service/iComposer

http://ckip.iis.sinica.edu.tw/service/iComposer

https://www.youtube.com/watch?v=Gstzqls2f4A

https://www.youtube.com/watch?v=Gstzqls2f4A

https://github.com/hhpslily/iComposer

https://github.com/hhpslily/iComposer

icomposer: an automatic songwriting system for chinese

Documents