chia-hao chung and homer chen - vzhao/temp/papers/mmsp15_058.pdfآ  chia-hao chung and homer chen...

Download Chia-Hao Chung and Homer Chen - vzhao/temp/Papers/MMSP15_058.pdfآ  Chia-Hao Chung and Homer Chen National

Post on 13-May-2020




0 download

Embed Size (px)


  • 978-1-4673-7478-1/15/$31.00 © 2015 IEEE


    Chia-Hao Chung and Homer Chen

    National Taiwan University

    Emails: {b99505003, homer}


    The flow of emotion expressed by music through time is a

    useful feature for music information indexing and

    retrieval. In this paper, we propose a novel vector representation of emotion flow for popular music. It

    exploits the repetitive verse-chorus structure of popular

    music and connects a verse (represented by a point) and

    its corresponding chorus (another point) in the valence-

    arousal emotion plane. The proposed vector representation

    visually gives users a snapshot of the emotion flow of a

    popular song in an intuitive and instant manner, more

    effective than the point and curve representations of music

    emotion flow. Because many other genres also have

    repetitive music structure, the vector representation has a

    wide range of applications.

    Index Terms—Affective content, emotion flow,

    music emotion representation, music structure.


    It is commonly agreed that music listening is an appealing

    experience for most people because music evokes emotion

    in listeners. As emotion conveyed by music is important

    to music listening, there is a strong need for effective

    extraction and representation of music emotion from the

    music organization and retrieval perspective. This paper focuses on music emotion representation.

    A typical approach to music emotion representation

    condenses the entire emotion flow of a song to a single

    emotion. This approach is adopted by most music emotion

    recognition (MER) systems [1]–[3]. It works by selecting

    a certain segment from the song and mapping the musical

    features extracted from the segment to a single emotion.

    The emotion representation is either a label, such as

    happy, angry, sad, or relaxed, or the coordinates of a point

    in, for example, the valence-arousal (VA) emotion plane

    [4]. The former is a categorical representation, while the

    latter is a dimensional representation [5]. A user can query songs through either form of single-point music emotion

    representation, and a music retrieval system responds to

    the query with songs that match the emotion specified by

    the user [6], [7].

    However, the emotion of a music piece varies as it

    unrolls in time [8]. This dynamic nature has not been fully

    explored for music emotion representation, perhaps

    because the emotion flow of music is difficult to qualify

    or quantify in data collecting and model training [1]. The

    work that comes close is called music emotion tracking

    [9]–[12], which generates a sequence of points at regular interval to form an affect curve in the emotion plane [13].

    Four examples are shown in Fig. 1, where each curve is

    generated by dividing a full song into 30-second segments

    with 10-second hop size and by predicting the VA values

    of all segments. Each curve depicts the emotion of a song

    from the beginning to the end. We can see that the

    variation of music emotion can be quite complex and that

    a point representation cannot properly capture the

    dynamics of music emotion.

    The representation of emotion flow for music should

    be easy to visualize, yet sufficiently informative to convey

    the dynamics of music emotion. The conventional point representation of music emotion is the simplest one;

    however, it does not contain any dynamic information of

    music emotion. On the other hand, the affect curve can

    fairly show the dynamics of music emotion, but it is too

    complex to specify for users. Clearly, simplicity and

    informativeness are two competing criteria, and a certain

    degree of tradeoff between them is necessary in practice.

    It has been reported that the emotion expressed by a

    music piece has to do with music structure. Schubert et al.

    [14] showed that music emotion flow can be attributed to

    the changes of music structure. Yang et al. [15] reported that the boundaries between contrasting segments of a

    music piece have rapid changes of VA values. Wang et al.

    Fig. 1. Affect curves of four songs in the VA plane where diamonds indicate the beginning and circles indicate the end of the songs. The black curve is Smells Like Teen Spirit by Nirvana. The blue curve is Are We the Waiting by Green Day. The green curve is Dying in the sun by The Cranberries. The

    red curve is Barriers by Aereogramme.

  • [16] showed that exploiting the music structure of popular

    music for segment selection improves the performance of

    an MER system. For popular music, the music structure

    usually consists of a number of repetitive musical sections

    [17]. Each musical section refers to a song segment that

    has its own musical role such as verse or chorus. As

    shown in Fig. 2, popular music typically has repetitive

    verse-chorus structure and its emotion flow changes

    significantly during the transition between verse and

    chorus sections.

    The burgeoning evidence of the strong relation between music structure and emotion flow motivates us to

    develop an effective representation of emotion flow for

    music retrieval. The proposed emotion flow representation

    of a song is a vector in the VA emotion plane, pointing

    from the emotion of a verse to the emotion of its

    corresponding chorus. This representation is simple and

    intuitive, which is made possible by exploiting the

    repetitive property of music structure of popular music.

    We focus on popular music in this paper because it has

    perhaps the largest user base on a daily basis and because

    its structure is normally within a finite set of well-known patterns [18]–[22].

    In summary, the primary contributions of this paper


     A study on the music structure of popular music,

    such as pop, R&B, and rock songs, is conducted to

    demonstrate the repetitive property of the music

    structure of popular music (Section 2).

     A novel vector representation of emotion flow for

    popular music is proposed. A comprehensive

    comparison of the proposed vector representation

    with the point and curve representations is presented (Section 3 and 4).

     A performance study is conducted to demonstrate

    the accuracy and effectiveness of the vector

    representation in capturing the emotion flow of a

    song (Section 5).


    Music is an art form of organized sounds. A popular song

    can be divided into a number of musical sections, such as

    introduction (intro), verse, chorus, bridge, instrumental

    solo, and ending (outro) [18]. Such sections are structured (maybe repeatedly) in a particular pattern referred to as

    musical form. Recovering the musical form is called

    music structure analysis and can be considered a segmentation process that detects the temporal position

    and duration of each segment [19]. Here, we briefly

    review the common musical sections and their musical


    Intro and outro indicate the beginning and the ending

    sections, respectively, of a song and usually only contain

    instrumental sounds without singing voice and lyrics.

    However, not every song has intro or outro. For example,

    composers may place a verse or a chorus in the beginning

    or at the end of a song to make the song sound special.

    The sections corresponding to verse or chorus normally

    express a flow of emotion as the music unfolds. The verse usually has low energy, and it is the place where the story

    of the song is narrated. Compared to verse, chorus is

    emotive and leaves significant impression on listeners

    [20]. Other structural elements, such as bridge and

    instrumental solo, are optional and function as transitional

    sections to avoid monotonous composition and to make

    the song colorful. Bridge means a transition between other

    types of sections, and instrumental solo is predominantly

    the special transitional section of instrumental sounds.

    To investigate music structure, we conduct an analysis

    of NTUMIR-60, which is a dataset consisting of 60 English popular songs [23]. Because the state-of-the-art

    automatic music structure analysis is not as accurate as

    expected [19], [21], we perform the analysis manually.

    The results are shown in Table 1. We can see that verse

    and chorus indeed make a large portion of a song and on

    the average appear 3.13 and 2.37 times, respectively, per

    song. This is consistent with the findings by musicologists

    that verse and chorus is a widely used musical form (aka

    the verse-chorus form) for song writers of popular music

    [20]. This also suggests that verse and chorus are the most

    memorable sections of a song [22] and represent the main

    affection of the song. The corresponding emotion flow gives listeners an affective sensation.