lu´ıs carlos de almeida esp´ırito santo - ulisboa
TRANSCRIPT
Automatically Generating Novel and Epic Music Tracks
Exploring Computational Creativity using Deep Structures against Music
Luıs Carlos de Almeida Espırito Santo
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Prof. Helena Sofia Andrade Nunes Pereira PintoProf. David Manuel Martins de Matos
Examination Committee
Chairperson: Prof. Francisco Joao Duarte Cordeiro Correia dos SantosSupervisor: Prof. Helena Sofia Andrade Nunes Pereira PintoMembers of the Committee: Prof. Fernando Amılcar Cardoso
June 2019
Acknowledgments
I would like to thank everyone that assisted my continuously learning process, over all this years.
I would like to thank my supervisor Sofia Pinto and co-supervisor David Matos, not only for the
guidance both provided during this work, but also for helping me build and improve my knowledge
during these last years. In the same way, I would like to thank to all the other teachers that shaped me
into what I am today.
I would like to thank everyone that answered my survey and all those people that in one way or
another demonstrated interest in my project. In addition, I wish to acknowledge the help provided by all
those that contributed directly with some feedback or improvement proposals.
I would also like to write a word of gratitude to all my friends and colleagues, from all different contexts
and areas, for providing me a variety of safe places where I could relax, laugh and learn different things,
everyday, which helped me growing as a person and allowed me to develop the ideas on this work.
Finally, I express my very great appreciation to my parents and my sister for teaching me friendship,
tolerance, balance, patience and caring just as so confidence and resilience over all my years of exis-
tence. I would also like to extend my gratitude to the rest of my family for their support and friendship
throughout all these years.
Last but not least, I would like to offer my special thanks to my girlfriend, for sharing with me so many
years and for teaching me how to manage every aspect of me, during the good and bad times in my life,
in order to always achieve one better version of myself, for always being there for me through thick and
thin and without whom this project would not be possible..
To each and every one of you – Thank you.
Abstract
Computational Creativity is an applied field of study focused on algorithms that allow a better under-
standing of the creativity processes or simply perform some task usually considered creative. Among
these models we can find some Deep Learning models, such as the Restricted Boltzmann Machines
and the Generative Adversarial Networks, also widely studied outside of Computational Creativity scope.
In addition, we can distinguish different application areas within Computational Creativity, such as music
or visual arts. With the purpose of exploring the capability of these models to work with music dynamics,
this work focuses on the application of neural models for multitrack epic music generation, trying to fol-
low a general approach and a complying vision with the field of Computational Creativity. Three different
models were developed, adapted and compared: the HRBMM, the MuseGAN and the MuCyG. After
conducting a survey, and analyzing the results obtained, we conclude that none of the computational
models consistently outperformed the other ones. The results also point out that the used methodology
led to problems of mode collapsing and possibly prevented the models to produce products capable of
causing a similar effect that epic human composed samples are capable of.
Keywords
Music; Deep Learning; Creativity; Epic; Generative Models.
iii
Resumo
A Criatividade Computacional e uma area aplicada que estuda algoritmos que permitem compreender
a criatividade ou que desempenham tarefas usualmente consideradas criativas. Entre estes modelos
encontram-se alguns modelos de Deep Learning, nomeadamente as Restricted Boltzmann Machines
e as Generative Adversarial Networks, tambem vastamente estudados fora da area de Criatividade
Computacional. Tambem dentro desta area podemos distinguir diferentes areas de aplicacao, como a
musica ou as artes visuais. Com o proposito de explorar a capacidade destes modelos trabalharem
com dinamicas musicais, este trabalho pretende focar-se na aplicacao de modelos neuronais a tarefa
de geracao de musica multitrack epica, seguindo uma abordagem geral e uma visao concordante com
a area da Criatividade Computacional. Tres modelos foram desenvolvidos, adaptados e posteriormente
comparados: o HRBMM, o MuseGAN e o MuCyG. Depois de conduzir um questionario e de analisar
os resultados obtidos, concluımos que nenhum destes modelos obteve avaliacoes consistentemente
melhores que os outros. Os resultados tambem indicam que a metodologia usada conduziu a problemas
de mode collapsing e que os produtos gerados nao foram capazes de afetar o ouvinte da mesma forma
que excertos epicos compostos por humanos.
Palavras Chave
Musica; Redes Neuronais; Criatividade; Epico; Modelos Gerativos.
v
Contents
1 Introduction 1
1.1 Terminology Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 7
2.1 Creativity Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Models for Human Creativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Models for Computational Creativity . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Deep Learning Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) . . . 19
2.2.3 Generative Deep Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Cyclical Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Automatic Music Composition Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Dataset 31
3.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Datasets Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Models 43
4.1 Environment and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 HRBMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 MuseGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 MuCyG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
4.5 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Results and Evaluation 55
5.1 Final Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Word Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Impacts on Creativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.3 Evaluating Confronting Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Conclusion 75
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Epic Dataset Reference List 85
8 Confrontations Results 93
9 Survey Example in English 114
viii
List of Figures
2.1 Artificial neuron scheme, adapted from [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Commonly used activation functions’ plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Convolution filter operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Variational Auto-Encoder architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Generative Adversarial Networks architecture . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Cyclical models common architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Most commonly used rhythmic figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Schematic illustration of representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Evolution of the volume along an average epic song in the new epic dataset . . . . . . . . 40
4.1 Hierarchical Restricted Boltzmann Musical Machine (HRBMM) architecture . . . . . . . . 47
4.2 Convolutional network architecture used in MuseGAN and MuCyG for the epic dataset . . 51
4.3 Deconvolutional network architecture used in MuseGAN and MuCyG for the epic dataset 51
4.4 Resulting plots for learning rate study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Pianoroll representation of Human composed samples used in survey . . . . . . . . . . . 58
5.2 Pianoroll representation of HRBMM’s samples used in survey . . . . . . . . . . . . . . . . 59
5.3 Pianoroll representation of MuseGAN’s samples used in survey . . . . . . . . . . . . . . . 60
5.4 Pianoroll representation of Musical CycleGAN (MuCyG)’s samples used in survey . . . . 61
5.5 Resulting Directed Acyclic Graph (DAG)’s from analysis of confrontation graphs on ”Cre-
ativity” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Resulting DAG’s from analysis of confrontation graphs on ”Inspiring” . . . . . . . . . . . . 69
5.7 Resulting DAG’s from analysis of confrontation graphs on ”Novelty” . . . . . . . . . . . . . 70
5.8 Resulting DAG’s from analysis of confrontation graphs on ”Epic” . . . . . . . . . . . . . . 71
5.9 Resulting DAG’s from analysis of confrontation graphs on ”Cinematography” . . . . . . . . 72
8.1 Number of confrontations between each pair of samples on ”Creativity” . . . . . . . . . . 94
ix
8.2 Number of confrontations between each pair of samples on ”Inspiring” . . . . . . . . . . . 95
8.3 Number of confrontations between each pair of samples on ”Novelty” . . . . . . . . . . . . 96
8.4 Number of confrontations between each pair of samples on ”Epic” . . . . . . . . . . . . . 97
8.5 Number of confrontations between each pair of samples on ”Cinematography” . . . . . . 98
8.6 Number of won and lost confrontations for each pair of samples on ”Creativity” . . . . . . 99
8.7 Number of won and lost confrontations for each pair of samples on ”Inspiring” . . . . . . . 100
8.8 NNumber of won and lost confrontations for each pair of samples on ”Novelty” . . . . . . 101
8.9 Number of won and lost confrontations for each pair of samples on ”Epic” . . . . . . . . . 102
8.10 Number of won and lost confrontations for each pair of samples on ”Cinematography” . . 103
8.11 Percentage of won and lost confrontations for each pair of samples on ”Creativity” . . . . 104
8.12 Percentage of won and lost confrontations for each pair of samples on ”Inspiring” . . . . . 105
8.13 Percentage of won and lost confrontations for each pair of samples on ”Novelty” . . . . . 106
8.14 Percentage of won and lost confrontations for each pair of samples on ”Epic” . . . . . . . 107
8.15 Percentage of won and lost confrontations for each pair of samples on ”Cinematography” 108
8.16 Number of tied confrontations for each pair of samples on ”Creativity” . . . . . . . . . . . 109
8.17 Number of tied confrontations for each pair of samples on ”Inspiring” . . . . . . . . . . . . 110
8.18 Number of tied confrontations for each pair of samples on ”Novelty” . . . . . . . . . . . . 111
8.19 Number of tied confrontations for each pair of samples on ”Epic” . . . . . . . . . . . . . . 112
8.20 Number of tied confrontations for each pair of samples on ”Cinematography” . . . . . . . 113
x
List of Tables
3.1 Characterization of the new datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Comparison between tools for Machine Learning (ML) models development . . . . . . . . 46
5.1 Summary of time spent in music related hobbies . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Summary of our sample’s age, knowledge on the project and relationship with music . . . 63
5.3 Four most used words used per model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Summary of the results about the impact on creativity . . . . . . . . . . . . . . . . . . . . 65
5.5 Confrontations results summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.6 Confrotations ranking based in percentage of gained games . . . . . . . . . . . . . . . . . 67
5.7 Confrontation ranking based in DAG’s topological order . . . . . . . . . . . . . . . . . . . 67
7.1 Full list of epic music samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xi
Acronyms
Adam Adaptive Moment Estimation
AI Artificial Intelligence
AMG Automatic Music Generation
ANN Artificial Neural Networks
BVSR Blind Variation and Selective Retention
BPM Beats per Minute
CC Computational Creativity
CNN Convolutional Neural Networks
CT Convergent Thinking
DAG Directed Acyclic Graph
DT Divergent Thinking
DL Deep Learning
Emmy Experiments in Music Intelligence
GA Genetic Algorithm
GAN Generative Adversarial Networks
GSN Generative Stochastic Networks
GPU Graphics Processing Unit
GUI Graphical User Interface
HMM Hidden Markov Models
xiii
HRBMM Hierarchical Restricted Boltzmann Musical Machine
INESC-ID Instituto de Engenharia de Sistemas e Computadores - Investigacao e Desenvolvimento
LSTM Long-Short Term Memory
MIDI Musical Instrument Digital Interface
MIR Music Information Research
MuCyG Musical CycleGAN
ML Machine Learning
LReLU Leaky Rectified Linear Unit
ReLU Rectified Linear Unit
RNN Recurrent Neural Networks
RBM Restricted Boltzman Machines
SGD Stochastic Gradient Descent
SVM Support Vector Machine
VAE Variational Auto-Encoder
WGAN-GP Wasserstein Generative Adversarial Networks with Gradient Penality
xiv
1Introduction
Contents
1.1 Terminology Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
During the last years, we have been witnessing the emergence of start-ups focusing on Automatic
Music Generation (AMG) such as Alysia1, Amper2, Hexachords3 and Jukedeck4. Some big compa-
nies also started to demonstrate interest and to invest in these technologies by creating, financing and
showing projects and partnerships such as Magenta5 from Google, that presented BachDoodle6 on
21st March of 2019, Watson Beat7 from IBM, Flow Machines8 and Sony, Aiva9 with NVIDIA and the
TransProse10 project involvement with Accenture, to name a few.
Nowadays, most of these projects about ”machine-made” songs are only used for advertising and
most of them are either based on human specified rules or suffered a strong human-based reviewing
process. Moreover, scientific research on this area usually focuses on a specific task such as melody
generation [2,3] scope of music such as Baroque music [4], Jazz [5,6] or Pop-Rock [7,8], in order to scale
down the music generation problem. We consider that both the fact that the quality of these machine
generated music products still strongly depends on human intervention and this over-specialization gen-
erally verified in research studies about music generation illustrate in a truthful way the actual landscape
of AMG and help to clarify how far we are from creating a totally automatic composer. However, the
emergence of some new data-driven technologies is dramatically changing this landscape, promising
important developments in generative models in a near future.
Generative Adversarial Networks (GAN) [9] are a Deep Learning (DL) model presented in 2014
that have been generating very interesting products mostly in visual field [10–12]. Since this model
was presented, adaptations of this model to the music domain have been proposed [13–15] but many
different aspects of music creativity are yet to be explored. Widmer in his Con Espressione Manifesto
in 2016, points out that: “[m]usic is expressive and affects us”, then good computer generated music
products should affect people as well. In order to explore the potential of these models to affect people,
we focused our study in one style of music that deeply relies on the effect it causes: epic music.
As pointed out by van Elferen, “the ’epic’ soundtrack idiom is based on the recognizable idiom of
classic orchestral film scoring” [16]. This definition highlights two different qualities of this style: the
importance of the symphonic timbres (specially strings, brass and percussion) as well as the recurrent
usage of the symbolism and semiotics defined and reproduced over and over again in multimedia con-
tents. However, this point of view can be very restrictive and as Hans Zimmer, a well known and widely
considered as an epic composer, states:
1https://www.withalysia.com2https://www.ampermusic.com3https://www.hexachords.com4https://www.jukedeck.com/5https://magenta.tensorflow.org6https://www.google.com/doodles/celebrating-johann-sebastian-bach7https://www.ibm.com/case-studies/ibm-watson-beat8https://www.flow-machines.com9https://www.aiva.ai
10https://www.musicfromtext.com
3
It’s usually not the size of the orchestra or the production that makes things sound epic, it’s
usually the commitment of the players. A great string quartet can sound louder when they
play with fire and heart, than a boring orchestra, and a single note by [rock guitarist and
collaborator] Jeff Beck can slice right through your heart.
(Hans Zimmer, 2013, in [17])
This new opinion breaks the previous strict bond between epic music and the symphonic orchestra,
by stating that even an electric guitar can play epic music. In addition, both music dynamics and the
expression of feelings are also referred as important aspects for epic music. We can summarize this
style as commonly characterized by repetitive rhythmic movements as well as decisive variations of
harmony, intensity and tension capable of expressing emotions to those people who are familiar with the
musical symbols and signs commonly used in multimedia content.
With this definition in mind, we looked for some fresh insights into DL’s capability of modeling music in
multi-instrument symbolic representations by exploring the generation capacities of three different mod-
els, trained against an original epic music dataset: Hierarchical Restricted Boltzmann Music Machine
(HRBMM), MuseGAN [8] and Music CycleGAN (MuCyG).
Thus, the main aim of this work is to explore DL models and their capacity of creating new epic music,
based on small examples of epic music and possibly inspired by one or more melodic lines. There are
some properties that a good model must verify: the products generated must be considered creative
(both novel and useful [18]) epic music excerpts and, at the same time, it should, in theory, represent a
reusable methodology for different musical categories (recursively enumerable sets of musical content).
As a secondary goal, this work aims at increasing our understanding on DL in general and on how DL
models can be used to study the human creativity.
The models were evaluated using an online survey encompassing three different kinds of questions:
one single-word description question; one question where products were confronted against each other
in a two ”player” game arbitrated by the user (representing an audience) that decides the winner; and
one question about the impact of knowledge on music creativity perception.
The results showed that the models were not able to express sentiments with the generated products,
but that a randomly selected sample composed by a human is not able to consistently outperform the
products of the models.
1.1 Terminology Concerns
When something is created it is said that it has come into being as a new whole entity. However, a
new entity may be similar to other entities that already exist, and in this case the new entity is considered
4
not novel. The disparate usage of the terms “new” and “novel” as words with different meanings requires
special attention when talking about creativity, which will be taken into account in this document.
Moreover, when a new, novel and different entity does not fulfill the expectations that fostered its own
creation, it will be considered not ”useful”. Creativity related authors propose different terminologies
for this dimension of creative artifacts, although the term ”value” is the most frequently used when
referring to music and other arts. Yet this term has the inconvenient that its common usage is strongly
positively connected to the artifact’s novelty. Therefore, in order to better reflect the independence of
these orthogonal and possibly antagonic components of creative artifacts (”novelty” and ”utility”), we
preferred to use the term ”utility” to refer to every aspect that contribute to the ”value” of an artifact apart
from the ”novelty”. In addition, although “all art is quite useless” [19] for the common concept of ”utility”,
as pointed out by Oscar Wilde, we may consider that art’s ”utility” is to fulfill one or more criterion of
aesthetics or beauty. So, in this work, we will accept that music and art are useful in some way.
In this document, we also terminologically differentiate the music produced by a computer and a
human by adopting distinct verbs. A music artifact is ”composed” by a human while it is ”generated”
by a computer. This distinction does not aim to separate human composition from computer-generation
procedures in what concerns creativity, serving only to improve on text clarity and provide a better
understanding of this work.
1.2 Document Structure
This thesis is organized as follows:
• Chapter 2 overviews three distinct but intersecting areas: Section 2.1 presents a summary about
creativity models for human and computer creative tasks, Section 2.2 goes from basic knowledge
on DL until state of the art recently presented DL models and Section 2.3 presents some general
concepts on music generating systems, focusing in DL models;
• Chapter 3 describes the processes of acquiring, processing and storing data for the new datasets;
• Chapter 4 presents, in detail, the development environment, the training methodology and the
three models implemented and explored: HRBMM, MuseGAN and MuCyG;
• in Chapter 5, we sumarize and discuss the results;
• Finally, in Chapter 6, we systematize this work, presenting also possible future developments.
5
2Related Work
Contents
2.1 Creativity Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Deep Learning Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Automatic Music Composition Related Work . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7
2.1 Creativity Related Work
2.1.1 Models for Human Creativity
Since the beginning of ages, humans have created tools and concepts to help modify and understand
the real world. However, the first usage of different words for the concepts “to make” and “to create” is
nowadays dated back from Ancient Roman roots with the words “facere” and “creare” [7]. The concept
“to create” has suffered several shifts in its meaning during time: it was divine until Renaissance, it was
considered purely innate in 19th century and in 20th century it gained even more different meanings. The
use of the noun “creativity” became popular only in the 1920s replacing the expression “creative imag-
ination”. Further, when, in 1950, Guilford [20] announces creativity as a “human” capacity liable to be
measured, a chase for new studies about this new ”power” began. Creativity has been explained through
different metaphors and using different concepts [21]: from madness and possession to evolution and
organism, from incubation and illumination to divergence and algorithms, from investment to democratic
attunement... Despite this long journey through these different and conflicting points of view, creativity
continues to have antagonic meanings for different people. Therefore, we are forced to conclude that,
using d’Inverno and Still’s words, “Creativity needs Creativity to explain itself” [22].
Moving away from the search for the origin of creativity and getting more involved in what is needed
to consider one manifestation or product as creative, different authors have been proposing different
important dimensions in creative artifacts such as novelty, utility, value, beauty, surprise. . . Nowadays,
according to Mumford, “we seem to have reached a general agreement that creativity involves the pro-
duction of novel, useful products” [18].
One of the main problems which emerges when we talk about creativity, is that this term may be used
to refer too many different degrees of novelty. It seems absurd to compare Salvador Dali’s work with a
child’s drawing, but both may be considered creative works. In order to separate this different kinds of
creative acts, Beghetto and Kaufman [23] distinguish four different types of creativity and propose “the
four C model of creativity”:
• Big-C: this class includes most of the cases when people casually talk about creativity. It refers to
creative acts that are historically known as creative and represent an important change in people’s
mentality. Beghetto and Kaufman were not the first ones to identify this class, which corresponds
to Boden’s H-creativity concept.
• Little-C: corresponds to creative acts that happen in everyday life. This category is similar to
P-creativity proposed by Boden and it was also not first distinguished by these two authors.
• Mini-C: this new class, proposed by Beghetto and Kaufman in 2007, contains every subjective
experience of creativity when we learn something new.
9
• Pro-C: described by the same authors only in 2009, this degree tries to recognize the creativity
acts between the little-c and big-c, when professionals develop creative ideas that never get to be
historically remembered.
Since Guilford [20], most of the creativity studies focus exclusively on the creative products. These
kind of approaches have been criticized, for being unable to explore the full picture of creativity. D’Inverno
and Still [22] propose to model creativity by building one closed system where autonomous agents act
on their social world and where creativity emerges from this interaction, without actively trying to produce
tangible products. When we try to analyze creativity besides the creative product itself, we may find that
in a creative act, according to Rhodes [24], there are four different main components that interact with
each other, known as “the four P’s of creativity”:
• the Product: also known as the “creation”, that may be a concept, an idea, a story, a joke, a song,
a performance, a cooking recipe, or just simply a phrase;
• the Person: the entity behind the creative act;
• the Press: this term refers to the environment where the creative act took place, including the
cultural and social factors;
• the Process: the method by which the Person achieved the Product.
In what concerns to the creative process, one of the most studied aspects of creativity, countless
different models to explain it have been proposed. Graham Wallas [25], in early 20th century, proposes
a four stage model for the creative process:
1. Preparation: the creative entity consciously starts to explore and understand the problem and its
properties;
2. Incubation: the unconscious process where the problem is extensively explored;
3. Insight: action by which the solution found in incubation jumps into the individual’s consciousness;
4. Verification: corresponds to the final examination of this new idea.
Wallas was an economist and this is one of the reasons why this model has been considered as
mainly focused in the economic value of creative products. After almost a century, this model was
expanded by Sawyer [26] up to eight stages:
1. Ask: in this stage one finds problems that the creative entity will try to solve;
2. Learn: in this stage is acquired relevant knowledge related with the problem;
10
3. Look: in this stage is important to keep aware about new results and information about this new
problem;
4. Play: in this stage is used this information in informal activities like games;
5. Think: in this stage a large variety of possibilities is generated;
6. Fuse: in this stage these ideas are combined in unexpected ways;
7. Choose: in this stage the best ideas are selected from all that were generated;
8. Make: in the final stage the best solution is externalized.
One topic about creativity that we consider of most importance is the discussion about the role of
freedom and constraints in the creative process. Naturally, we may think total freedom is necessary in
order to achieve a pure creative act but Patridge and Rowe [27] consider that the need of constraints in
order to create something creative is “one of the paradoxes of creativity”. These authors argue that in a
free domain it is easy to create novel and complexity increasing ideas, while when there are restrictions
it is much more difficult to find simple novel products, making these latter more likely to be considered
as creative. The authors refer five criteria to classify these constraints:
• Sharp vs Blurry: the former one is precisely checkable while the other is not;
• Explicit vs Implicit: the first one is a conscious restriction and the second is implicitly provided;
• Strong vs Weak: the latter may be violated while the first type must not be disrupted.;
• External vs Self-imposed: depending on the constraint’s origin, it is considered self-imposed if it
came from the creative entity itself, it is considered external otherwise;
• Elastic vs Rigid: the latter corresponds to a predicate, which means that it only has two differ-
ent states: fulfilled or not; while an elastic constraint has one spectrum between broken and the
completely satisfied.
Dali, in an interview in 1974, said that “[f]reedom of any kind is the worst thing for creativity” [28].
Whether he is being ironic or not, we cannot tell and, according to the painter, neither can he, but this
discussion about freedom and its role in creative acts does not seem to finish any time soon. We believe
that a broad study about the relationship between constraints and creativity would bring a new look to
this kind of theories, since constraints are play a very important role in Convergent Thinking (CT).
When we talk about creativity, Convergent Thinking (CT) corresponds to the intellectual methodology
used to trim a huge number of different possibilities to only one solution, while Divergent Thinking (DT)
refers to the different mental processes by which one is able to generate different hypothesis going
11
in different directions. DT, by definition, requires freedom because we need different possibilities to
explore. Constraints are much more important for CT, once that they will help to confine our possibilities.
There are many other different theories about creative processes, focusing various aspects. The
exhaustive exploration of all these models would not be a realistic goal for this work. However, we
consider important to refer some pointers for future research on the topic: in the beginning of the 60s,
based in DT and CT theory, Campbell presents Blind Variation and Selective Retention (BVSR) model
[29]; in 1964, Koestler publishes a matrix based bisociation theory to explain creativity [30]; in 1990,
Boden publishes an exploratory vision of creativity based in conceptual spaces [31]; Sternberg and
Lubart in 1991 publish the beginning of an investment theory of creativity [32]; Fink, Smith and Ward
present the Geneplore Model in 1992 [33]; Turner and Fauconnier, based on Koestler’s work, develop
the initial notion of Conceptual Blending in 1995 [34]; Csikszentmihalyi expands Wallas model to five
steps in 1997 [35]; Propulsion Theory is presented in 1999 by Sternberg [36]; in 2006, Wiggins [37]
proposes a formalization of Boden’s exploratory creativity model.
Summing up, creativity is a complex, subjective and difficult to define concept and that’s the main
reason why there are so many different models trying to explain the different aspects of this capacity.
Creativity happens in several different contexts from science to art and it relates different abstract con-
cepts regarding consciousness and human mind. That is one of the reasons why creativity is so complex
and so difficult to study. After this summary and reflection on creativity, we still have intentionally left out
several other topics about creativity such as: the relationship between creativity and emotions; the role
of intention in creativity; the correlation between motivation and creativity; author versus tool duality; the
value of creator’s capacity to explain it’s work; and the relevance of time in creativity. We hope that with
the creation of new tools, the scientific community will be able to address these kind of questions in a
near future.
2.1.2 Models for Computational Creativity
Nowadays we have a steady collection of different means to study human cognition: neuroimaging
techniques, deep brain stimulation, psychoanalysis, stem cells and organoids, artificial intelligence and
robotics, autopsy, auto-reflection, among others. Some of these tools study the brain functionally, while
others focus on the brain’s structure. Some of them study the brain by dissecting, others do it by ana-
lyzing how it reacts to different stimuli and others by trying to imitate how the brain behaves. Artificial
Intelligence (AI) techniques might help to understand our brain functionally. The main hypothesis con-
sidered in this kind of approaches is that if we replicate human behavior using a computer, we may
assume that the computable process which originates those results and the mental process that occurs
in the brain responsible for that behavior share some similarities. This kind of approaches are known as
analysis by synthesis and the studies on computer techniques that may be used to specifically simulate
12
creative behavior are usually included in the area known as Computational Creativity (CC). CC is an
interdisciplinary area where art, philosophy and cognitive sciences meet computer science and AI, that
aims at studying the relationship between computers and creativity, trying to discover if it is possible to
endow computers with creative capacities; if human creativity has algorithmic nature; or even if we can
enhance human creativity using computers.
There are different classifications of the techniques and models used in CC using different criteria.
The first classification we consider relevant is to distinguish between discriminative models and gener-
ative models. The former support most of AI applications nowadays and correspond to a classification
problem, while the latter ones focus on the production of new data. Both are extremely important for CC
area and are used to complement each other, similarly to art analysis and art production.
This classification helps to organize these techniques but it does not take into account other prop-
erties of the algorithms. One complete and general algorithm-based classification of these approaches
would have numerous advantages: it would allow to explain similar approaches in a systematic way
and teach CC in an easier way; it would allow to compare these generalized approaches with creativity
models, and advance the state of art of creativity psychology and the knowledge of the human mind;
and it would also allow to implement and expand similar systems, making it possible to create complexity
increasing systems with possibly better results.
In 2017, Ackerman et al. [38], focusing pedagogic issues, presented what they call the CC-continuum,
a spectrum between two opposites views on the CC area. On one side of the spectrum, we have what
the authors consider a more engineering related approach, where the main purpose of the systems is to
simulate the creative behavior. On the other side, there is a more theoretical and more cognitive focused
approach where the system is only used to verify the quality of one creativity model. In the engineering
approach the systems’ creation is usually motivated by the final products that the system will create,
while in the cognitive approach the major motivation is the initial model. The authors argue that all CC
systems can be located in this continuum, allowing the comparison of different systems.
Besides this classification, Ackerman et al. [38] also propose an arrangement of the different CC
approaches in 5 categories: state space search, Markov chains, knowledge-based systems, genetic
algorithms and learning or adapted systems.
State space search
The idea of creativity as a search problem is not recent. According to D’Inverno and Still [22], Hobbes
and Leibniz, early in the 17th century, tried to explain the creation process using search metaphors
and a combinatorial search space of possibilities. But only in 2006, Wiggins [37] formalized Boden’s
conceptual spaces exploration model that complies directly with this idea but that is far from being only
limited to computational purposes. He defines the trajectory of a creative agent through the conceptual
13
space as a set of four different components: an universe, U , of possibilities to explore; a set of rules,
R, that define the acceptable conceptual space; an evaluation set of rules, E, that assigns a value to
each concept in U ; and finally, a set of rules, TR,E , that define the strategy to explore U , taking into
consideration R and E.
This point of view about creativity is simple, mechanical and demystifying. We see artists trying
and failing, which is concordant with this theory and emphasizes the iterative nature of the art creation
process. However, this model does not provides hints on what is the best strategy TR,E for searching
the possibility space.
Markov chains
These models use probabilistic approaches and are usually implemented as probabilistic state ma-
chines where the Markov property is assumed. This property was named after the homonym Russian
mathematician and refers to memoryless sequential random variables. A sequence of related random
variables, X1, X2, . . . , Xn, . . ., is said to verify the Markov property if the value of one random variable
Xn is enough to characterize the behavior of next random variable Xn+1, then we may just ignore the
rest of the sequence, as represented in equation 2.1.
P (Xn+1 = x|Xn = xn, Xn−1 = xn−1, . . . , X1 = x1) = P (Xn+1 = x|Xn = xn) (2.1)
This type of systems have been quite popular in text and music generation but, as Widmer [39] ar-
gues, music has a lot of long-term relationships and a memoryless process such as a Markov process is
not able to remember and recreate these relations. This is the reason why the usage of these techniques
has been criticized and considered not appropriate for music composition. One early example of these
models is the Analogique developed by Iannis Xenakis in 1958, which used Markov chains.
Knowledge-based systems
In this approach we use mechanisms of reasoning, known as inference engines, and a knowledge
representation structure: the knowledge base. The former allows us to deduce indirect information from
the knowledge base, while the latter is one symbolic representation of the world state. Rules systems,
frame based systems or even constraint satisfaction systems are examples of this kind of systems. One
big disadvantage of these kind of systems is that the knowledge needs to be acquired from experts, and
then standardized, which might be very long process.
14
Genetic algorithms
The Genetic Algorithm (GA), also known as evolutionary algorithm, is a search algorithm inspired
in the evolutionary and genetics theories. In a short, it takes some random samples, selects the best
artifacts and tries to get more somehow similar samples by crossing the good ones. We believe that
Ackerman et al. [38] considered that this algorithm deserved a distinct class due to its importance for
the CC area: it has many parallels with different creativity theories and it has been used in different
domains achieving good results. According to Floreano et al. [40], this algorithm requires seven steps:
choose a genetic representation; build a population; design a fitness function; choose a selection op-
eration; choose a recombination operator; choose a mutation operator and at the end devise a data
analysis procedure. These steps involve four different components that interact with each other: one
representation, one population, some operators (recombination and mutation) and the fitness function.
The overall algorithm consists in modifying the population of represented elements using the operators
and selecting the best elements, using the fitness function to score them.
Both the representation and the operators have the possibility to constrain or expand the search
space: by exploring less/more different possibilities in exchange of time. The representation defines
what kind of possibilities are acceptable and one representation is said to be more flexible if it allows
to represent more possibilities. The operators receive one or more possibilities from the population and
return one new possibility. If these operators have domain specific knowledge they will only produce
plausible possibilities. In order to clarify the distinction between blind and domain specific operators, let
us compare the latter with one math expert while comparing the former with one child. Let both solve an
equation. While the math expert turns the equation into another well-formed formula by applying known
operations, the child will randomly play with the symbols, thus possibly achieving the solution.
GA has many different spaces where randomness can be added to: the operators may contain
stochastic processes; we may choose random elements of the population to apply the operators to; and
the fitness function may define a probability of survival. Since the fitness function is measuring the utility
of the possibilities and acts like an heuristic, the GA is able to mix the randomness and the heuristically
directed search in a very elegant way.
Learning and adapted systems
Learning systems are systems that use some kind of ML techniques that aim to define a function
using examples and are considered data-driven approaches as opposed to model-driven approaches.
ML encompasses an enormous diversity of algorithms, but one of the most famous, controversial, but
recently considered promising family of techniques is Deep Learning (DL). Therefore, although Acker-
man et al. [38] presented this wide class of techniques without emphasizing none of them specifically, in
the next section, we will focus on presenting exclusively DL concepts.
15
Figure 2.1: Artificial neuron scheme, adapted from [1]
2.2 Deep Learning Related Work
2.2.1 General Concepts
During the last 70 years, different words, such as cybernetics and connectionism, have been used
to refer the approaches that nowadays are included in DL. According to Goodfellow et al. [41], “DL
enables the computer to build complex concepts out of simpler concepts” by “introducing representations
that are expressed in terms of other”. This representation learning process usually is implemented
using structures famously known as neural networks. Neural networks were originally inspired in the
structure of the neural system and process data by connecting artificial neurons. Similarly to the brain’s
synapses plasticity, this structure learns by adapting the strength of the links between neurons and thus
the knowledge is coded in these links.
In 1943, McCullock and Pitts [42] presented a simplified model of the neurons, the ”Threshold Logic
Unit”, that supported the implementation of what we may assume as the first artificial neurons, at the
time with adjustable but not learned weights. These neurons received signals, or inputs, X = [x0, x1...xt]
through weighted links, with weights W = [w0, w1...wt] respectively; a threshold value was applied to the
pondered sum of the signals plus one bias, φ(∑t
i=0 wixi + b); finally the resulting signal or output y
was propagated to the next neurons. This process is schematically represented in Figure 2.1. In 1957,
Frank Rosenblatt presents the Perceptron which was used for linear classification tasks and where the
value of the weights, bias and threshold were learned from a dataset. This model was so surprising that
newspapers at the time described it as “the embryo of an electronic computer [. . . ] that [. . . ] will be able
to walk, talk, see, write, reproduce itself and be conscious of its existence” [43] .
From a first simple analysis of this system we understand that we are applying a function (the ac-
tivation function) to a linear operator. Since these first models do not used non-linear functions, at the
end of the 60’s, researchers proved that it was theoretically impossible for Perceptron to learn non-
linear functions, even simple ones such as XOR function, which lead to an alienation of research in
16
(a) Sigmoid (b) Hyperbolic tangent plot
(c) ReLU function plot (d) LReLU function plot
Figure 2.2: Commonly used activation functions’ plots
neural networks. Therefore, in order to model real data (usually non-linear data), models grew, started
connecting more and more neurons and included non-linear activation functions, also known as non-
linearities, in neural networks’ structure. Different activation functions are currently used such as the
sigmoid function, σ(x) = 11+e−x (Figure 2.2(a)); the hyperbolic tangent, tanh(x) = 2
1+e−2x − 1 (Fig-
ure 2.2(b)), Rectified Linear Unit (ReLU), R(x) = max(0, x) (Figure 2.2(c)) or the Leaky Rectified Linear
Unit (LReLU) LR(x) = max(0.1× x, x) (Figure 2.2(d)) to name the most usual ones.
In Feedforward Neural Networks, in order to organize these neurons we use layers and neurons
only link to neurons in the right next layer. There are two special layers: the input layer, the one that
first receives the data, and the output layer that exports the final result. All other layers that might exist
between these two are named hidden layers. The fact that in these networks the information only moves
in one direction (without cycles) explains their name.
The core concept of this technique is the Back-Propagation algorithm, proposed in 1986 by Geoffrey
Hinton et al. [44], which is responsible for the learning process. This algorithm receives a network and a
series of examples and returns the neural network with the updated weights, according to the examples
given. This process is refereed as “training the network”. For each pair, input (x) and expected output
17
(d), two steps take place in the learning process: the forward phase and the backward phase. In forward
phase, the input signal x is propagated through the fixed network until we get the output value calculated
by the network, y. At the end we calculate an error measure, L(d, y), which is usually called the loss
function or cost function and it represents a critical part of the entire process.
During the Backward Phase, this loss is used to correct network’s weights and bias, starting at the
end of the network and propagating the error to the beginning of it. Using derivative properties, it will
calculate the way each weight in the network influences our loss measure, ∂L∂wt
and minimize it. In
mathematical terms, we will calculate the gradient of the loss function ∇L(w0, w1...wn) and, considering
we want to minimize our loss function, we will update our weights taking a small step, defined by a
learning rate η, in the opposite direction of this gradient: W t+1 = W t−∇L(W ). This iterative process of
moving against the gradient is commonly referred as gradient descent. Bibliography usually also refers
some kinetics-inspired meta-parameters that might have a great impact in convergence and efficiency
of the algorithm:
• Batch size: if we update the weights taken into account all of the instances in the dataset then error
variation may be monotonic but it may not be computably treatable, when we only have access to
bounded memory or time resources. To overcome these limitations, we may update our weights
using only a fixed number, n, of instances (the so called mini-batch gradient descent with batch
size of n) or even, when n = 1, we will have the well known Stochastic Gradient Descent (SGD),
that migh lead to a loss measure with a high variance;
• Learning Rate: this represents how much we will learn from each batch and it may be constant,
or vary in different ways: from iteration to iteration (Ex: exponentially decay), from input to input or
even from weigth to weigth (e.g. AdaGrad, explained in [41]);
• Momentum: similarly to what happens in physical systems, we can endow our training process
with an inertial property and update the weights also taking into account the direction in which the
weights have been evolving during the last iterations;
• Other: different models and techniques can make use of additional hyperparameters such as the
threshold c for weight clipping in gradient penalties, for instance.
One optimizing technique that worth to point out is Adaptive Moment Estimation (Adam) [45] which is
a method that uses exponentially decaying momenta and parameter specific learning rate and is widely
used for different purposes, making it suitable for both sparse and noisy data. Moreover, although this
optimizer is widely used the best optimizing technique can deeply depend on the architecture of the
network.
18
Figure 2.3: Convolution filter operation
2.2.2 CNN and RNN
From the last half of the 80’s to the appearing of Support Vector Machine (SVM) in 1995, research
in neural networks expanded and different models were proposed such as Recurrent Neural Networks
(RNN) and Convolutional Neural Networks (CNN).
RNN is a class of networks where we have recurrent connections, which means, part of the output
of some component is injected together with the input for another iteration. The first RNN are from the
80’s and had learning problems caused by exploding and vanishing gradients, but, in 1997, Hochreiter
and Schmidhuber [46] presented the Long-Short Term Memory (LSTM), where, in addition to the output
of the last iteration and the input, we have an hidden state that is passed through out the iterations. This
state is updated using three gates that allow to control how much to forget about the past value, how
much of the new value should be remembered and what should be the influence of this value in the
output. There are many different adaptations of these memory cells but we do not consider relevant to
expose them in this introductory approach to deep learning.
CNN were first used by Yann LeCun et al. for number recognition in 1989 [47], they mimic the
visual cortex and take advantage of local correlations to compress data. These networks use different
kinds of layers which use filters, also named kernels, which are windows that slide through the different
dimensions of our input, applying an operation several times to different regions of it, as illustrated in
Figure 2.3.
In order to define a set of filters two values are mandatory: the number of filters (which correspond
to the output channels) and the size of the filter (input channels, height, width and occasionally depth
19
when dealing with 3D data). However, there are two additional ways we can modify the filter behaviour:
the stride and the padding. The stride controls the size of the shifting when the filter is sliding, while the
padding indicates if we should add extra volume around our input in order to preserve the dimension of
our data. In Figure 2.3, the 8 × 8 input with only one channel is filtered by only one 3 × 3 convolution
filter using a stride and a padding both set to 1× 1, resulting in a 8 × 8 output. We can use the formula
in Equation 2.2 to calculate the output size O of one dimension with input size I, filter size K, a stride S
and a symmetric padding P , assuming that the fraction bar represents the integer division.
O =I −K + 2P
S+ 1 (2.2)
There are different operations we can do with kernels, but the most common are: pooling operations
and the convolution operation. Pooling operations are non-linear down-sampling techniques that usually
use strides that allow non-overlapping sub-regions. Among these operations the max pooling is the
most commonly used and uses a max filter, choosing the higher value of the sub-region. On the other
hand, the convolution operations commonly keep the dimensions of the input by using a stride of 1 and
making use of padding. In one convolution operation for a two dimensional input, a matrix M(m×m),
using a filter K(k×k), the result will be a new matrix N(m×m) where the Equation 2.3 is used to calculate
the value in position i, j and v = k/2 using integer division.
Ni,j =Mi−v,j−v.K0,0 +Mi−v,j−v+1.K0,1 + . . .+
Mi−v+1,j−v.K1,0 + . . .+
Mi−v+k,j−v+k.Kk,k
(2.3)
2.2.3 Generative Deep Models
With the dawn of the 21th century, the software and hardware developments in parallel computation,
the easy access to big data and to gradient-computation based platforms (e.g. Theano1, TensorFlow2
. . . ) and some new training ideas (unsupervised pre-training, the Adam optimizer [45], dropout [48] as
well as other optimizers and regularization techniques) allowed the blossom of a new era of research in
neural networks now with access to quite complex and deep structures. Deep Artificial Neural Networks,
networks with many hidden layers, are mostly used as black boxes because, despite its simplicity in
terms of model and implementation, it turns out they are very challenging to understand. It is specially
difficult to keep track of what is happening in the hidden layers.
One of the most formidable advantages of these models is that, besides the already known ca-
pacity to evaluate data, they have been recently discovered to obtain very good results in generative1http://deeplearning.net/software/theano2https://www.tensorflow.org
20
approaches. Many generative models and frameworks using Artificial Neural Networks (ANN) have
been presented in the last years: the PixelCNN and PixelRNN [49]; the Generative Stochastic Net-
works (GSN) [50]; and the interest in Restricted Boltzman Machines (RBM) [51] reappeared.
RBM were among the first deep generative models and, according to Goodfellow et al. [41], were
presented under the name ”Harmonium” by Smolensky [51] during the 80’s. These are Boltzman Ma-
chines, binary stochastic undirected graph-based models, where neurons are organized in two layers,
the visible and hidden one, and where, unlike to common Boltzman Machines, intra-layer connection are
not allowed. These machines use stochastic neurons and are energy-based models, meaning that we
define the probability of each state for these neurons using an energy function, E(v, h).
For a machine with nv visible neurons, V, and nh hidden ones, H, with weights defined by the matrix
Wnv×nhand bias bv and bh for visible and hidden neuron respectively, we can calculate the energy, E,
and probability, P , functions for one state (v, h) using the formulas defined in Equations 2.4 and 2.5,
respectively. In these equations, V and H are random vectors, Vi represents the ith random variable in
V and while v and h are concrete states, V ∗ and H∗ represent the set of all possible states for each
of the random vectors. Besides, W is a nv × nh matrix, bv and bh are nv and nh sized vectors. In
Equation 2.6, we prove that, when dealing with binary data and thanks to the special restriction of RBM
that makes neurons in the same layer conditionally independent given the full state of the other layer, we
can calculate the activation probability of one neuron in a layer given the full state of the opposite using
the sigmoid function, σ, shown in Figure 2.2(a).
E(v, h) = −v>Wh− b>v v − b>h h (2.4)
P (V = v,H = h) =e−E(v,h)∑
v∈V ∗
∑h∈H∗
e−E(v,h)(2.5)
P (Vi = 1|H = h) =P (Vi = 1,H = h)
P (H = h)
=
∑v∈V ∗
e−E(v,h), where Vi = 1∑v∈V ∗
e−E(v,h)
=ebvi+Wih
ebvi+Wih + 1
=1
1 + e−(bvi+Wih)
= σ(bvi +Wih)
(2.6)
21
Figure 2.4: Variational Auto-Encoder architecture
Figure 2.5: Generative Adversarial Networks architecture
In 2013 and 2014, two very promising new generative frameworks using deep structures have been
introduced: the Generative Adversarial Networks (GAN) [9] and Variational Auto-Encoder (VAE) [52].
The classic auto-encoder model starts with one encoder that is responsible for turning one instance
into a code (also known as the latent variable) which represents this instance in a different space.
The decoder, the next phase, receives this code and creates a new instance. The learning process
consists in comparing the instance created by the decoder and the initial instance and propagate the
errors through the network. In one VAE [52], the result of the encoder is the description of a Gaussian
distribution, one pair: mean value and standard deviation (µ, σ). For this case, the input for the decoder
is one random sample from this distribution, as represented in Figure 2.4.
GAN [9] is a generative framework where two networks, the generator and the discriminator, dispute
against each other and evolve together. While the role of the generator is to produce instances to
trick the discriminator, this last one must distinguish between fake and real instances. A schematic
representation of the components of GAN is presented in Figure 2.5. The generator G takes random
noise (usally gaussian) z ∼ prandom and turns it into potentially good instances D(z) that are evaluated
by the discriminator D while it also evaluates real data x ∼ preal to continually improve. In this zero-sum
non-cooperative game between the discriminator and the generator, we are looking for a state where
the generator is so good at generating data that the discriminator is not able to find a way to distinguish
between real and generated data. Equation 2.7 presents the minimax equation we want to optimize.
The model converges when neither of the players can achieve a better score by locally improving its
22
strategy, i.e., when they reach a Nash equilibrium point, in terms of game theory, or when both gradients
are very small.
minG
maxD
V (D,G) = Ex∼preal[logD(x)] + Ez∼prandom
[log(1−D(G(z)))] (2.7)
However, there are several major problems that we may run into during the training process:
• Non-convergence: when using gradient descent, there are no guaranties the model will converge.
then, it can oscillate around some stable point(s) and never converge;
• Mode collapse: this case happens when we get an over-specialized generator that only gener-
ates a small number of examples, causing a generation with low variability and an output that is
completely independent from the seed, z;
• Vanishing Gradient: deeply related with the exploding gradient problem, it is a very common
problem in other deep models as well, including RNN, and it is characterized by an accentuated
decrease in gradient’s magnitude, resulting in a very slow training process;
• Overfitting: when the generator and discriminator overfit, we end up with a short variability of
results with the additional problem that the collapsing points are point from the real data, i.e., no
new data is generated;
All these problems are the target of innumerous studies, nowadays, and thanks to these studies, all
of them have already some possible solutions and some insights on how to solve them. Furthermore,
all of them are suspected to be mostly caused by sensitive and inappropriate hyperparameter values,
unsuitable lost functions, meager datasets or unbalanced training processes that side with one of the
components giving it some unwanted advantage. Despite the fact that currently no general procedure
to solve all these problems is known, some commonly used strategies include adding noise to the train-
ing process, using more robust cost functions such as Wasserstein distance with gradient penalties
(WGAN-GP) [53], searching for new hyperparameter values, using dynamic and complex hyperparam-
eters, component specific hyperparameters (e.g. use different learning rates for discriminator and gen-
erator), normalizing or even clipping weights and/or results along the network (batch normalization [54],
spectral normalization [55] and weight clipping) or even pre-trainning some components.
In short, training GAN is a non trivial task and it still is an heuristic guided process that usually
involves a lot of empirical experimentation. Actually, all these training problems represent the biggest
drawback of this approach. Yet GAN have been successfully used for generation and style transferring
in visual data, recently providing sharp high quality results [11,12].
23
Figure 2.6: Cyclical models common architecture
2.2.4 Cyclical Generative Models
Besides all the problems mentioned before, GAN were not originally designed to provide control over
the features of the generated objects. Therefore, some new ways of mixing these training frameworks
have been introduced, such as CycleGAN [56] and DiscoGAN [57], that make use of different loss
functions to find one-to-one mappings between two domains A and B both defined by representative
datasets of non-paired samples. These have been studied and applied for style transferring, achieving
great results in image to image translation. Conversely, to the best of our knowledge, applications of
these models to non-visual data are limited and cross-domain, for example visual to audio, use cases
are almost nonexistent.
Figure 2.6 represents the general components we may find in a cyclical model: two discriminators
(DA, DB) and two generators, one that maps instances of one domain A into instances of B, GAB , and
one that does the opposite GBA. In one of the two streams, the network maps instances a ∈ A into
an intermediate representation b and, afterwards, decode it back a, while trying to ensure that these
transitional codes fool the discriminator DB . The other stream is responsible for the corresponding
process starting with one instance from b ∈ B. The adversarial, also called classical-GAN, loss, LAadvers,
works precisely in the same way it does in usual GAN, pushing b into domain B. At the same time, to
minimize the reconstruction or cycle consistency loss, LArecons, which measures the differences between
the original instance a and the one we were able to recover a, forces the relevant information to flow into
b. The way these loss functions are implemented and used during the training process can vary from
model to model.
24
2.3 Automatic Music Composition Related Work
2.3.1 General Concepts
The marriage between technology and music was desired even before the emergence of computers
as we know them today. Ada Lovelace3 in 1843 already talks about the usage of the Analytical Engine
for music composition:
Supposing, for instance, that the fundamental relations of pitched sounds in the science of
harmony and of musical composition were susceptible of such expressions and adaptations,
the engine might compose elaborate and scientific pieces of music of any degree of com-
plexity or extent.
(Ada Lovelace, 1843, Note A, p. 696)
In the 18th century, Wolfgang Amadeus Mozart composed using one algorithmic composition tech-
nique in Musikalisches Wurfelspiel where songs were created using dices to randomize the order of a
set of already composed parts. In the middle of the 20th century, the dawn of computer brings differ-
ent composers like Iannis Xenakis, Karlheinz Stockhausen and John Cage, to name just a few, to use
these new sound technologies in their art work. David Cope in the beginning of the 1980s starts de-
veloping the Experiments in Music Intelligence (Emmy) and, at the end of the same decade, according
to Eck and Schmidhuber [59], Todd publishes one attempt to generate music using Recurrent Neu-
ral Networks (RNN), technique explored later in the CONCERT system presented by Mozer in 1994.
Since then, several authors have been developing music related systems and new events (conferences,
concerts and workshops among others) focused specifically on this area have been organized.
The usage of generative models to create musical products is known as Automatic Music Generation
(AMG) or as Algorithmic Composition. Since both partial and total automations of the compositional
process are considered in this area, to organize these systems into categories turned out a challenging
task. In 2013, Eigenfeldt et al. [60] propose one taxonomy based on the relationship between the system
and the human user, and its relation with musical gestures:
Level 0 - Not Metacreative Systems: systems that can not be considered as metacreative nor inde-
pendent are placed in this level.
Level 1 - Independence: the systems in this category are simple systems that expand composer/per-
former’s musical gesture without his control.
Level 2 - Compositionality: these systems determine relationships between musical gestures.
3In translator’s notes for Menabrea’s article [58] on Babbage’s Analytical Engine
25
Level 3 - Generativity: the generation of musical gestures is what characterizes this type of systems.
Level 4 - Proactivity: these are systems that are able to initiate their own musical gesture, and may
already be considered as agents.
Level 5 - Adaptability: agents that may influence each other or behave in different ways over time are
known as adaptable.
Level 6 - Versatilty: here we consider agents that can determine their own content with almost no
stylistic limits.
Level 7 - Volition: finally, these agents decide when, what and how to compose/perform; they are
considered as totally autonomous.
It is important to clarify that this taxonomy does not aim to hierarchize systems neither by creativity
nor complexity. It is a scale of autonomy. A system that plays random sounds at random times may be
at the top of this taxonomy and yet it does not seem much complex nor creative. The authors argue that
only when a system is placed in one of these categories, it is possible to discuss about its complexity
and/or musicality, by comparing with others in the same level.
Music is, indeed, different from almost all other areas of creativity (visual arts, humor, sculpture or
even science). It needs to take the time dimension into account; it has several more or less independent
layers of complexity (tracks); and it is, most of the times, preformed. The area of AMG began isolated,
by looking for techniques in other areas that could generate music. Due to the small number of projects
and the isolation of these early small research communities, some problems have been pointed out
to these early works. The poor specification of the practical and theoretical aims, the non-existence
of a methodology to achieve these aims and the usage of inappropriate evaluation methods are some
common problems we may find in most of these early AMG systems.
Merz [61] considers that, nowadays, most of automatic music systems try to get the best results,
without taking into account the algorithm’s and/or approach’s purity. The author defends that this is
appropriate when the main goal is to have musical products. When we want to study the creative pro-
cess that allows the creation of music, we should try not to include what is designated as “ad hoc”
elements. Ad hoc modifications are alterations that are concerned with one specific case, domain de-
pendent changes that are not appropriate for other areas. Three aspects are taken into account by
Merz [61] to decide if a change of the “pure” algorithm may be considered as one ad hoc modification:
• In order to operate with musical information, it is unavoidable to have some kind of non-general
change that defines our work representation.
• Some alteration in one context may be considered as an ad hoc modification and may not be ad
hoc when they are applied in a different context.
26
• Most “pure” algorithms do not have a single and unique definition. Most of the times they are
expandable.
Ad hoc modification analysis is one way to study and measure how much a solution may be gen-
eralized to different kinds of music or different areas of CC. Merz mentioned that methods that have
too many ad hoc modifications “are used to model a specific task rather than the general functioning
of the brain” [61]. In addition to this analysis we must try to find the limitations of these systems, like
contents that may be interesting but can not be generated or the differences between the generated
musical products from each other. At the end, the author questions the need of algorithmic “purity” in
this area arguing that the reason behind the common usage of ad hoc modifications is that music is
social and intrinsically tied to culture and tradition, perhaps, making it impossible to have good results
without these modifications.
In 2016, Widmer [39] presents what he considers to be six well known facts about Western music
that are being “ignored” by the area of Music Information Research (MIR), including AMG:
1. Music is time dependent, therefore approaches based in bag-of-frames, where the frame order is
ignored, should be dropped and temporal models should start to be more used.
2. Music is fundamentally non-Markovian, meaning that usually music does not have the Markov
property; that it is filled with long-term dependencies not captured by most temporal models like
Hidden Markov Models (HMM) or even RNN.
3. Music main goal is to be perceived by human listeners therefore, besides the digital representation,
the emotional effect, tension and anticipation in complex musical structures need to be explored in
AMG systems.
4. Music perception and appreciation are learned, a great argument to use unsupervised artificial
learning systems in AMG and to create good quality data corpora to train these systems.
5. Music is usually performed and there are several different creative choices that are performer’s
responsibilities. These aspects have been neglected in AMG area and the Con Espressione
Project [39] tries to change this.
6. Music is expressive. It affect us. Most of music systems do not take into account none of the three
levels of expressiveness identified: basic, intrinsic and associative
In short, as a recent and applied area the main goals of the AMG community should be to find a
general methodological approach that would provide better analytical and comparison tools; to focus on
new emerging technologies to overcome old obstacles and to unlock new possibilities; to create open
resources in order to expand the community; and finally to explore the merging of completely different
techniques in order to get the best of each one of them.
27
2.3.2 Using Deep Learning
The number of deep learning articles in MIR has increased last years which reflects the new interest
in these techniques, according Choi et al. in their introductory article on deep learning in MIR [62].
In September of 2017, Briot et al. [63], from the Flow Machines project, presented a survey on mu-
sic generation systems using deep learning methods. In this study, the authors propose an analytical
methodology based in four dimension that are not entirely orthogonal:
• Objective: there are different AMG systems that aim different objectives of music generation.
According to Briot et al. [63] the creation of a melody (monophonic or polyphonic) must be consid-
ered as a different task from a multi-track generation or the generation of an harmony for a given
melody. Also the autonomy of the system must be taken into consideration while analyzing deep
AMG systems.
• Representation: when dealing with generative systems we must consider training and generating
phases separately. The training input, the generating input and the generating output representa-
tions might be different. The authors divided the different representations in signal (e.g. waveform,
audio spectrum) which represent more directly the sound waves and symbolic representations
(e.g. Musical Instrument Digital Interface (MIDI), pianoroll, text, chords, lead sheet) much more
similar to a score or even to the act of playing an instrument. They also talk about two different
encodings: one-hot encoding and value-encoding. The first one is suited for finite discrete dimen-
sions, while the former is usually used for continuous dimensions that may be defined as a function
of the other dimensions.
• Architecture: in this dimension we explore: the number of layers; the number of neurons in each
layer; which nonlinearities should be used; how should the artificial neurons be connected; if we
should use attention layers; if we should use some already well known deep structures such as
CNN, RNN, RBM, GAN...
• Strategy: one architecture can be used in different ways, providing different outputs and solv-
ing different tasks. One direct way to use the model starts by feeding it with the beginning of
one song and predict the rest. However, many other different strategies are possible: sampling
from the generated distribution, including input manipulation, making networks play against each
other, concatenating cherry-picked results of different models or even any combination of these
strategies.
28
2.4 Summary
Creativity is a very complex, subjective and difficult to define concept. Usually a creative act involves
four different components, ”the fours P’s of creativity”: a creative person inserted in some creative en-
vironment (press) creates a creative product through a creative process. All these components are
currently target of several studies and have several complementary and sometimes apparently contra-
dictory theories. However, nowadays, researchers have reached the agreement that both novelty and
utility have some big role in creative tasks.
Computational Creativity (CC) is an interdisciplinary field that aims at exploring the relationship be-
tween creativity and algorithms. In this area, researchers develop computational systems to perform
creative tasks or simulate the mental processes that occur in the brain during a creative task, using a
varied range of algorithms. In this work we focused on Deep Learning (DL) algorithms.
DL is a family of ML algorithms that use structures with several layers, commonly known as neural
networks, to extract features from data. Nowadays there are a set of commonly used network architec-
tures such as the Convolutional Neural Networks (CNN), that pay special attention to features related
with spacial locality, and Recurrent Neural Networks (RNN) or Long-Short Term Memory (LSTM) suitable
for sequence processing and time realted features. On generative models we can pointed out the RBM
a energy-based stochastic model, the Variational Auto-Encoder (VAE) and the Generative Adversarial
Networks (GAN). In GAN, we have a generator network that tries to fool a discriminator network that
distinguishes between real and fake examples. Both networks are trainned against each other which can
lead to some already well known but not easily solvable problems such as mode collapse and vanishing
gradient. Cyclical generative models such as CycleGAN [56] and DiscoGAN [57] use GAN loss and
an additional reconstruction loss function in order to learn a one-to-one mapping between two domains
defined by representative non-paired datasets. With this kind of models, two different generators and
two discriminators can be trained to find for instance a translation between two styles of music.
About the Automatic Music Generation (AMG) task, Widmer [39] criticized, in 2016, the way some
well known facts have been ignored by the area of Music Information Research (MIR) such as that music
is usually performed for humans to listen and that those humans socially learned how to appreciate the
expressiveness of it. In 2017, Briot et al. [63] presented four main components of any solution of music
automatic generation using deep learning: the objective, the representation, the architectures and the
strategy.
29
3Dataset
Contents
3.1 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Datasets Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
31
3.1 Representations
Although nowadays the advantages of data-driven approaches are clear, one must also take into
account all their disadvantages. Data-driven approaches are only possible because there is data to
drive our model and, although nowadays we have easy access to a massive amount of data, more than
at any other time thanks to the internet, data retrieving, cleaning, processing, converting and all the
other data management tasks are not as easy as they may seem in theory.
Searching and gathering data is one of the first tasks in a modern data-driven approach and one of
the most complex. This task encompasses three sub tasks which we are going to present separately: to
choose one representation; to search and gather data from data sources; to preprocess data.
In this work we used DL to learn how to generate epic music. In order to achieve that, we used a
representative dataset, i.e., a dataset of examples, to represent our definition of epic music, which was
used to teach our model the common characteristics of epic music. One distinct representative dataset
exclusively dedicated to melodies was also built and used in our experiments.
As exposed before, Briot et al. [63] consider representation one important component of any DL ap-
proach to music composition. Actually, this is an important aspect of any data-driven approach. If we
think only of the size of the search space generated by the representation, an overly flexible represen-
tation can generate such a vast search space that any search process becomes inefficient, whereas a
too rigid representation may end up excluding too many potentially interesting artifacts. Therefore, trivial
problems may seem very difficult if one uses the wrong representation.
In section 2.3.2, we explained that music representations can classified as signal representations,
such as waveforms, or symbolic representations, such as MIDI and pianoroll representations. The
pianoroll representation receives its name from the homonym storage media used in music boxes or old
automatic pianos.
We used the pianoroll representation as output representation (generated songs) as well as training
and input representations, both for melodies and epic songs. The translation process from pianoroll to
audio allows us to efficiently use different sound libraries (or even human players) to render different
audio results, while the translation from MIDI (commonly used online) to pianoroll is also very efficient.
In addition, the similarities between this representation and some visual representations commonly used
in DL makes it suitable for the use of CNN and GAN models.
In our pianoroll representation we have time, pitch, tracks and velocity (musical intensity) dimensions.
While, the first three dimensions use one hot encoding, i.e., as explained in section 2.3.2, there is one
cell for each one of 3-tuple (timestep,pitch,track), the last one is value-encoded which means that
is represented by a value in each one of those cells. We can visualize this structure as a three-order
tensor (or a cube of cells) where cell cs,p,t stores the value of the velocity of track t at timestep s for the
note p.
33
Figure 3.1: Most commonly used rhythmic figures
To fully understand our representation, we present a deeper analysis of each one of our musical
dimensions:
• Intensity: in our approach, it is represented by a real value in a range from 0, that means not
playing at all, up to 1, which means playing as loud as possible. Usually, this dimension is repre-
sented on a musical score using Italian words such as piano and forte, which mean low and loud
respectively, while in MIDI it is represented by an integer value between 0 and 127.
• Pitch: a scale is a musical structure that defines both the amount of different pitches and the rela-
tionships between them. Different music styles use different musical scales, for instance, Eastern
music, renaissance music and microtonal music all use different pitch scales. The chromatic scale
is the standard for western music because it includes all the most used scales in this culture and
it is used in MIDI where each pitch maps into one integer between 0 and 127. In this mapping, for
example, an A3 corresponds to 57. We also used a 128 sized array to represent the 128 chromatic
notes used in MIDI while allowing to have multiple notes played at the same time.
• Timesteps: the resolution of time (also referred as tick) can be absolute, if it represents a fixed du-
ration time interval, or relative, if it is measured in relation to a symbolic figure and needs a tempo
value to be converted into an absolute time. Regarding relative time, although beat and quarter
are different concepts, the expressions ”beat resolution” and ”quarter resolution” have been fre-
quently used interchangeably, which may become misleading in cases not limited to simple time
signatures, i.e., when a beat does not correspond to a quarter note. For simplicity purposes, in our
approach, we chose to have a 24 beat resolution and a fixed tempo of 120 Beats per Minute (BPM).
This resolution value means that, 4 bars of 4 beats each (assuming a quaternary time signature)
totals 4× 4× 24 = 384 timesteps. These values have been commonly used in pianoroll represen-
tation and allow to represent the most common rhythmic figures, represented in Figure 3.1, from
the whole note, on the left, until a set of sixteenth note triplets on the right. The value of 24 was
calculated by finding the least common multiple for 1, 2, 3, 4 and 6 representing the full, half, third,
quarter and sixth of a beat and then doubling it to make sure we can always represent the end of
one rhythmic figure by inserting an empty timestep.
34
Figure 3.2: Schematic illustration of representation
• Tracks: tracks usually represent instruments or groups of instruments. MIDI represents instru-
ments using programs, using once again integer values between 0 and 127, and allows dynamic
instrumentation, i.e., changing the program of one track in the middle of the piece. On the other
hand, our implementation uses static instrumentation, a fixed program for each track. We de-
cided to use only one track for melody representation (usually rendered in a piano sound). For
epic songs, we used a fixed set of 8 tracks, to represent different groups of instruments: woods
(rendered using the clarinet), brass (rendered using the french horn), percussion set, timpani,
pitched percussion (rendered using tubular bells), voices, strings and keyboards (using piano).
These groups were chosen based on: organological knowledge also reflected in MIDI’s program
mapping; the instrumentation of the epic examples we have collected and the classical symphonic
orchestra configuration.
With one of these cubes or blocks of cells, we can represent one segment of epic music by calculating
the intensity of all the 128 chromatic pitches in each one of the 384 time intervals and for each one of our
8 instruments. The limited fixed time scope is a particularity of this representation, which, at first glance,
may seem as a disadvantage but that actually simplifies our deep learning solution architecture thanks
to its fixed dimension. Moreover, we can have a sequence of these blocks in order to represent longer
music tracks, knowing that with this approach we may lose some structural information.
On the whole, as illustrated in Figure 3.2, the final representation of one epic song is a finite sequence
of 128× 384× 8 sized blocks (notes, timesteps and track), with values between 0 and 1, i.e., a tensor of
4 dimensions, T 4[0,1].
35
3.2 Data Sources
A very natural way to define a concept is throughout stating examples. However this practice is very
prone to bias factors, even more if we are dealing with subjective concepts such as ”epic”. There are
many different ways to create a representative dataset, but when dealing with subjective concepts most
of them include two steps: gathering data and labeling it. During the analysis of the different possible
approaches, we identified two dimensions that may impact both the time it takes to create the dataset
and its overall quality.
The first of these dimensions is the temporal relationship between the processes of gathering and
labeling data. By gathering pre-labeled data we are able to spend less time in labeling data afterwards.
However, to find pre-labeled data often means using costly expert knowledge or conducting a time
consuming search process. On the other side, to conduct a post-labeling process on an heterogeneous
amount of gathered data provides us the flexibility and control we may need in some contexts, in return
for a much more longstanding labeling process. Hybrid approaches that join advantages from both are
also a possibility and were also considered.
The second dimension we should consider is the format of the gathered data. When we are dealing
with real data, we can not expect to receive the data from the source already formatted and using the
intended final representation. A preprocessing phase is often needed and depending on the source
representation, this phase can become very time consuming or even produce low quality results. For
instance, nowadays, the task of decoding signal data into a symbolic representation is a complex, very
prone to fault task while, conversely, the translation between symbolic representations is most of the
times simple to program. This means that, to extract pianorolls from MIDI files is easy whereas extracting
them from WAV files is much more complex. However, while the latter are relatively easy to find available
online, the former are not and most of the times are not pre-labeled with the terms we desire nor have
enough heterogeneity to allow us to use the desired labels afterwards.
Besides these two, there are many other dimensions which we can pay attention to, when defining cri-
teria for choosing the best data source and methodology. As guidelines for our pursuit for data sources,
we focused on pre-labeled symbolic data sources, in order to prevent great loses during preprocessing
and to avoid the time consuming task of post-labeling the data.
After analyzing the perks and realities of some of the approaches, we included in our representative
dataset of epic music some samples available in an open score library created and managed by the
enterprise responsible for MuseScore1, an open source score editing software. From this library of
original and adapted pieces, available for downloading in MIDI format, we considered exclusively those
samples that were provided in a specific group uniquely dedicated to ”Epic Orchestral Music”2. We
1https://musescore.com2https://musescore.com/groups/epicorchestralmusic
36
used the assumption that the content (orchestral pieces) present in this group could represent well the
concept of epic music, and, in this sense, this group represents a pre-labeled MIDI data source which
supported a simple way to retrieve data for our representative dataset. Moreover, the high availability of
the service and the existence of other groups dedicated to other music styles, provides a free and easy
way to create new and distinct datasets. On the other hand, there are also some disadvantages. One
is copyrights, that may partially or integrally forbid us to freely distribute our dataset and to share it with
the rest of the scientific community. Besides that, we must take into account that most of this content
was inserted in the ”Epic Orchestral Music” group by content creators, which may have some impact in
the overall quality of the dataset. Summing up, we considered this solution achieved a good balance
between the dataset quality and the time dispensed for data management and processing.
For our dataset of melodies, we opted to use a very practical and brief approach. Since ”melody”
is a much less subjective concept, we used an automatic melody generator3 available online to create
a dataset of heterogeneous MIDI melodies. This generator used some parameters such as the tonality
factor, which regulates how tonal should the melody be; the proximity factor, that fosters smaller intervals;
or even repeated notes to allow the repetition of notes.
3.3 Preprocessing
Music dedicated tools usually are not as popular as those used in other domains. When searching
for python libraries for MIDI and pianoroll preprocessing and visualization, although we were able to
find some, they were very dispersed and were not uniformed. Consequently, during the whole devel-
opment of this work, we used different libraries and different representations such as midi.Pattern4,
pretty midi.PrettyMIDI5, pypianoroll.Multitrack6 and mido.MidiFile7. All these representa-
tions were finally integrated in one unified library, the Gmidi8 (General MIDI) library, along with a new
one midiarray.MidiArray and some new functionalities. This new library works as a facade for all the
representations, transparently translating from one to another when we want to use specific methods or
access to certain attributes. Inside the Gmidi library, we also included some new methods for visualiza-
tion, storing and preprocessing MIDI data such as for chopping a MIDI file in blocks or for re-orchestrating
a MIDI.
Our preprocessing procedure is summarized in the next steps:
1. Re-orchestrate: the MIDI files gathered, had different numbers of tracks and used different sets
3https://www.link.cs.cmu.edu/melody-generator4https://github.com/vishnubob/python-midi5https://github.com/craffel/pretty-midi6https://salu133445.github.io/pypianoroll7https://mido.readthedocs.io8https://github.com/LESSSE/gmidi
37
of instruments, therefore, in order to uniformize them, we mapped each MIDI program into one of
our 8 tracks, based on this we assigned a destination track to each one of the original tracks and
finally copied every note in each one of the original tracks to its correspondent destination track;
2. Translate into Pianoroll: the MIDI files are transformed into pianoroll, walking through the list of
events and filling all the cells of the matrix that correspond to the timesteps between the start and
stop note events;
3. Chop: after considering several chopping techniques, we opted to use a non-informed chopping
technique that chops one song in contiguous non-overlapping fixed size blocks;
4. Transpose: thanks to the equal temperament, i.e., all semitones are equal, we can move the pitch
of each note n semitones above or bellow without damaging the relationship between those, thus
we transposed each one of the pieces for each one of the pitches between −6 and +5 semitones,
in one attempt to create more variety in our dataset;
5. Thresholds: we discarded notes that were being played to low or too loud;
6. Normalize: all value-encoded velocities were mapped into a value between 0 and 1;
Data augmentation techniques are methods by which one can expand the size of one dataset by
exploiting known symmetries (or other kinds properties) of the data. During the creation of the dataset,
several augmentation techniques were considered: transposing (an epic music using a different key still
sounds epic), doubling or halving rhythmic values (the sonority of one song does not depend on the
relative time of the notes), track switch (switching parts of two instruments should not make one epic
song not epic, but can make it very difficult to be played by human musicians). Some initial experiments
were conducted using data augmentation by transposition and following one adaptation of the SGD
process where all the batch instances were augmented from the same original instance, technique which
we called Augmented Stochastic Grandient Descent. However, in the end, no augmentation method was
used, since we verified that it added no clear value to the overall approach.
During early development, we verified that to store all the data as a full already preprocessed matrix
would require more memory than we had available. Supposing that it uses 32-bit float precision format
to store 12 different transpositions of one block with the dimensions defined in section 3.1 (128 pitches,
384 timesteps and 8 tracks), a single block of one epic song would occupy 128 × 384 × 8 × 4 × 12 =
18MB. Knowing that the 335 songs gathered in our dataset have in average 17 blocks, we would need
over than 335 × 17 × 18 > 100GB of available storing space just to store only one of the datasets.
MIDI files are a much more efficient way to store our data, with an average file size of 20.17KB and
a maximum of 118.91KB, then we adopted an online preprocessing method, where one instance is
entirely preprocessed just before running the training with that particular instance and never storing
38
the results of preprocessing. However, this approach has proved to be very time inefficient. Finally, to
address this last issue, we returned to an offline approach and developed a sparse representation, that
we also included in Gmidi library, that take advantage of the sparsity of our data, allowing us to store
already preprocessed data with an average size of 1.20MB per file, while liberating the training phase
from all preprocessing overheads.
3.4 Datasets Characterization
From the 561 songs available inside ”Epic Orchestral Music” group on 30th May of 2018, we filtered
those that: used strange programs that our representation could not support; used no strings or were
composed for solo instruments; were to big to load in memory as a matrix; keeping only 335 of those
songs in the final version of the Epic Dataset. The complete list of references for the resulting set of epic
songs is available at Appendix 7.
The Melody Dataset includes 300 melodies using various keys, that were generated using different
values for the proximity and tonal factors, allowing an heterogeneous set fo melodies. Table 3.1 compiles
some statistics and metrics to characterize our new datasets: the Epic Dataset and the Melody Dataset.
In the first parts of the table we hove some statistics about the users that uploaded content that was
included in the final Epic Dataset, including those that contributed the most. As expected in any scale-
free network, the number of songs is not uniformly distributed over the users, it follows a power law, i.e.,
there are few authors contributing a lot for the dataset while there are many composers contributing with
only one piece. For instance, the user 10712571 is responsible for uploading more than one third of the
blocks included in the dataset 233216.91×335 ≈ 0.4, and the four users that contributed the most cover more
than 50% of the dataset.
The second and third parts of the table show some information about the dimensions and musical
features of the songs and melodies in our datasets. The methods to calculate these musical metrics were
adapted from the MuseGAN project [8] and were included in the Gmidi library. The data shows us that
although the strings dominate in epic songs, which means that, in average, they have more volume than
any other instrument, usually one epic song uses 6 of the 8 tracks we have in our representation. Both
epic songs and melodies have little spots of silent and usually use no more than 6 from the 12 possible
classes of pitches (all the C notes belong to the same class, regardless of the octave). The diatonic,
harmonic and pentatonic metrics provide a way to measure how the pitch classes used in a sample
correspond to the diatonic, harmonic and pentatonic scales, while the harmonicity metric represents the
inter tonality between different tracks.
39
Figure 3.3: Evolution of the volume along an average epic song in the new epic dataset
In Figure 3.3 we can see that both the average and the standard deviation of the sound volume
increase along one epic song. This graph was achieved by aligning the beginning and ending of the
songs and using linear interpolation to align the rest of the points between two lines. This sound volume
measure is affected not only by how loud the instruments are playing but also how many of them are
playing in one moment, and it is calculated by summing up all the velocity values and dividing by the
number of timesteps in our block.
3.5 Summary
Concluding, in this chapter we present the two new datasets created for this work: the Epic Dataset
and the Melody Dataset. The first one contains samples of epic songs originally obtained from an
online group exclusively dedicated to orchestral epic music, that were preprocessed and translated into
a sequence of pianoroll blocks with 8 tracks, 128 pitches and 384 timesteps. The data in our Melody
Dataset was created by applying the same preprocessing procedure to some MIDI files generated by
an automatic melody generator. During the creation of the data sets we developed the Gmidi library in
order to gather, compile and integrate all the tools we used.
40
Table 3.1: Characterization of the new datasets
Epic Dataset Melody DatasetNumber of songs 335 songs 300 songs
Users and ContributionsDistinct Users 78 users -Users with only one song 45 users -User with Greatest Contribution user 10712571 -Greatest Contribution (Songs) 100 songs -Greatest Contribution (Blocks) 2332 blocks -Second User with Greatest Contribution user 2544941 -Second Greatest Contribution (Songs) 18 songs -Second Greatest Contribution (Blocks) 274 blocks -
DimensionsAverage Number of Original Tracks per Song 24.6 tracks 1 trackAverage MIDI File Size per Song 20.14KB 3.11KBAverage Number of Blocks per Song 16.91 blocks 12.38 blocksAverage Number of Ticks per Song 6781.33 ticks 4755.20 ticksAverage Number of Notes per Song 1876.48 notes 347.50 notesAverage Duration per Song 02m57s 01m43sTotal MIDI Size 6.59MB 933.07KBTotal Number of Blocks 5665 blocks 3715 blocksTotal Duration 16h30m56s 08h35m52s
Musical MetricsStatistical Mode of Dominant Instrument Strings PianoStatistical Mode of Number of Instruments Used per Song 6 1Average Empty Timesteps Ratio per Song 0.042263 0.097024Average Volume per Sounding Tick 4.134 0.503Statistical Mode of Number of Pitch Classes Used 6 5Statistical Mode of Pitch Extension 65 34, 36Highest Pitch Used 111 78Lowest Pitch Used 0 42Average Reused Notes Index 43.216 25.288Average Ratio of Qualified Notes (≥ than a sixteenth note triplet) 0.920664 0.999812Average Ratio of Long Notes (≥ than an eighth note) 0.607416 0.534424Average Ratio of Long Notes (≥ than a quarter note) 0.338370 0.159075Average Diatonic Similarity Metric 0.582159 0.627548Average Harmonic Similarity Metric 0.545730 0.584224Average Pentatonic Similarity Metric 0.369851 0.298396Average Harmonicity Metric 1.403273 −Average Polyphonic Ratio per Song (> than 2 notes) 0.774234 −Average Polyphonic Ratio per Song (> than 3 notes) 0.618742 −
41
4Models
Contents
4.1 Environment and Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 HRBMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 MuseGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 MuCyG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
43
4.1 Environment and Tools
Choosing the right tools to tackle a problem is a decisive issue that can define the success or failure
of any project. Therefore, before developing any kind of piece of software, one must always have a look
into what tools and resources are available. To find the right tools, we need to find a balance between
the learning curve, the time spent on configuring and maintaining the development environment and the
time spent developing the final software, while taking into account other issues such as the fault recovery
capability, code reviewing and the flexibility of the tools.
Nowadays, there is a wide range of tools for developing ML models, providing various degrees of
abstraction and different levels of flexibility: some allow to create new models by writing the code of the
different stages of the model, others provide drag and drop interfaces to built the data flow; some were
developed for on-permise development, others run on the cloud for scalability; some are for general use,
others focus specific branches of ML. In Table 4.1 we compare some basic machine learning developing
tools.
We decided that using a Python 31 based tool was the best choice due to its simplicity, flexibility and
familiarity. Python is a simple, flexible and easy to learn object-oriented language that, during these last
years, has been widely used for all kind of applications, including for DL, thanks to the great availability
of DL related libraries such as Tensorflow. Tensorflow is a strong and flexible dataflow programming
framework widely used for general ML. In this dataflow programming paradigm, the programmer writes
the code to create and manipulate a set of objects which represent data operations. These objects
define a graph of operations through out which the data will flow. Tensorflow have a big, very active and
helpful support community online. There is also a big online collection of already implemented models
using Tensorflow on Python, including some of the models we wanted to adapt.
We used a local Windows machine to work on, from which we manage and access all the other
development environments. Firstly, using Vagrant2, we created a local Linux virtual machine, that, using
Ansible3, automatically configured a Jupyter Notebook4 server and all the Python libraries we needed for
quick testing small parts of code, visualizing results and semi-automate reporting. This environment al-
lowed us to test new ideas and to start working on the models, even without any internet connection, and
in case of a problem with the local machine it could be easily setted up in a different machine with virtual-
ization tools and a Vagrant instance installed, by cloning a git repository5 available on GitHub. However,
the models grew up and it became practically impossible to run them in a local machine. Therefore,
we used servers provided by the L2F group from Instituto de Engenharia de Sistemas e Computadores
- Investigacao e Desenvolvimento (INESC-ID) at Lisbon. These machines provided enough Graphics1https://www.python.org2https://www.vagrantup.com3https://www.ansible.com4https://jupyter.org5https://github.com/LESSSE/music-cage-machine
45
Table 4.1: Comparison between tools for ML models development
Framework Code vs GUI Local vs Cloud Generic vs SpecificTensorflow Code Both GenericKeras Code Both DL specificDeep Cognition Both Cloud DL specificAzure ML Studio GUI Cloud Generic
Processing Unit (GPU) power and memory to successfully and efficiently run our models.
Lastly, in what concerns development tools, to avoid additional tool configuring tasks, no IDE! (IDE!)
was used. Aside from the Jupyter Notebooks, we ended up using a simple text editor such as vim and
discarded any debugging, refactoring and automated test tools, which in the end may have slowed down
the development process.
4.2 HRBMM
In a Boltzamnn Machine, generative models based on energy commonly used to generate discrete
data, each node is sampled from a Bernoulli distribution with the parameter p, the probability of success,
that depends on the value, 0 or 1, of other nodes. The dependencies between nodes are codified in a
set of weighted arcs and the learning process consists in iteratively modify these weights, to low or rise
the p value for each one of the nodes accordingly to the data instances given, expecting to converge into
a state where the arcs codify the important relationships between the nodes. After the training phase,
we can stochastically generate new data by providing a random value for part of the data, propagate
the weights to calculate the p value for the remaining nodes and sample these. Restricted Boltzman
Machines (RBM), shortly presented in 2.2.3, usually have two layers of nodes, the visible and the hidden
layers, where connections between nodes in the same layer are not allowed, as represented four different
times in Figure 4.1.
The first architecture we used, which we named Hierarchical Restricted Boltzmann Musical Ma-
chine (HRBMM), uses some RBM, each one identified by a natural number, to create multitrack musical
excerpts. It uses one different machine for each one of the N tracks, RBM from 1 to N , and one different
machine, identified by the number 0, that computes on the concatenation of the hidden states of tracks’
RBM. We can see these machines as a way to codify a visible state into a smaller representation, the
hidden state. With these machines, it is also possible to go the other way around, trying to recover the
visible state from the code, but since it is a stochastic model this operation commonly gives us a different
visible state.
Since this model usually computes only binary data, the first step is to make our data binary by using
a threshold value. Usually, any intensity value higher than 0 is considered a playing note, but different
values can be used to eliminate noisy notes. The training phase starts by inserting the binary vectorized
46
Figure 4.1: HRBMM architecture
pianoroll (also called flatten) of each one of the 8 tracks in the visible layer of the respective machine.
With this in mind, one can easily conclude that the visible layer of each one of these 8 machines has
128 × 384 = 49 152 nodes, the number of notes times the number of timesteps in our representation.
These are connected to 128 hidden nodes using (128× 384)× 128 arcs and using one extra bias arc for
each one of the nodes, resulting in a total of (128×384)×128+(128×384)+128 = 6 340 736 arc weights
per track machine that we need to optimize. The hidden states of each one of the RBM from 1 to 8 are
concatenated in a 128 × 8 = 1024 sized visible state for machine number 0 which uses a new hidden
states also with 128 nodes. This last machine adds more (1024 × 128) + 1024 + 128 = 132 224 arcs to
train, including the arcs between the layers and bias arcs. Counting everything, we end up with the wide
number of 8× 6 340 736 + 132 224 = 50 858 112 parameters to optimize during the training process.
During the training process, after inserting the input in machine i, we propagate the visible values
vi to the hidden nodes by multiplying them by the weights matrix W i, adding hidden bias bih, applying
the sigmoid function σ to this sum, as shown in Equation 2.6 and finally sampling using the resulting
value to parameterize a Bernoulli experiment. After getting all the hidden states for 1 to N machines,
we concatenate them and inject it in the visible layer of machine number 0 and we do the same step to
propagate the values to the hidden layer.
Once we arrived to this state we go backwards using transposed matrices to sample new visible
states for each one of the machines, i.e., 8 new segments for each one of the 8 tracks. The process
that goes from one visible state to another is called a Gibbs step and a Gibbs sample uses one or more
Gibbs steps. Due to time efficiency constraints, it is common to perform only one Gibbs step during the
training process. After having one sample of the hidden and visible states we can use the differences
47
between these states and the states first propagated from our data to update the biases and weights.
The sampling process is similar, using only one Gibbs step on a full 0 initial visible state and a full song
is created by concatenating of several sampled blocks.
After a quick analysis of this model, one of the first things that caught our attention was that it only
generates binary data which leads to music without any dynamics. Knowing that epic music takes
advantage of dynamics to create an impact on the listener, this approach may not fit to our goals.
Another problem with this approach, is that it is not easy to control the output, i.e., include features from
one input into the final product, which can make unclear how the melody inspiration mechanism could or
should be included. Yet, it was a very straight forward and easy to implement model with the capability
of providing great results, specially in what concerns the novelty.
4.3 MuseGAN
MuseGAN6, first presented in November of 2017 by Dong et al. [8] was, to the best of our knowledge,
the first application of GAN, discussed in 2.2.3, to the task of multitrack symbolic music generation. This
project also used pianoroll to represent the training and output samples, it was also implemented using
Tensorflow over Python and it explores three different ways of generating tracks inspired in different
contexts for music creation:
• the jamming model: where each track uses an independent generator and discriminator;
• the composer model: which uses only one generator and single discriminator for all the tracks;
• an hybrid model: that combines features of both models, having one private generator for each
one of the tracks but only one discriminator that evaluates the credibility of the joined tracks.
One basic possible sampling strategy was to concatenate several results from the bar generator(s),
but, since these bars are totally independent, this strategy does not allow to model some important
temporal relations in music, possibly resulting in a very incoherent final product. In order to address this
problem, the authors include a vector that represents the temporal structure, a sequence of codes that
can be used to generate a sequence of coherent bars, which are afterwards complemented with some
per-track and per-bar specific random seeds to promote the variety of the results.
In the end, to simplify the task, Dong et al. [8] decided to generate binary pianorolls (ignoring note
velocity) and in order to make the overall process more stable the authors used different techniques:
• Wasserstein Generative Adversarial Networks with Gradient Penality (WGAN-GP) [53] as the cost
function;
6https://github.com/salu133445/musegan
48
• batch normalization [54] in the generator, which learns the mean and variance of the values in
each layer and use these values to normalize the hidden states;
• LReLU as the activation function;
• an unbalanced training, updating the generators only once every five updates of the discriminator.
We adapted the hybrid model to work against our Epic Dataset and limited the modifications to those
that were imperative to do in order to work with this new data. Since now one instance consists in 4
bars of 8 tracks of 128 different pitches, these changes consisted mostly in changing the size of some
variables and the size of some convolution filters. In the training process, we only choose a different
learning rate, explained in 4.5 and dismissed the unbalancing factor, making both the discriminator and
generator update at the same time, keeping all the rest.
Both the generators and discriminators use CNN, introduced in 2.2.2, to compress each one of the
blocks used to represent one song, presented in section 3.1, into a smaller representation. In the
discriminator, the model uses the convolutional network represented in Figure 4.2. Starting with one
block of shape (386, 127, 8), the first step is to divide it into 4 different bars of 96 timesteps and extend
the pitch dimension by padding the data, to get a multiple of 12 semitones, i.e., an integer number of
octaves (recalling that a perfect octave has 12 semitones). The block is then inserted into four different
streams:
• the timestep-first stream: where, using convolutional filters in non overlapping areas of the data,
we reduce the dimensions of the data, starting by the timesteps;
• the pitch-first stream: that perform the same operations as the timestep-first stream but in a
different order, starting by compressing the pitch first;
• the onset stream: which firstly identifies the beginning of each note and uses convolutional filters
over the result;
• the chroma stream: that starts by folding the pianoroll in a single octave representation and
performing convolutions over that.
After these streams the resulting tensors are concatenated and reduced into a 512 vector. On the
other hand, the generator uses a deconvolutional network which expands a new block from a compact
representation, as shown in Figure 4.3. In this network we have two streams that are similar to the
pitch-fist and timestep-first streams but that expand the dimensions instead of reducing them.
Contrarily to the binary sampling method originally used in the paper, we decided not to discretize
the results and make use of the values to define the velocity of the notes. This way we can create
dynamics fluctuations along the music. Therefore, when compared to HRBMM, this model has the
49
advantage of making use of dynamics and the usage of convolution networks allows a smaller number
of weights to optimize, dropping the total number of parameters to 9 821 081. However, as we have seen
in section 2.2.3, GAN are complex models that are really difficult to train, that are not easy to optimize
and that may suffer from too many different problems for which we do not have yet one general solution.
4.4 MuCyG
We present the Musical CycleGAN (MuCyG) that is the first implementation of CycleGAN applied
to multitrack symbolic music that takes into account velocities, to the best of our knowledge. With our
main objectives in mind, we conceived what we considered a general model, inspired by the process
of musical composition with mechanisms of inspiration, that uses cyclical models which have been
achieving great results in one-to-one mapping across visual domains.
As in any generic cyclical model, such as the one presented in section 2.2.4, the MuCyG uses two
discriminators and two generators. Taking into account the main aims of this work, we attempt to use
the model to translate melodies into epic music and the other way around. To minimize the complexity
of the project, we used generators and discriminators based on the architecture used in MuseGAN,
using CNN. In Figures 4.2 and 4.3, we have a detailed schematic representation of the operations and
different states our 8-track data flows through inside our convolutional and deconvolutional architectures,
respectively.
When compared to other CNN, these networks present two peculiarities: pooling layers are not used;
and, when a convolutional filter is applied, usually there is no overlapping areas, i.e., most of the times
the stride corresponds to the size of the filter.
In our convolutional network the data is compress into a 512 sized representation. The first step is
to expand the pitch dimension to the next multiple of 12 greater than 128, in order to always represent
complete octaves (each one including twelve semitones). After that, different streams compress the
data in different ways, exploring different features, but always converging into a concatenation step. On
the contrary, the deconvolutional network expands the representation, generating a new instance. At the
end of the process we drop the pitches that were synthetically added to our data.
Both these networks are used to build our generators, starting with a convolutional network followed
by a deconvolutional network. The discriminators are made up of only one convolutional network that
compresses our blocks into 512 sized representations and feeds those to one LSTM, which is responsi-
ble to evaluate the pattern structure of these compressed codes. As adversarial loss function, we used
WGAN without gradient penalties, instead of WGAN-GP used in MuseGAN, because this was not sup-
ported by the Tensorflow version we were using. To measure the reconstruction loss we used a simple
mean difference.
50
Figu
re4.
2:C
onvo
lutio
naln
etw
ork
arch
itect
ure
used
inM
useG
AN
and
MuC
yGfo
rthe
epic
data
set
Figu
re4.
3:D
econ
volu
tiona
lnet
wor
kar
chite
ctur
eus
edin
Mus
eGA
Nan
dM
uCyG
fort
heep
icda
tase
t
51
When training simultaneously with two non-aligned datasets of different sizes, some questions may
arise: should we balance the datasets? How to make both datasets the same size? What is an epoch,
when the sizes are different? Should we train one of the domains with more examples? In our definition,
one epoch corresponds to train the model with all the data one time and one time only. This way, in one
epoch no instance is repeated nor wasted, thanks to the implemented mechanism that only optimizes
what it is possible with the given data. Yet, very unbalanced datasets lead to an inappropriate training
process, which may cause some problems such as mode collapsing or overfitting.
The sampling process firstly consisted in getting the result from the epic generator by feeding it one
melody.
This model aims at overcoming the lack of an inspiration mechanism. Yet, it is an experimental model
with a high complexity which can take a very long time to optimize, possibly beyond our time constraints.
For example one point particularly difficult to manage is the balance between the adversarial losses
and reconstruction losses. To the best of our knowledge this problem was not addressed yet, probably
because it only becomes visible when dealing with non balanced loss measurements, such as the mean
difference.
4.5 Tuning
Using these models and environments we conduct several experiments on music generation in order
to tune some aspects of the models. We chose the name MuCaGEX (Music Categories Generation
Experiments) to identify all the software we developed in order to conduct these experiments, and the
code is integrally available in https://github.com/LESSSE/public_MuCaGEx. In order to compare these
three models in a fair way we needed to make sure that all are in a state that can be comparable, since
the choice of the right hyperparameters can have a huge impact in performance.
During the initial experiments, with training, the results of our models converged into completely
empty tensors, full zero matrices, representing completely silent samples. After checking for semantic
bugs, we decided to focus our study in one of the most important hyperparameters: the learning rate.
Learning rate is “reliably one of the most difficult to set hyperparameters because it significantly affects
model performance”, as noted by Goodfellow et al. [41]. However, the same authors also state that “[to
choose the right learning rate] is more of an art than a science”, making clear that, currently, there is no
good general way to tune this value.
In our method to tune the learning rate, we trained each model on three random small subsets of the
Epic Dataset with an exponential growing learning rate, and plotted the difference between the densities
of the original o and the generated instance g, as defined in Eq. 4.1. In Figure 4.4 we see the resulting
plots and verify that the three models behave differently when set with the same learning rate.
52
density(Tm×n×t) =
∑mi=0
∑nj=0
∑tk=0 Tijk
n×m× tloss =density(o)− density(g)
(4.1)
Choosing a learning rate value associated with high variances should allow the training to move
quickly towards the goal. In fact, it allows steps too big, resulting in an erratic training trajectory. We
considered that this erratic movement had some creative potential, therefore we searched for spikes in
the standard deviation and in the first derivative and used the corresponding learning rate. The final
values used for the learning rates of each model are: 1 × 10−1 for HRBMM, 2.51 × 10−4 for MuseGan
and 5× 10−4 for MuCyG.
4.6 Summary
During this work, three models in Tensoflow were adapted or implemented to work against our Epic
Dataset. MuCaGEx is the name of the repository that includes all the code we used to conduct several
experiments on automatic music generation using these three models.
The HRBMM uses one dedicated RBM for each one of the tracks and an additional one that computes
on the concatenation of the hidden states of the track-dedicated machines. Since it is a binary stochastic
model, it did not allow us to explore dynamics.
We adapted the hybrid MuseGAN model, presented by Dong et al. [8], that uses dedicated genera-
tors for each one of the tracks, that play against an unique discriminator in a GAN-like environment. This
model was able to model dynamics and some long term structure.
The last model, MuCyG, was an original model based on the idea of cycle consistency, explored in
other models such like CycleGAN [56] and DiscoGAN [57]. This model intended to translate melodies
into epic music excerpts and it was the only one from the three, that used both the Epic Dataset and the
Melody Dataset. Using two streams, two generators, two discriminators and two types of loss functions,
it generates a product from a melody and generates a second melody from this product, making sure
that this generated product was indistinguishable from real epic music samples and that the second
melody was the same as the original one. In the second stream, the process started with an epic music
that was translated into a melody.
We tuned the learning rate of our models by studying the variation of density of the generated sam-
ples when the models were trained with exponentially increasing learning rate. After this study, we chose
learning rate values associated with big variations of density.
53
5Results and Evaluation
Contents
5.1 Final Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
55
5.1 Final Experiment
The final experiments used a methodology inspired in the way humans learn music composition:
in the beginning, one starts studying some ”classic” examples in detail, and only then one gradually
introduces more and more complex examples, taking less time to learn, in each iteration. The training
process adopted the following steps:
1. firstly, using one subset of only 15 elements of each one of the datasets, we trained each model
for one full day of CPU time;
2. after this we took our first 32 seconds long sample, which corresponds to 4 blocks;
3. then we trained the models one more time using sub-datasets of 30 random samples for only half
a day;
4. sampled the second excerpt also with a duration of 32 seconds.
The resulting 8 samples1 (including two samples composed by humans, randomly selected from
the dataset), hereinafter referred to as Human 1, Human 2, HRBMM 1, HRBMM 2, MuseGAN 1,
MuseGAN 2, MuCyG 1 and MuCyG 2, have their pianorolls represented in Figure 5.1(a), Figure 5.1(b),
Figure 5.2(a), Figure 5.2(b), Figure 5.3(a), Figure 5.3(b), Figure 5.4(a) and Figure 5.4(b), respectively.
We can draw some conclusions about the characteristics of our samples, from these figures.
HRBMM’s samples look very chaotic, without any kind of order nor long notes. However, we can
see that strings, represented in orange, dominate in HRBMM 1, Figure 5.2(a), while they share the
dominance with brass instruments in HRBMM 2, Figure 5.2(b).
In MuseGAN’s samples, represented in Figure 5.3(a) and Figure 5.3(b), we can see that there is
a pattern that is repeated 16 times, with some variations, associated with the 16 bars we have in our
samples. We can also see patterns repeated 4 times inside a bar, 64 time in total along our sample,
that represent 4 beats inside the quaternary time signature. According to Simchy-Gross and Margulis,
”repetition musicalizes [...] sounds” [64]. This effect makes repetition to provide musical coherence and
makes samples to sound like minimal music. On the other hand this not completely identical repetition
is a symptom of a bigger problem in our model: mode collapse. Besides this, we can see longer notes
and a dominance of the strings in both these samples.
Finally, we can see that similar repetition effects and symptoms are present in MuCyG samples, rep-
resented in Figure 5.3(a), Figure 5.3(b). Possibly also due to the collapse mode problem, the melodies
generated by MucyG were empty and all the epic songs that were generated with some fixed learned
parameters sounded identical disregarding the input. These samples also make a strong use of strings
but possibly due to the additional reconstruction restriction both samples feature less notes.1http://web.tecnico.ulisboa.pt/~ist178303/mucagex/Final_Samples/
57
(a) Human 1
(b) Human 2
Figure 5.1: Pianoroll representation of Human composed samples used in survey
58
(a) MuseGAN 1
(b) MuseGAN 2
Figure 5.3: Pianoroll representation of MuseGAN’s samples used in survey
60
5.2 Survey
In order to compare the results of the experiments using our three models, we conducted an online
survey2. In Appendix 9, we provide one example of the survey presented to the users, with the difference
that we present the names of the samples bellow each audio controller. This responsive web page was
implemented using PHP and SurveyJS3, a JavaScript library, and the final results after being stored in
JSON files were analyzed using Pandas4 and NetworkX5 and visualized using Matplotlib6 libraries for
Python in a Jupyter Notebook.
The final samples were compared in three different contexts:
• word description, one open question where the respondent could insert up to three words to de-
scribe each one of the excerpts presented;
• one exercise where the sentence ”This epic music is creative.” referring to two specific excerpts,
namely MuCyG 1 and Human 2, is evaluated on a Likert scale from 1 to 10, several times while
provided with new information about the inspiration, explanation and nature of the excerpt;
• one relative direct confrontation, where excerpts confronted each other in an evaluation focused in
one of 5 specific characteristics, and where the user explicitly chose the winner.
The information retrieved from all the 100 responses we received, is from one population character-
ized as presented in both Table 5.1 and Table 5.2. As we may see, most of the people were between
18 and 34 years old and had no prior information about the project and its main objectives. The sample
was very rich with a wide variety of relationships with music: performers, musicologists, composers and
even featuring some producers, music teachers and conductors. 13 people that answered the question-
naire have some knowledge in music technologies and only 5 people had no relationship with music at
all. On a weekly basis, most of these people answered that they spend roughly 1 to 6 hours watching
films, playing games and watching videos and 6 to 12 hours listening to music. Moreover, while 77%
spend less than 1 hour in music concerts only 7% spend less than 1 hour listening to music. The survey
was provided in both Portuguese and English seeking some richness and variety of nationalities in our
sample.
2http://web.tecnico.ulisboa.pt/~ist178303/mucagex3https://surveyjs.io4https://pandas.pydata.org5https://networkx.github.io6https://matplotlib.org
62
Table 5.1: Summary of time spent in music related hobbies
Music Related Hobbies Listeningto music
Watchingfilms
Playing videogames
Attendinglive concerts
Watchingvideos
(%) (%) (%) (%) (%)None 1 0.010 2 0.020 33 0.330 33 0.330 7 0.070Less than 1 hour 6 0.060 12 0.120 17 0.170 44 0.440 12 0.1201-6 hours 17 0.170 42 0.420 27 0.270 19 0.190 39 0.3906-12 hours 27 0.270 23 0.230 14 0.140 0 0.000 17 0.17012-18 hours 17 0.170 8 0.080 2 0.020 1 0.010 9 0.09018-24 hours 14 0.140 7 0.070 3 0.030 0 0.000 9 0.090More than 24 hours 17 0.170 4 0.040 2 0.020 0 0.000 5 0.050
Table 5.2: Summary of our sample’s age, knowledge on the project and relationship with music
Frequency Percentage (%)Total 100 1.000
Age groupUnder 18 years old 10 0.10018-34 years old 68 0.6835-54 years old 18 0.18055-74 years or older 4 0.040
Knowledge About the ProjectNever heard of it 77 0.770I don’t know its goals 12 0.120I know the main goals 8 0.080I know implementation details 3 0.030
Music RelationshipBusiness 2 0.020Composition 17 0.170Conducting 5 0.050Critic 2 0.020Instruments Sale 1 0.010Musicology 26 0.260None 5 0.050Other (Cinema) 1 0.010Performing 41 0.410Production 13 0.130Teaching 15 0.150Technology 12 0.120Therapy 2 0.020
63
Table 5.3: Four most used words used per model
Human HRBMM MuseGAN MuCyGEPIC 13 CONFUSION 34 REPETITIVE 23 REPETITIVE 29CINEMATOGRAPHIC 11 CHAOS 16 SUSPENSE 18 BELLS 12HAPPINESS 6 RANDOM 8 CINEMATOGRAPHIC 8 MYSTERIOUS 6ELECTRONIC 6 NOISE 8 TENSION 7 MONOTONY 6
5.2.1 Word Description
To analyze the results of this question, we needed to deal with some common issues in natural
language processing field, such as mistyping, translation due to the fact that our data contained words
from two distinct languages, time and gender (in Portuguese) tense, different word classes (nouns,
verbs, adjectives) referring to the same radical. Therefore, to have an effective counting of concepts,
i.e., to join similar words in only one concept, is an hard task.
However, since our data was entirely composed of meaningful isolated words, we could use a very
simple and basic processing methodology to aggregate similar words. We joined together in one unique
concept all the words that matched in more characters than 23 the size of the smallest of the two words.
We are sure that this technique, when applied to more complex cases, will not be able to achieve great
performance metrics, however, in this practical case, it achieved acceptable results.
The songs were grouped by model. In Table 5.3, we present the four most used terms to describe
the samples of each one of the models. As we can see on this table, the characteristics we identified in
the graphical representation of the excerpts are also perceived when listening to them, for instance while
the HRBMM is described as chaotic, random and messy, both MuseGAN and MuCyG are classified as
mostly repetitive. According to this data, only the human excerpts are perceived as epic while creativity
or creative are not between the most frequent words to describe the excerpts of any model.
We can find two different kinds of words on this table: descriptive words, those that corroborate
those characteristics that we previously identified on the graphical representation of the excerpts, such
as confusion, chaos, random, noise, repetitive, bells and monotony; and effect words, the words that
express characteristics much more easily detected or only detectable when we listen to the samples,
such as epic, cinematographic, happiness, suspense, tension or mysterious. Samples described more
frequently with effect words affected more the listener than those that were described using descriptive
words. With this analysis, we can think that HRBMM was not able to create an effect on the while the
human examples were very good at it. We can also conclude that MuseGAN was best at affecting the
listener than MuCyG.
64
Table 5.4: Summary of the results about the impact on creativity
Impact Human 2 MuCyG 1mean std mean std
Base 6.03 2.52 5.77 2.23Melody 6.4 2.61 5.92 2.41Explanation 6.28 2.52 5.92 2.44Computer 6.40 2.54 5.84 2.48
Melody - Base 0.37 1.49 0.15 1.13Explanation - Melody −0.12 1.12 0.00 0.94Computer - Explanation 0.12 0.77 −0.08 0.76
5.2.2 Impacts on Creativity
We believe that some knowledge about a product can deeply influence the perception of creativity.
Therefore, in this question we aimed to study the impact of three different aspects:
• the first one is the impact of knowing that one song was inspired in one melody, hereinafter called
the melody factor;
• the second aspect we wanted to study is how does explaining the creative product using external
concepts impact the way our audience evaluates the overall creativity of the product, which we
named explanation factor ;
• The last and most important aim of this question, intrinsically related with our main goal in this
work, is to study the existence of bias in favor or against computer generated music, factor which
we named the computer factor.
As mentioned before, only two samples participated in this study: Human 2 and MuCyG 1, which
appeared in a random order during the survey, that included the exact same questions for both samples.
In Table 5.4, we present the means and standard deviations of the answers for each one of the questions
as well as the differences, i.e., the impacts of each one of the pieces of information.
From our analysis, we could quickly see that Human 2 is consistently evaluated as more creative
than MuCyG 1. It is also important to notice that, while, for both these two examples, the melody factor
caused an increasing on the perception of creativity, the explanation factor provoked a slight decrease.
In what concerns to the automatic nature of the product, it had a different effect in both examples. In
order to identify a bias factor, we were expected to observe that the computer factor would cause a
consistent increase or decrease. Yet, we can interpret this difference effect as a sign that MuCyG model
was easily spotted by the users. We mean that possibly, the respondents became surprised by the fact
that Human 2 was generated by a computer which contributed to a higher value of creativity, while it had
the opposite effect in the MuCyG sample.
65
5.2.3 Evaluating Confronting Pairs
One really natural way to evaluate subjective concepts is to use a jury in a binary confrontation
between two parts. This idea is used in commonly different contexts in day-to-day life. With these
questions, instead of asking for an absolute score or asking to choose between dichotomic states (for
instance, epic versus not epic), we asked for choice between two opposing samples, saying that the
picked one represents a better example than the other one for some provided characteristic.
We explored 5 different characteristics of our products, related to the words: creative, inspiring, novel,
epic and cinematographic. The first aim of this question was to order our samples based on the results
to have an idea of which sample is the best in each one of these dimensions. A secondary aim was to
verify if there is a direct correlation between the concepts of novelty and creativity as well as between
epic and creativity. We used two different approaches to order the samples. In Appendix 8, we have
fully described all the confrontations we gathered in this sample and the respective results.
Table 5.5 presents us the number of played games, as well as the percentage of wins, losses and
ties. This is a possible way to order our samples. With the hypothesis that the game is fair, which means
that the probability of confrontation is the same for any pair of samples, we can suppose that better
samples win more times. Therefore, in Table 5.6, we present, for each one of the dimensions, the order
that results from this reasoning process.
If we suppose an unbalanced way to choose opponents, some models could avail from this last
ordering approach. For instance, in the case that some advantageous sample confronts many times
a weak sample, then the percentage of winning games of the former can overcome all the other fair
playing samples. Therefore, another completely different idea, and our first idea, was to create a DAG
based on winning games.
To create our DAG’s, as a first step, if the sample A won in a confrontation with the sample B, the
edge (B,A) will be inserted into the directed graph G, meaning that A is potentially better than B. Since
we want an acyclic graph, we introduced a mechanism to cancel opposite edges, so if A wins B only
once and after a while B wins A also only one time, then neither (A,B) nor (B,A) will be present in the
final graph G. After this step, for each pair A and B of nodes, G will only have one edge between those,
pointing to the most likely to be the better and the weight of the arc corresponds to how many more
times did the best node won. For instance, supposing that A wins B 4 times while B only wins 1 time,
then, after this step we will have in graph G the edge (B,A) with weight: w(B,A) = 3.
However, this mechanism only prevents direct cycles or cycles of two nodes. To destroy bigger cycles
we use two different heuristics: the edge with smaller weight first and the edge present in more cycles
first. After identifying a cycle, we check if there is one edge with a weight value below the weights of all
the other edges in the cycle and remove it if it exists. If there are more than one, from those, we chose
the edge that has part in the most number of cycles, destroying as many cycles we can by removing
66
Table 5.5: Confrontations results summary
Human 1 Human 2played win (%) lost (%) tie (%) played win (%) lost (%) tie (%)
Creativity 19 0.4211 0.4211 0.1579 30 0.4333 0.5000 0.0667Inspiring 24 0.0417 0.9583 0.0000 27 0.2963 0.6296 0.0741Novelty 19 0.7895 0.1579 0.0526 26 0.4231 0.5000 0.0769Epicness 18 0.1667 0.8333 0.0000 30 0.1667 0.8000 0.0333Cinematography 26 0.2692 0.6923 0.0385 25 0.1600 0.8000 0.0400
HRBMM 1 HRBMM 2played win (%) lost (%) tie (%) played win (%) lost (%) tie (%)
Creativity 27 0.3704 0.5185 0.1111 21 0.5714 0.3810 0.0476Inspiring 20 0.5500 0.2000 0.2500 29 0.5517 0.2414 0.2069Novelty 29 0.4138 0.4138 0.1724 28 0.3214 0.5357 0.1429Epicness 26 0.7308 0.1154 0.1538 28 0.6429 0.1429 0.2143Cinematography 18 0.5000 0.3889 0.1111 30 0.6667 0.3333 0.0000
MuseGAN 1 MuseGAN 2played win (%) lost (%) tie (%) played win (%) lost (%) tie (%)
Creativity 30 0.7667 0.1333 0.1000 20 0.3000 0.5000 0.2000Inspiring 29 0.4483 0.3448 0.2069 28 0.5714 0.2143 0.2143Novelty 25 0.6400 0.2400 0.1200 29 0.3448 0.6207 0.0345Epicness 25 0.3200 0.5200 0.1600 23 0.6087 0.1739 0.2174Cinematography 28 0.3571 0.6071 0.0357 24 0.8333 0.0833 0.0833
MuCyG 1 MuCyG 2played win (%) lost (%) tie (%) played win (%) lost (%) tie (%)
Creativity 29 0.3103 0.6552 0.0345 24 0.3750 0.5000 0.1250Inspiring 20 0.5500 0.4000 0.0500 23 0.3913 0.4348 0.1739Novelty 20 0.4000 0.3500 0.2500 24 0.3333 0.6250 0.0417Epicness 24 0.4167 0.5000 0.0833 26 0.3846 0.4615 0.1538Cinematography 24 0.4167 0.5417 0.0417 25 0.6400 0.3600 0.0000
Table 5.6: Confrotations ranking based in percentage of gained games
Creativity Inspiring Novelty Epicness Cinematography1o MuseGAN 1 MuseGAN 2 Human 1 HRBMM 1 MuseGAN 22o HRBMM 2 HRBMM 2 MuseGAN 1 HRBMM 2 HRBMM 23o Human 2 MuCyG 1 Human 2 MuseGAN 2 MuCyG 24o Human 1 HRBMM 1 HRBMM 1 MuCyG 1 HRBMM 15o MuCyG 2 MuseGAN 1 MuCyG 1 MuCyG 2 MuCyG 16o HRBMM 1 MuCyG 2 MuseGAN 2 MuseGAN 1 MuseGAN 17o MuCyG 1 Human 2 MuCyG 2 Human 1 Human 18o MuseGAN 2 Human 1 HRBMM 2 Human 2 Human 2
Table 5.7: Confrontation ranking based in DAG’s topological order
Creativity Inspiring Novelty Epicness Cinematography1o MuseGAN 1 HRBMM 1 MuseGAN 1 HRBMM 1 MuseGAN 22o HRBMM 2 HRBMM 2 Human 1 HRBMM 2 MuCyG 23o Human 1 MuCyG 2 Human 2 MuseGAN 2 HRBMM 14o MuCyG 2 MuseGAN 2 HRBMM 1 MuCyG 1 HRBMM 25o Human 2 MuseGAN 1 MuCyG 1 MuseGAN 1 MuCyG 16o MuseGAN 2 MuCyG 1 MuseGAN 2 MuCyG 2 Human 17o MuCyG 1 Human 2 MuCyG 2 Human 2 MuseGAN 18o HRBMM 1 Human 1 HRBMM 2 Human 1 Human 2
67
only one edge. If in one cycle, there are several edges with the same small weight, that participates in
the same number of cycles, then we chose one randomly between those and remove it from the graph
G. After breaking all cycles, we are left with one DAG. In Figures 5.5, 5.6, 5.7, 5.8 and 5.9 we may see
the generated DAG’s and Table 5.7 shows the order calculated based on the DAG’s.
Our results demonstrate that, in this context, novelty and epicness might be kind of inversely related,
while epic and cinematographic may be positively related. Considering this and our definition of creativity
as both novel and epic products, we were expecting to have those samples that better balance epicness
and novelty scoring better in creativity, but the data do not reflect this definition. Also we were not able
to identify any clear relation between creativity and novelty nor epicness. In our interpretation of these
results, this data reflects that this simplistic definition of creativity do not comply in general with the
common use of the word in this context.
Another interesting aspect to notice is that even though human samples were the only ones described
as epic in the open question, in this question human samples were evaluated as not being epic nor
inpiring at all, but as being novel. Also according to our results, different samples were classified the
best in different categories: MuseGAN 1 was considered the most creative sample; the HRBMM 1 was
the most epic; MuseGAN 2 was the most cinematographic; while MuseGAN 1 and Human 1 were both
considered novel. Moreover, there was no model that consistently produced the best results for all the
categories.
5.3 Summary
Using a process with two sequential training phases, we gathered two different 32 seconds long
samples from each one of the models. From the graphical representation of these pianorolls, we could
conclude that RBM-based model’s products were very chaotic and featured only very short notes while
both GAN-based models ended up suffering from the mode collapse problem that opportunely provided
some repetitive musical coherence to the final products. Additionally to this 6 samples, we randomly
picked two 32 seconds long epic excerpts that were composed by humans, and all of these 8 samples
were evaluated in three different types of questions, using an online survey.
In the first question, an open question where the respondent could insert up to three words to de-
scribe each one of the excerpts presented, we used a very basic way to aggregate words with similar
meanings: we considered that words that matched on 23 of the size of the smallest word had similar
meanings. With this strategy we got the list of most frequent terms that were used to describe the sam-
ples of each one of the models. These words corroborate the conclusions we visually inferred from the
pianorolls and allowed us to order our models by the impact their products caused on the listener. From
the greater impact down to no impact the models followed the order: Human, MuseGAN, MuCyG and
73
HRBMM.
In one second question, that evaluated only MuCyG 1 and Human 2, we used Likert scales from 1
to 10 to evaluate the impact some factor may have on the perception of creativity. Our results point out
that, knowing that an excerpt is based on a melody will make the listener consider it as more creative,
while the explanation factor will impact negatively this perception. On our results we were not able to
detect any kind of bias nor in favor nor against automatic generated music.
Our last question consisted in a game where two excerpts confronted each other from which the
user explicitly chose the winner while focusing in one specific characteristic. We studied 5 characteris-
tics: creative, inspiring, novel,epic and cinematographic; and we used two different strategies to order
our samples: using the percentage of winning games and using DAG’s based on the winning matches
between each pair of samples. In the end, we conclude that no model was able to consistently outper-
form all the other ones in every category, but some randomly picked human composed samples were not
able to outperform all the models in all the categories. In addition, the data did not reflect any observable
relationship between creativity and novelty or creativity and epicness.
74
6Conclusion
Contents
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
75
6.1 Conclusions
Our project consisted in exploring the CC field using DL technologies and generating symbolic epic
music. We created two new pianoroll representative datasets: the Epic Dataset and the Modely Dataset;
and explored three different DL models: HRBMM, MuseGAN and MuCyG. We used an pianoroll repre-
sentation and collected two representative datasets in this representation: one dedicated to epic music
and one of melodies.
Two different software products resulted from the development process: the Gmidi library1, that
includes a set of tools to process music, and the MuCaGEx2, a collection of deep learning models and
complementary libraries to perform experiments on music generation.
To train deep learning models is still a very empirical process and, currently, expert knowledge and
previous experience are the best guides, which are one of the main contributions of this works to our
personal experience. Defining our task, choosing the best representation, choosing the best architec-
ture, choosing the tools, balancing between generality and performance, tuning hyperparameters... all
these are very hard tasks that need informed testing and prototyping in order to achieve the best results.
According to our results, none of the models consistently outperformed the others. However, human
creations, according to the survey answers, also did not overcome the models. The generated final
products revealed both visually and audibly that our GAN-based models suffered from the mode collapse
problem, that was probably caused by some choices on the architecture or on the procedure such as
the learning rate associated with a high variance and the training method using small subsets. The
computational models were worse than human in affecting the listener but were considered more epic
when direct confronted with human samples. Finally, our results also do not comply with the definition
of creativity as utility and novelty.
6.2 Future Work
Since the beginning of our thesis, many new developments on this area have been published which
we were not able to follow up, in order to focus on the practical development and implementation of this
project. The datasets can be cleaned up, the preprocessing needs to be reviewed and we would like to
get the authorization to distribute the datasets we created.
In what concerns to the representation, the representation we used is limited to a fixed number of
tracks, a fixed number of timesteps and do not generate other musical dimensions such as stacatto,
tenuto, tempo and time signature. Those musical aspects and many others may be very important in
order to get good products in epic music.
1https://github.com/LESSSE/gmidi2https://github.com/LESSSE/public_MuCaGEx
77
In HRBMM, we should test different sampling techniques such as performing gibbs steps until the
sample stabilizes. We should also try different sizes for the hidden states and compare this hierarchical
version with a simple full RBM. There are some versions of RBM that are able to compute real values,
thus maybe adapt one of these models could bring some value to this approach.
To improve MuCyG several approaches were considered but not performed due to time constraints.
Both MuCyG and MuseGAN might benefit from a review of the architecture to include some overlapping
convolutions. The implementation of a mechanism of self-attention has been proved to achieve good
results in visual field, and maybe this can improve the way MuCyG model generates melodies. Other
improvements include implementing WGAN-GP for RNN, and add bi-directional LSTM to model the final
parts of the structure and generate time structure. On a technical level, we can migrate the models to
the new version of Tensorflow and the code should be refactorized, while joining new efforts to solve the
balancing losses and the collapsing mode problems.
78
Bibliography
[1] S. J. Russell and P. Norvig, Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson
Education, 2010.
[2] Magenta: A recurrent neural network music generation tutorial. [accessed at: 2017-12-11]. [Online].
Available: https://magenta.tensorflow.org/2016/06/10/recurrent-neural-network-generation-tutorial
[3] Magenta: Generating long-term structure in songs and stories. [accessed at: 2017-12-11].
[Online]. Available: https://magenta.tensorflow.org/2016/07/15/lookback-rnn-attention-rnn
[4] D. Cope, Virtual Music: Computer Synthesis of Musical Style. The MIT Press, 2004.
[5] J. A. Biles, “Genjam: A genetic algorithm for generating jazz solos,” in Proceedings of the 1994
International Computer Music Conference, ICMC, 1994.
[6] G. Bickerman, S. Bosley, P. Swire, and R. Keller, “Learning to create jazz melodies using deep
belief nets,” Proceedings of the International Conference on Computational Creativity, ICCC-10,
Jan 2010.
[7] J. Teixeira, “Cross domain analogy: From image to music,” Master’s thesis, Instituto Superior
Tencico, Universidade de Lisboa, Lisbon, Portugal, May 2017.
[8] H. W. Dong, W. Y. Hsiao, L. C. Yang, and Y. H. Yang, “Musegan: Multi-track sequential generative
adversarial networks for symbolic music generation and accompaniment,” ArXiv e-prints, Sep 2017.
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and
Y. Bengio, “Generative adversarial networks,” CoRR, vol. abs/1406.2661, 2014.
[10] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep
convolutional generative adversarial networks,” 2015, cite arxiv:1511.06434 Comment: Under
review as a conference paper at ICLR 2016. [Online]. Available: http://arxiv.org/abs/1511.06434
[11] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved
quality, stability, and variation,” CoRR, vol. abs/1710.10196, 2017. [Online]. Available:
http://arxiv.org/abs/1710.10196
79
[12] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial
networks,” in Proceedings of the 36th International Conference on Machine Learning, ser.
Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.
Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 7354–7363. [Online]. Available:
http://proceedings.mlr.press/v97/zhang19d.html
[13] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol.
abs/1609.03499, 2016.
[14] O. Mogren, “C-RNN-GAN: continuous recurrent neural networks with adversarial training,” CoRR,
vol. abs/1611.09904, 2016. [Online]. Available: http://arxiv.org/abs/1611.09904
[15] C. Donahue, J. McAuley, and M. Puckette, “Synthesizing audio with generative adversarial
networks,” CoRR, vol. abs/1802.04208, 2018. [Online]. Available: http://arxiv.org/abs/1802.04208
[16] I. van Elferen, “Fantasy music: Epic soundtracks, magical instruments, musical metaphysics,”
Journal of the Fantastic in the Arts, vol. 24, no. 1 (87), pp. 4–24, 2013. [Online]. Available:
http://www.jstor.org/stable/24352902
[17] S. Meyer, Music in Epic Film: Listening to Spectacle. Taylor & Francis, 2016. [Online]. Available:
https://books.google.fr/books?id=JVH0DAAAQBAJ
[18] M. D. Mumford, “Where have we been, where are we going? taking stock in creativity research,”
Creativity Research Journal, vol. 15, no. 2-3, pp. 107–120, 2003.
[19] O. Wilde, The Picture of Dorian Gray. Floating Press, 2009. [Online]. Available: https:
//books.google.cz/books?id=J9cnJ21pKNgC
[20] J. P. Guilford, “Creativity,” American Psychologist, vol. 5, no. 9, pp. 444–454, 1950.
[21] A. McKerracher, “Understanding creativity, one metaphor at a time.” Creativity Research Journal,
vol. 28, no. 4, pp. 417–425, 2016.
[22] M. d’Inverno and A. Still, “A history of creativity for future AI research,” in Proceedings of the Seventh
International Conference on Computational Creativity (ICCC 2016). Sony CSL Paris, France, 2016,
pp. 147–154.
[23] J. C. Kaufman and R. A. Beghetto, “Beyond big and little: The four c model of creativity,” Review of
General Psychology, vol. 13, no. 1, pp. 1–12, 2009.
[24] M. Rhodes, “An analysis of creativity,” Phi Delta Kappan, vol. 42, no. 7, pp. 305–310, 1961.
80
[25] G. Wallas, The Art of Thought. Harcourt, Brace, 1926.
[26] R. Sawyer, Explaining Creativity: The Science of Human Innovation. Oxford University Press,
USA, 2012.
[27] D. Partridge and J. Rowe, Computers and Creativity, ser. Intellect Books. Intellect, 1994.
[28] Salvador Dali - Quotes. [accessed: 2019-05-30]. [Online]. Available: https://en.wikiquote.org/wiki/
Salvador Dal%C3%AD
[29] D. T. Campbell, “Blind variation and selective retention in creative thought as in other knowledge
processes,” Psychological Review, vol. 67, no. 6, pp. 380–400, 1960.
[30] A. Koestler, The Act of Creation. Arkana, 1964.
[31] M. A. Boden, The Creative Mind: Myths and Mechanisms. New York, NY, USA: Basic Books, Inc.,
1991.
[32] R. J. Sternberg and T. I. Lubart, “An investment theory of creativity and its development,” Human
Development, vol. 34, no. 1, pp. 1–31, 1991.
[33] R. A. Finke, T. B. Ward, and S. M. Smith, Creative Cognition: Theory, Research, and Application
(Bradford Books). The MIT Press, 1992.
[34] M. Turner and G. Fauconnier, “Conceptual integration and formal expression,” Journal of Metaphor
and Symbolic Activity, vol. 10, pp. 183–204, 1995.
[35] M. Csikszentmihalyi, Creativity: Flow and the Psychology of Discovery and, ser. Harper Perennial
Modern Classics. HarperCollins, 2009.
[36] R. J. Sternberg, The Propulsion Theory of Creative Contributions. Cambridge University Press,
2003, pp. 124–144.
[37] G. A. Wiggins, “Searching for computational creativity,” New Generation Computing, vol. 24, no. 3,
pp. 209–222, Sep 2006.
[38] M. Ackerman, A. Goel, C. G. Johnson, A. Jordanous, C. Leon, R. P. y Perez, H. Toivonen, and
D. Ventura, “Teaching computational creativity,” in Proceedings of the Eigth International Confer-
ence on Computational Creativity (ICCC 2017). Sony CSL Paris, France, 2017, pp. 9–16.
[39] G. Widmer, “Getting closer to the essence of music: The con espressione manifesto,” CoRR, vol.
abs/1611.09733, 2016.
[40] D. Floreano and C. Mattiussi, Bio-Inspired Artificial Intelligence: Theories, Methods, and Technolo-
gies (Intelligent Robotics and Autonomous Agents series). The MIT Press, 2008.
81
[41] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available:
http://www.deeplearningbook.org
[42] W. Mcculloch and W. Pitts, “A logical calculus of ideas immanent in nervous activity,” Bulletin of
Mathematical Biophysics, vol. 5, pp. 127–147, 1943.
[43] “The navy revealed the embryo of an electronic computer today that it expects will be
able to walk, talk, see, write, reproduce itself and be conscious of its existence,” The
New York Times, Jul 1986. [Online]. Available: https://www.nytimes.com/1958/07/08/archives/
new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html
[44] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating
errors,” Nature, vol. 323, Oct 1986.
[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, cite
arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference
for Learning Representations, San Diego, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
[46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–
80, Dec 1997.
[47] Y. Lecun, B. Boser, J. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. Jackel, “Backprop-
agation applied to handwritten zip code recognition,” Neural Computation, vol. 1, pp. 541–551, Dec
1989.
[48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple
way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp.
1929–1958, Jan 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2627435.2670313
[49] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” CoRR,
vol. abs/1601.06759, 2016.
[50] G. Alain, Y. Bengio, L. Yao, J. Yosinski, E. Thibodeau-Laufer, S. Zhang, and P. Vincent, “Gsns :
Generative stochastic networks,” CoRR, vol. abs/1503.05571, 2015.
[51] P. Smolensky, “Information processing in dynamical systems: Foundations of harmony theory,”
in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, D. E.
Rumelhart, J. L. McClelland, and C. PDP Research Group, Eds. MIT Press, 1986, pp. 194–281.
[52] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol. abs/1312.6114, 2013.
82
[53] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved
training of wasserstein gans,” CoRR, vol. abs/1704.00028, 2017. [Online]. Available: http:
//arxiv.org/abs/1704.00028
[54] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167
[55] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial
networks,” CoRR, vol. abs/1802.05957, 2018. [Online]. Available: http://arxiv.org/abs/1802.05957
[56] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using
cycle-consistent adversarial networks,” CoRR, vol. abs/1703.10593, 2017. [Online]. Available:
http://arxiv.org/abs/1703.10593
[57] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations
with generative adversarial networks,” CoRR, vol. abs/1703.05192, 2017. [Online]. Available:
http://arxiv.org/abs/1703.05192
[58] L. F. Menabrea, “Sketch of the analytical engine invented by charles babbage,” in Scientific mem-
oirs: selected from the transactions of foreign Academies of Science and learned societies, and
from foreign journals. Richard and John E. Taylor, London, 1842, vol. 3, pp. 666–731.
[59] D. Eck and J. Schmidhuber, “A first look at music composition using lstm recurrent neural networks,”
Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, Tech. Rep., 2002.
[60] A. Eigenfeldt, O. Bown, P. Pasquier, and A. Martin, “Towards a taxonomy of musical metacreation:
Reflections on the first musical metacreation weekend,” in Proceedings of the Second International
Workshop on Musical Metacreation (MUME 2013). The AAAI Press, Palo Alto, California, 2013,
pp. 40–47.
[61] E. X. Merz, “Implications of ad hoc artificial intelligence in music,” in Proceedings of the Third
International Workshop on Musical Metacreation (MUME 2014), Philippe Pasquier, Arne Eigenfeldt,
and Oliver Bown. The AAAI Press, Palo Alto, California, 2014, pp. 35–39.
[62] K. Choi, G. Fazekas, K. Cho, and M. B. Sandler, “A tutorial on deep learning for music information
retrieval,” CoRR, vol. abs/1709.04396, 2017. [Online]. Available: http://arxiv.org/abs/1709.04396
[63] J. Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music generation - A survey,”
CoRR, vol. abs/1709.01620, 2017.
83
[64] R. Simchy-Gross and E. H. Margulis, “The sound-to-music illusion: Repetition can musicalize
nonspeech sounds,” Music & Science, vol. 1, p. 2059204317731992, 2018. [Online]. Available:
https://doi.org/10.1177/2059204317731992
84
7Epic Dataset Reference List
Table 7.1: Full list of epic music samples
Sample Name Username Duration (s) Tracks Blocks Size
0 001-Epic 3510006 76.82 16 8 7.24KB
1 002-Epic 8bitlp 136.82 21 14 22.43KB
2 003-Epic 3749941 110.69 18 15 15.20KB
3 004-Epic 504736 38.01 13 7 3.28KB
4 006-Epic 6279851 237.04 73 29 48.19KB
5 007-Epic 76173 309.33 15 29 35.94KB
6 008-Epic 14917511 46.82 8 4 2.82KB
7 009-Epic 11311916 141.80 16 10 4.03KB
8 010-Epic owlman142 157.48 24 8 7.33KB
9 014-Epic elliot-butler 181.74 20 15 42.08KB
10 015-Epic elliot-butler 152.33 25 20 65.86KB
11 016-Epic elliot-butler 368.74 30 26 51.96KB
12 024-Epic 27529630 116.49 25 16 20.31KB
13 027-Epic theepiccomposer 310.76 28 30 39.33KB
14 028-Epic ros 114.39 34 25 31.46KB
15 029-Epic theepiccomposer 320.84 16 32 30.94KB
16 032-Epic jg-77 2 81.09 18 9 8.13KB
17 034-Epic echo 316.90 34 14 16.99KB
Continued on next page
85
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
18 035-Epic solacedescending 218.75 23 25 28.28KB
19 038-Epic 8003276 225.14 36 22 66.71KB
20 039-Epic 14917511 75.32 14 7 4.23KB
21 040-Epic lizzapie 232.99 51 15 40.26KB
22 041-Epic maja-pechanach 125.63 25 9 8.24KB
23 042-Epic 1793111 131.26 28 15 27.85KB
24 043-Epic 2391956 113.36 23 12 17.69KB
25 044-Epic 9283941 194.00 31 20 67.66KB
26 047-Epic 4776246 185.34 18 13 14.89KB
27 048-Epic mollymawk 42.02 29 5 10.37KB
28 049-Epic 167892 42.02 29 5 10.37KB
29 051-Epic 2903486 450.90 16 23 7.94KB
30 052-Epic bntibbetts 94.22 29 9 15.29KB
31 053-Epic rookwizard 234.03 20 19 8.00KB
32 054-Epic rookwizard 216.03 33 22 35.97KB
33 055-Epic rookwizard 242.16 31 6 52.80KB
34 056-Epic 4918891 223.22 34 11 27.40KB
35 058-Epic rookwizard 249.05 17 13 5.67KB
36 059-Epic kalle-edh 222.84 24 18 18.89KB
37 061-Epic kalle-edh 196.39 21 13 15.04KB
38 063-Epic fabiolaw 239.42 12 24 20.00KB
39 065-Epic kalle-edh 188.93 19 16 14.57KB
40 066-Epic kalle-edh 234.45 28 16 40.57KB
41 067-Epic kalle-edh 271.96 32 27 35.50KB
42 068-Epic kalle-edh 162.62 24 8 12.85KB
43 069-Epic kalle-edh 164.28 25 11 24.03KB
44 070-Epic robin m butler 70.19 19 4 2.51KB
45 072-Epic-Prelude-for elliot-butler 126.44 15 7 2.35KB
46 073-Epic-Pastoralia-mid 18361371 441.94 27 29 21.24KB
47 074-Epic-Medieval-Times joshuaai 153.85 41 19 57.16KB
48 076-Epic-Blackheart-Full 28264345 266.21 56 14 61.71KB
49 077-Epic-March-for kalle-edh 260.52 29 32 42.22KB
50 079-Epic-Celthyan-In 2544941 124.91 36 14 33.16KB
51 080-Epic-To-a elliot-butler 190.09 16 7 11.92KB
52 081-Epic-Love-theme kalle-edh 136.52 17 9 5.04KB
53 084-Epic-Yeomen-Of kalle-edh 138.63 30 18 34.09KB
54 085-Epic-Intermezzo-Summer robin m butler 104.04 15 6 1.01KB
55 087-Epic-Adagio-A robin m butler 258.77 26 12 18.13KB
56 089-Epic-Feuillemort-HQ rookwizard 268.08 18 8 2.45KB
57 090-Epic-Running-Away robin m butler 92.76 20 7 14.37KB
58 093-Epic-Tian-Shan kalle-edh 162.50 22 11 18.87KB
59 095-Epic-INTERGALACTIC-Theme robin m butler 126.04 21 7 2.96KB
60 096-Epic-Superhero-theme kalle-edh 101.07 37 6 37.33KB
61 097-Epic-In-The lizzapie 53.39 12 0 4.00KB
62 098-Epic-Left-to 10712571 226.70 12 17 12.77KB
63 099-Epic-Two-Steps 2544941 167.76 36 20 26.91KB
64 101-Epic-An-adventure kalle-edh 123.02 40 12 28.53KB
65 102-Epic-Anton-Coladecci 10712571 306.69 23 28 38.43KB
66 104-Epic-Kingdom-of kalle-edh 219.56 29 19 22.85KB
67 105-Epic-The-Haunting kalle-edh 182.54 28 17 24.70KB
Continued on next page
86
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
68 106-Epic-A-life kalle-edh 183.12 37 14 15.27KB
69 107-Epic-Celthyan-I 2544941 303.54 41 23 79.73KB
70 108-Epic-Zap-Nick 10712571 198.42 10 31 30.07KB
71 109-Epic-A-Virtual 18361371 384.79 12 7 31.30KB
72 110-Epic-Celthyan-Ode 2544941 215.51 20 12 15.45KB
73 111-Epic-Artist-Rendition 10712571 132.02 8 16 7.47KB
74 112-Epic-Jim-Saves lizzapie 149.65 48 22 26.79KB
75 114-Epic-Duel-of cherylthegoat 160.61 30 25 24.98KB
76 115-Epic-Theme-of 10712571 230.04 8 14 4.01KB
77 117-Epic-We-Run tristanwillcox 63.27 50 10 17.53KB
78 118-Epic-The-Circus 10712571 264.54 18 35 20.27KB
79 120-Epic-O-HOLY owlman142 104.79 54 7 13.35KB
80 122-Epic-Klap-Sahali 123002 180.57 27 17 24.53KB
81 123-Epic-O-Holy 1332076 97.88 39 7 21.94KB
82 124-Epic-Overture-in 5270096 137.17 20 18 19.75KB
83 125-Epic-John-Adams johnwd 53.30 73 8 38.71KB
84 126-Epic-Samurai-of robin m butler 67.40 31 9 2.73KB
85 127-Epic-Planet-Of robin m butler 72.02 27 9 4.76KB
86 128-Epic-Tempo-di rookwizard 136.51 13 25 10.61KB
87 131-Epic-Perspective-A johnwd 87.64 35 11 32.48KB
88 133-Epic-Celthyan-You 2544941 84.80 17 5 2.66KB
89 134-Epic-Ghost-The ronaldspotomusic 132.22 28 7 7.55KB
90 135-Epic-Phantom-Of ronaldspotomusic 120.89 16 4 13.89KB
91 137-Epic-Dreams-of 10712571 146.11 4 4 5.88KB
92 138-Epic-Ethereal-Sci lizzapie 189.19 62 10 20.26KB
93 139-Epic-Bergan-Village 10712571 453.39 10 43 24.35KB
94 140-Epic-Lycia-Kingdom 10712571 258.69 9 24 5.72KB
95 141-Epic-Music-of johnwd 85.39 44 13 47.26KB
96 142-Epic-Take-What 10712571 224.03 14 21 10.05KB
97 145-Epic-The-Freedom 10712571 147.09 16 13 13.60KB
98 146-Epic-Tears-remain 18361371 379.34 10 0 18.41KB
99 148-Epic-Incedendo-Epic johnwd 240.66 58 38 61.47KB
100 149-Epic-Opening-Titles johnwd 143.70 37 13 33.17KB
101 151-Epic-Boss-Battle 10712571 273.50 19 28 23.79KB
102 152-Epic-Space-Movie johnwd 156.68 39 12 21.90KB
103 154-Epic-Iris-The 10712571 144.87 14 10 8.68KB
104 155-Epic-The-Wild 9528871 189.03 21 15 17.78KB
105 157-Epic-King-Norrix 10712571 160.00 16 12 7.52KB
106 158-Epic-Jungle-Ruins lizzapie 119.77 32 9 10.42KB
107 162-Epic-BlackheartGame-of owlman142 125.31 18 22 27.55KB
108 163-Epic-The-Life lizzapie 285.13 46 21 15.45KB
109 164-Epic-Ninja-s 10712571 266.02 16 33 20.34KB
110 165-Epic-Hymn-mid dun-ought 212.50 22 13 3.96KB
111 166-Epic-Rhapsody-by 18361371 665.21 8 0 23.25KB
112 167-Epic-Drava-The 10712571 240.02 12 25 9.20KB
113 168-Epic-ZeroBlade-Run 107032 129.58 24 24 25.68KB
114 169-Epic-The-King 10712571 240.02 13 24 9.85KB
115 170-Epic-AOR-Battle 10712571 163.22 14 17 10.61KB
116 171-Epic-Midnight-600 rookwizard 204.05 30 10 22.42KB
117 172-Epic-AOR-World 10712571 268.82 10 28 17.88KB
Continued on next page
87
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
118 173-Epic-Dreams-Of 10712571 182.42 5 16 1.66KB
119 174-Epic-Leonard-Cohen owlman142 135.06 44 7 15.82KB
120 177-Epic-The-Time lizzapie 276.59 45 26 29.22KB
121 178-Epic-Dark-Depths 107032 30.03 23 1 2.86KB
122 179-Epic-How-To tristanwillcox 67.78 67 5 12.07KB
123 180-Epic-Lulu-and cherylthegoat 141.59 33 19 27.70KB
124 181-Epic-Celthyan-A 2544941 210.89 53 13 55.80KB
125 182-Epic-HikariSimple-and piesafety 191.78 46 15 76.26KB
126 183-Epic-Electric-Deity 10712571 256.02 18 40 27.33KB
127 184-Epic-Main-Theme cherylthegoat 140.58 32 17 22.13KB
128 185-Epic-Sonic-The 10712571 281.48 18 32 33.50KB
129 188-Epic-End-Credits rookwizard 43.75 18 5 5.83KB
130 189-Epic-Quiet-Town 10712571 208.03 15 19 11.28KB
131 191-Epic-We-Will 10712571 245.27 15 25 26.34KB
132 194-Epic-The-Vault 4118671 166.39 20 26 15.17KB
133 195-Epic-By-the rookwizard 87.45 33 12 12.05KB
134 197-Epic-Celthyan-Downcast 2544941 151.90 47 14 28.21KB
135 198-Epic-FLOWER-TIME 10712571 118.17 8 16 10.00KB
136 201-Epic-Celthyan-First 2544941 126.45 52 15 31.01KB
137 202-Epic-The-Story lizzapie 179.53 40 16 14.29KB
138 203-Epic-A-Room rookwizard 143.12 34 11 10.52KB
139 204-Epic-Zues-Lord 10712571 169.86 19 23 13.61KB
140 205-Epic-Medusa-The 10712571 289.82 13 30 24.46KB
141 206-Epic-Rondo-Purcell 10712571 234.02 16 29 12.92KB
142 207-Epic-AOR-Cutscene 10712571 204.82 12 28 20.96KB
143 208-Epic-Central-Intelligence 10712571 260.42 14 24 20.00KB
144 210-Epic-A-New lizzapie 228.00 43 16 29.21KB
145 212-Epic-Theme-Of 10712571 290.61 9 21 18.56KB
146 213-Epic-Celthyan-Unseen 2544941 170.28 18 20 10.55KB
147 214-Epic-The-Land lizzapie 232.42 41 23 27.92KB
148 215-Epic-Fairy-Sprites 10712571 218.77 14 18 16.90KB
149 216-Epic-The-Battle lizzapie 187.36 52 14 26.80KB
150 217-Epic-Morning-Star 4118671 220.68 4 34 11.45KB
151 218-Epic-Fairy-Sprites 10712571 248.75 16 28 28.27KB
152 219-Epic-Fairy-Sprites 10712571 150.03 6 12 5.37KB
153 220-Epic-The-Voyage ronaldspotomusic 177.47 19 20 14.46KB
154 221-Epic-Celthyan-The 2544941 159.01 45 18 70.79KB
155 222-Epic-The-Old 10712571 167.13 8 16 2.95KB
156 223-Epic-Rise-of 10712571 194.64 18 32 14.27KB
157 224-Epic-Dark-Castle lizzapie 109.30 62 8 9.75KB
158 225-Epic-Fantasia-1 lizzapie 181.71 38 15 36.15KB
159 226-Epic-Beyond-The lizzapie 227.83 38 20 20.45KB
160 227-Epic-e-Minor 1388021 169.22 24 17 20.14KB
161 229-Epic-Chinchila-s 10712571 326.42 15 34 25.16KB
162 230-Epic-Promenade-in 18361371 577.75 11 38 31.37KB
163 231-Epic-Naru-s 10712571 220.52 12 27 14.21KB
164 232-Epic-Dira-The 10712571 198.02 8 24 13.95KB
165 233-Epic-The-Night rookwizard 96.17 29 10 17.04KB
166 234-Epic-Celthyan-I 2544941 135.07 47 18 39.75KB
167 235-Epic-Zoosters-Breakout ronaldspotomusic 93.51 37 15 18.42KB
Continued on next page
88
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
168 236-Epic-Owain-Glyndwr 5270096 388.89 26 33 30.92KB
169 237-Epic-Last-Stop 10712571 192.02 18 28 5.83KB
170 238-Epic-Forever-HQ rookwizard 165.22 32 13 34.93KB
171 239-Epic-Cari-the 10712571 117.36 6 11 4.85KB
172 240-Epic-Remember-ANZAC rookwizard 211.06 43 16 29.90KB
173 241-Epic-Suite-for rookwizard 155.20 18 6 7.69KB
174 242-Epic-Fanfare-for 19173711 170.78 37 16 20.14KB
175 243-Epic-Cari-s 10712571 297.02 8 34 8.48KB
176 245-Epic-Celthyan-Pearl 2544941 328.81 26 22 17.21KB
177 246-Epic-Caris-Castle 10712571 123.02 19 15 4.65KB
178 247-Epic-Celthyan-From 2544941 105.55 54 6 25.54KB
179 249-Epic-War-on 10712571 264.02 21 41 26.27KB
180 251-Epic-The-First rookwizard 89.38 32 13 19.29KB
181 252-Epic-Star-Trek robin m butler 113.98 33 13 27.12KB
182 253-Epic-Intensity-Remix 10712571 163.66 19 27 28.93KB
183 254-Epic-The-Last jmusic1600 125.36 37 11 31.21KB
184 255-Epic-Temple-of 10712571 122.42 14 19 10.27KB
185 256-Epic-Temple-of 10712571 150.94 11 20 6.01KB
186 257-Epic-Temple-of 10712571 276.03 5 23 8.06KB
187 259-Epic-Epic-2 pandorasbox123 179.98 38 15 40.43KB
188 261-Epic-Naru-s 10712571 88.02 5 11 5.38KB
189 262-Epic-Cari-s 10712571 153.62 7 16 6.81KB
190 263-Epic-Blizzerd-Battle 10712571 238.17 14 32 10.09KB
191 264-Epic-I-Cant 10712571 213.37 13 8 2.79KB
192 265-Epic-V-S 10712571 162.73 18 24 19.08KB
193 266-Epic-Dira-s 10712571 156.82 14 24 6.23KB
194 267-Epic-Plains-Shop 10712571 172.82 4 27 18.03KB
195 268-Epic-Plains-Lv 10712571 228.02 17 23 9.26KB
196 269-Epic-Boss-Forest 10712571 264.08 12 33 22.67KB
197 270-Epic-Boss-Caves 10712571 288.03 10 27 18.60KB
198 271-Epic-Caves-Lv 10712571 293.36 4 27 9.80KB
199 272-Epic-Forest-Lv 10712571 171.80 5 34 11.71KB
200 273-Epic-Slow-and 10712571 185.19 7 18 5.47KB
201 274-Epic-The-Dictators knightsofarrethtrae 117.09 23 11 13.59KB
202 275-Epic-Shop-Mart 10712571 145.36 6 13 11.35KB
203 277-Epic-Ninja-Skills 10712571 266.42 5 27 18.29KB
204 278-Epic-Rigged-School 10712571 163.25 10 8 7.70KB
205 279-Epic-The-Treasure lizzapie 173.16 43 19 29.95KB
206 280-Epic-The-United 10712571 222.59 12 15 18.49KB
207 281-Epic-World-Theme 10712571 220.03 12 20 3.57KB
208 282-Epic-Naru-s 10712571 132.02 7 16 5.24KB
209 283-Epic-Car-Chase 10712571 187.22 14 28 36.22KB
210 284-Epic-The-United 10712571 317.56 21 43 13.81KB
211 286-Epic-The-White 10712571 320.03 18 30 16.54KB
212 287-Epic-Tails-Of 10712571 240.04 9 11 6.10KB
213 288-Epic-Queen-of 10712571 240.02 13 37 13.70KB
214 289-Epic-Bribing-The 10712571 250.69 6 12 13.66KB
215 290-Epic-The-String 10712571 142.01 9 16 5.72KB
216 291-Epic-March-of 10712571 316.85 10 16 3.58KB
217 292-Epic-Meadows-Film rookwizard 78.06 14 3 2.66KB
Continued on next page
89
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
218 293-Epic-The-US 10712571 224.82 13 35 11.18KB
219 294-Epic-Battle-of robin m butler 38.79 32 5 9.86KB
220 295-Epic-Coronation-mid sapphirefloutist 141.28 38 23 45.41KB
221 296-Epic-Agent-Diezo 10712571 124.02 19 15 9.23KB
222 297-Epic-Against-The 10712571 184.02 10 23 6.75KB
223 298-Epic-The-Castle 10712571 144.03 15 13 4.94KB
224 299-Epic-March-of 10712571 264.03 15 24 7.90KB
225 300-Epic-The-Jungle 10712571 123.02 10 15 6.08KB
226 301-Epic-Two-Steps 2544941 265.09 63 14 55.65KB
227 302-Epic-Castle-Theme 10712571 614.42 19 64 70.26KB
228 304-Epic-To-the origamidos 128.22 16 12 9.44KB
229 307-Epic-Electroman-Theme robin m butler 101.35 30 12 26.17KB
230 309-Epic-Strings-Esemble 10712571 319.67 12 35 9.56KB
231 310-Epic-Building-of robin m butler 166.79 32 19 21.09KB
232 311-Epic-The-Encampment robin m butler 68.02 38 8 24.36KB
233 312-Epic-T-e robin m butler 94.43 18 7 5.37KB
234 313-Epic-Epic-Movie robin m butler 32.66 22 4 9.35KB
235 314-Epic-Newt-Says cherylthegoat 195.16 20 13 8.32KB
236 316-Epic-Lawrence-s 10712571 277.71 25 35 42.46KB
237 317-Epic-Starship-Explorer robin m butler 91.83 33 9 31.37KB
238 320-Epic-Leo-s 10712571 186.82 6 14 21.88KB
239 322-Epic-THE-MIGHTY robin m butler 67.99 27 5 11.78KB
240 323-Epic-Irons-Theme 10712571 174.02 15 23 7.86KB
241 324-Epic-Broken-Hero 10712571 144.03 19 13 13.54KB
242 325-Epic-Alyssa-s 10712571 166.01 18 27 9.54KB
243 326-Epic-Thinking-Music 10712571 189.02 12 23 8.91KB
244 327-Epic-The-Agency 10712571 132.24 12 16 4.43KB
245 328-Epic-Icy-Wastland 10712571 149.36 9 14 4.50KB
246 329-Epic-Battle-Theme 10712571 336.02 14 52 19.09KB
247 330-Epic-Make-The 10712571 268.63 23 35 21.32KB
248 332-Epic-Under-Attack 10712571 195.57 17 29 29.49KB
249 334-Epic-Imprisoned-mid 10712571 192.16 12 19 6.47KB
250 335-Epic-Token-God 10712571 120.02 12 15 3.09KB
251 336-Epic-Insert-creative origamidos 58.02 14 7 5.64KB
252 337-Epic-Light-in 10712571 379.07 26 27 6.53KB
253 339-Epic-Kill-The 10712571 124.85 8 6 3.92KB
254 341-Epic-Adestes-Fidelis owlman142 123.16 45 14 23.74KB
255 342-Epic-Please-help 6877881 159.90 29 24 17.08KB
256 345-Epic-Celthyan-In 2544941 195.00 39 13 68.41KB
257 346-Epic-Celthyan-Conquest 2544941 128.45 68 9 110.73KB
258 347-Epic-The-Colony origamidos 78.30 26 10 18.14KB
259 348-Epic-BATTLE-OF robin m butler 82.21 32 6 12.52KB
260 350-Epic-Celthyan-A 2544941 238.48 54 22 56.81KB
261 351-Epic-The-Last origamidos 99.92 29 10 17.72KB
262 352-Epic-Also-Untitled 6877881 230.02 25 28 8.98KB
263 353-Epic-The-End origamidos 68.04 14 3 3.47KB
264 354-Epic-The-Chariots 4118671 64.63 12 8 5.96KB
265 355-Epic-Super-Hero jmusic1600 47.83 38 8 22.10KB
266 357-Epic-Russian-March 142190 181.18 28 17 32.39KB
267 358-Epic-The-Trial origamidos 84.02 21 12 12.87KB
Continued on next page
90
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
268 360-Epic-Eitz-Chayim isaacweiss 96.26 8 3 2.58KB
269 361-Epic-Oceans-Twilight 4118671 144.04 15 10 1.92KB
270 362-Epic-Carol-of 12165716 172.45 36 16 64.97KB
271 363-Epic-Papyruss-Mansion origamidos 252.98 14 41 27.04KB
272 364-Epic-Blue-Team austin harning 307.27 28 22 13.10KB
273 365-Epic-Time-Warp origamidos 110.23 25 11 45.46KB
274 366-Epic-Undertale-Spider piesafety 201.17 47 22 118.91KB
275 367-Epic-The-Closing origamidos 159.20 26 6 19.98KB
276 369-Epic-A-dark qqqant 177.32 47 21 32.85KB
277 370-Epic-Mammoths-mid 2644126 81.50 46 8 13.49KB
278 371-Epic-SSE-7 owlman142 121.04 24 11 11.55KB
279 372-Epic-Final-Boss thepopstardude 201.31 16 32 31.57KB
280 374-Epic-Dedication-200 owlman142 114.08 24 14 18.00KB
281 375-Epic-Sword-Valley thepopstardude 246.21 25 39 26.73KB
282 377-Epic-Release-From 11742266 222.20 14 20 3.00KB
283 378-Epic-Fugue-VI rpbouman 186.95 27 20 23.10KB
284 379-Epic-Simple-Wartime tristanwillcox 34.69 28 3 10.77KB
285 380-Epic-Lament-in rpbouman 176.39 27 11 12.52KB
286 381-Epic-Blue-Team austin harning 268.28 25 21 10.22KB
287 384-Epic-Encanto-Gitano 3857556 114.98 45 12 8.74KB
288 385-Epic-Gravity-Falls origamidos 54.00 40 3 12.41KB
289 387-Epic-Users-Documents origamidos 93.37 43 7 10.64KB
290 388-Epic-Siman-TovChassen isaacweiss 147.04 12 18 10.08KB
291 389-Epic-SSE-6 owlman142 119.38 26 15 18.14KB
292 391-Epic-I-Dont 6278966 212.44 37 14 14.76KB
293 393-Epic-Celthyan-Towards 2544941 274.22 43 16 87.87KB
294 394-Epic-Sun-For tristanwillcox 154.21 32 10 27.25KB
295 396-Epic-Trojan-Wildfire johnwd 163.46 33 23 56.81KB
296 397-Epic-Jupiter-the 6278966 177.90 40 11 14.43KB
297 398-Epic-F-a solacedescending 178.23 9 21 7.60KB
298 399-Epic-Not-for 2644126 196.25 42 33 52.02KB
299 402-Epic-SSE-4 owlman142 120.04 24 15 23.57KB
300 404-Epic-Revelation-EPIC tristanwillcox 185.61 43 13 18.51KB
301 406-Epic-Jabba-the 5549196 40.03 13 3 1.67KB
302 407-Epic-The-Southern thenightreader 202.48 34 23 29.22KB
303 408-Epic-SSE-3 owlman142 92.02 24 11 10.58KB
304 409-Epic-Mesa-Shewie owlman142 84.02 22 10 7.10KB
305 413-Epic-Expressivo-mid 4118671 115.15 2 0 9.63KB
306 414-Epic-Victoria-Altissimi tehdoctorr 186.70 46 15 28.85KB
307 415-Epic-Pokemon-Super 7135916 91.75 61 14 18.66KB
308 417-Epic-The-Majestic 4118671 108.70 18 10 6.18KB
309 418-Epic-Pot-O 4118671 108.43 22 13 20.57KB
310 419-Epic-Star-Fox vgoscore 39.39 22 5 15.92KB
311 420-Epic-From-the 5104821 191.27 24 11 24.88KB
312 421-Epic-The-Cascadant 4118671 259.80 18 20 12.58KB
313 422-Epic-Undertale-Medley vgoscore 60.72 18 6 8.38KB
314 425-Epic-Bravery-Honor rodcosta 25.44 18 2 3.09KB
315 427-Epic-Stand-mid 1388021 125.28 23 26 26.74KB
316 429-Epic-Nyan-Cat vgoscore 14.56 18 2 6.52KB
317 430-Epic-SimCity-Theme vgoscore 46.17 22 5 6.55KB
Continued on next page
91
Table 7.1 – continued from previous page
Sample Name Username Duration (s) Tracks Blocks Size
318 433-Epic-Marks-in 4118671 56.04 15 3 2.23KB
319 435-Epic-Muave-and 4118671 80.67 17 9 5.69KB
320 437-Epic-Dreams-mid 4118671 114.06 5 4 1.83KB
321 440-Epic-Liberty-Shield 5104821 132.57 17 17 13.79KB
322 441-Epic-Heart-of vgoscore 48.52 31 6 9.82KB
323 443-Epic-I-Dont 6278966 225.85 36 15 15.41KB
324 444-Epic-The-Beauty 4118671 113.07 17 11 5.46KB
325 446-Epic-Joshs-OcarinaWIP brosenvall2 102.06 38 4 5.10KB
326 447-Epic-Dumbledores-Farewell cherylthegoat 110.03 18 11 2.10KB
327 448-Epic-Android-7 4118671 423.20 5 54 27.35KB
328 449-Epic-Invasion-of pandorasbox123 126.82 24 15 15.52KB
329 450-Epic-Mearnas-Past lizzapie 104.38 63 7 31.68KB
330 451-Epic-Heros-Homecoming 28678260 74.93 25 8 5.86KB
331 452-Epic-Scald-Lizard 10712571 116.33 12 11 9.45KB
332 453-Epic-Catle-Theme 10712571 172.82 18 18 16.90KB
333 454-Epic-The-Winter lizzapie 147.82 70 15 34.50KB
334 455-Epic-Sakura-A wind e 204.43 41 16 23.06KB
Continued on next page
92
Figure 8.15: Percentage of won and lost confrontations for each pair of samples on ”Cinematography”
108