jacek wołkowicz - dalhousie university
TRANSCRIPT
POLITECHNIKA WARSZAWSKA
WYDZIAŁ ELEKTRONIKI I TECHNIK INFORMACYJNYCH
INSTYTUT INFORMATYKI
Rok akademicki 2006/2007
PRACA DYPLOMOWA MAGISTERSKA
Jacek Wołkowicz
N-gram-based approach to automatic composer recognition
Ocena:.................................
............................................ Podpis przewodniczącego
Komisji Egzaminu Dyplomowego
Opiekun pracy: prof. nzw. dr hab. inż. Zbigniew Kulka
Specjalność: Inżynieria Systemów Informatycznych
Data urodzenia: 8 listopada 1983
Data rozpoczęcia studiów: 1 października 2002 r.
Życiorys
Nazywam się Jacek Wołkowicz. Urodziłem się 8 listopada 1983 r. w Warszawie. Edukację zacząłem w Szkole Podstawowej nr 106 im. Ryszarda Suskiego w Warszawie. Następnie zdałem do XIV LO im. Stanisława Staszica w Warszawie, gdzie przez cztery lata uczyłem się w klasie o profilu matematyczno-fizycznym z rozszerzonym programem nauczania informatyki. W czasach licealnych brałem udział w Międzynarodowych Turniejach Młodych Fizyków gdzie otrzymałem drugą nagrodę (Helsinki 2001) oraz pierwszą nagrodę (Odessa 2002). Po osiągnięciu bardzo dobrego wyniku na egzaminie wstępnym na Politechnikę Warszawską (PW) rozpocząłem studia informatyczne na Wydziale Elektroniki i Technik Informacyjnych. W roku akademickim 2005/2006 wziąłem udział wymianie międzynarodowej między PW a Dalhousie University w Kanadzie. Mojej edukacji zawsze towarzyszyła muzyka. Poza regularną edukacją, uczęszczałem do Szkoły Muzycznej I st. im. Witolda Lutosławskiego w Warszawie oraz Szkoły Muzycznej II st. im. Józefa Elsnera w Warszawie w klasie fortepianu. W ramach studiów brałem udział w ponadprogramowych zajęciach związanych z zagadnieniami Akustyki.
Egzamin dyplomowy
Złożył egzamin w dniu .................................................................................................... 2007 r.
Z wynikiem ................................................................................................................................
Ogólny wynik studiów ...............................................................................................................
Dodatkowe wnioski i uwagi komisji ..........................................................................................
.....................................................................................................................................................
.....................................................................................................................................................
Abstract
The methods of Natural Language Processing can be successfully applied to the musical symbolic contents since music can be treated not as an artificial language, but the natural one. Showing that some of the methods from natural language processing work on music leads to the point where we can apply well known methods, such as clustering, plagiarism detection or information retrieval to musical contents. A method of converting complex musical structure to the features corresponding with words for text was introduced. A mutual correspondence between both representations was shown. As far as composer recognition is concerned, keeping in mind that a successful authorship recognition task using n-grams’ statistical analysis was brought, one can assume, that this method will also work for composer attribution. The aim of the work is to create such a tool. The obtained effectiveness of the method is very high.
Key words: Statistical Computing; Content Analysis and Indexing – Linguistic processing; Sound and Music Computing - Methodologies and Techniques; Natural Language Processing
Streszczenie
Zastosowanie N-gramowej analizy informacji
do automatycznego rozpoznawania kompozytorów
Techniki Przetwarzania Języka Naturalnego mogą zostać skutecznie zastosowane do danych muzycznych jeśli tylko muzykę będzie się traktować jako język naturalny. Pokazanie, że możliwe jest zastosowanie tych technik w muzyce, prowadzi do rozwiązania problemów grupowania, wykrywania plagiatów czy wyszukiwania informacji dla danych muzycznych. Rozważając problem rozpoznawania kompozytorów, mając na względzie fakt, iż skuteczne rozwiązania do bliźniaczego problemu dla tekstu zostały już zaproponowane przez analizę statystyk n-gramów, można przypuszczać, że ta metoda zadziała również przy rozpoznawaniu kompozytorów. Zaproponowano metodę przejścia ze złożonej notacji muzycznej na cechy odpowiadające słowom tekstowym. Została pokazana wzajemna jednoznaczność obu form reprezentacji wiedzy. Uzyskano wysoką skuteczność zaproponowanej metody.
Słowa kluczowe: Przetwarzanie muzyki, przetwarzanie języka naturalnego, sztuczna inteligencja, wyszukiwanie informacji muzycznej.
ROZSZERZONE STRESZCZENIE:
Przetwarzanie muzyki w dzisiejszych czasach staje się istotną sprawą. Coraz
większe ilości informacji muzycznej są gromadzone zarówno w prywatnych zasobach
jak i w bibliotekach dostępnych przez sieć WWW dla każdego. Z tego powodu
narzędzia efektywnego przeszukiwania i automatycznej identyfikacji utworów
muzycznych są coraz bardziej pożądane. W powyższej pracy zauważono, że informacja
muzyczna wykazuje wiele cech podobnych do tekstu i wskazano możliwość aplikacji
znanych już metod przetwarzania języka naturalnego i wydobywania informacji w
analizie muzyki. Przybliżono również podstawowe informacje z zakresu
psychoakustyki oraz przedstawiono różne podejścia do przechowywania cyfrowej
informacji muzycznej: bezpośredniego zapisu cyfrowego sygnału akustycznego,
kodowania perceptualnego, protokołu MIDI oraz komputerowych metod zapisu notacji
muzycznej.
W drugim rozdziale przybliżono standard MIDI, gdyż to właśnie pliki MIDI
zostały użyte jako obiekt badań do szczegółowej implementacji oraz oceny
testowanych algorytmów i prawidłowości. Pliki MIDI są zapisem cyfrowym sesji
protokołu MIDI (zbioru wiadomości MIDI) i składają się z bloków zawierających
zdarzenia, czyli wiadomości MIDI z informacją czasową lub dodatkowych informacji
sterujących. Wskazano, iż biorąc pod uwagę wydobywanie wiadomości muzycznej,
interesujące są bloki MThd (reprezentujący plik) i MTrk (reprezentujący ścieżkę) oraz
zdarzenia: Note-On, Note-Off oraz Tempo. Na ich podstawie da się wyodrębnić całą
informację o czasie i wysokości odgrywanych dźwięków. Korzystając ze znajomości
budowy plików MIDI zaimplementowano parser wydobywający informację o
sekwencjach dźwięków dla całego utworu, które następnie brane są jako podstawa do
dalszego przetwarzania. Przedstawiono koncepcję uni-gramów jako najmniejszych
jednostek informacji muzycznej dla tej reprezentacji. Wprowadzono również pojęcie n-
gramów jako podstawowej cechy odpowiadającej słowom tekstowym oraz możliwość
skrótowego zapisu najprostszych ich form. Porównano ten pomysł z istniejącymi
propozycjami, głównie z dziedziny MIR (wyszukiwania informacji muzycznej).
Przedstawiono różne przykłady analizy korpusu muzycznego. Do badań użyto
utworów fortepianowych zebranych z wielu różnych publicznych stron WWW
autorstwa pięciu kompozytorów klasycznych. Pokazano, że muzyka przy przyjętej
reprezentacji spełnia prawo Zipf’a. Korzystając z pojęcia entropii informacji
zlokalizowano pozycję słów kluczowych w dokumentach muzycznych.
Przeprowadzono rozkład na wartości osobliwe macierzy zebranych dokumentów
jednak wyniki nie okazały się w tym przypadku pomocne dla problemu rozpoznawania.
Następnie przedstawiono koncepcję algorytmu rozpoznawania kompozytorów
muzycznych. Metoda ta oparta jest na budowie profili każdego kompozytora a
następnie porównywania tych profili do profilu klasyfikowanego dokumentu.
Ostateczny rezultat jest wynikiem całościowej oceny podobieństwa rytmu, melodii oraz
sposobu łączenia rytmu z melodią. Nie zakładano ścisłych wartości parametrów
systemu, lecz pozostawiono zmiennymi długość n-gramów, stopień starzenia przy
trenowaniu klasyfikatora, normalizację, ograniczania wielkości i metody porównania
profili.
W ramach pracy został stworzony system rozpoznawania kompozytorów. Został
on zaimplementowany w języku Perl z użyciem biblioteki graficznej Tk.
Zaproponowano format pliku umożliwiającego przechowywanie utworzonych profili
kompozytorów. W strukturze systemu wyraźnie wydzielono warstwę logiki aplikacji
(podsystem Engine), prezentacji (podsystem UI), obsługi zdarzeń (moduł UI::cmd.pm)
i biblioteki narzędzi dodatkowych niezależnych od głównego zadania aplikacji
(podsystem Utils).
System był testowany dla wielu możliwych konfiguracji parametrów. Dokładna
interpretacja uzyskanych wyników wskazuje na trafne i intuicyjnie poprawne rezultaty.
Nawet, gdy rozstrzygnięcie programu jest niepewne, po głębszej analizie można
znaleźć przyczynę takiego stanu rzeczy w analizie twórczości danego kompozytora.
Analiza szczegółowa algorytmu wskazuje, że prze przyjęciu odpowiednich
parametrów, sprawność systemu dla zebranych danych osiąga prawie 90%.
Wyznaczono optymalną długość n-gramów na poziomie 6, 7.
Acknowledgements
I would like to thank my supervisor prof. dr hab. inż. Zbigniew Kulka for his insight and helpful
comments on acoustic matters.
I would like to thank my Canadian advisor, Dr. Vlado Keselj for his help in creating the idea, how to take up the composer
recognition.
This work would not have this shape without the incalculable help of the best of friends: beloved sis Ola Kontkiewicz, and
best crony Kuba Gawryjołek.
Copyright
by
Jacek Wołkowicz
2007
- 8 -
CONTENTS:
1 INTRODUCTION ............................................................................................. 14
1.1 Aim of the work .................................................................................... 14
1.2 Music as a natural language – basic information about NLP................ 15
1.3 Psychoacoustic foundations .................................................................. 17
1.4 Music storage approaches ..................................................................... 17
1.4.1 Waveform...................................................................................... 18
1.4.2 Perceptual audio coding ................................................................ 20
1.4.3 MIDI.............................................................................................. 21
1.4.4 Symbolic notations ........................................................................ 23
1.5 Musical data vs. textual data. ................................................................ 23
2 MUSIC REPRESENTATION IN ALGORITHMS .................................................. 25
2.1 MIDI on computers ............................................................................... 25
2.2 MIDI parsing ......................................................................................... 26
2.2.1 File structure.................................................................................. 26
2.2.2 Events ............................................................................................ 28
2.2.3 Parser implementation................................................................... 29
2.3 N-grams extraction ................................................................................ 30
2.3.1 Uni-grams...................................................................................... 30
2.3.2 N-grams......................................................................................... 32
2.3.3 Compression of N-grams representation....................................... 34
2.4 Related work ......................................................................................... 35
2.4.1 Musical Information Retrieval ...................................................... 35
2.4.2 Existing approaches....................................................................... 36
3 CORPUS AND ITS FEATURES .......................................................................... 38
3.1 Building a musical data corpus ............................................................. 38
3.2 N-gram features..................................................................................... 39
3.3 Zipf’s law for music .............................................................................. 41
3.4 Entropy analysis .................................................................................... 42
3.4.1 Information entropy....................................................................... 42
3.4.2 Term ranking ................................................................................. 44
3.5 Singular value decomposition ............................................................... 46
4 THE ALGORITHM FOR COMPOSER ATTRIBUTION......................................... 50
4.1 Related work ......................................................................................... 50
4.2 Algorithm .............................................................................................. 51
- 9 -
4.2.1 Testing and training set ................................................................. 52
4.2.2 Building profiles............................................................................ 53
4.2.3 Building piece representation........................................................ 54
4.2.4 Profiles comparison....................................................................... 54
4.2.5 Final judgment............................................................................... 56
4.3 Algorithm details................................................................................... 56
4.3.1 N-gram length ............................................................................... 56
4.3.2 Aging factor................................................................................... 57
4.3.3 Normalization................................................................................ 57
4.3.4 Profiles size ................................................................................... 58
5 COMPOSER RECOGNITION SYSTEM .............................................................. 59
5.1 Functionality.......................................................................................... 59
5.2 Project.................................................................................................... 60
5.2.1 CDB file ........................................................................................ 61
5.2.2 Importing MIDI files ..................................................................... 61
5.3 Implementation...................................................................................... 62
5.3.1 Packages overview ........................................................................ 62
5.3.2 Engine subsystem.......................................................................... 63
5.3.3 Utils subsystem ............................................................................. 63
5.3.4 UI subsystem ................................................................................. 64
5.3.5 Running script – Run.pl ................................................................ 69
5.3.6 Testing plug-in .............................................................................. 69
6 ANALYSIS OF THE RESULTS .......................................................................... 70
6.1 Results interpretation............................................................................. 70
6.1.1 Proper judgment ............................................................................ 71
6.1.2 Wrong judgment............................................................................ 71
6.1.3 Unseen composers......................................................................... 72
6.2 Algorithm evaluation............................................................................. 73
6.2.1 Profiles comparison....................................................................... 73
6.2.2 Normalization................................................................................ 73
6.2.3 N-gram length ............................................................................... 75
6.2.4 Profile’s sizes ................................................................................ 75
6.2.5 Aging factor................................................................................... 75
6.2.6 Representative data ....................................................................... 75
6.2.7 Key-words-based classification..................................................... 76
7 CONCLUSIONS ............................................................................................... 78
A. APPENDIX – MUSIC NOTATION ...................................................................... 80
I. Western music system........................................................................... 80
- 10 -
II. Staff system ........................................................................................... 81
III. Temporal information ........................................................................... 82
BIBLIOGRAPHY ..................................................................................................... 84
- 11 -
LIST OF FIGURES:
Figure 1.1 Spectral Analysis of c-moll prelude BWV 846, J.S. Bach................. 20
Figure 1.2 Spectral Analysis of Etude c-moll, op. 25 no 12, F. Chopin.............. 20
Figure 1.3 NLP and Music processing domains.................................................. 24
Figure 2.1 Sample MThd chunk header .............................................................. 27
Figure 2.2 Sample MTrk chunk header ............................................................... 27
Figure 2.3 Sample Tempo event.......................................................................... 28
Figure 2.4 Sample Note-On – Note-Off sequence .............................................. 29
Figure 2.5 Solving the problem of parallelism.................................................... 30
Figure 2.6 Unigrams extraction ........................................................................... 32
Figure 2.7 Gliding window.................................................................................. 33
Figure 2.8 The sample of Thai document............................................................ 34
Figure 2.9 Two sample melodies......................................................................... 34
Figure 3.1 Composers Timeline .......................................................................... 39
Figure 3.2 Zipf’s law for text............................................................................... 41
Figure 3.3 Zipf’s law for music corpus ............................................................... 42
Figure 3.4 Entropy dist ........................................................................................ 45
Figure 3.5 Eigenvalues for the corpus ................................................................. 47
Figure 3.6 SVD dimensions: 1, 2, and 3.............................................................. 48
Figure 3.7 SVD dimensions: 1, 4, and 6.............................................................. 48
Figure 3.8 SVD dimensions: 4, 7, and 8.............................................................. 49
Figure 4.1 Building profiles ................................................................................ 53
Figure 4.2 Measure components for different n-grams values............................ 55
Figure 4.3 Aging example ................................................................................... 57
Figure 5.1 System scheme................................................................................... 60
Figure 5.2 System structure ................................................................................. 62
Figure 5.3 Adding to database window............................................................... 65
Figure 5.4 Adding composer window ................................................................. 65
Figure 5.5 Adding composer window ................................................................. 66
Figure 5.6 Application main window.................................................................. 67
Figure 5.7 Application: recognizing window...................................................... 68
Figure 6.1 Normalization’s influence to the results ............................................ 74
- 12 -
Figure 6.2 Accuracy of sieved and full profiles .................................................. 77
Figure A.1 Clefs ................................................................................................... 81
Figure A.2 Time related symbols ......................................................................... 82
Figure A.3 Staff layout ......................................................................................... 83
Figure A.4 Plethora of music notation potpourri.................................................. 83
- 13 -
LIST OF TABLES:
Table 1.1 Levels of NLP. Text vs. music........................................................... 15
Table 2.1 Variable Length Quantities ................................................................ 28
Table 2.2 Pitch and rhythm quantization ........................................................... 37
Table 3.1 Composer Corpus............................................................................... 38
Table 3.2 Sample document-term matrix........................................................... 43
Table 3.3 Class entropies calculations ............................................................... 43
Table 3.4 Sample term ranking .......................................................................... 44
Table 4.1 Training and testing set split .............................................................. 52
Table 6.1 Evaluation of the Frederic Chopin prelude Op. 28 No. 22 ................ 71
Table 6.2 Evaluation of the Ludwig van Beethoven Sonata Op. 49 No. 2 ........ 71
Table 6.3 Evaluation of the Franz Liszt Concert Etude No. 3 ‘Un sospiro’ ...... 72
Table 6.4 Unknown Composers assignments .................................................... 72
Table 6.5 Algorithm results of aging 0.96 ......................................................... 73
Table 6.6 Algorithm results of aging 0.96 with profiles normalization............. 74
Table 6.7 Maximal accuracies for different aging factors ................................. 75
Table 6.8 Results for representative data ........................................................... 76
Table A.1 Pitches ................................................................................................ 81
Table A.2 Notes and Rests .................................................................................. 82
- 14 -
1 Introduction
1.1 Aim of the work
People store large amounts of music on their computers nowadays. They listen
to it almost all the time in the background or sometimes they even treat computers as
mini sound studios that provide them with a great aural relax. These facts make the
problem of processing musical data more and more important. Musical data are still
treated as unstructured binary data left on the same shelf as images, movies, programs
opposite to textual data – easy to process, search, index, driven by a huge bunch of
available computer aided techniques provided by NLP (Natural Language Processing),
IR (Information Retrieval) or TDM (Text Data Mining) like classification, analysis,
generation, summarization, searching and much more.
The aim of the work is to prove that music can be treated as a natural language
and thus the automatic composer recognition system has been developed. The system
was based on the solution of the same problem concerning text provided by Keselj,
Peng, Cercone and Thomas [29]. The program was implemented in Perl, the language
designed for text processing. In order to show the accuracy of the algorithm a corpus of
MIDI files containing piano pieces of various classical composers has been built. Since
an NLP algorithm was to be applied, it has been shown how one can obtain equivalents
of characters and words for music [14] and apply the comparison algorithm as it is. The
system works and one could find that there are a lot to be done in this area, from sound
processing (in order to manage personal music libraries) through MIR (music IR) to
advanced music semantic analysis (musical NLP). Music recognition software tools can
- 15 -
be very important nowadays since there are a lot of web music repositories, and by
now, all of them had to be indexed manually. With automatic, content-based tools one
can build more sophisticated systems, for instance, the one similar to Google for text
[32].
One can think that music is in the same situation as other fine arts, like painting,
dance or sculpture so why we just not treat other arts like natural languages? In my
opinion, it is not possible. There is a big difference between music and writing
compared to other fine arts mentioned above. Music as well as writing uses a kind of
symbolic notation in order to easily exchange and preserve these artworks for next
generations, which none of the other fine arts do.
1.2 Music as a natural language – basic information about NLP
In order to treat music as a natural language, one has to show that music
processing works on the same classes of problems as NLP does. One distinguishes
certain levels of a text processing, listed in the Table 1.1. NLP, as well as music
processing, tries to convey through all levels, from recording (a voice, speech) to
understanding (the meaning of a discourse). Of course, there is no such tool that does
everything at a time, i.e., understands the meaning and gets knowledge from a raw
waveform. In fact, NLP tools concentrate on a certain level trying to move the problem
to the upper level.
Table 1.1 Levels of NLP. Text vs. music.
Text processing Music processing
phonetics Recorded voice Recording
phonology Phonemes of the language Separated notes
morphology Words structure Notes in the score
syntax Words order N-grams, notes order
semantics Words meaning, POS Harmonic functions
pragmatics The meaning of a sentence Phrase structure
discourse Context of a text Interpretation of a piece
Music, similarly to the natural language, can be recorded and presented
primarily as a waveform. On the ‘phonetics’ level one tries to investigate the structure
of a sound, separate and distinguish between notes and instruments. This task combined
with notes recognition is a well known problem to contemporary sound engineers even
- 16 -
if they do not know that they are involved in NLP tasks. This is the major task in NLP
and many different approaches to this task were proved to be successful. Nevertheless,
music is much more complex and sound recognition tasks regarding musical pieces are
still in music’s infancy. A simple explanation with an example of this fact will be
shown later in this chapter.
The second very important similarity results from the fact that music, as well as
text, has the symbolic representation. The first text script system called Cuneiform was
invented in the ancient world of Mesopotamia by the Sumerians about 3200 b.c. [51].
The origins of music scripting are dated back to the 8th century to Carolingian Empire
when the neumatic system, the first notation for music only, was invented, while the
first inscription that may be treated as a basic music notation is dated back to 2000 b.c.
[58]. It is true, that these two facts are disjoined in time, but both music and writing are
the only human activities that have a symbolic representation. This fact allows and
encourages thinking about the music content analysis as the next step of this so called
MLP (Music Language Processing) and similarly to text, can be analyzed on the
semantic and syntactic level. The music score also consists of characters which are
called notes. Similarly to NLP’s morphology and syntax – music has hidden,
grammar-like structure, hidden rules. In this case it is called the harmony. It determines
how to put words (notes) together, how to build well-formed phrases using them. It also
manages the musical meaning of a piece which is the order of chords. In the first case –
notes – we may talk about the syntax of the music while in the second case – chords,
harmonic functions – about the semantics of the certain chord or the pragmatics of a
phrase. This is very similar in its form to one of the main problems of NLP nowadays,
which is grammars analysis. The method of detecting probabilistic harmony was
introduced by Bod [6]. He makes his investigation based on Essen Folksongs
Collection (collection of folks themes, equivalent of the Penn Tree Bank).
The second main NLP’s course of action, which is statistical NLP, can be as
well applied to musical data. MIR (Music Information Retrieval), which is now a
highly exploited domain of research, is an example of this approach. The other problem
with music is that there are no word boundaries and phrasing is driven by harmony, so
one has to figure out the structure of a piece as well as its harmonic representation in
order to successfully retrieve the musical meaning. However, there are methods of
partitioning pieces into smaller themes [55], [57]. Similar problem can be found in
- 17 -
some natural languages that do not contain whitespaces; like the Thai language, a
language of almost 65 million people.
The highest levels of NLP (pragmatics and discourse) are also common for
music. Pieces can be positive (major) or negative (minor). They may represent human
ideas, desires or aspirations (romantic music) as well as depict real situation and actions
(program music). Paul Dukas’s The Sorcerer's Apprentice is the very good example of
program music and was filmed in 1940 in Walt Disney’s animated film Fantasia, in
which Mickey Mouse plays the role of the apprentice.
1.3 Psychoacoustic foundations
While talking about music as a natural language, especially at those low-level
analyses (phonetics, phonology), one has to point out some basic information from
psychoacoustics of hearing and human aural perception.
Sounds are disturbances of pressure that propagate from the source of a sound
through matter (air) as a longitudinal wave. They are perceived by the eardrum,
transmitted through the middle ear into cochlea where mechanical energy from sound is
converted to neural signals and then carried to the brain in order to create a sound scene
(sound picture).
The nature of the sound results from physical features of air and human aural
system. The shape of ear, its dimensions make us more sensitive for the frequencies
from 1 kHz to 4 kHz. The length of the cochlea in the internal aural system limits
human perception to the sounds form the range of 20 to 20 kHz. This information was
implicitly and unwittingly used by people in building and inventing musical
instruments as well as in developing contemporary musical systems. Nowadays they
are also used more consciously in the matter of sound processing and music storage. An
attention will be paid to some other facts form psychoacoustics in all the sections of
this thesis. Nevertheless, they all ensue from the foundations of human sound
perception.
1.4 Music storage approaches
Music can be represented digitally in various ways. However, there are mainly
two types of storage approaches:
- 18 -
1. Raw (Waveform) – the sound recorded by microphones representing nothing
but the motion of the speaker’s (or microphone’s) membrane. The data are kind of
snapshot of a real recording. It generally does not matter whether it is compressed
(mp3, ogg and more other well known formats) or stored explicité (pcm, wav or
aiff format).
2. Symbolic representation – score notations (mus, sib, abc, xml) and MIDI
protocol, which store information about musical events rather then about the actual
sound.
People got used to raw representations because they like to hear “real” artists’
music, not the symbolic version, which is played differently on every machine. The
other reason of this situation comes from the fact that not everyone understands music
in the way he reads. Musical education and studying scores are not such a common
entertainment compared with what used to be in the past. The ease of off-line listening
to the music comes from the rapid prevalence of vinyl Long-Play discs followed by
compact cassettes (audio tapes) and finally, in 1982 – audio compact discs (CD). These
technologies make music available for all, but also fewer people need to be involved in
active creating of music. The progress in compression methods and the rapid
development of personal computers and the Internet allows sharing music through the
web. People are surrounded by music, knowing nothing about it, about its content.
These formats will be briefly described in the following sections.
1.4.1 Waveform
Waveform is an audio format in which music is stored as a digital audio signal.
Analog sound signal is a variation of acoustic pressure usually represented by a
continuous-time voltage signal at the output of microphone, which is then low-pass
filtered, sampled, quantized and binary coded. As an output of this process, a digital
PCM (Pulse Code Modulation) signal is then stored in a file. There are plenty of
possible configurations of sampling rate and quantization depth, however, only one
became more popular then others. The human ear is sensible to the sounds up to 20
kHz. According to Shannon-Kotielnikow theorem, in order to encode a signal with a
maximum component frequency of 20 kHz one has to sample it with the frequency
greater than 40 kHz. Then the information will not be lost and the analog signal will be
able to be reconstructed. In order to leave a safety margin it was decided to sample
- 19 -
sound with a standard of 44.1 kHz. The second problem is the quantization level.
According to other research on human sound perception it was shown that dynamic
range of a human ear (i.e. the ratio of the loudest sound to the quietest one) is about 120
dB. Each bit more in the quantified sample gives 6 dB in the resulting dynamic range,
thus the most convenient quantization depth is 16 bits (96 dB) or 24 bits (144 dB).
The standardized parameters for audio (CD Audio) are 44.1 kHz sampling rate
and 16 bits quantization (96 dB dynamic range and more than 20 kHz maximal
frequency) and these are the most common settings for waveform files.
Two examples of piano music with additional graphical information for two
different excerpts are shown in Figure 1.1 and Figure 1.2. Sound, in basic approach,
can be represented as a digital function of voltage in time. This representation is called
a waveform and it is shown just above notes in both figures. There are some simple
approaches to waveform analysis such as various zero-crossing parameters or
envelope’s shape analysis [33] but this method may be used in few simple tasks, such
as speech/music distinction. The second possible approach is to calculate spectral
representation of the waveform. Spectral representation is the two dimensional function
of sound energy depending on time and frequency. It shows the distribution of sound
components of various frequencies. Both spectral images in Figure 1.1 and Figure 1.2
were calculated using Kaiser (180 dB) gliding windowing function with 8192 bands
(~2.5 Hz frequency resolution and ~ 50 ms time resolution).
Spectral analysis (i.e. the analysis of this kind of functions) is now the only
approach to notes distinction and recognition from a pure sound data. Sound
recognizing come down to catch all maxima from spectral view and classify them by
their shape and positions to different groups (notes recognition, instrument
identification). Various researches were done in this area ([3], [5], [18], [33], [35], [36]
and [59]). However, the results are still very poor. We are able to recognize notes from
the simple examples (Figure 1.1) but more complex and messy ones (e.g. Figure 1.2)
still can cause too many errors.
- 20 -
Figure 1.1 Spectral Analysis of c-moll prelude BWV 846, J.S. Bach
Figure 1.2 Spectral Analysis of Etude c-moll, op. 25 no 12, F. Chopin
1.4.2 Perceptual audio coding
A one very important use of the spectral approach to music identification is the
perceptual audio coding, where several psychoacoustics phenomena are incorporated in
- 21 -
the codecs design in order to remove the redundancy caused by the irrelevant
information. Many various lossy compression algorithms of storing audio data were
invented and many music file formats are in use. They leave only the information about
events in spectral view that is audible for the human ear (about tones and noise bands).
The psychoacoustic phenomena, such as the threshold in quiet derived from the equal-
loudness contours obtained for sine tones accompanied with temporal and frequency
masking, is used in this case. For instance, each frequency peak and noise band are able
to mask other sounds that occur in the close vicinity of the event thus, despite the fact
that they were registered by the microphone and probably will appear during playback,
will not be noticed by human ear. Masking phenomenon occurs in both, frequency and
time dimension, and may be both forward and backward. All additional information
that will not be perceived is not stored, therefore the compression is lossy. Analyzing
this kind of musical data may be simpler than analyzing pure recording, because all the
important features that represent maxima on spectral map are already extracted. The
use of perceptual audio coding in music was analyzed, for instance, in Fraunhofer IIS,
Germany where mp3 and aac formats were invented [19].
1.4.3 MIDI
MIDI files represent different approach to storing audio data. Apart from storing
exact information about a piece, i.e. recording, one can store information about the
piece itself. MIDI stores information about musical events, such as pressing or
releasing a key. Perceptually coded audio files contain the same type of information;
however, they are automatically obtained from the original recording (storing about
1/10 original data keeping whole audible information) without any human semantic
feedback while MIDI files are sequenced by people thus the content may be described
as semantic. The actual midi size is usually about 1/1000 those of original recording, so
one can see a difference.
MIDI stands for Musical Instrument Digital Interface and according to
Wikipedia, it is an industry-standard electronic communications protocol that enables
electronic musical instruments, computers and other equipment to communicate,
control and synchronize with each other in real time. MIDI was not designed to
transmit only the audio signal or media – it simply transmits digital data "event
messages" such as the pitch and intensity of musical notes to play, control signals for
- 22 -
parameters such as volume, vibrato and panning, cues and clock signals to set the
tempo [39]. For instance, RFC 4695 [31] describes a standard of network
communication using MIDI protocol and MIDI infrastructure. As an electronic
protocol, it is notable for its success, both in its widespread adoption throughout the
industry, and in remaining essentially unchanged in the face of technological
developments since its introduction in 1983.
MIDI files are binary files that consist of concurrent channels and tracks ([23],
[24] and [37]). Each channel is a container for events – MIDI messages. There are three
categories of these messages:
1. Channel (voice) messages – essential for midi usage, representing things that
may happen during music generation, such as: Note-On (pressing a key), Note-Off
(releasing a key), Key Pressure, etc.,
2. System Real-time messages – messages controlling real-time events that may
happen during music performance, regarding the time control and flow of other
sync messages such as: Clock, Tick, Start, Stop, Continue, Reset,
3. System Common messages – additional information about performance of the
piece such as: System Exclusive messages (various, usually textual data) and
playlist controllers (Song Select, Song Request).
Both channel and real-time messages depend on time. Time of events in a
Standard Midi File is counted in microseconds, so it is easy to determine the exact
moment of an event. While time information still remains real (as it is in waveform
data), pitch information became symbolical. Instead of frequency value, a key number
on a virtual instrument is used. There are 128 possible notes on a MIDI device,
numbered from 0 to 127. The middle C (261 Hz) is note number ‘60’, and, as it is in the
well-tempered scale system, the frequency of each note is 21/12 of the previous ones.
This solution however has a drawback, because it fixes MIDI utility to western music
that uses chromatic, twelve pitches scale. It is useless for other systems, like pentatonic,
Pythagorean or natural systems but it makes MIDI events clear for interpretation. It will
be explained in the following chapters how one can use this information.
MIDI format contains information about music structure only. It does not
preserve any information about the exact performance (except timing schedule), so
MIDI files are played using special programs, called synthesizers, where each note is
- 23 -
generated using certain algorithms; or samplers, more precisely, which use fixed sets of
previously recorded notes.
1.4.4 Symbolic notations
There is also another way of storing musical data that stores information about
musical score only. There are lots of formats that fulfill this paradigm but,
unfortunately, there in no such standard as for MIDI. They are usually designed for
different score editors (so called ‘Scorewriters’) and are not widely used by now. There
is a short list of some of these formats:
1. Finale file format [16] – binary file, commercial (MakeMusic, Inc.) but used
widely by composers,
2. Chris Walshaw’s ABCMusic notation [62] - designed to notate music, tunes
and lyrics, in ASCII format under various licenses, both commercial and open
source,
3. LilyPond notation [34] – text scripting notation for engraving sheet music.
Unlike some commercial proprietary programs such as Finale and Sibelius,
LilyPond does not contain its own graphical user interface for the creation of
scores. It is developed under GNU Public License,
4. MusicXML [40] – XML standard for storing musical data. It is very interesting
as far as music information retrieval or music as a natural language processing is
concerned because of the standardization of XML formats, but useless unless it is
widely used.
These formats could be very good for analyzing but they are not used over the
Internet as widely as MIDI is and there are not enough resources to build a reasonable
music corpus for researches.
1.5 Musical data vs. textual data.
Two main approaches to storing musical data, symbolic and actual one were
described in this chapter. One can show the same relationship for texts. Speech can also
be recorded and as it has been shown, there are already well known text processing
tools that may involve both text and speech processing. The resemblance between NLP
and music processing is not accidental and cannot be neglected. Representing speech as
text seems natural for us; however, on the other hand we can store human speech as
- 24 -
waveform data. We will have then the original author’s voice. But people are used to
represent text rather symbolically. This representation is easy to store, edit (using
computer keyboards), process in the way that it is flat – words occur one after another,
there is no concurrency in the text. Second problem is that almost everyone can read,
but rarely play the music and almost nobody understands music. The reason for this
situation lies in contemporary education model – music is simply not needed nowadays.
That is why people got used to mp3 files.
The similarities connected with the matter (types of data that are being
processed) of both domains (all the content of this diagram was described in previous
sections) have been summarized in Figure 1.3.
Music and natural languages were both used and created by people in order to
communicate, exchange thoughts, sensations. They have been used by people since the
very beginning of human existence. They both seem to have free structures. In fact,
they both are created using a complex set of rules called grammar (or harmony), and
were evaluating among centuries. Both, text processing and music processing, remain a
key problem, where its solution may help us understand human nature and the nature of
human thinking.
Figure 1.3 NLP and Music processing domains
- 25 -
2 Music representation in algorithms
2.1 MIDI on computers
MIDI files are binary records of MIDI messages. They are prepared for
playback on computers and other similar machines. There are two types of software
sequencers that perform MIDI files: synthesizers and samplers. They are both virtual
instruments that may be operated through MIDI protocol, while MIDI files are
practically records of real, usually faked performances. Synthesizers generate musical
sounds mathematically (algorithmically) so the sound is totally artificial. They are more
popular and they are usually distributed with operating systems or with the software
provided with sound cards. As they need much less resources (disc and processor), they
exist from the beginning of PCs development. Samplers contain a set of the original
sound samples, for many instruments, recorded with different pitches, sound pressures
and durations (with regard to different transients). Samplers require much more disc
space (several megabytes per instrument) and processor time (it should run smoothly
on a 1 GHz processor. One such system is a part of Finale Music, commercial
distribution of a score-scripting system [16]. There are also some hardware
implementations that use DSP. The main advantage of samplers is the real, not artificial
sound. It is also possible to record samples of a valuable, unique instrument and use
these samples to generate faked pieces. There are a lot of free and commercial sound
libraries available online (search for DLS files [20]).
As it was mentioned in the introduction, the MIDI protocol was not designed for
any special purpose. In order to standardize MIDI files, MIDI Manufacturer
- 26 -
Association (MMA [37]) provides the specification for synthesizers which imposes
several requirements beyond more abstract MIDI standard. While MIDI itself provides
the protocol which ensures that different instruments can interoperate at a fundamental
level (e.g. that pressing keys on a MIDI keyboard will cause an attached MIDI sound
module to play musical notes), General MIDI (or GM) goes further in two ways: it
requires that all GM-compatible instruments meet a certain minimal set of features,
such as being able to play at least 24 notes simultaneously (polyphony), and it attaches
certain interpretations to many parameters and control messages which were left
unspecified in MIDI, such as defining instrument sounds for each of 128 program
numbers [22]. Next, MIDI messages (along with timing information) can be collected
and stored as a file in a computer file system, in what is commonly called a MIDI file,
or more formally, a Standard MIDI File (SMF). The SMF specification was developed
and is maintained also by MMA [39].
2.2 MIDI parsing
2.2.1 File structure
Data parsing tool is an intrinsic task to be done if one is talking about
processing the data with complex structure and Standard MIDI File format has such a
complex structure. According to SMF specification, MIDI file can contain any number
of tracks and every track may contain up to 16 independent channels.
Data in a MIDI files are organized in chunks and there can be many chunks
inside a file. Each chunk can have a different size so the information about the size of
the data is always stored in the chunk header. Each chunk contains the chunk ID (four
bytes) that identifies the chunk type and 32-bit length of chunk data. The 4 bytes that
make up the length are stored in the (Motorola) "Big Endian" byte order, not in the
(Intel) "Little Endian" reverse byte order and this has to be taken into consideration,
especially on PCs.
MThd chunk defines primary MIDI features. MThd header contains 6 bytes –
16-bits format, 16-bits Number of tracks and 16-bits Division (tempo information). The
sample MThd header is shown in Figure 2.1:
- 27 -
Figure 2.1 Sample MThd chunk header
There are actually 3 different formats of MIDI files. The ‘0’ type means that the
file contains one single track containing midi data on possibly all 16 MIDI channels.
The ‘1’ type means that the file contains one or more simultaneous (i.e., all start from
the assumed time of 0) tracks, perhaps each on a single midi channel. This is the most
common type nowadays and the example of it is shown in Figure 2.1 (bytes 8 and 9).
The ‘2’ means that the file contains one or more sequentially independent single-track
pieces.
The second pair of bytes in MThd chunk is the number of tracks in the midi file.
It should have a value of 1 for format ‘0’. In the example it equals 5.
The third pair of bytes describes timing information. If it is positive it shows
Pulses per quarter note (PPQN) (in the example given above it is 192) or if it is
negative, the first byte defines a SMPTE (Society of Motion Picture and Television
Engineers) ‘frame rate’ of the piece (-24, -25, -29 or -30 fps) and the second byte – the
number of subframes per frame. So if the division is E7 28 (-25, 40) it gives
25*40=1000 ticks per second. Deep explanation of these tempo markings is available in
[25].
There are one or more MTrk (Track) chunks after the MThd chunk. Each MTrk
chunk contains the chunk ID (4 bytes ‘MTrk’) and the chunk data length (4 bytes). No
additional information is stored in MTrk chunk:
Figure 2.2 Sample MTrk chunk header
MTrk chunk is a container for all MIDI messages. Each message if preceded by
the time signature which has a type of ‘Variable Length Quantity’. If the value of a byte
is less than 127, the byte value is the final value. If the MSB is set, it means, that all
bits from the first byte, except MSB, are final value’s bits and the rest is in following
bytes. This situation happens until MSB of a byte is not set (i.e., its value is less than
128), then all bits from previous and current byte, except MSB’s, form the resulting
- 28 -
value. The examples of VLQ are shown in Table 2.1. Each VLQ value in SMF
describes a delta time (the number of ticks that elapsed from the previous event).
Table 2.1 Variable Length Quantities
Quantity VLQ representation
0x0 00
0x10 10
0x7F 7F
0x80 81 00
0x1000 A0 00
0x3FFF FF 7F
0x4000 81 80 00
0x100000 C0 80 00
0x1FFFFF FF FF 7F
0x200000 81 80 80 00
0x1000000 88 80 80 00
0xFFFFFFF FF FF FF 7F
2.2.2 Events
MTrk chunks are containers for MIDI events (MIDI messages plus delta-time
information). There are numerous event types in SMF specification, but in this case
only a few of them remain interesting:
1. Tempo,
2. Note-Off,
3. Note-On.
Tempo is a non-MIDI event. It has a structure ‘delta-time FF 51 03 xx xx xx’
where last three bytes are the new tempo. It expresses tempo as "the amount of time
(i.e., microseconds) per quarter note”. Default tempo is 500,000 (0x07A120) (120 BPM
– bits per minute). In the example in Figure 2.3 it is changed to ‘315,789’ (190 BPM)
and delta-time is 0 (it is here the very beginning of the piece). Tempo defines how fast
ticks are triggered and this can give actual time (in ms) of other events.
Figure 2.3 Sample Tempo event
Note-Off and Note-On both have the same structure ‘delta-time XY ww zz`
where X defines the event type, ‘8’ stands for Note-Off and ‘9’for Note-On, Y is the
channel number (can have a value of 0x0 to 0xF), ‘ww’ gives the key number (pitch)
- 29 -
and ‘zz’ gives velocity (volume level). The information about velocity will not be
needed except for the situation where it is a Note-On event with velocity – 0. In this
situation it is identical to the Note-Off event.
As before, an excerpt from MIDI file showing how MIDI Note-on and Note-off
works is presented in Figure 2.4:
Figure 2.4 Sample Note-On – Note-Off sequence
2.2.3 Parser implementation
Keeping all the information about MIDI files in mind, a MIDI parser was
implemented. It is based on MIDI Package which is a part of an Open Source project,
abcMIDI, distributed under GNU Public License. Original software resources are
available at [62]. It provides a framework for analyzing MIDI files. The program
collects Note-On, Note-Off events for each channel separately. Each event is assigned a
time value (in ms). In the next step all pairs of Note-On – Note-Off events are merged
into one note. Each note is characterized by onset time, pitch and duration. However,
that is not all. As it was mentioned above, music differs from text in the way that text is
flat but music is not. In this approach one has to find linear order of notes from this
concurrency. One can treat every channel separately if only every channel represents
each hand. Then on each channel the problem of parallelism is to be solved. According
to basic psychoacoustic knowledge, one can show that people concentrate on the
highest currently played note (Figure 2.5):
- 30 -
Figure 2.5 Solving the problem of parallelism
It was also shown by Uitdenbogerd and Zobel [61]. They tried to find the best
heuristic model that capture polyphonic music into monophonic representation. In their
investigations with human listeners, a heuristic with the highest currently played notes
in the chords or concurrencies has performed the best. One can find that this is what
people really perceive so there should be the key of understanding human music
perception.
In the next step, a set of notes in each channel is sorted in ascending order of
onset time and, in the second order, by pitch, descending. Then, a sequence of the
highest currently played notes is created for each channel. The output is the list of
channels with the channel number and number of notes in the channel and a list of
notes (pitch and duration information) in each channel:
channel 0 <number of notes>
<pitch duration>
<pitch duration>
...
<pitch duration>
channel 1 <number of notes>
<pitch duration>
...
<pitch duration>
...
channel <n> <number of notes>
...
2.3 N-grams extraction
2.3.1 Uni-grams
Unigrams are the elemental units that are taken into consideration during data
processing. Regarding NLP, we can talk about characters or words. It depends on the
task, whether taking one of those possibilities gives better of worse results. It should
always be checked in every task. With music, we do not have such a problem. Of
course, there are ‘phrases’ in music that may be assigned the ‘word’ meaning, but there
are no whitespaces or any other delimiters in music that may simply distinguish
- 31 -
between phrases. However, we may separate those using barlines as delimiters but it is
unlikely it will work well.
The simplest approach to the unigram extraction task is simply getting the
duration or pitch as the basic feature, but this cannot bring good results. The pieces can
be played in different speeds and can be transposed to any key. The features that one
needs to have are that they were to be key independent, so not the pitch itself is
important, but the relative pitch to other notes. It is important, because the key does not
tell us anything about the certain work, e.g., J. S. Bach wrote two sets of preludes and
fugues, each fugue in each existing key in well tempered scale, thus if one conducts the
pitch distribution analysis, we will obtain a flat, normalized one. The second significant
feature of musical n-grams is that they should be tempo-independent. The duration is
not given explicité in MIDI files – as quarters, eights, half-notes, but in the direct way,
that can be mapped to milliseconds. Every midi file representing the same piece, but
sequenced by different people (or programs) will look slightly different. That is why a
decision of applying the relative duration counting, not the direct one has been made.
Each difference is in the logarithmic scale and is rounded to cover some random tempo
fluctuations. Number ‘1’ means that the following note lasts twice as long as the
previous one, ‘2’ means 4 times more, ‘0’ – the same duration, ‘-1’ – twice faster. The
procedure of extracting N-grams is shown in Figure 2.6 and the proposed formula
applied to each pair of notes is given as follows:
( )
−= +
+ )(log,, 121
i
iiiIi
t
troundppTP (2.1)
Where Pi and Ti denote the resulting relative values, pi is the pitch of the i-ths note
represented by a MIDI value and ti is the duration (in seconds) of i-ths note. The
rounding precision of 0.2 has been chosen and after this smoothing it has been assessed
that the changes were imperceptible comparing to the original performance.
- 32 -
Figure 2.6 Unigrams extraction
In fact, all this preprocessing tasks described above, that lead to obtaining uni-
grams are implemented in the preprocessing (MIDI parser), so the sample (although a
very short one) output of the parser can be like that:
channel 0 2
1.0 -1
-1.6 3
One has to keep in mind that this example describes an excerpt containing three notes,
not two, because each uni-gram describes relative quantities thus represents a pair of
notes.
One thing that may struck us is that these files, that come out of the parser, are a
kind of textual data for music processing algorithms, similar to documents from text
corpuses. They contain ‘letters’ and may be easily analyzed. That is why they can be
given an ‘mtxt’ suffix (for musical txt) to emphasize their similarity to textual data.
2.3.2 N-grams
N-grams are simply n consecutive tokens [60]. In the case of text, one can
distinguish character n-grams and word n-grams. The task in this case is to retrieve n-
grams from the musical data. In this solution a sequence of tuples (i.e., relative pitch,
relative duration) was obtained from the preprocessor and three types of n-grams can be
extracted out of this. One can consider the rhythm only, the melody only and can also
take n-grams as a combination of both these features. N-grams are collected using a
gliding window which is shifted along each channel (Figure 2.7):
- 33 -
Figure 2.7 Gliding window
Almost every uni-gram is taken n-times during processing, so the resulting number of
n-grams doesn’t depend much on the ‘n’. The thing that is essential while choosing the
length of n-grams is that taking larger values of ‘n’ increases exponentially the size of
possible n-grams set, so the number of different n-grams really grows. The optimal n-
gram length will be chosen and evaluated in the further sections.
According to the fact, that it is not easy to separate words’ equivalents for
music, one can assign a ‘word’ meaning to n-grams. In NLP, character n-grams are
treated in the same way as single words and are used usually in the task, where word
separation is not evident, like during OCR tasks or if a language without evident word
borders is being processed. Thai is one of these languages. The sample of Thai shown
in Figure 2.8 was taken from the webpage of the Mahidol University, Bangkok [26]:
- 34 -
Figure 2.8 The sample of Thai document
This language has the same features as uni-gram music representation – it consists of
sequences of atoms (letters) that cannot be separated by additional symbols such as
whitespaces. N-gram analyses are one of the those techniques that work in this situation
[52].
2.3.3 Compression of N-grams representation
N-grams for larger values of n may need more space to be stored. However the
most frequent n-grams are those whose internal structure is very simple in this
representation. Let us assume the following situation. Two melodies, a simple, and a
complex one are given in Figure 2.9:
Figure 2.9 Two sample melodies
The logical representations as 7-grams of both items are as follows:
1. (4,3,5,-1,-2,-4,-3;-1,0,1.6,-1.6,1,0,1),
2. (0,0,0,0,0,0,0;0,0,0,0,0,0,0).
One can see, that the second example gives very simple n-gram representation and,
what needs to be pointed out is that this kind of n-grams are the most common ones.
That is why one may try to compress somehow of this representation in order to
- 35 -
simplify further processing. The following steps were proposed in order to compress
these strings:
1. Replacing delimiters by ‘#’:
1) (4,3,5,-1,-2,-4,-3;-1,0,1.6,-1.6,1,0,1) => 4#3#5#-1#-2#-4#-3#-1#0#1.6#-1.6#1#0#1,
2) (0,0,0,0,0,0,0;0,0,0,0,0,0,0) => 0#0#0#0#0#0#0#0#0#0#0#0#0#0.
2. Removing ‘zeros’:
1) 4#3#5#-1#-2#-4#-3#-1#0#1.6#-1.6#1#0#1 => 4#3#5#-1#-2#-4#-3#-1##1.6#-1.6#1##1,
2) 0#0#0#0#0#0#0#0#0#0#0#0#0#0 => #############.
3. Choosing some additional symbols and replacing successively each symbol. In
this situation it is: ‘##’=>’$’, ‘$$’=>’@’, ‘@@’=>’%’, ‘%%’=>’&’ so the
following sequences can be compressed as follows:
1) 4#3#5#-1#-2#-4#-3#-1##1.6#-1.6#1##1 => #3#5#-1#-2#-4#-3#-1$1.6#-1.6#1$1,
2) ############# => $$$$$$# => @@@# => %@#.
So the first, complex, sequence was not much compressed, but more frequent
pattern was compressed to the short ‘%@#’. Of course, this algorithm can also be
applied in one pass but presenting it in steps makes it easy to follow.
This algorithm applied backwards gives the primary n-gram representation.
2.4 Related work
2.4.1 Musical Information Retrieval
Since the number of music documents has rapidly been increasing with the
development of computers and networking opportunities, it became a serious problem
to handle these datasets. Music Information Retrieval (MIR) came out of Information
Retrieval (IR), the field that is concerned with the structure, analysis, organization,
storing, searching and retrieval of relevant information from the large textual databases.
At the beginnings of IR (1940s) the problem was to manage huge scientific literature
stored in textual documents. The duty of IR is to provide mechanisms to retrieve
documents or texts with information content that is relevant to the users needs [53].
However, along with the development of multimedia technology, the information
content that needs to be made available for searching changed its nature, from pure
textual data to multimedia content (text, images, videos and audios). MIR is nowadays
a growing international community drawing upon multidisciplinary expertise from
- 36 -
computer science, sound engineering, library science, information science, cognitive
science, and musicology and music theory [15]. MIR systems, that are operational or in
widespread use, have been developed using meta-data such as filenames, titles, textual
references and whole non-music information provided with a piece. Now, researches
and developers need to face creating content-based MIR systems. The most advanced
waveform-like content-based systems rely now upon musical fingerprint idea. It insists
on creating a small set of features that may be simply extracted from the piece and
trying to retrieve information basing on these features [41].
The most important research areas in this case are works done at the field of
symbolic music representation. With pitch and rhythm dimensions quite easily
obtainable from music data, one can obtain the textual string representation of the
music and then try to apply text based techniques to solve MIR tasks. The main
problem is to define the relation between pitch and rhythm information and musical
text representation.
2.4.2 Existing approaches
Various music representations have been already proposed. Buzzanca [8]
proposed using symbolic notes meanings, i.e. pitches like c’, d’, c” and durations like
‘quarter-note’, ‘half-note’ instead of using absolute values for pitch and duration.
However, the task, which was taken, was classification of highly prepared themes
representing the same type of music. Moreover, these features were given then as an
input to a neural network, so one does not know what was really taken into
consideration. This is the main drawback of neural networks, because we do not have
any feedback from the network whether our ideas and assumptions are good or bad.
Thom ([55], [56]) suggests splitting the piece on bars. She contends that using fixed
length, gliding window could make the problem sparse. It is true, however, the
researches conducted in this work show, that modern computers could handle with
good results and performance even such a sparse problem. The next example is the
Essen Folksong Collection. It provides a large sample of (mostly) European folksongs
that have been collected and encoded under the supervision of Helmut Schaffrath at the
University of Essen (see [47], [48], [50]). Each of the 6,251 folksongs in the Essen
Folksong Collection is annotated with the Essen Associative Code (ESAC) which
includes pitch and duration information ([6], [7]). In this approach the pitch is given
- 37 -
explicitly, while regarding time, we can say, that this information is more flexible
because it gives us the information about relative duration of the first (or shortest) note
in the passage. Another approach was presented in [17]. They use original MIDI pitch
representation and absolute time value with 20 ms resolution.
Against all the approaches presented above, MIR researches prefer an approach
similar to the one presented in this work. The first such approach was introduced by
Downie [14]. In this work, only a pitch was encoded as an interval between two
consecutive notes. It was then coded into letters as follows:
- ‘@’ stands for ‘no difference’ (perfect unison),
- small letters of the alphabet stands for lower notes. ‘a’ is minor second, ‘b’ is
major second … ‘g’ is perfect fifth, ‘l’ is perfect octave,
- capital letters of the alphabet stands for higher notes. ‘A’ is minor second, ‘B’ is
major second … ‘G’ is perfect fifth, ‘L’ is perfect octave.
In this approach no information about time (duration) is stored. However,
Downie claims, that this is sufficient to treat such sequences as text and to do
successful n-gram based retrieval. A more precise approach was presented by
Doraisamy [13]. She encoded both pitch (as an interval to the previous note) and
duration ratio (as a ratio of durations of 2 consecutive notes). However, she did not
logarithm it. In the work regarding theme classification provided by Pollastri and
Simoncelli [43] an approach to take relative pitch and relative duration was also used.
However, they quantified both dimensions as shown in Table 2.2:
Table 2.2 Pitch and rhythm quantization
Pitch Rhythm
‘much higher’ Interval ∈ [3,∞]
‘higher’ Interval ∈ [1,2] ‘faster’ Ratio<1
‘same’ Interval = 0 ‘same’ Ratio=1
‘lower’ Interval ∈ [-1,-2] ‘slower’ Ratio>1
‘much lower’ Interval ∈ [-3,-∞]
- 38 -
3 Corpus and its features
3.1 Building a musical data corpus
A set of MIDI files of different composers was collected. For better
compatibility, only piano works were included in the corpus. Moreover, each piece of a
set needs to be well-sequenced, i.e., each channel has to represent only one staff or
hand. The aspiration of this work is to classify musical pieces in terms of authorship,
thus one has to keep in mind that while composing a piece, one is thinking about one
hand or voice at a time. Choosing the pieces that satisfy this criterion was the only
human preprocessing task carried out on this data. The reason for this requirement is
that it is very easy to produce the MIDI sequence that sounds good, but has a mess
inside. So if one searches for the score-like music, it has to be checked whether the
midi file is well-sequenced in that way.
Table 3.1 Composer Corpus
Composer Number of pieces total size
Johann Sebastian Bach 109 963kB
Ludwig van Beethoven 44 1399kB
Frederic Chopin 58 1052kB
Wolfgang Amadeus Mozart 17 448kB
Franz Schubert 23 1116kB
A corpus of the following classical piano composers was set up: Johann
Sebastian Bach, Ludwig van Beethoven, Frederic Chopin, Wolfgang Amadeus Mozart
and Franz Shubert. Numbers of pieces and sizes are given in Table 3.1 above.
- 39 -
While considering music files one has to point out that there are big
disproportions between pieces. Some miniatures are quite tiny, but there are also very
large forms, like concertos. Thus, it is better to describe the volume of a corpora in
bytes rather than in number of pieces. The second important thing is to know that the
differences between composers vary depending on the composers’ background and
their lifetimes, e.g., greater difference is between Schubert and Bach than between
Schubert and Chopin because they both lived in 19th century. Lifetimes of the
composers are shown in Figure 3.1 for better understanding of the relations between
given composers:
Figure 3.1 Composers Timeline
3.2 N-gram features
As it has been shown in the previous sections, one can retrieve musical words
from MIDI corpus and then built document-term matrix, well-known from IR. It is
large, it is sparse and it contains all the information about documents that are in the
corpus. The only problem is to retrieve the knowledge. Document-term matrix is a table
where columns represent documents, i.e. each column is one piece and rows represent
terms, i.e. each row contains information about one n-gram in all documents. The value
in a cell describes relation between a term and a document. There are plenty of
available relations that are in use. In the binary matrix the value in the cell is positive if
the corresponding document contains a corresponding term. In the majority of
application, this value tells how many n-gram occurrences are in the appropriate
document. There are some more sophisticated approaches, for example tf.idf measure,
but it is now beyond my interests albeit the use of more refined measures may give
better results.
Apart from n-gram to piece affinity, one binds two types of values for n-grams.
The first, quite straight forward one is the probability of the n-gram itself in all
documents, so this is simply the frequency of each n-gram in the corpus:
- 40 -
)...()...( 11 knkknk wwFreqwwP +−+− = (3.1)
where P(wk-n+1...wk) is the probability of an n-gram, Freq(wk-n+1...wk) is the frequency
of given n-grams in the corpus and wi is i-th’s (pitch, duration) pair.
The given approach can be applied in recognition tasks. One can obtain
statistical distribution of n-grams in the corpus and provide some reasoning out of this
information. The second approach can be calculating the conditional probability of the
last uni-gram given the probability of the tail (previous n-1 uni-grams). This is also
known as a Markov assumption:
)|()|( 1...1..0 −+−≅ knkkNk wwPwwP (3.2)
It tells, that the probability of the uni-gram at the k-ths position does not depend
on whole uni-grams’ distribution in the document but from the n-1 preceding tokens
only.
According to Bayes Rule, conditional probability for each n-gram equals as
follows:
n
n
knk
knk
knk
knkknkk
Count
Count
wFreq
wFreq
wP
wPwwP 1
1...1
...1
1...1
...11...1
)(
)(
)(
)()|( −
−+−
+−
−+−
+−−+− ⋅== (3.3)
where P(wk-n+1...wk) denotes the probability of an n-gram, P(wk-n+1...wk-1) denotes the
probability of (n-1)-gram, Countn-1 is the total number of (n-1)-grams and Countn is the
total number of n-grams. If we notice, that the ratio Countn-1/ Countn remains the same
for all n-grams, one can simplify the relation to:
)(
)()|(
1...1
...11...1
−+−
+−−+− ≈
knk
knkknkk
wFreq
wFreqwwP (3.4)
This formula can be applied in a music generation task and it was shown by
Ponsford, Wiggins and Mellish [44] as well as Rivasseau [46]. They built a system that
statistically learns harmonic movements from the given corpus and then generates
sample pieces that satisfy this statistical harmony, equivalent PCFG (Probabilistic
Context Free Grammar) task known from NLP.
The main problem in both these approaches is that if one takes too small value
for ‘n’, the method cannot tell anything about the problem but if we take ‘n’ too big, we
- 41 -
have very large problem on the sparse data, that requires lots of computing, storage and
memory.
3.3 Zipf’s law for music
The statistical researches done on the musical data, represented as n-grams,
show great convergence with fundamental NLP and IR theories. The n-grams obtained
out of the corpus satisfy Zipf’s Law, which constitutes the foundation of Information
Retrieval. According to this law, the frequency of any term is roughly inversely
proportional to its rank. If dimensions are in the logarithmic scale, the relation should
be linear. A presentation how this law is satisfied for Polish poem “Pan Tadeusz” is
shown in Figure 3.2:
Figure 3.2 Zipf’s law for text
According to the investigations that were conducted in this work, Zipf’s law is
also satisfied for the n-gram music representation which is presented in Figure 3.3. The
investigations was conducted independently for three types of n-grams, defined in
chapter 2:
- 42 -
Figure 3.3 Zipf’s law for music corpus
In the case of music, rhythmic and melody curves behave slightly different, i.e.,
rhythm has higher slope, however, the regularity is preserved.
3.4 Entropy analysis
3.4.1 Information entropy
The second investigation carried out of this data was to identify the position of
the most important ‘words’ of musical content. It turned out that it is similar situation
to the one regarding textual data. The most frequent n-grams behave like stop-words –
they occur in every piece with almost the same probability. The least frequent n-grams,
that occur a few times, build the majority of the lexicon and do not have any positive
effect on the tasks, such as retrieval or classification. The most important ones lay in
the middle of the rank axis.
A tool that will sieve the profiles extracting only the n-grams that give us more
information about certain class (classes) than others do can play an important role in
showing text to music correspondence. These ‘key-words’ might be simply taken out
from both, ‘noise’ and ‘stop’ words using data-mining tools, same as those used in text
processing, such as information gain or similar methods. In order to find this data, the
following experiment has been conducted, but first of all the term ‘Entropy’ should be
explained.
Information entropy is a measure of the uncertainty associated with a random
variable. It can be interpreted as the shortest average message length, in bits, that can
- 43 -
be sent to communicate the true value of the random variable to a recipient, so Entropy
can be also interpreted as an amount of information. The information entropy of a set X
containing n events, that occurs {x1...xn} times in the probe is defined to be:
( ) ( )( )
( ) ( )( )∑∑==
−=
=
n
i
ii
n
i i
i xpxpxp
xpXH1
2
1
2 log1
log (3.5)
where the probability of each event is given by:
( )∑
=
=n
k
k
ii
x
xxp
1
(3.6)
The value of the entropy is higher if the distribution of values in the set is
flatter. The value is undefined if p(xi)=0, but in this case, it was assumed, that the
element of the sum for element i is 0. The entropy for an empty set (i.e., containing no
event) was also assigns ‘0’ value.
Entropy analysis will be conducted on the data arranged in document-term
matrix described above. Table 3.2 contains examples of various types of terms to make
the reasoning in the next paragraph easy. It is also a sample document-term matrix
containing n documents that belong to N classes which all contain k terms:
Table 3.2 Sample document-term matrix
class 1 class 2 class 3...class N-1 class N d1 d2 d3 d4 d5 d6 d7…dn-3 dn-2 dn-1 dn
term1 0 0 1 5 4 5 ...0... 2 0 0
term2 2 3 1 4 3 3 ..2..5..4.. 2 3 2
term3 0 0 1 1 1 0 ...0... 0 0 1
... ... ... ... ... ... ... ... ... ... ...
termk 1 0 0 0 2 0 ...0... 0 0 0
Table 3.3 Class entropies calculations
class 1 class 2 class 3…class N-1 class N Entropy (N=5)
term1 0 1.58 …0… 0 0
term2 1.46 1.57 ..1.26..1.12..1... 1.56 2.20
term3 0 1 …0… 0 0
... … … … … …
termk 0 0 …0… 0 0
- 44 -
Table 3.3 contains precalculated values of entropies in the following classes
with additional information (last column) about the entropy of the entropy values of a
given row (assuming the number of classes equals 5).
3.4.2 Term ranking
The feature that is a good indicator between classes needs to occur quite
frequently in all the documents belonging to the certain class (i.e. the entropy of this
term in a certain class should be high), but has to be quite rare in the documents that do
not belong to the class (i.e., the entropy of entropies calculated in each class should be
as small as possible). Thus, (whereas the maximum entropy on N elements is equal
log2N) the rank of each term denoted as:
( )( ))(log),(max)( 2..1
iHNkiHiRkNk
−=∈
, (3.7)
where H(i,k) is the entropy inside k-ths class for i-ths terms and H(i) is the Entropy of
all entropies for i-ths term (last column in Table 3.3, should be large if the term is a
good discriminator between certain classes, and low, if it does not discriminate well.
The limiting value for being a key word is log2N where N is the number of classes (for
N=5, log25=2.32). In this case there are only two occurrences of the term and they
dropped luckily into the same class. The probability of this event is 1/N.
Table 3.4 Sample term ranking
… … term1 3.67
… … } Key-words
term3 2.32 Noise-words … …
term2 0.19 … …
} Stop-words
termk 0 … … } Noise-words
Listing all the terms sorted by Rank in descending order (Table 3.4) reveals the
following groups having:
1) R(i)>log2N (‘key words’),
2) R(i)= log2N (‘random pairs’),
3) R(i)< log2N (‘stop words’),
- 45 -
4) R(i)=0 (‘noise words’).
The first group contains words that bring us most information about its classes.
The second one is the random pairs group described above. This group limits our
research. The terms from the third group bring us less information than random words,
these are noise words, that occur equally frequently in every group. They more mess up
the classification than really help. The fourth group represents the words that occur at
most one time in every group. These words are void but it is the most frequent group in
the corpus, so leaving these terms saves computational time and storage requirements.
The distribution of these groups is given in Figure 3.4 (on the vertical axis there is the
proportion of each group, on the horizontal – the log rank of each term assigned during
calculating Zipf’s law). One can see that, as expected, key words occur in the middle of
the scale (each group is given a number corresponding to the position in the list show
above):
Figure 3.4 Entropy dist
This method can be used in sieving out the noise in classification tasks. The
experiments that were conducted on the text classification show that the error rate
decreased 3 – 4 times using this method of sieving training data [21]. The method may
also work for music classification [63].
- 46 -
3.5 Singular value decomposition
SVD (Singular Value Decomposition) analysis may also be done in order to
find most frequent patterns in piece classes (composer groups) and sift out the noise.
This is an unsupervised method of analyzing the data, so no additional information
about the membership to a particular composer’s group is needed.
One can see that more than half of the information contains first 20 dimensions,
so using them, we might be able to do successful classification, but, still, half of the
information is lost and actually we do not know anything about what the dimensions
really indicate. In order to find the information about each composer’s contribution, I
visualized the pieces on the 3D space and tried to find the dimensions responsible for
composer attribution. My investigations show that there is certain regularity, but it is
not significant, so I concluded that an unsupervised approach to the composer
recognition task is not the thing that we are looking for.
Singular values decomposition, also known as PCA (Principal Component
Analysis) in computer science, consists in converting a matrix to the new base so that
the values of successive dimensions were the lowest possible. In general, having almost
200 000 features (different n-grams) and 250 pieces, 250 dimensions is enough to
precisely distinguish between those pieces (the simplest solution is that each dimension
is identical to the appropriate document vector) but using SVD we may simplify this
situation. In other words, the aim of this method is to lead to such situation where each
consecutive dimension gives less and less information, so that one can chop off some
dimensions with loosing some insignificant information. What then remains is a few
dimensions that can be visualized and analyzed. This is an unsupervised method, so no
information about piece’s composer affinity needs to be given. It has many advantages
because if it succeeds, we may assume that after adding more pieces and more
composers the result will be still satisfactory i.e. every new composer should be
detected and recognized. The main drawback of this method is the fact that one does
not provide any information to the program so we have to rely upon its insight (the
same problem as the one with neural network, described in the introduction).
A program that calculates SVD for the document-term matrix that comes out the
corpus was implemented. The program was based on the algorithm presented in [45].
The visualization was written using OpenGL.
- 47 -
The results that came out of the program were not satisfying. Figure 3.5 shows
eigenvalues of the resulting dimensions:
Figure 3.5 Eigenvalues for the corpus
The first k values that bring half of the useful information are taken into
consideration. One can see that more than half of the information contains first 20
dimensions. Therefore by using them, we might be able to do successful classification,
but still half of the information is lost and actually we do not know anything about what
the dimensions really indicate. In order to find the information about each composer’s
contribution, a visualization of the pieces on the 3D space was conducted and one tried
to find the dimensions responsible for composer attribution. Investigations show, that
there is certain regularity, but it is not significant, so one might conclude that an
unsupervised approach to the composer recognition task is not the thing that one is
looking for. Moreover, twenty dimensions are still too much for 3D visualization which
requires 3 elements’ vector. However, one can visualize different triples of dimensions.
The following colors are used in the following figures: Bach – red, Beethoven – green,
Chopin – blue, Mozart – yellow, Schubert – cyan. In the first screen (Figure 3.6)
dimensions 1, 2, 3 were shown:
- 48 -
Figure 3.6 SVD dimensions: 1, 2, and 3
Except that one can see the grouping of red dots on the right (Bach) – no special
information about composership is stored in the most ‘informational’ dimensions.
Figure 3.7 SVD dimensions: 1, 4, and 6
In Figure 3.7 (dimensions 1, 4, and 6) one can observe distinction between Bach
(red dots) and Beethoven (green dots)
- 49 -
Figure 3.8 SVD dimensions: 4, 7, and 8
In Figure 3.8 (dimensions 4. 7, 8), Bach is visibly grouped in the center of the
screen.
In turned out, that composership is not a primary feature of the piece because
the very first dimensions do not distinguish between composer groups. However, it
may be obvious on the second thought. The most important features of the piece is
mode, length, tempo, form and these features may come out the analysis. Some groups
reveal in deeper analysis, but they are observed after, not before assigning colors to the
pieces, which is not the point regarding unsupervised method.
- 50 -
4 The algorithm for composer attribution
4.1 Related work
A system about authorship attribution has been done on texts by Keselj, Peng,
Cercone and Thomas [29]. They reported that a successful authorship attribution
method can be applied to text using n-gram based statistical approach from natural
language processing with the accuracy that reaches 100%. The method introduced is
very simple in its concepts and might be successfully applied in other fields like music.
Pollastri and Simoncelli [43] have done the system of theme recognition using
Hidden Markov Model and report 42% accuracy among 5 composers. This is not a
satisfactory amount. However, they claimed, according to other psychological research,
that human ability of recognizing themes for professionals is about 40 %. They have
also used n-grams, as it was described in the previous section and they have done their
research just on monophonic themes.
Ponsford et al. [44] and Rivasseau [46] conducted a research to apply
unsupervised learning of the harmony in order to produce the set of grammar rules and
generate random harmonic structures. Their reasoning was clear but similar results may
be obtained through applying a probabilistic context-free harmonic grammar rules, but
of course – it will not be an unsupervised approach. However, their results shows that
music can be investigated using statistical analysis.
Lots of work has been done to recognize some aspects of waveform data using
different methods ([3], [18], [35]), but this field is so far not investigated enough and
the results are quite poor. The main problem in this field is that we still cannot interpret
- 51 -
the waveform data well and without this insight our work is still just a rambling in the
darkness
Kranenburg and Backer [30] conducted an interesting research on the method
on classifying the composer by the fixed set of features and using machine learning
methods to predict results, but they did not show clearly the accuracy of their system.
The main drawback of this system is that it uses the direct expert knowledge to run the
system. Nowadays, main systems try to teach themselves because the amount of data is
too large to be run over by people, but fortunately big enough that computers can take
out some useful knowledge and patterns from it.
Lots of work has been done in the field of psychology and psychoacoustics
([11], [64]). It shows that people are not so perfect in recognizing music in general, but
they work very fast in recognizing well known pieces.
A successful style recognition system has been done by Buzzanca [8]. He used
neuronets and reports 97% accuracy, but “highly prepared data” were used in this
solution. By “highly prepared data”, one means selecting themes from pieces, not
giving whole pieces to be classified. Having that in mind, this solution is not fully
automated, because it involves long-lasting users, experts preprocessing, which is not
the case in this thesis. Second thing is that the use of neuronets cannot give an
explanation of such behavior and results. It does not give the insight into the features
that can distinguish between different composers. The system may work, but it will not
increase human knowledge in this area. In n-gram based approach one assumes, that the
order of notes plays role and after that one can take the profiles out and check what the
features (the sequences of notes) that specify composer’s contribution are.
4.2 Algorithm
According to Wikipedia, document classification or categorization is a problem
in information science. The task is to assign an electronic document to one or more
categories, based on its contents. Document classification tasks can be divided into two
sorts: supervised document classification where some external mechanism (such as
human feedback) provides information on the correct classification for documents, and
unsupervised document classification, where the classification must be done entirely
without reference to external information [12]. The problem of assigning composership
- 52 -
to the musical pieces is such a task. Musical pieces are electronic documents, that can
be processed and useful information can be obtained out of them. In the previous
section it was described how this can be done. It was also show that a few regularities,
which occur in texts, are satisfied for the music as well. Therefore NLP techniques may
be applied to music content as well.
In the previous section it was shown that an unsupervised approach may not
work in the case of composer recognition. Unsupervised methods are usually less
effective then supervised ones that is why a decision of trying the following method
was made.
In order to classify musical documents the following steps can be done:
1. splitting corpus into two sets, training and testing,
2. building profiles for each composer using information from training corpus,
3. building representation of each piece from the testing corpus,
4. comparing each of the testing documents to each profile,
5. judgment for the resulting assignments.
In the following sections each of these steps will be briefly described.
4.2.1 Testing and training set
The corpus was split into two sets. If the composer’s subset was big enough, 10
items were drawn out of it. For smaller subcorpora such number of elements was
chosen, so that the remaining – training part – was still big enough to train the
classifier. The result of the split is presented in Table 4.1 :
Table 4.1 Training and testing set split
Composer Training Set Testing Set
J. S. Bach 99 items, 890kB 10 items, 73kB
L. van Beethoven 34 items, 1029kB 10 items, 370kB
F. Chopin 48 items, 870kB 10 items, 182kB
W. A. Mozart 15 items, 357kB 2 items, 91kB
F. Schubert 18 items, 863kB 5 items, 253kB
The resulting number of testing items is not too big, for very precise algorithm
evaluation; however, it is still enough to show whether the method works. Building
corpora is a tough problem, because, as it was mentioned in the previous chapter, one
- 53 -
has to choose only those MIDI files that were well-sequenced, while minority of MIDI
files available on the web satisfies this requirement.
4.2.2 Building profiles
In the next step, each item from the training corpus is used to build composers’
profiles. Each document is being preprocessed and three sets of n-grams are extracted:
melodic n-grams, rhythmic n-grams and combined n-grams. Then, n-grams in each
group are counted and information about them combined with the number of
occurrences is kept. The process of building such profiles is illustrated in Figure 4.1:
Figure 4.1 Building profiles
Trigrams are used in the example. In this case each n-gram occurs once,
however, as far as whole pieces are concerned, some n-grams are more frequent than
others. Each profile is a hash table where n-grams are keys and numbers of occurrences
are values. As a result, one obtains three independent profiles and they are analyzed in
next steps separately.
Then, three profiles are created for each composer. Each profile is a vector of
features (n-grams) which is a join of all profiles of the same type from all pieces of a
composer. The process of joining profiles will be described in algorithm details section.
As it was already mentioned, some n-grams are quite frequent in all document
regardless of composership, some of them occur only few times. There are also
- 54 -
n-grams, that are true composer indicators and they made the algorithm work as
planned.
Building composers’ profiles is the last training step. At this point the entire
knowledge is stored in profiles and one can sell the system, if it is a commercial project
or share if it is a free distribution.
4.2.3 Building piece representation
The next step of the algorithm is logically the first step in end-user part. This
part works similarly to the previous one – profiles part. Here, the piece that is being
recognized is converted to the same form as original profiles – each piece is also
represented as a vector (or as a hash table) of n-grams occurrences and these
representations are then being compared.
4.2.4 Profiles comparison
One may notice from previous chapters, that both, composer profiles and the
profiles for each analyzed piece have the same structure – multidimensional vector.
There are many methods of comparing such representation [9]. They are being widely
developed in NLP and IR since documents are presented in the same way – as vectors.
The most popular method used in IR is cosine similarity measure. It is simply
the cosine of the angle between vectors representing document and composer’s profile.
Following cosine definition, it returns ‘1’ if vectors are parallel (i.e. if the vectors are
equal regardless of the scale), ‘0’ if the vectors are orthogonal (the documents contain
disjunctive sets of words) and value in between according to the following formula in
other case:
( )
( ) ( )∑∑∑
⋅
⋅=
22),(
ii
ii
yx
yxyxCosSimrr
(4.1)
where xr
and yr
are documents’ vectors and xi and yi represent i-ths value. Of course,
both vectors are to have the same cardinality, so if a feature in a profile does not exists,
i.e. there were no such term in the document, ‘0’ is assumed.
Cosine similarity has many advantages. It is sensitive to the vectors that contain
values with the same proportions, not only the same values. It is a frequent situation
while comparing single documents against profiles of whole subcorpora. However, it
- 55 -
behaves unjust if some dimensions are incomparably larger (i.e. contain larger values)
than others. In this situation cosine similarity measure biases its verdict by the most
frequent terms. It is an undesired situation because the most frequent terms, known as
stop-words, does not contain useful knowledge and mislead the assessment.
Another method for comparing profiles is here proposed. It is a modified
method described by Keselj, Peng, Cercone and Thomas [29] that was used for
comparing the profiles of texts authors:
( ) ( )∑
+
−⋅−=
2
24,
ii
ii
yx
yxyxSimrr
(4.2)
This similarity measure consist in counting the relative difference of valued
( ii yx − ) to the mean ( ) 2ii yx + for each of the dimensions separately and then
summing them up.
This method does not lead to the situation where not incomplete profiles bias
the verdict in favor of their side. Every n-gram may increase final verdict by the small
value between ‘0’, if one feature is not in a vector and ‘4’ if the values are equal. The
graphs showing possible component values depending on feature’s values are shown in
Figure 4.2, where maximum ‘fours’ are marked by dashed line. However, it is
susceptible to the situation where the vectors are the same, but scaled. It happens if a
piece contains much less n-grams than a profile. Still, while comparing the same piece
against different profiles, results are bigger or larger according to the piece size but
remain comparable among all profiles (if the profiles are quite balanced).
Figure 4.2 Measure components for different n-grams values
- 56 -
4.2.5 Final judgment
One obtains 3n values, where n stands for the number of analyzed composers,
for a piece. There are many possible judgment algorithms that can be applied in order
to find the most appropriate choice. Thus it is not connected with composer
classification but classification itself, therefore a decision not to pay an attention on this
parameter was made. The following steps were applied:
1. Sum up all the similarities for profiles of each composer,
2. Sort all sums descending,
3. Take topmost composers as a result.
Sample calculations based on real example are shown in Table 6.1 (on page 71).
The example presented there also shows the other side of this algorithm. It evaluates in
its foundations the rhythmic and melodic adherence of each piece to the appropriate
composership element. The analyses of this algorithm behavior are shown in the
evaluation chapter.
4.3 Algorithm details
Before algorithm application one has to specify many other parameters, not
essential for the algorithm, like comparison measure, but inherent in algorithm’s
‘runtime mode’ (opposite to its logical, paper specification).
4.3.1 N-gram length
As it was described it in n-grams features section, the use of n-grams results
from Markov inference assumption. N influents on, so called, though horizon of the
algorithm. If it is too small (if n is small) n-grams contain less information about the
context of a note. If it is large one has to feed the algorithm using larger and larger
training corpus otherwise the profiles will suffer from the lack of data, most n-grams
will occur only a few times and the recognition system will be unlikely to work.
N-grams size was not limited in this work. However, it turned out during
investigations that n-gram length should not exceed several.
- 57 -
4.3.2 Aging factor
Aging factor consist in lowering profiles values during learning. This operation
should be repeated before each training document. Profiles aging may lose some
information that comes out the very first pieces, but it enables to generalize information
stored in training pieces. This is useful mechanism if some corrupted data came out of
some pieces that may damage accuracy of the profiles. In order to understand it better,
the following example was prepared: Let us assume one has two types of incoming n-
grams – good, but rare (it occurs in every piece twice) and misleading – that were
occurring five times less frequent, but were 5 times stronger. The training process on
these data is shown in Figure 4.3:
Figure 4.3 Aging example
It turned out that larger, but less frequent event was remembered almost 4 times
less. Without aging both symptoms will end up with the value of 20.
4.3.3 Normalization
Normalization is the process of equalizing values in profiles to ensure, that none
of the composers obtains a handicap resulting from the fact, that his profiles were more
complete than others. Testing whether to normalize was also the point of my research.
One can also think whether to normalize pieces profiles that are compared
against composers’ profiles, however a decision not to pay attention to this point was
made. That results in the fact, that smaller pieces have smaller cumulative similarities,
but as it was already mentioned, the ratios still remain the same for a piece.
- 58 -
4.3.4 Profiles size
The size of the profiles represents nonfunctional limitations of the system. The
number of different n-grams for n=6 reaches 200,000 using this not so large corpus and
it turns out that many of them occur only few times. That’s why a decision not have to
bother these kinds of problems was made. Current machines should easily handle such
data. An observation of some problems and regularities that may occur while applying
such system on much larger datasets was also a reason for applying this parameter. The
second purpose was to observe how the size of the profiles affects other parameters
such as n-gram length, normalization or aging factor.
The final project was not limited in terms of the size of the profiles. Looking
ahead the final release of the system uses less than 100 MB of RAM and processor
delays were acceptable. However, limiting some information using more sophisticated
method, like entropy sieving may drop down resource requirements for the system
without significant fall of its performance factors.
- 59 -
5 Composer recognition system
5.1 Functionality
The implementation schema is given in Figure 5.1. Squares stand for the units
that analyze the data. Preprocessor was implemented in C++ and converts input MIDI
files into the unigram representation, denoted as the Musical TXT files (MTXT)
because of its linear representation, which makes it corresponding to the written text.
The trainer creates the profiles from the training data and then the profiles are stored in
external files. Classifier classifies every testing file in three ways – using three types of
n-grams, rhythmic, melodic and combined independently and then merges the results to
find the final verdict as it was explained in previous section.
System should also be able to store profiles in external files, allow saving and
loading profiles in order to available using them in different locations and in different
time.
It should also provide the ability to create composers’ profiles for various
system parameters, which were described in previous chapter.
- 60 -
Figure 5.1 System scheme
5.2 Project
Some additional system features should be analyzed before the implementation.
In the following sections a brief description of the solution of storing composer’s
profiles and MIDI corpus will be given. Other features such as system structure will be
described in the implementation section.
- 61 -
5.2.1 CDB file
In order to store information about the single set-up a file type of composer
database (CDB) should be defined. Storing all the desired information in a zip archive
(renamed to .cdb) involves the following files:
1. settings.dat – file containing information about database settings, n-gram
length, aging factor and profiles’ size,
2. composers.dat – file with a list of composers containing composers names and
their IDs,
3. the following files are stored for each $composer from composers.dat:
3.1. $composer.piecelist – list of trained pieces with various additional
meta-information,
3.2. $composer/* – MIDI files associated with certain composer. No
additional information about MTXT files is stored since it is quick and easy
to obtain n-gram’s representation directly from the MIDI file,
3.3. $composer.$type.profile – profile hashes for each $type from the
following list: rhythm, melody and both, containing information about each
profile type for each composer.
5.2.2 Importing MIDI files
One way of managing external files is not to bother them, and make the user
store the files in his own filesystem. A user should browse for them each time they
want to add them to a database, classify or move out of database. This is a simple
approach but might not be convenient to maintain this kind of data resources.
The other approach is to store MIDI files inside program structures for easy user
management. A decision was made to force user to import MIDI files to the program
before they are used for training or testing and then MIDI files are copied to the
system’s structures at the time they are imported. It requires from the system storing
and maintaining the whole set of MIDI files. However, MIDI files are small, so the
system will not suffer from large storage (disc) requirements. Moreover, the user will
be free from problems of moving pieces in and out the profiles and moving them across
different databases since the files will be available in the program all the time.
- 62 -
5.3 Implementation
Composer recognition system is implemented in Perl scripting language with
TK GUI distributed with Perl package. I used Perl version 5.8.8.820 distributed by
ActiveState [1] under Artistic License [2]. Additionally, MIDI parser was implemented
in ANSI C thus it is compilable using many popular compilers. I have compiled it with
compiler provided with Visual Studio 2005 on Win32 and gcc on UNIX (for tests). I
have used zipping package provided by Info-ZIP [28], distributed freely for any
purposes.
5.3.1 Packages overview
System is divided into three subsystems. The structure of these systems is
shown in Figure 5.2:
Figure 5.2 System structure
1. Engine – subsystem that provides fundamental system functionality such as
creating profiles, managing databases, passing judgments etc…,
2. UI – subsystem responsible for graphical user interaction, containing business
procedures to all functionalities provided by Engine API,
3. Utils – subsystem containing all functionalities which are not inherent to the
system, but vital for the system itself such providing OS API, packager API,
preprocessor API,
4. Run.pl – runtime script.
All subsystems are described in following sections.
- 63 -
5.3.2 Engine subsystem
Engine subsystem contains the following modules (Perl packages):
5.3.2.1 DataConnectionMgr.pm
Package that provides API for loading, saving and creating new CDB
(composers’ database) files. While loading and creating a database, it informs the user
about the progress and handles refreshing information in other packages.
5.3.2.2 ListMgr.pm
This Package handles external MIDI files. These files can be then added to
profiles or classified against existing profiles. An assumption was made that external
MIDI files should be imported to the system even if they are not included in profiles. It
enables easy transfer of files among different profiles without accessing local
filesystem. ListMgr package is responsible for adding, removing and loading these files
and provide functions for accessing these files.
5.3.2.3 ProfileMgr.pm
This package handles information about current configuration such as
composer’s information, created profiles and system parameters. It loads current
configuration from cdb files and saves them, allows adding, removing composers,
adding and removing files from profiles. ProfilesMgr package contains functions that
enable the comparison of profiles using comparison algorithm presented in the previous
chapter.
5.3.2.4 MidiProcessor.pm
It contains a set of functions, which retrieve n-grams from MIDI files for other
packages. It also implements n-gram packing algorithm.
5.3.3 Utils subsystem
Utils subsystem contains the following modules (Perl Packages and programs):
5.3.3.1 color.pm
This package implements color operations, generates colors of desired
parameters for all UI packages.
5.3.3.2 debug.pm
This package handles information about the environment of the application as
well as provides some useful information about system performance and status.
- 64 -
5.3.3.3 hashop.pm
It provides basic operations on hashes that represent document vectors and
composer’s profiles. Various functionalities include adding, subtracting of two vectors
(for adding and removing files form profiles), multiplying vector by scalar (for aging)
and various vectors’ limiting functions (for profile’s sizes limitation).
5.3.3.4 refresh.pm
It provides a functionality, which ensures that all application windows are
refreshed.
5.3.3.5 settingsMgr.pm
It stores information about opened database between system’s launches. It also
provides a function that creates a unique Composer ID among the system.
5.3.3.6 system.pm
This package contains functions that access operation system. The actions that it
takes depend on the operating system that the program is running on. It provides
various files listing functions, storing and loading document vectors on disc, archiving
and restoring data from zip files, copying files and managing folders.
5.3.3.7 zip.exe and unzip.exe
Tools that enable packing and unpacking data for storage purposes if the
program is ran on Windows OS.
5.3.3.8 preprocessor.cpp and preprocessor.exe
Preprocessing programs for retrieving unigrams described in chapter 2.
5.3.4 UI subsystem
UI subsystem contains these modules, usually containing widget creation
methods:
5.3.4.1 AddToDatabaseWindow.pm (Figure 5.3)
This package creates dialog window that performs adding external midi files to
the profiles specified in Composer field. Option Save source indicates whether to
include original MIDI files to the CDB file or not. Choosing this option results in larger
files size but allows retrieving MIDI file later or performing some conversion
operations on composers’ profiles. Pieces are added in the same order as it is on the list.
- 65 -
If anyone wants to change the order, there are buttons on the left to do it. It is important
if aging is applied.
Figure 5.3 Adding to database window
5.3.4.2 ComposerWindow.pm (Figure 5.4)
This is a tiny modal window that allows adding new composers to the database.
After typing composer’s name, one can press Enter to accept new composer or Tab key
if one wants to add this one and then add another one – an empty Add new composer
window will then pop up.
Figure 5.4 Adding composer window
5.3.4.3 DBSettingsWindow.pm (Figure 5.5)
This window pops up when New database action is fired. It allows choosing
different system parameters such as n-gram’s length, profiles’ sizes or aging factor.
- 66 -
Figure 5.5 Adding composer window
5.3.4.4 MainWindow.pm (Figure 5.6)
This package builds main window of the application. It is split into two main
parts. The left part contains current database view with composers and already trained
pieces (with information about age of each piece). Different colors indicate piece status,
red is used when there is no available MIDI source for piece, green if MIDI piece is
stored with CDB file. On the right side MIDI resources are available. These are the
files imported to the system. Information about channels containing notes with
information how many n-grams are in each channel may be provided for each piece.
Each list is followed by a set of function buttons.
- 67 -
Figure 5.6 Application main window
5.3.4.5 Message.pm
This package is responsible for popping up messages to the user.
5.3.4.6 RecognizeWindow.pm (Figure 5.7)
This package creates recognizing window. It performs recognizing tasks on
selected pieces and presents results to the user. Each piece is assigned values of
similarities to composers followed by a final verdict. User is informed about the
progress using various progress bars at the bottom of the window.
- 68 -
Figure 5.7 Application: recognizing window
5.3.4.7 DBTree.pm
It holds database list (left pane on main window). It creates the list widget and
manages list operations inside the list such as refreshing, adding and removing
composers and files, retrieving selected items.
5.3.4.8 FileTree.pm
The purpose of this package is the same as DBTree.pm but it is used for
managing external files list, on the right side of main window.
5.3.4.9 Menu.pm
This package creates a menu and binds actions to menu items.
5.3.4.10 progress.pm
This package is responsible for informing the user about the progress of actions
connected with CDB operations such as loading, saving databases. It creates progress
bar at the bottom of main window.
- 69 -
5.3.4.11 WindowList.pm
It manages the proper refreshing of windows. It allows Utils::refresh package to
properly refresh all currently opened windows while various time-consuming
operations.
5.3.4.12 graphicd.pm and icons folder
It is responsible for loading all required icons and provides easy access to them.
5.3.4.13 cmd.pm
This package contains all business procedures for the applications. All the
actions in menu, application main window or other windows point to some functions in
this package. cmd.pm package contains business scenarios for following actions:
1. creating and removing composers,
2. creating, opening, saving and ‘saving as’ databases,
3. adding file, adding folder, removing items from external MIDI files list,
4. moving files in and out the database from MIDI files list,
5. recognizing action,
6. closing application.
5.3.5 Running script – Run.pl
It runs the application. If run on Win32 – it releases the console and lowers
program priority class (for better overall system performance during complex and time
consuming operations). It sets application environmental settings, creates main
window of the application and load recently used database.
5.3.6 Testing plug-in
System was also provided with a script for algorithm testing. It was designed to
enable applying different parameters to the system for the same data, automatic testing
and outputting results on external file, since deep analyses of the system are
time-consuming.
- 70 -
6 Analysis of the results
A set of tests of the system was conducted using the corpus described in the
previous chapter for various parameters. The testing results space includes the
following parameters:
1. normalization – with or without,
2. similarity measures: cosine similarity, proposed similarity,
2. various n-grams length: 2, 3, 4, 5, 6, 7, 9, 12,
3. various profiles sizes: 100, 250, 500, 1000, 2500, 5000, 10000,
4. various aging factors: 0.7, 0.8, 0.85, 0.9, 0.96, 0.99 and without aging.
The accuracy, i.e. the ratio of the number of correctly assigned pieces to the
total number of tested files was used. The small number of available testing files does
not allow applying more sophisticated methods such as precision or recall, because they
could be not reliable enough. Nevertheless, the accuracy is good enough to show
whether the algorithm works.
Testing consists in repeating all the steps (training, classifying) using various
parameters. Despite the fact that tests were conducted on 4-processors, Sparc Solaris
machine, since I decided to test exhaustively the result space, it lasts sometimes many
days to conduct these tests.
6.1 Results interpretation
Running experiments using the program that output all intermediate results
allows looking inside the algorithm’s performance. The judgment is clear because
- 71 -
assigning pieces to certain groups is well defined, i.e. one knows for sure who the
author of the piece is. There are many problems that do not have this certainty, for
instance, text classification by topic. Of course, while talking about a certain composer
we usually have his style in mind. Bach wrote his compositions in Bach’s style, F.
Chopin had Chopin’s style, but one talks also about influences that have appeared
between composers. These impacts may be also observed by looking into some
examples:
6.1.1 Proper judgment
Example of the successful judgment is shown in Table 6.1:
Table 6.1 Evaluation of the Frederic Chopin prelude Op. 28 No. 22
Profiles melodic rhythmic combined Total Verdict
Beethoven 43.2 17.2 11.0 71 3
Mozart 49.2 11.4 6.4 67 4
Bach 62.4 8.2 6.4 77 2
Schubert 19.3 13.2 5.9 38 5
Co
mp
ose
rs
Chopin 86.8 25.1 10.9 122 1
According to this example, the prelude shows high Chopin affinity albeit it
manifests Bach-based melody carrying on with typical romantic rhythmic structure
(high marks for Beethoven and Shubert). It is well-known fact that Chopin was
fascinated by Bach’s compositions. Before playing in concert he shut himself up and
played, not Chopin but Bach, always Bach [27]. These facts came out of the composer
recognition system.
6.1.2 Wrong judgment
In Table 6.2 one can show the results for Beethoven’s training sonata:
Table 6.2 Evaluation of the Ludwig van Beethoven Sonata Op. 49 No. 2
Profiles melodic rhythmic combined Total Verdict
Beethoven 303.7 208.8 109.7 622 4
Mozart 319.2 201.7 124.4 645 2
Bach 366.1 263.0 83.7 712 1
Schubert 315.6 201.8 119.1 636 3
Co
mp
ose
rs
Chopin 296.5 127.3 79.0 502 5
- 72 -
This composition is not a typical Beethoven work. However, it was drawn as a
member of testing corpus. According to the fact, that it is simple piece in the style of
his predecessors, Bach and Mozart, they were classified at the top. On can see also how
low Chopin was classified in terms of rhythmic structure. It was so because this piece
has a very simple rhythmic structure which is unlikely for Chopin.
6.1.3 Unseen composers
The algorithm behaves surprisingly well even for unseen composers. A result
for Liszt’s pieces is shown in the Table 6.3. F. Liszt was contemporary for F. Chopin
(almost the same birth date). The system did not know Liszt’s compositions; however,
it sorted existing composers in terms of relative lifetimes to F. Liszt (sic!).
Table 6.3 Evaluation of the Franz Liszt Concert Etude No. 3 ‘Un sospiro’
Profiles melodic rhythmic combined Total Verdict
Beethoven 203.8 115.9 76.6 396 2
Mozart 198.8 60.7 45.5 305 4
Bach 178.8 42.8 36.0 257 5
Schubert 192.0 93.6 82.6 368 3
Co
mp
ose
rs
Chopin 309.0 109.0 88.5 506 1
A small set of other composers’ pieces was collected aside. Assignments done
by the system opinioned if the choice was correct (√√√√) or not (××××) are presented in Table
6.4. Only professionals are allowed to pass sentences but these results look good.
Table 6.4 Unknown Composers assignments
Piece Assignmen
L. Boccherini Badinerie Chopin ×××× A. Borodin Prince Igor Schubert √√√√ C. Debussy Golliwogs Cakewalk Chopin √√√√ C. Debussy Petit Negre Beethoven ×××× L. Delibes Lakme Bach ××××
E. Grieg Wedding March Schubert √√√√ J. Haydn Arietta Bach √√√√ J. Haydn Capriccio Schubert ×××× S. Joplin Entertainer Bach ×××× F. Liszt Un Sospiro Chopin √√√√
F. Mendelssohn Characteristic Piece Chopin √√√√ F. Mendelssohn Christmas Piece Chopin √√√√
J. Pachelbel Canon Mozart √√√√
- 73 -
6.2 Algorithm evaluation
The best results for regular pieces (from training set) are given in Table 6.5. The
best results were obtained without normalization for the featured measure. Using large
profiles and n-gram length about 6 allows obtaining 85% accuracy of the system.
Columns represent the size of profiles, rows indicate the n-gram length and the results
are shown for the aging factor 0.96:
Table 6.5 Algorithm results of aging 0.96
100 250 500 1k 2.5k 5k 10k
2 41 38 38 35 32 43 43
3 46 54 59 62 59 51 43
4 62 70 65 73 73 78 86
5 54 62 70 78 78 81 81
6 54 59 68 68 84 78 84
7 46 49 68 68 68 70 84
9 46 57 49 51 57 68 76
12 41 46 41 41 41 46 49
The important thing to be pointed out is that the random classifier will have the
accuracy of 20%, so the result over 80% is a very good result and show that the system
really works. The second thing is the fact, that some pieces were written by a composer
in a different style and it is really hard for people who do not know the certain piece to
classify a piece to the proper class.
6.2.1 Profiles comparison
I have tested the algorithm in terms of using cosine similarity measure or the
one featured in this thesis. Cosine similarity behaves quite well. However, for the best
parameters the accuracy reaches 65% which is still less then using proposed method. It
probably results from the fact that the words that are the most frequent drive most the
final verdict while it was shown that the most important words lie in the middle of the
frequency ranking scale.
6.2.2 Normalization
One can think that the answer for the question whether to normalize profiles is
obvious: yes. The investigations show that it is not true. It means, it is true, but not in
the area of our interests. It turned out that using small profiles and small n-gram lengths
- 74 -
leads to the necessity of normalization, but using larger parameters makes the result
slightly worse. The kind of normalization of results is maintained by the aging
mechanism, which drops down some incidental growths. The results for the
normalization profiles using other parameters, the same as for Table 6.5, are shown in
Table 6.6, while the differences between results are presented in Figure 6.1. ‘∆’ value
describes the accuracy advantage of normalized profiles to regular profiles:
Table 6.6 Algorithm results of aging 0.96 with profiles normalization
100 250 500 1k 2.5k 5k 10k
2 62 68 73 57 62 62 54
3 59 68 68 62 76 62 54
4 41 65 65 70 73 76 76
5 43 62 70 76 70 78 78
6 46 59 59 68 76 78 81
7 32 57 68 68 65 78 84
9 43 43 54 54 43 57 73
12 19 32 46 41 32 46 41
Figure 6.1 Normalization’s influence to the results
- 75 -
6.2.3 N-gram length
It comes from the experiments, that the best results were obtained for n-grams
lengths about 6, 7. One can infer that it means that average phrases contain 7, 8 notes
(unigram describes two notes, not one). It is the value that was expected. Similar values
were obtained by Downie [14]. It can be the next voice in the discussion about average
musical phrase length.
6.2.4 Profile’s sizes
It turned out that profiles size might be limited for better system performance.
The difference between accuracies for unlimited profiles and large profiles are
imperceptible especially while lacking a large testing and training corpus. One has to
notice, that as the size of profiles increases, the best n-gram length grows.
6.2.5 Aging factor
We can see that the best results can be obtained in the area with aging factor
0.96 for big profiles and with n-gram size about 6, 7 (7, 8 notes). The problem with
highest aging factor is that some pieces have misleading contents and without aging –
these misleading features are preserved for final classification. The aging factor
represents the generalization ability of the classifier. The maximal accuracies obtained
for different aging factors are shown in Table 6.7. According to these results one can
come to a conclusion, that aging factor should be as high as possible, but not higher.
Table 6.7 Maximal accuracies for different aging factors
Aging factor 0.7 0.8 0.85 0.9 0.96 0.99 1.00
Best accuracy 69% 75% 75% 84% 86% 86% 81%
6.2.6 Representative data
The algorithm has been also tested on the testing data chosen from the dataset in
the way that the pieces that are in the testing set are representative for each composer,
not chosen randomly. Very good coefficients were received in this case (Table 6.8),
because the tests of the algorithm were conducted without misleading contents
described above.
- 76 -
With these data the accuracy of 100% can be reached, and over 90% was easy to
obtain. It shows, that the problem of why the results for random test set is not near
100% but 80% is because of the fact that each composer wrote some pieces not in their
own style.
Table 6.8 Results for representative data
100 250 500 1k 2.5k 5k 10k
2 38 31 25 38 38 38 38
3 50 69 69 62 75 44 38
4 69 81 81 88 81 94 100
5 62 81 81 81 88 100 94
7 56 75 81 75 81 81 88
9 44 75 75 69 69 88 88
12 44 62 69 69 56 69 56
This conclusion leads to the next one – the classifier recognizes the style, as
well as the “hand” of a composer. This is good and bad on the other side, because it
shows that “a hand” of a composer is not the only indicator, but possibly, using this
method one can recognize the style and genre of music better than the authorship.
The classifier also behaves very well tested on the training data – in this case
aging factor does not matter and the results are the best for the aging factor equal ‘1.0’.
However, the classifier does not recognize a style then, but tries to remember the piece.
6.2.7 Key-words-based classification
One can also investigate whether the algorithm works better for the n-grams that
are key-words only. In this case only the small part of the whole profiles is taken into
consideration as long as entropy sieving is applied to the profiles (page 44) and only
key-words are left. It is usually from 5 to 15 percent of the content of the full profile. It
also depends on the initial size of the profile. The experiment that measures accuracy of
the system with or without sieving shows that the results are quite poor comparing to
ones obtained for text classification based on subjects [21] (Figure 6.2). The reason of
that fact is probably that there is lots of useful information that lies between keywords
(or key n-grams) which is important for more complex classifications like composer
classification then for the plainer ones, like subject categorization. Nevertheless,
according to the fact that new profiles are about 10 times smaller than the original ones,
the classification process requires about 10 times less memory and processor time.
- 77 -
Figure 6.2 Accuracy of sieved and full profiles
- 78 -
7 Conclusions
The analysis shows that music can be treated as a natural language and thus can
be sensitive to NLP and IR tools. Converting music to the flat, textual representation
enables application this methods directly, which was used in this work. The full process
of obtaining n-gram representation of music from MIDI files has been shown; from
brief introduction to MIDI files specification, through MIDI parser description to the
method of retrieving n-grams.
N-gram representation of musical data demonstrates great convergence to text
since a number of classic NLP and IR theories such as Zipf's law or keywords
distribution are fulfilled for musical n-grams as well. A small corpus that was used to
carry out these investigations was collected, however it should be pointed out that it
was a toy database. In order to conduct detailed analyses of algorithms on the field of
music processing a far-flung community project needs to be set up in order to create a
large, freely distributed, well-tagged corpus of musical data. One can then talk about
comparison and performance of various algorithms.
The algorithm for composer recognition was proposed and a system based on
this approach has been developed. System accuracy of over 80% among the corpus of
five classical composers was reported. However, the method can still be improved,
while various features and details of the algorithm may be changed and replaced by
some more sophisticated solutions. On the other hand, the steps of the algorithm
presented in this work may be applied to many different tasks. The most important
common value of these methods is that doing researches on music data is also doing
- 79 -
researches on achievements of human thought which may lead to finding the answer to
the question of what differs mankind from apes.
- 80 -
A. Appendix – music notation
I. Western music system
Western music system is based on the standard western chromatic scale [49].
Each doubling of the frequency of a sound is called an octave. Assuming that humans
perceive sound from the range of 20 Hz – 20 kHz, there are about 10 (since 210 ≈ 1000
times) audible octaves. In the western chromatic scale, each octave consists of twelve
semitones which are the smallest integral units. The distribution of tones in the octave
is uniform in the logarithmic scale, i.e., the ratio of frequencies of every pair of
adjacent notes remains the same. This is also described as the equal temperament.
Having twelve notes in the octave it turns that each note pitch depends on the previous
according to the following rule:
1
1212 −⋅= ii pp (A.1)
Modern musical system refers to the note called A4 which has exactly 440 Hz.
The number that follows the note letter indicates the octave number. Following the rule
presented above, the middle 'c' note (C4) has 261.6 Hz, while 880 Hz is called A5.
The MIDI system assumes that the middle 'c' (C4) has the value of '60' where the
values represent semitone order. Since referencing A4 has an order of '69', the
following mapping may be applied to all MIDI pitches:
⋅+=
440log1269 2
fp (A.2)
where p is the midi pitch of a note and f is the frequency.
- 81 -
II. Staff system
Staff is the fundamental latticework of a musical score upon which all musical
symbols, such as notes, are placed. Each of the five staff lines and intervening spaces
correspond to the seven repeating pitches of the diatonic scale [38]. Diatonic scale
contains seven notes chosen out of the twelve from chromatic scale – they are
presented directly on the staff, while remaining 5 notes need to be altered using flats,
which lower the pitch by a semitone, and sharps, which raise the pitch by a semitone
(Figure A.4, 1a-1d, page 83). Using multiple alteration signs allows changing the pitch
more. Each transition between the position on line and the intervening space describes
one diatonic step.
Each staff is characterized by a clef. It shows the reference to the diatonic scale.
There are 3 types of clefs (Figure A.1):
Figure A.1 Clefs
1. Treble clef (G clef) – its reference pitch (G4) is pointed by the center of the
spiral,
2. Alto clef (C clef) – the middle point of the clef defines the C4 pitch line,
3. Bass clef (F clef) – its reference point (the place between the dots) defines F3.
Knowing staff system and clefs allows looking at the examples of different
notes with their detailed description shown in Table A.1:
Table A.1 Pitches
Note
Name C0 C1 C2 F3 C4 G4 A4 C5 C6 C7
Freq. (Hz) 33 66 131 175 262 392 440 523 1046 2093
MIDI pitch 24 36 48 53 60 67 69 72 84 96
- 82 -
III. Temporal information
Notes are assigned with duration information. Each note starts at the point,
where the previous note stopped so the flow of notes in contiguous. One may use rests
in order to generate periods of silence. Note and rest values are not absolutely defined,
but are proportional in duration to all other note and rest values. This proportion is
defined by the shape of the rest, shape of the note stems or note filling. In Table A.2 the
semibreve (a 'whole') was used as a reference value.
Table A.2 Notes and Rests
Note
Rest
Breve Minim Quaver Demisemiquaver Name
Semibreve Crotchet Semiquaver Hemidemisemiquaver
double whole half eight thirty-second a.k.a.
whole quarter sixteenth sixty-fourth
duration 2 1 1/2 1/4 1/8 1/16 1/32 1/64
Notes with flags may be merged in groups using beams instead of flags (Figure
A.2a). Notes may be also assigned additional symbols such as dots (Figure A.2b),
fermatas (Figure A.2c) in order to change their length.
Figure A.2 Time related symbols
Notes with pitch and duration information are organized in bars (Figure A.3a),
separated by various barlines (Figure A.3b). Each bar contains a certain number of
given time units (Figure A.3c). For instance, means four quarters per bar, means
twelve eights per bar. Special symbol is defined as and as . The exact duration
of the reference note (i.e., the tempo) should be defined by the performer supported by
the verbal information about the speed of the piece (Figure A.3d) and tempo marking
(Figure A.3d), placed at the beginning of a piece.
- 83 -
Figure A.3 Staff layout
There are many various auxiliary musical symbols that enable precise
description of a performance. They may be involved in the events that are important
regarding content-based music processing and retrieval such as notes alterations (note’s
pitches modification) (Figure A.4 group 1) but may also affect volatile elements of the
performance that are important for the player but may be negligible for music analysts
such as notes relationship marking, dynamics marking, articulation marking or
ornamentation. The examples of all this groups are assembled in Figure A.4:
Figure A.4 Plethora of music notation potpourri
- 84 -
Bibliography
[1] ActivePerl 5.8.8.820 Perl distribution for Windows. ActiveState. Online resource. http://www.activestate.com/products/activeperl/. retrieved 1.05.2007.
[2] ActivePerl License Agreement. ActiveState. Online resource. http://www.activestate.com/Products/ActivePerl/license_agreement.plex. Retrieved 1.05.2007.
[3] Allamanche, E., Herre, J., Hellmuth, O., Fröba, B., Kastner, T., Cremer, M. (2001). Content-based Identification of Audio Material Using MPEG-7 Low Level Description. In Proceedings of the International Symposium of Music Information Retrieval. http://ismir2001.ismir.net/pdf/allamanche.pdf
[4] Baumann, S. (1995). A Simplified Attributed Graph Grammar for High-Level Music Recognition. In Proceedings of the Third International Conference on Document Analysis and Recognition. http://ieeexplore.ieee.org/iel3/4755/13256/00602096.pdf
[5] Berenzwieg, A., Logan, B., Ellis, D.P.W., Whitman, B. (2004). A Large-Scale Evaluation of Acoustic and Subjective Music Similarity Measures. Computer Music Journal, 28:2, pp. 63–76. http://www.ee.columbia.edu/~dpwe/pubs/ismir03-sim-draft.pdf
[6] Bod, R. (2001). Probabilistic Grammars for Music. In proceedings of the Belgian-Dutch Conference on Artificial Intelligence. http://staff.science.uva.nl/~rens/bnaic01.pdf
[7] Bod, R. (2002). A unified Model of Structural Organization in Language and Music. Journal of Artificial Intelligence Research 17 (2002), 289-308. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/jair/OldFiles/OldFiles/pub/volume17/bod02a.pdf
[8] Buzzanca, G. (1997). “A Supervised Learning Approach to Musical Style Recognition”. In Proceedings of International Computer Music Conference. http://www.conservatoriopiccinni.it/~g.buzzanca/A_Supervised_Learning_Approach.PDF
[9] Chapman, S. String Similarity Metrics for Information Integration. Online resource. http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Retrieved 1.05.2007.
[10] Conklin, D. (2003). Music Generation form Statistical Models. In Proceedings of the AISB 2003 Symposium on Artificial Intelligence and Creativity in the Arts and Sciences, Aberystwyth, Wales, p. 30–35. http://www.soi.city.ac.uk/~conklin/papers/AISB/paper.pdf
[11] Damiani, A., Olivetti Belardinelli, M. (2003). Recognition of Composer’s Style from Musical Fragments. In Proceedings of the 5th Triennial ESCOM Conference. p. 254-256. http://www.epos.uos.de/music/books/k/klww003/pdfs/228_Damiani_Proc.pdf
- 85 -
[12] Document classification. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Document_classification. Retrieved 1.05.2007
[13] Doraisamy, S. (2004). Polyphonic Music Retrieval: The N-gram Approach. Ph.D. thesis. University od London.
[14] Downie, S. (1999). Evaluating a simple approach to music information retrieval: Conceiving melodic n-grams as text. Ph.D. Thesis, University of Western Ontario.
[15] Downie, S. (2003). Music Information Retrieval. Annual Review of Information Science and Technology 37, 295-340. http://www.music-ir.org/downie_mir_arist37.pdf
[16] Finale, composing and score-writing tool. MakeMusic, Inc. Online resource. http://www.finalemusic.com/finale/. Retrieved 1.05.2007.
[17] Francu, C., Nevill-Manning, C. G. (2000). Distance Metrics and Indexing Strategies for a Digital Library of Popular Music. IEEE International Conference on Multimedia and Expo (II). http://cristian.francu.com/Papers/icme00.pdf
[18] Franklin, D. R, Chicharo, J. F. (1999). Paganini – A Music Analysis and Recognition Program. Fifth International Symposium on Signal Processing and its Applications, Brisbane. p. 107-110 vol. 1. http://ieeexplore.ieee.org/iel5/6605/17735/00818124.pdf
[19] Fraunhofer Institut Integrierte Schaltungen. Online resource. http://www.iis.fraunhofer.de/. Retrieved 22.05.2007.
[20] Frojdh, P., Lindgren, U., Westerlund, M. (2006). Media Type Registration for Downloadable Sounds for Musical Instriment Digital Interface. RFC 4613. IETF. Online resource. http://www.apps.ietf.org/rfc/rfc4613.txt. Retrieved 1.05.2007
[21] Gawryjołek, J. (2007). Analiza I wizualizacja wpisów w serwisach typu wiki. B.Sc. Thesis. Warsaw University of Technology.
[22] General MIDI. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/General_MIDI. Retrieved 1.05.2007
[23] Glatt, J. (2004a). MIDI Specification. Online resource. http://www.borg.com/~jglatt/tech/midispec.htm. Retrieved 1.05.2007
[24] Glatt, J. (2004b). MIDI File Format. Online resource. http://www.borg.com/~jglatt/tech/midifile.htm. Retrieved 1.05.2007
[25] Glatt, J. (2004c). MIDI File Format: Tempo and Timebase. Online resource. http://www.borg.com/~jglatt/tech/midifile/ppqn.htm. Retrieved 1.05.2007
[26] History of Mahidol University (in Thai). Online Resource. http://www.mahidol.ac.th/muthai/history/history.htm. Retrieved 1.05.2007.
[27] Huneker, J. (1900). Chopin: The Man and His music. New York. ISBN 1-603-03588-5
[28] Info-Zip 2.32. GNU Project. Online resource. http://www.info-zip.org/. Retrieved 1.05.2007
- 86 -
[29] Keselj, V., Peng, F., Cercone, N., Thomas, C. (2003). N-gram-based Author Profiles for Authorship Attribution”. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING’03,pages 255–264, Dalhousie University, Halifax, Nova Scotia, Canada,. http://users.cs.dal.ca/~vlado/papers/pacling03.pdf
[30] Kranenburg, P. van, Backer, E. (2004). Musical style recognition – a quantitative approach. In Proceedings of the Conference on Interdisciplinary Musicology. http://gewi.uni-graz.at/~cim04/CIM04_paper_pdf/Kranenburg_Backer_CIM04_proceedings.pdf
[31] Lazzaro, J., Wawrzynek, J. (2006). RTP Payload Format for MIDI. RFC 4695. IETF. Online resource. http://www.apps.ietf.org/rfc/rfc4695.txt. Retrieved 1.05.2007
[32] Lemstrom, K. (2000). String matching Techniques for Music Retrieval. Ph.D. Thesis, University of Helsinki. http://www.cs.helsinki.fi/~klemstroz/THESIS/thesis-gzipped.pdf/lemstr00string.pdf
[33] Li, T., Ogihara, M. Li, T. (2003). A Comparative Study on Content-Based Music Genre Classification. Proceedings of the 26th Annual International ACM Conference on Research and Development in Information Retrieval. http://portal.acm.org/citation.cfm?id=860435.860487.
[34] LilyPond, automated engraving system. GNU Project. Online resource. http://lilypond.org/. Retrieved 1.05.2007.
[35] Martin. K. D. (1999). Ph.D. Thesis. Sound-Source Recognition: A Theory and Computational Model. Massachusetts Institute of Technology. http://xenia.media.mit.edu/~kdm/research/papers/kdm-phdthesis.pdf
[36] Martin, K. D., Kim, Y. E. (1998). 2pMU9. Musical instrument identification: A pattern-recognition approach. In Proceedings of the 136th meeting of the Acoustical Society of America. http://sound.media.mit.edu/Papers/kdm-asa98.pdf
[37] MIDI specification. MIDI Manufacturer Association. Online resource. http://www.midi.org/. Retrieved 1.05.2007.
[38] Modern Musical Symbols. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Modern_musical_symbols. Retrieved 1.05.2007
[39] Musical Instrument Digital Interface. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/MIDI. Retrieved 1.05.2007
[40] MusicXML Definition. Recordare LLC. Online resource. http://www.musicxml.org/xml.html. Retrieved 1.05.2007.
[41] Pardo, B. (2006). Finding Structure in Audio for Music Information Retrieval. IEEE Signal Processing Magazine. Vol. 23 Issue 4, 126-132. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=1628889&isnumber=34166
- 87 -
[42] Perez-Sancho, C., Inesta, J.M., Calera-Rubio, J. (2004). Style Recognition through Statistical Event Models. In Proceedings of the Sound and Music Computing Conference, SMC ’04. http://smc04.ircam.fr/scm04actes/P23.pdf
[43] Pollastri, E., Simoncelli, G. (2001). Classification of Melodies by Composer with Hidden Markov Models. In Proceedings of the First International Conference on Web Delivering of Music, p. 88-95. http://ieeexplore.ieee.org/iel5/7752/21299/00990162.pdf
[44] Ponsford, D., Wiggins, G., Mellish, C. (1999). Statistical learning of harmonic movement. Journal of New Music Research, 28(2). .http://www.doc.gold.ac.uk/~mas02gw/papers/JNMR97.pdf
[45] Press, W.H. (1992). Numerical recipes in C: the art of scientific computing. Cambridge University Press. ISBN 0-521-43720-2.
[46] Rivasseau, J.-N. (2004). Learning harmonic changes for musical style modeling. Project report, University of British Columbia. http://www.elvanor.net/files/learning_harmonic_changes.pdf
[47] Schaffrath, H. (1993). Repräsentation einstimmiger Melodien: computerunterstützte Analyse und Musikdatenbanken. In B. Enders and S. Hanheide (eds.) Neue Musiktechnologie, 277-300, Mainz, B. Schott’s Söhne.
[48] Schaffrath, H., Huron, D (ed). (1995). The Essen Folksong Collection in the Humdrum Kern Format. Menlo Park, CA. CCARH.
[49] Scientific pitch notation. Wikimedia Foundation, Inc. Online Resource. http://en.wikipedia.org/wiki/Scientific_pitch_notation. Retrieved 1.05.2007
[50] Selfridge-Field, E. (1995). The Essen Musical Data Package. Menlo Park, California. CCARH.
[51] Senner, W.M. (1991). The Origins of Writing. University of Nebraska Press. ISBN 0-80-329167-1
[52] Sornlertlamvanich, V., Potipiti, T., Charoenporn, T. (2000). Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. Proceedings of the 18th conference on Computational Linguistics - Volume 2, 802-807. http://acl.ldc.upenn.edu/C/C00/C00-2116.pdf
[53] Spärck Jones, K., Willett, P. (1997). Reading in Information Retrieval. Overall Introduction, 1-7. Morgan Kaufmann.
[54] Spevak, C., Thom, B., Hothker, K.. (2002). Evaluating Melodic Segmentation. In Proceedings of the 2nd Conference on Music and Artificial Intelligence, ICMAI’02, Edinburgh, Scotland. http://www.cs.cmu.edu/~bthom/PAPERS/icmai02.pdf
[55] Thom, B. (2000a). Unsupervised Learning and Interactive Jazz/Blues Improvisation. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, p. 652–657. http://www.cs.cmu.edu/~bthom/PAPERS/aaai2k.pdf
[56] Thom, B. (2000b). BoB: an Interactive Improvisational Music Companion. In Proceedings of the Fourth International Conference on Autonomous Agents
- 88 -
(Agents-2000), Barcelona, Spain. http://www.cs.cmu.edu/~bthom/PAPERS/agents2k.pdf
[57] Thom, B. (2001). Machine Learning Techniques for Real-time Improvisation Solo Trading. In Proceedings of the 2001 International Computer Music Conference, Havana, Cuba. http://www.cs.cmu.edu/~bthom/PAPERS/icmc01.pdf
[58] Treitler, L. (2000). With Voice and Pen. Oxford University Press. ISBN 0-19-816644-3
[59] Truong, B. (2002). Trancedence: An artificial life approach to the synthesis of music. http://www.informatics.susx.ac.uk/easy/Publications/Online/MSc2002/bt20.pdf
[60] Tseng, Y. (1999). Content-based retrieval for music collections. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 176-182. http://blue.lins.fju.edu.tw/~tseng/papers/p176-tseng.pdf
[61] Uitdenbogerd, A., Zobel, J. (1999). Melodic matching techniques for large database. In Proceedings of the seventh ACM international conference on Multimedia, pp 57-66. http://pi3.informatik.uni-mannheim.de/~helmer/m1.pdf.
[62] Walshaw Chris. “Abc Music Notation Software Package”. Web resource. http://staffweb.cms.gre.ac.uk/~c.walshaw/abc/ http://abc.sourceforge.net/abcMIDI/original/
[63] Wołkowicz, J. (2006). Analysis of piano pieces of various composers stored in MIDI files, (in Polish). Project Report. Warsaw university of Technology. http://torch.cs.dal.ca/~jacek/papers/projects/classification_enhancement.pdf
[64] Zdrahal-Urbanek, J. ,Vitouch, O. (2003). Recognize the tune? A Study on Rapid Recognition of Classical Music. In Proceedings of the 5th Triennial ESCOM Conference, p. 257-260. http://www.musicpsychology.net/vitouch/vitouch_2003b.pdf