automatic assessment technologies to support drums practicing
TRANSCRIPT
Automatic assessment technologies to support drums practicing
Amoroacutes Cortiella Maciagrave
Curs 2020-2021
Treball de Fi de Grau
Director Xavier Serra
GRAU EN ENGINYERIA EN SISTEMES AUDIOVISUALS
Bachelorrsquos Degree on Audiovisuals Systems EngineeringUniversitat Pompeu Fabra
Automatic assessment technologies tosupport drums practicing
Maciagrave Amoroacutes i Cortiella
Supervisor Xavier Serra
May 2021
Contents
1 Introduction 1
11 Motivation 1
12 Existing solutions 2
13 Identified challenges 3
131 Guitar vs drums 3
132 Dataset creation 4
133 Signal quality 4
14 Objectives 5
15 Project overview 5
2 State of the art 6
21 Signal processing 6
211 Feature extraction 6
212 Data augmentation 8
22 Sound event classification 9
221 Drums event classification 9
23 Digital sheet music 11
24 Software tools 11
241 Essentia 12
242 Scikit-learn 12
243 Lilypond 12
244 Pysimmusic 13
245 Music Critic 13
25 Summary 13
3 The 40kSamples Drums Dataset 14
31 Existing datasets 14
311 MDB Drums 15
312 IDMT Drums 15
32 Created datasets 16
321 Music school 16
322 Studio recordings 18
33 Data augmentation 22
34 Drums events trim 23
35 Summary 23
4 Methodology 24
41 Problem definition 24
42 Drums event classifier 25
421 Feature extraction 26
422 Training and validating 27
423 Testing 31
43 Music performance assessment 33
431 Visualization 33
432 Files used 35
5 Results 36
51 Tempo limitations 37
52 Saturation limitations 40
53 Evaluation of the assessment 43
6 Discussion and conclusions 47
61 Discussion of results 47
62 Further work 48
63 Work reproducibility 49
64 Conclusions 50
List of Figures 51
List of Tables 53
Bibliography 54
A Studio recording media 57
B Extra results 60
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Bachelorrsquos Degree on Audiovisuals Systems EngineeringUniversitat Pompeu Fabra
Automatic assessment technologies tosupport drums practicing
Maciagrave Amoroacutes i Cortiella
Supervisor Xavier Serra
May 2021
Contents
1 Introduction 1
11 Motivation 1
12 Existing solutions 2
13 Identified challenges 3
131 Guitar vs drums 3
132 Dataset creation 4
133 Signal quality 4
14 Objectives 5
15 Project overview 5
2 State of the art 6
21 Signal processing 6
211 Feature extraction 6
212 Data augmentation 8
22 Sound event classification 9
221 Drums event classification 9
23 Digital sheet music 11
24 Software tools 11
241 Essentia 12
242 Scikit-learn 12
243 Lilypond 12
244 Pysimmusic 13
245 Music Critic 13
25 Summary 13
3 The 40kSamples Drums Dataset 14
31 Existing datasets 14
311 MDB Drums 15
312 IDMT Drums 15
32 Created datasets 16
321 Music school 16
322 Studio recordings 18
33 Data augmentation 22
34 Drums events trim 23
35 Summary 23
4 Methodology 24
41 Problem definition 24
42 Drums event classifier 25
421 Feature extraction 26
422 Training and validating 27
423 Testing 31
43 Music performance assessment 33
431 Visualization 33
432 Files used 35
5 Results 36
51 Tempo limitations 37
52 Saturation limitations 40
53 Evaluation of the assessment 43
6 Discussion and conclusions 47
61 Discussion of results 47
62 Further work 48
63 Work reproducibility 49
64 Conclusions 50
List of Figures 51
List of Tables 53
Bibliography 54
A Studio recording media 57
B Extra results 60
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Contents
1 Introduction 1
11 Motivation 1
12 Existing solutions 2
13 Identified challenges 3
131 Guitar vs drums 3
132 Dataset creation 4
133 Signal quality 4
14 Objectives 5
15 Project overview 5
2 State of the art 6
21 Signal processing 6
211 Feature extraction 6
212 Data augmentation 8
22 Sound event classification 9
221 Drums event classification 9
23 Digital sheet music 11
24 Software tools 11
241 Essentia 12
242 Scikit-learn 12
243 Lilypond 12
244 Pysimmusic 13
245 Music Critic 13
25 Summary 13
3 The 40kSamples Drums Dataset 14
31 Existing datasets 14
311 MDB Drums 15
312 IDMT Drums 15
32 Created datasets 16
321 Music school 16
322 Studio recordings 18
33 Data augmentation 22
34 Drums events trim 23
35 Summary 23
4 Methodology 24
41 Problem definition 24
42 Drums event classifier 25
421 Feature extraction 26
422 Training and validating 27
423 Testing 31
43 Music performance assessment 33
431 Visualization 33
432 Files used 35
5 Results 36
51 Tempo limitations 37
52 Saturation limitations 40
53 Evaluation of the assessment 43
6 Discussion and conclusions 47
61 Discussion of results 47
62 Further work 48
63 Work reproducibility 49
64 Conclusions 50
List of Figures 51
List of Tables 53
Bibliography 54
A Studio recording media 57
B Extra results 60
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
241 Essentia 12
242 Scikit-learn 12
243 Lilypond 12
244 Pysimmusic 13
245 Music Critic 13
25 Summary 13
3 The 40kSamples Drums Dataset 14
31 Existing datasets 14
311 MDB Drums 15
312 IDMT Drums 15
32 Created datasets 16
321 Music school 16
322 Studio recordings 18
33 Data augmentation 22
34 Drums events trim 23
35 Summary 23
4 Methodology 24
41 Problem definition 24
42 Drums event classifier 25
421 Feature extraction 26
422 Training and validating 27
423 Testing 31
43 Music performance assessment 33
431 Visualization 33
432 Files used 35
5 Results 36
51 Tempo limitations 37
52 Saturation limitations 40
53 Evaluation of the assessment 43
6 Discussion and conclusions 47
61 Discussion of results 47
62 Further work 48
63 Work reproducibility 49
64 Conclusions 50
List of Figures 51
List of Tables 53
Bibliography 54
A Studio recording media 57
B Extra results 60
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
51 Tempo limitations 37
52 Saturation limitations 40
53 Evaluation of the assessment 43
6 Discussion and conclusions 47
61 Discussion of results 47
62 Further work 48
63 Work reproducibility 49
64 Conclusions 50
List of Figures 51
List of Tables 53
Bibliography 54
A Studio recording media 57
B Extra results 60
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Acknowledgement
I would like to express my sincere gratitude to
bull Xavier Serra for supervising this project
bull Vsevolod Eremenko Eduard Vergeacutes and the MTG for helping me whenever I
have needed it
bull Sepe Martiacutenez and the Stereodosis team for helping me record a drums dataset
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Abstract
This project is focused in the development of automated assessment tools for the
support of musical instrument learning concretely drums learning The state of
the art on music performance assessment has advanced in the last years and the
demand for online learning resources has increased The goal is to develop software
that listens to a student reading a music sheet with drums then evaluates the audio
recording giving some feedback on the tempo and reading accuracy First a review
of the previous work on different topics is made Then a drums event dataset
is created as well as a classifier and a performance assessment pipeline has been
developed focused on giving feedback through a visualization of the waveform in
parallel with the exercise score Results suggests that the classifier can generalize
with new audios despite this there is a large margin to improve the classification
results Furthermore there are some limitations in terms of maximum tempo to
classify correctly the events and the amount of saturation that the system can afford
to ensure a correct prediction Finally a discussion introduces the idea of a teacher
that follows the same process as the pipeline proposed thinking if this is a correct
approach to the problem and if this is a fair way to evaluate the system in a task
that we do differently with more information than one audio
Keywords Automatic assessment Signal processing Music information retrieval
Drums Machine learning
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 1
Introduction
The field of sound event classification and concretely drums events classification has
improved the results during the last years and allows us to use this kind of technolo-
gies for a more concrete application like education support [1] The development of
automated assessment tools for the support of musical instrument learning has been
a field of study in the Music Technology Group (MTG) [2] concretely on guitar
performances implemented in the Pysimmusic project [3] One of the open paths
that proposes Eremenko et al [2] is to implement it with different instruments
and this is what I have done
11 Motivation
The aim of the project is to implement a tool to evaluate musical performances
specifically reading scores with drums One possible real application may be to sup-
port the evaluation in a music school allowing the teacher to focus on other items
such as attitude posture and more subtle nuances of performance If high accuracy
is achieved in the automatic assessment of tempo and reading a fair assessment of
these aspects can be ensured In addition a collaboration between a music school
and the MTG allows the use of a specific corpus of data from the educational insti-
tution program this corpus is formed by a set of music sheets and the recordings of
the performances
1
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
2 Chapter 1 Introduction
Besides this I have been studying drums for fourteen years and a personal motivation
emerges from this fact Learning an instrument is a process that does not only rely
on going to class there is an important load of individual practice apart of the class
indeed Having a tool to assess yourself when practicing would be a nice way to
check your own progress
12 Existing solutions
In terms of music interpretation assessment there are already some software tools
that support assessing several instruments Applications such as Yousician1 or
SmartMusic2 offer from the most basic notions of playing an instrument to a syllabus
of themes to be played These applications return to the students an evaluation that
tells which notes are correctly played and which are not but do not give information
about tempo consistency or dynamics and even less generates a rubric as a teacher
could develop during a class
There are specific applications that support drums learning but in those the feature
of automatic assessment disappears There are some options to get online drums
lessons such as Drumeo3 or Drum School4 but they only offer a list of videos impart-
ing lessons on improving stylistic vocabulary feel improvisation or technique These
applications also offer personal feedback from professional drummers and a person-
alized studying plan but the specific feature of automatic performance assessment
is not implemented
As mentioned in the Introduction automatic music assessment has been a field
of research at the MTG With the development of Music Critic5 an assessment
workflow is proposed and implemented This is useful as can be adapted to the
drums assessment task
1httpsyousiciancom2httpswwwsmartmusiccom3httpswwwdrumeocom4httpsdrumschoolappcom5httpsmusiccriticupfedu
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
13 Identified challenges 3
13 Identified challenges
As mentioned in [2] there are still improvements to do in the field of music as-
sessment especially analyzing expressivity and with advanced level performances
Taking into account the scope of this project and having as a base case the guitar
assessment exercise from Music Critic some specific challenges are described below
131 Guitar vs drums
As defined in [4] a drumset is a collection of percussion instruments Mainly cym-
bals and drums even though some genres may need to add cowbells tambourines
pailas or other instruments with specific timbres Moreover percussion instruments
are splitted in two families membranophones and idiophones membranophones
produces sound primarily by hitting a stretched membrane tuning the membrane
tension different pitches can be obtained [5] differently idiophones produces sound
by the vibration of the instrument itself and the pitch and timbre are defined by
its own construction [5] The vibration aforementioned is produced by hitting the
instruments generally this hit is made with a wood stick but some genres may need
to use brushes hotrods or mallets to excite specific modes of the instruments With
all this in mind and as stated in [1] it is clear that transcribing drums having to
take into account all its variants and nuances is a hard task even for a professional
drummer With this said there is a need to simplify the problem and to limit the
instruments of the drumset to be transcribed
Returning to the assessment task guitars play notes and chords tuned so the way
to check if a music sheet has been read correctly is looking for the pitch information
and comparing it to the expected one Differently instruments that form a drumset
are mainly unpitched (except toms which are tuned using different scales and tuning
paradigms) so the differences among drums events are on the timbre A different
approach has to be defined in order to check which instrument is being played for
each detected event the first idea is to apply machine learning for sound event
classification
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
4 Chapter 1 Introduction
Along the project we will refer to the different instruments that conform a drumkit
with abbreviations In Table 1 the legend used is shown the combination of 2 or
more instruments is represented with a rsquo+rsquo symbol between the different tags
Instrument Kick Drum Snare Drum Floor tom Mid tom High tomAbbreviation kd sd ft mt ht
Instrument Hi-hat Ride cymbal Crash cymbalAbbreviation hh cy cr
Table 1 Abbreviationsrsquo legend
132 Dataset creation
Keeping in mind the last idea of the previous section if a machine learning approach
has to be implemented there is a basic need to obtain audio data of drums Apart
from the audio data proper annotations of the drums interpretations are needed in
order to slice them correctly and extract musical features of the different events
The process of gathering data should take into account the different possibilities that
offers a drumset in terms of timbre loudness and tone Several datasets should be
combined as well as additional recordings with different drumsets in order to have
a balanced and representative dataset Moreover to evaluate the assessment task
a set of exercises has to be recorded with different levels of skill
There is also the need to capture those sounds with several frequency responses in
order to make the model independent of the microphone Also those samples could
be processed to get variations of each of them with data augmentation processes
133 Signal quality
Regarding the assignment we have to take into account that a student will not be
able to record its interpretations with a setup as the used in a studio recording most
of the time the recordings will be done using the laptop or mobile phone microphone
This fact has to be taken into account when training the event classifier in order
to do data augmentation and introduce these transformations to the dataset eg
introducing noise to the samples or amplifying to get overload distortion
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
14 Objectives 5
14 Objectives
The main objective of this project is to develop a tool to assess drums interpretations
of a proposed music sheet This objective has to be split into the different steps of
the pipeline
bull Generate a correctly annotated drums dataset which means a collection of
audio drums recordings and its annotations all equally formatted
bull Implement a drums event sound classifier
bull Find a way to properly visualize drums sheets and their assessment
bull Propose a list of exercises to evaluate the technology
In addition having the code published in a public Github6 repository and uploading
the created dataset to Freesound7 and Zenodo8 will be a good way to share this work
15 Project overview
The next chapters will be developed as follows In chapter 2 the state of the art is
reviewed Focusing on signal processing algorithms and ways to implement sound
event classification ending with music sheet technologies and software tools available
nowadays In chapter 3 the creation of a drums dataset is described Presenting the
use of already available datasets and how new data has been recorded and annotated
In chapter 4 the methodology of the project is detailed which are the algorithms
used for training the classifier as well as how new submissions are processed to assess
them In chapter 5 an evaluation of the results is done pointing out the limitations
and the achievements Chapter 6 concludes with a discussion on the methods used
the work done and further work
6httpsgithubcom7httpsfreesoundorg8httpszenodoorg
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 2
State of the art
In this chapter the concepts and technologies used in the project are explained
covering algorithm references and existing implementations First signal process-
ing techniques on onset detection and feature extraction are reviewed then sound
event classification field is presented and its relationship with drums event classifica-
tion Also the principal music sheet technologies and codecs are presented Finally
specific software tools are listed
21 Signal processing
211 Feature extraction
In the following sections sound event classification will be explained most of these
methods are based on training models using features extracted from the audio not
with the audio chunks indeed [6] In this section signal processing methods to get
those features are presented
Onset detection
In an audio signal an onset is the beginning of a new event it can be either a
single note a chord or in the case of the drums the sound produced by hitting one
or more instruments of the drumset It is necessary to have a reliable algorithm
6
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
21 Signal processing 7
that properly detects all the onsets of a drums interpretation With the onsets
information (a list of timestamps) the audio can be sliced to analyze each chunk
separately and to assess the tempo consistency
It is important to address the challenge in a psychoacoustical way as the objective
is to detect the musical events as a human will do In [7] the idea of perceptual
onset for percussive instruments is defined as a time interval between the physical
onset and the moment that the maximum level is reached In [8] many methods are
reviewed focusing on the differences of performance depending on the signal Non
Pitched Percussive instruments are better detected with temporal methods or high-
frequency content methods while Pitched Non Percussive instruments may need to
take into account changes of energy in the spectrum distribution as the onset may
represent a different note
The sound generated by the drums is mainly percussive (discarding brushesrsquo slow
patterns or malletrsquos build-ups on the cymbals) which means that is formed by a
short transient followed by a short decay there is no sustain As the transient is a
fast change of energy it implies a high-frequency content because changes happen
in a very little frame of time As recommended in [9] HFC method will be used
Timbre features
As described in [10] a feature denotes in some way a quantity or a value Features
extracted by processing the audio stream or transformations of that (ie FFT)
are called low-level descriptors these features have no relevant information from a
human point of view but are useful for computational processes [11]
Some low-level descriptors are computed from the temporal information for in-
stance the zero-crossing rate tells the number of times the signal crosses the zero
axis per second the attack time is the duration of the transient and temporal cen-
troid the energy distribution of an event during the time Other well known features
are the root median square of the signal or the high-frequency content mentioned
in section 211
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
8 Chapter 2 State of the art
Besides temporal features low-level descriptors can also be computed from the fre-
quency domain Some of them are spectral flatness spectral roll-off spectral slope
spectral flux ia
Nowadays Essentialsquos library offers a collection of algorithms that reliably extracts
the low-level descriptors aforementioned the function that englobes all the extrac-
tors is called Music extractor1
212 Data augmentation
Data augmentation processes refer to the optimization of the statistical representa-
tion of the datasets in terms of improving the generalization of the resultant models
These methods are based on the introduction of unobserved data or latent variables
that may not be captured during the dataset creation [12]
Regarding this technique applied to audio data signal processing algorithms are
proposed in [13] and [14] that introduces changes to the signals in both time and
frequency domains In these articles the goal is to improve accuracy on speech and
animal sound recognition although this could apply to drums event classification
The processes that lead best results in [13] and [14] were related to time-domain
transformations for instance time-shifting and stretching adding noise or harmonic
distortion compressing in a given dynamic range ia Other processes proposed
were focused on the spectrogram of the signal applying transformations such as
shifting the matrix representation setting to 0 some areas or adding spectrograms
of different samples of the same class
Presently some Python2 libraries are developed and maintained in order to do audio
data augmentation tasks For instance audiomentations3 and the GPU version
torch-audiomentations4
1httpsessentiaupfedustreaming_extractor_musichtml2httpswwwpythonorg3httpspypiorgprojectaudiomentations0604httpspypiorgprojecttorch-audiomentations
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
22 Sound event classification 9
22 Sound event classification
Sound Event Classification is the task of detecting and recognizing sound events in
an audio stream [15] As described in [10] this task can be approached from two
sides on one hand the perceptual approach tries to extract the timbre similarity to
cluster sounds as how we perceive them on the other hand the taxonomic approach
is determined to label sound events as they are defined in the cultural or user biased
taxonomies In this project the focus is on the second approach as the task is to
classify sound events in the drums taxonomy (ie kick drum snare drum hi-hat)
Also in [] many classification methods are proposed Concretely in the taxonomy
approach machine learning algorithms such as K-Nearest Neighbors Support Vector
Machines or Neural Networks All of them using features extracted from the audio
data as explained in section 211
221 Drums event classification
This section is divided into two parts first presenting the state-of-the-art methods
for drum event classification and then the most relevant existing datasets This
section is mainly based on the article [1] as it is a review of the topic and encompasses
the core concepts of the project
Methods
Focusing on the taxonomic drums events classification this field has been studied for
the last years as in the Music Information Retrieval Evaluation eXchange5 (MIREX)
has been a proposed challenge since 20056 In [1] a review of the main methods
that have been investigated is done The authors collect different approaches such
as Recurrent Neural Networks proposed in [16] Non-Negative matrix factorization
proposed in [17] and others real-time based using MaxMSP7 as described in [18]
5httpswwwmusic-irorgmirexwikiMIREX_HOME6httpswwwmusic-irorgmirexwiki2005Audio_Drum_Detection_Results7httpscycling74comproductsmax
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
10 Chapter 2 State of the art
It is needed to mention that the proposed methods are focused on Automatic Drum
Transcription (ADT) of drumsets formed only by the kick drum snare drum and
hi-hat ADT field is intended to transcribe audio but in our case we have to check
if an audio event is or not the expected event this particularity can be used in our
favor as some assumptions can be made about the audio that has to be analyzed
Datasets
In addition to the methods and their combinations the data used to train the
system plays a crucial role As a result the dataset may have a big impact on the
generalization capabilities of the models In this section some existing datasets are
described
bull IDMT-SMT-Drums [19] Consists of real drum recordings containing only
kick drum snare drum and hi-hat events Each recording has its transcription
in xml format and is publicly avaliable to download8
bull MDB Drums [20] Consists of real drums recordings of a wide range of genres
drumsets and styles Each recording has two txt transcriptions for the classes
and subclasses defined in [20] (eg class Hi-hat Subclasses Closed hi-hat
open hi-hat pedal hi-hat) It is publicly avaliable to download9
bull ENST-Drums [21] Consists of real drum audio and video recordings of dif-
ferent drummers and drumsets Each recording has its transcription and some
of them include accompaniment audio It is publicly available to download10
bull DREANSS [22] Differently this dataset is a collection of drum recordings
datasets that have been annotated a posteriori It is publicly available to
download11
Electronic drums datasets have not been considered as the student assignment is
supposed to be recorded with a real drumset8httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml9httpsgithubcomCarlSouthallMDBDrums
10httpspersotelecom-paristechfrgrichardENST-drums11httpswwwupfeduwebmtgdreanss
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
23 Digital sheet music 11
23 Digital sheet music
Several music sheet technologies have been developed since the first scorewriter
programs from the 80s Proprietary softwares as Finale12 and Sibelius13 or open-
source software as MuseScore14 and LilyPond15 are some options that can be used
nowadays to write music sheets with a computer
In terms of file format Sibelius has its encrypted version that can only be read and
written with the software it can also write and read MusicXML16 files which are
not encrypted and are similar to an HTML file as it contains tags that define the
bars and notes of the music sheet this format is the standard for exchanging digital
music sheet
Within Music Criticrsquos framework the technology used to display the evaluated score
is LilyPond it can be called from the command line and allows adding macros that
change the size or color of the notes The other particularity is that it uses its own
file format (ly) and scores that are in MusicXML format have to be converted and
reviewed
24 Software tools
Many of the concepts and algorithms aforementioned are already developed as soft-
ware libraries this project has been developed with Python and in this section the
libraries that have been used are presented Some of them are open and public and
some others are private as pysimmusic that has been shared with us so we can use
and consult it In addition all the code has been developed using a tool from Google
called Collaboratory17 it allows to write code in a jupyter notebook18 format that
is agile to use and execute interactively
12httpswwwfinalemusiccom13httpswwwavidcomsibelius14httpsmusescoreorg15httpslilypondorg16httpswwwmusicxmlcom17httpscolabresearchgooglecom18httpsjupyterorg
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
12 Chapter 2 State of the art
241 Essentia
Essentia is an open-source C++ library of algorithms for audio and music analysis
description and synthesis [23] it can also be installed as a Python-based library
with the pip19 command in Linux or compiling with certain flags in MacOS20 This
library includes a collection of MIR algorithms it is not a framework so it is in the
userrsquos hands how to use these processes Some of the algorithms used in this project
are music feature extraction onset detection and audio file IO
242 Scikit-learn
Scikit-learn21 is an open-source library for Python that integrates machine learning
algorithms for regression classification and clustering as well as pre-processing and
dimensionality reduction functions Based on NumPy22 and SciPy23 so its algorithms
are easy to adapt to the most common data structures used in Python It also allows
to save and load trained models to do inference tasks with new data
243 Lilypond
As described in section 23 LilyPond is an open-source songwriter software with
its file format and language It can produce visual renders of musical sheets in
PNG SVG and PDF formats as well as MIDI files to listen to the compositions
LilyPond works on the command line and allows us to introduce macros to modify
visual aspects of the score such as color or size
It is the digital sheet music technology used within Music Criticrsquos framework as
allows to embed an image in the music sheet generating a parallel representation of
the music sheet and a studentrsquos interpretation
19httpspypiorgprojectpip20httpsessentiaupfeduinstallinghtml21httpsscikit-learnorg22httpsnumpyorg23httpswwwscipyorgscipylibindexhtml
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
25 Summary 13
244 Pysimmusic
Pysimmusic is a private python library developed at the MTG It offers tools to
analyze the similarity of musical performances and uses libraries such as Essentia
LilyPond FFmpeg24 ia Pysimmusic contains onset detection algorithms and a
collection of audio descriptors and evaluation algorithms By now is the main eval-
uation software used in Music Critic to compare the recording submitted with the
reference
245 Music Critic
Music Critic is a project from the MTG intended to support technologies for online
music education facilitating the assessment of student performances25
The proposed workflow starts with a student submitting a recording playing the
proposed exercise Then the submission is sent to the Music Criticrsquos server where
is analyzed and assessed Finally the student receives the evaluation jointly with
the feedback from the server
25 Summary
Music information retrieval and machine learning have been popular fields of study
This has led to a large development of methods and algorithms that will be crucial
for this project Most of them are free and open-source and fortunately the private
ones have been shared by the UPF research team which is a great base to start the
development
24httpswwwffmpegorg25httpswwwupfeduwebmtgtech-transfer-asset_publisherpYHc0mUhUQ0G
contentid229860881maximizedYJrB-usp7YV
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 3
The 40kSamples Drums Dataset
As stated in section 132 having a well-annotated and balanced dataset is crucial to
get proper results In this section the 40kSamples Drums Dataset creation process is
explained first focusing on how to process existing datasets such as the mentioned
in 221 Secondly introducing the process of creating new datasets with a music
school corpus and a collection of recordings made in a recording studio Finally
describing the data augmentation procedure and how the audio samples are sliced
in individual drums events In Figure 1 we can see the different procedures to unify
the annotations of the different datasets while the audio does not need any specific
modification
31 Existing datasets
Each of the existing datasets has a different annotation format in this section the
process of unifying them will be explained as well as its implementation (see note-
book Dataset_formatUnificationipynb1) As the events to take into account
can be single instruments or combinations of them the annotations have to be for-
matted to show that events properly None of the annotations has this approach
so we have written a function that filters the list and joins the events with a small
1httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
14
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
31 Existing datasets 15
difference of time meaning that they are played simultaneously
Music school Studio REC IDMT Drums MDB Drums
audio + txt
Sibelius to MusicXML
MusicXML parser to txt
Write annotations
AnnotationsAudio
Figure 1 Datasets pre-processing
311 MDB Drums
This dataset was the first we worked with the annotation format in txt was a key
factor as it was easy to read and understand As the dataset is available in Github2
there is no need to download it neither process it from a local drive As shown in
the first cells of Dataset_formatUnificationipynb data from the repository can
be retrieved with a Python wrapper of the Github API3
This dataset has two annotations files depending on how deep the taxonomy used
is [20] In this case the generic class taxonomy is used as there is no need to
differentiate styles when playing a given instrument (ie single stroke flam drag
ghost note)
312 IDMT Drums
Differently to the previous dataset this one is only available downloading a zip
file4 It also differs in the annotation file format which is xml Using the Python
2httpsgithubcomCarlSouthallMDBDrums3httpspypiorgprojectgithubpy4httpswwwidmtfraunhoferdeenbusiness_unitsm2dsmtdrumshtml
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
16 Chapter 3 The 40kSamples Drums Dataset
package xmltodict5 in the second part of Dataset_formatUnificationipynb the
xml files are loaded as a Python dictionary and converted to txt format
32 Created datasets
In order to expand the dataset with more variety of samples other methods to get
data have been explored On one hand with audio data that has partial annotations
or some representation that is not data-driven such as a music sheet that contains
a visual representation of the music but not a logic annotation as mentioned in
the previous section On the other hand generating simple annotations is an easy
task so drums samples can be recorded standalone to create data in a controlled
environment In the next two sections these methods are described
321 Music school
A music school has shared its docent material with the MTG for research purposes
ie audio demos books in pdf format music sheet in Sibelius format As we can
see in Figure 1 the annotations from the music school corpus are in Sibelius format
this is an encrypted representation of the music sheet that can only be opened with
the Sibelius software The MTG has shared an AVID license which includes the
Sibelius software so we were able to convert the sib files to musicxml MusicXML
is not encrypted and allows to open it and read so a parser has been developed to
convert the MusicXML files to a symbolic representation of the music sheet This
representation has been inspired by [24] which proposes a system to represent chords
MusicXML parser
As mentioned in section 23 MusicXML format is based on ordering the visual
information with tags creating a tree structure of nested dictionaries In the first cell
of XML_parseripynb6 two functions are defined ConvertXML2Annotation reads
the musicxml file and gets the general information of the song (ie tempo time
5httpspypiorgprojectxmltodict6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterXML_parseripynb
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
32 Created datasets 17
measure title) then a for loops throughout all the bars of the music sheet checking
whereas the given bar is self-defined the repetition of the previous one or the begin
or end of a repetition in the song (see Figure 2) in the self-defined bar case the bar
indeed is passed to an auxiliar function which parses it getting the aforementioned
symbolic representation
Figure 2 Sample drums score from music school drums grade 1
In Figure 2 we can see a staff in which the first bar has been written and the three
others have a symbol that means rsquorepetition of the previous barrsquo moreover the
bar lines at the beginning and the end represents that these four bars have to be
repeated therefore this line in the music score represents an interpretation of eight
bars repeating the first one
The symbolic representation that we propose is based in [24] defines each bar with
a string this string contains the representations of the events in the bar separated
with blank spaces Each of the events has two dots () to separate the figure (ie
quarter note half note whole note) from the note or notes of the event which
are separated by a dot () For instance the symbolic representation of the first bar
in Figure 2 is F4A44 F4A44 F4A44 F4A44
In addition to this conversion in parse_one_measure function from XML_parser
notebook each measure is checked to ensure that fully represents the bar This
means that the sum of the figures of the bar has to be equal to the defined in the
time measure the sum of the events in a 44 bar has to be equal to four quarter
notes
Symbolic notation to unified annotation format
As we can see in Figure 1 once the music scores are converted to the symbolic
representation the last step is to unify the annotations with the used in sections 31
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
18 Chapter 3 The 40kSamples Drums Dataset
This process is made in the last cells of Dataset_formatUnification7 notebook
A dictionary with the translation of the notes to drums instrument is defined so
the note is already converted Differently the timestamp of each event has to be
computed based on the tempo of the song and the figure of each event this process
is made with the function get_time_steps_from_annotations8 which reads the
interpretation in symbolic notation and accumulates the duration of each event
based on the figure and the tempo
322 Studio recordings
At this point of the dataset creation we realized that the already existing data
was so unbalanced in terms of instances per class some classes had around two
thousand samples while others had only ten This situation was the reason to
record a personalized dataset to balance the overall distribution of classes as well
as exercises with different accuracy when reading simulating students with different
skill levels
The recording process took place on April 16 and 17 at Stereodosis Estudio9 (Sants
Barcelona) the first day was intended to mount the drumset and the microphones
which are listed in Table 2 in Figure 3 the microphone setup is shown differently
to the standard setup in which each instrument of the set has its microphone this
distribution of the microphones was intended to record the whole drumset with
different frequency responses
The recording process was divide into two phases first creating samples to balance
the dataset used to train the drums event classifier (called train set) Then recording
the studentsrsquo assignment simulation to test the whole system (called test set)
7httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_formatUnificationipynb
8httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4scriptsdrumspyL9
9httpswwwstereodosiscom
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
32 Created datasets 19
Microphone Transducer principleBeyerdynamic TG D70 Dynamic
Shure PG52 DynamicShure SM57 Dynamic
Sennheiser e945 DynamicAKG C314 CondenserAKG C414 CondenserShure PG81 CondenserSamson C03 Condenser
Table 2 Microphones used
Figure 3 Microphone setup for drums recording
Train set
To limit the number of classes we decided to take into account only the classes
that appear in the music school subset this decision was motivated by the idea of
assessing the songs from the books so only classes of the collection of songs were
needed to train the classifier In Figure 4 the distribution of the selected classes
before the recordings is shown note that is in logarithmic scale so there is a large
difference among classes
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
20 Chapter 3 The 40kSamples Drums Dataset
Figure 4 Number of samples before Train set recording
To organize the recording process we designed 3 different routines to record depend-
ing on the class and the number of samples already existing a different routine was
recorded These routines were designed trying to represent the different speeds dy-
namics and interactions between instruments of a real interpretation In Appendix
A the routines scores are shown to write a generic routine a two lines stave is used
the bottom line represents the class to be recorded and the top line an auxiliary
one The auxiliary classes are cymbals concretely crashes and rides whose sound
remains a long period of time and its tail is mixed with the subsequent sound events
bull Routine 1 (Fig 31) This routine is intended for the classes that do not include
a crash or ride cymbal and has a small number of classes (ie lt500)
bull Routine 2 (Fig 32) This routine does not include auxiliary events as it is
intended for classes that include crash or ride cymbal whose interaction with
itself is intrinsic
bull Routine 3 (Fig 33) This is a short version of routine 1 which only repeats
each bar two times instead of four is intended for classes which not include a
crash or ride cymbal and has a large number of classes (ie gt500)
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
32 Created datasets 21
Routines 1 and 3 were recorded only one time as we had only one instrument of each
of the classes differently routine 2 was recorded two times for each cymbal as we
was able to use more instances of them different cymbals configurations used can
be seen in Appendix A in Figures 34 35 and 36
After the Train set recording the number of samples was a little more balanced as
shown in Figure 5 all the classes have at least 1500 samples
0
1000
2000
3000
ht+kd
kd+m
t
ht
mt
ft+sd
ft+kd
+sd
cr+sd
ft
cr+kd
cr
ft+kd
hh+k
d+sd
kd+s
d
cy+s
d
cy
cy+k
d sd
kd
hh+s
d
hh+k
d
hh
recorded before record
Figure 5 Number of samples after Train set recording
Test set
The test set recording tried to simulate different students performing the same song
in the same drumset to do that we recorded each song of the music school Drums
Grade Initial and Grade 1 playing it correctly and then making mistakes in both
reading and rhythmic ways After testing with these recordings we realized that we
were not able to test the limits of the assessment system in terms of tempo or with
different rhythmic measures So we proposed two exercises of groove reading in 44
and in 128 to be performed with different tempos these recordings have been done
in my study room with my laptoprsquos microphone
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
22 Chapter 3 The 40kSamples Drums Dataset
33 Data augmentation
As described in section 212 data augmentation aims to introduce changes to the
signals to optimize the statistical representation of the dataset To implement this
task the aforementioned Python library audiomentations is used
The library Audiomentations has a class called Compose which allows collecting
different processing functions assigning a probability to each of them Then the
Compose instance can be called several times with the same audio file and each time
the resulting audio will be processed differently because of the probabilities In
data_augmentationipynb10 a possible implementation is shown as well as some
plots of the original sample with different results of applying the created Compose
to the same sample an example of the results can be listened in Freesound11
The processing functions introduced in the Compose class are based in the proposed
in [13] and [14] its parameters are described
bull Add gaussian noise with 70 of probability
bull Time stretch between 08 and 125 with a 50 of probability
bull Time shift forward a maximum of 25 of the duration with 50 of probability
bull Pitch shift plusmn2 semitones with a 50 of probability
bull Apply mp3 compression with a 50 of probability
10httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterdata_augmentationipynb
11httpsfreesoundorgpeopleMaciaACpacks32213
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
34 Drums events trim 23
34 Drums events trim
As will be explained in section 421 the dataset has to be trimmed into individual
files to analyze them and extract the low-level descriptors In the Dataset_feature
Extractionipynb12 notebook this process has been implemented slicing all the
audios with its annotations each dataset separately to sight-check all the resultant
samples and detect better which annotations were not correct
35 Summary
To summarize a drums samples dataset has been created the one used in this
project will be called the 40k Samples Drums Dataset Nonetheless to share this
dataset we have to ensure that we are fully proprietary of the data which means
that the samples that come from IDMT MDBDrums and MusicSchool datasets
cannot be shared in another dataset Alternatively we will share the 29k Samples
Drums Dataset formed only by the samples recorded in the studio This dataset will
be available in Zenodo13 to download the whole dataset at once and in Freesound
some selected samples are uploaded in a pack14 to show the differences among mi-
crophones
12httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterDataset_featureExtractionipynb
13httpszenodoorgrecord4958592YMmNXW4p5TZ14httpsfreesoundorgpeopleMaciaACpacks32397
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 4
Methodology
In this chapter the methodologies followed in the development of the assessment
pipeline are explained In Figure 6 the proposed pipeline diagram is shown it is
inspired by [2] Each box of the diagram refers to a section in this chapter so the
diagram might be helpful to get a general idea of the problem when explaining each
process
The system is divided into two main processes First the top boxes correspond to
the training process of the model using the dataset created in the previous chapter
Secondly the bottom row shows how a student submission is processed to generate
some feedback This feedback is the output of the system and should give some
indications to the student on how has performed and how can improve
41 Problem definition
To check if a student reads correctly a music sheet we need some tool to tag which
instruments of the drumset is playing for each detected event This leads us to
develop and train a Drums events classifier if this tool ensures a good accuracy
when classifying (ie lt95) we will be able to properly assess a studentrsquos recording
If the classifier has not enough accuracy the system will not be useful as we will not
be able to differentiate among errors from the student and errors from the classifier
24
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
42 Drums event classifier 25
Assessments
Music Scores
Studentsrsquo performances
Annotations
Audio recordings
Dataset
Feature extraction
Drums event classifier training
Performanceassessment
training
Feature extraction
Performanceassessment
inference
New studentrsquos recording
Visualization Performancefeedback
Figure 6 Proposed pipeline for a drums performance assessment system inspiredby [2]
For this reason the project has been mainly focused on developing the aforemen-
tioned drums event classifier and a proper dataset So developing a properly as-
sessed dataset of drums interpretations has not been possible nor the performance
assessment training Despite this the feedback visualization has been developed as
it is a nice way to close the pipeline and get some understandable results moreover
the performance feedback could be focused on deterministic aspects as telling the
student if is rushing or slowing in relation to a given tempo
42 Drums event classifier
As already mentioned this section has been the main load of work for this project
because of the dependence of a correct automatic transcription in order to do a
reliable assessment The process has been divided into 3 main parts extracting
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
26 Chapter 4 Methodology
the musical features training and validating the model in an iterative process and
finally validating the model with totally new data
421 Feature extraction
The feature extraction concept has been explained in Section 211 and has been
implemented using the MusicExtractor()1 method from Essentiarsquos library
MusicExtractor() method has to be called passing as a parameter the window and
hope sizes that will be used to perform the analysis as well as a filename of the event
to be analyzed The function extract_MusicalFeatures()2 has been implemented
to loop a list of files and analyze each of them to add the extracted features to a
csv file jointly with the class of each drum event At this point all the low-level
features were extracted both mean and standard deviation were computed across
all the frames of the given audio filename The reason was that we wanted to check
which features were redundant or meaningful when training the classifier
As mentioned in section 34 the fact that MusicExtractor() method has to be
called with a filename not an audio stream forced us to create another version of
the dataset which had each event annotated in a different audio file with the corre-
spondent class label as a filename Once all the datasets were properly sliced and
sight-checked the last cell of the notebook were executed with the correspondent
folder names (which contains all the sliced samples) and the features saved in differ-
ent csv one for each dataset3 Adding the number of instances in all the csv files
we get 40228 instances with 84 features and 1 label
1httpsessentiaupfedureferencestd_MusicExtractorhtml2httpsgithubcomMaciACtfg_DrumsAssessmentblobe81be958101be005cda805146d3287eec1a2d5a4
scriptsfeature_extractionpyL63httpsgithubcomMaciACtfg_DrumsAssessmenttreemasterdataslices
features
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
42 Drums event classifier 27
422 Training and validating
As mentioned in section 22 some authors have proposed machine learning algo-
rithms such as Support Vector Machines (SVM) and K-Nearest Neighbours (KNN)
to do sound event classification also some authors have developed more complex
methods for drums event classification The complexity of these last methods made
me choose the generic ones also to try if it were a good way to approach the problem
as there is no literature concretely on drums event classification with SVM or KNN
The iterative process of training and validating the aforementioned methods has
been the main reference when designing the 40k Drums samples dataset the first
times we tried the models we were working with the classes distribution of Figure
4 as commented this was a very unbalanced dataset and we were evaluating the
classification inference with the accuracy formula 41 that did not take into account
the unbalance in the dataset The accuracy computation was around 92 but the
correct predictions were mainly on the large classes as shown in Figure 7 some
classes had very low accuracy (even 0 as some classes has 10 samples 7 used to
train an 3 to validate which are all bad predicted) but having a little number of
instances affects less to the accuracy computation
accuracy(y y) =1
nsamples
nsamplesminus1sumi=0
1(yi = yi) (41)
Otherwise the proper way to compute the accuracy in this kind of datasets is the
balanced accuracy it computes the accuracy for each class and then averages the
accuracy along with all the classes as in formula 42 where wi represents the weight
of each class in the dataset This computation lowered the result to 79 which was
not a good result
wi =wisum
j 1(yj = yi)wj
(42)
balanced-accuracy(y y w) =1sumwi
sumi
1(yi = yi)wi
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
28 Chapter 4 Methodology
Figure 7 Confusion matrix after training with the dataset in Figure 4
Another widely used accuracy indicator for classification models is the f-score which
combines the precision and the recall of the model in one measure as in formula
43 Precision is computed as the number of correct predictions divided by the total
number of predictions and recall is the number of correct predictions divided by the
total number of predictions that should be correct for a given class
F_measure =precisiontimes recallprecision+ recall
(43)
Having these results led us to the process of recording a personalized dataset to
extend the already existing (See section 322) With this new distribution the
results improved as shown in Figure 8 as well as better balanced accuracy and f-
score (both 89) Until this point we were using both KNN and SVM models to
compare results and the SVM performed always 10 better at least so we decided
to focus on the SVM and its hyper-parameter tunning
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
42 Drums event classifier 29
Figure 8 Confusion matrix after training with the dataset in Figure 5 and parameterC = 1
The C parameter in a support vector machine refers to the regularization this
technique is intended to make a model less sensitive to the data noise and the
outliers that may not represent the class properly When increasing this value to
10 the results improved among all the classes as shown in Figure 9 as well as the
accuracy and f-score (both 95)
At that point the accuracy of the model was pretty good but the 88 on the snare
drum class was somehow a problem as is one of the most used instruments in the
drumset jointly with the hi-hat and the kick drum So I tried the same process
with the classes that include only the three mentioned instruments (ie hh kd sd
hh+kd hh+sd kd+sd and hh+kd+sd) Reducing the number of classes improved
the overall accuracy and f-score to 977 and concretely the sd accuracy to 96 as
shown in Figure 10
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
30 Chapter 4 Methodology
Figure 9 Confusion matrix after training with the dataset in Figure 5 and parameterC = 10
Figure 10 Confusion matrix after training with the dataset in Figure 5 and param-eter C = 10 but only hh sd and kd classes
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
42 Drums event classifier 31
The implementation of the training and validating iterative process has been de-
veloped in the Classifier_trainingipynb4 notebook First loading the csv files
with the features extracted in Dataset_featureExtractionipynb then depend-
ing on which subset of classes will be used the correspondent instances and filtered
and to remove redundant features the ones with a very low standard deviation are
deleted (ie std_dev lt 000001) As the SVM works better when data is normalized
the standard scaler is used to move all the data distributions around 0 and ensuring
a standard deviation of 1
In the next cells the dataset is split into train and validation sets and the training
method from the SVM of sklearn is called to perform the training when the models
are trained the parameters are dumped in a file to load the model a posteriori and
be able to apply the knowledge learned to new data This process was so slow on
my computer so we decided to upload the csv files to Google Drive and open the
notebook with Google Collaboratory as it was faster and is a key feature to avoid
long waiting times during the iterative train-validate process In the last cells the
inference is made with the validation set and the accuracy is computed as well as
the confusion matrix plotted to get an idea of which classes are performing better
423 Testing
Testing the model introduces the concept of onset detection until now all the slices
have been created using the annotations but to assess a new submission from
a student we need to detect the onsets and then slice the events The function
SliceDrums_BeatDetection5 does both tasks as explained in section 211 there
are many methods to do onset detection and each of them is better for a different
application In the case of drums we have tested the rsquocomplexrsquo method which finds
changes in the frequency domain in terms of energy and phase and works pretty
well but when the tempo increase there are some onsets that are not correctly de-
tected for this reason we finally implemented the onset detection with the HFC4httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterClassifier_
trainingipynb5httpsgithubcomMaciACtfg_DrumsAssessmentblob9422e71a998d3cd0a6c7f03e92a8b0c6f6dac869
scriptsdrumspyL45
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
32 Chapter 4 Methodology
method This method computes for each window the HFC as in equation 44 note
that high-frequency bins (k index) weights more in the final value of the HFC
HFC(n) =sumk
|Xk[n]|2lowastk (44)
Moreover the function plots the audio waveform jointly with the onsets detected to
check if it has worked correctly after each test In Figures 11 and 12 we can see two
examples of the same music sheet played at 60 and 220 bpm in both cases all the
onsets are correctly detected and no false detection occurs
Figure 11 Onsets detected in a 60bpm drums interpretation
Figure 12 Onsets detected in a 220bpm drums interpretation
With the onsets information the audio can be trimmed in the different events the
order is maintained with the name of the file so when comparing with the expected
events can be mapped easily The audios are passed to the extract_MusicalFeatures()
function that saves the musical features of each slice in a csv
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
43 Music performance assessment 33
To predict which event is each slice the models already trained are loaded in this new
environment and the data is pre-processed using the same pipeline as when training
After that data is passed to the classifier method predict() which returns for each
row in the data the predicted event The described process is implemented in the first
part of Assessmentipynb6 the second part is intended to execute the visualization
functions described in the next section
43 Music performance assessment
Finally as already commented the assessment part has been focused on giving visual
feedback of the interpretation to the student As the drums classifier has taken so
much time the creation of a dataset with interpretations and its grades has not been
feasible A first approximation was to record different interpretations of the same
music sheet simulating different levels of skills but grading it and doing all the
process by ourselves was not easy apart from that we tended to play the fragments
good or bad it was difficult to simulate intermediate levels and be consistent with
the proposed ones
So the implemented solution generates an image that shows to the student if the
notes of the music sheet are correctly read and if the onsets are aligned with the
expected ones
431 Visualization
With the data gathered in the testing section feedback of the interpretation has
to be returned Having as a base implementation the solution of my companion
Eduard Vergeacutes7 and thanks to the help of Vsevolod Eremenko8 in the last cell of
the notebook Assessmentipynb the visualization is done
First the LilyPond file paths are defined Then for each of the submissions the
audio is loaded to generate the waveform plot
6httpsgithubcomMaciACtfg_DrumsAssessmentblobmasterAssessmentipynb7httpsgithubcomEduardVergesFranchU151202_VA_FinalProject8httpsgithubcomseffkaForMacia
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
34 Chapter 4 Methodology
To do so the function save_bar_plot()9 is called passing the lists of detected and
expected onsets the waveform and the start and end of the waveform (this comes
from the lilypond filersquos macro) To properly plot the deviations in the code we are
assuming that the interpretation starts four beats after the beginning of the audio
In Figures 13 and 14 the result of save_bar_plot() for two different submissions is
shown The black lines in the bottom of the waveform are the detected onsets while
the cyan lines in the middle are the expected onsets when the difference between
the two values increase the area between them is colored with a traffic light code
(green good to red bad)
Figure 13 Onset deviation plot of a good tempo submission
Figure 14 Onset deviation plot of a bad tempo submission
Once the waveform is created it is embedded in a lambda function that is called from
the LillyPond render But before calling the LillyPond to render the assessment of
the notes has to be done In function assess_notes()10 the expected and predicted
events are passed with a comparison a list of booleans is created but with 0 in the
False and 1 in the True index then the resulting list is iterated and the 0 indices
are checked because most of the classification errors fail in one of the instruments
to be predicted (ie instead of hh+sd it is predicting sd) These cases are considered
partially correct as the system has to take into account its errors the indices in
which one of the instruments is correctly predicted and it is not a hi-hat (we are
9httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL112
10httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsdrumspyL88
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
43 Music performance assessment 35
considering it more important to get right the snare and kick reading than a hi-hat
which is present in all the events) the value is turned to 075 (light green in the color
scale) In Figure 15 the different feedback options are shown green notes mean
correct light green means partially correct and red means incorrect
Figure 15 Example of coloured notes
With the waveform the notes assessed and the LilyPond template the function
score_image()11 can be called This function renders the LilyPond template jointly
with the waveform previously created this is done with the LilyPond macros
On one hand before each note on the staff the keyword color() size()
determines that the color and size of the note depends on an external variable (the
notes assessed) and in the other hand after the first note of the staff the keyword
eps(1150 16) indicates on which beat starts to display the waveform and
on which ends in this case from 0 to 16 in a 44 rhythm is 4 bars and the other
number is the scale of the waveform and allows to fit better the plot with the staff
432 Files used
The assessment process of an exercise needs several files first the annotations of the
expected events and their timesteps this is found in the txt file that has been already
mentioned in the 311 section then the LilyPond file this one is the template write
in LilyPond language that defines the resultant music sheet the macros to change
color and size and to add the waveform are defined when extracting the musical
features each submission creates its csv file to store the information and finally
we need of course the audio files with the submission recorded to be assessed
11httpsgithubcomMaciACtfg_DrumsAssessmentblob2aaf0dbdd1f026dfebfba65eaac9fcd24a8629afscriptsvisualizationpyL187
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 5
Results
At this point the system has been developed and the classifier trained so we can do
an evaluation of the results to check if it works correctly and is useful to a student to
learn also to test which are the limits regarding the audio signal quality and tempo
The tests have been done with two different exercises recorded with a computer
microphone and played at a different tempo starting at 60 bpm and adding 40
bpm until 220 bpm The recordings with good tempo and good reading have been
processed adding 6dB until an accumulate of +30 dB
In this chapter and Appendix B all the resultant feedback visualizations are shown
The audio files can be listened in Freesound a pack1 has been created Some of
them will be commented on and referenced in further sections the rest are extra
results
As the High frequency content method works perfectly there are no limitations nor
errors in terms of onset detection all the tests have an f-measure of 1 detecting all
the expected events without detecting any false positive
1httpsfreesoundorgpeopleMaciaACpacks32350
36
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
51 Tempo limitations 37
51 Tempo limitations
One of the limitations of the system is the tempo of the exercise the accuracy drops
when the tempo increases Having as a reference the Figures that show good reading
which all notes should be in green or light green (ie Figures 16 17 18 19 20
21 and 22) we can count how many are correct or partially correct to punctuate
each case a correct prediction weights 10 a partially correct weights 05 and an
incorrect 0 the total value is the mean of the weighted result of the predictions
Figure 16 Good reading and good tempo Ex 1 60 bpm
Figure 17 Good reading and good tempo Ex 1 100 bpm
Figure 18 Good reading and good tempo Ex 1 140 bpm
In Table 3 we can see that by increasing the tempo of exercise 1 the accuracy of the
classifier decreases it may be because increasing the tempo decreases the spacing
between events and consequently the duration of each event which leads to fewer
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
38 Chapter 5 Results
Figure 19 Good reading and good tempo Ex 1 180 bpm
Figure 20 Good reading and good tempo Ex 1 220 bpm
values to calculate the mean and standard deviation when extracting the timbre
characteristics As stated in the Law of Large numbers [25] the larger the sample
the closer the mean is to the total population mean In this case having fewer values
in the calculation creates more outliers in the distribution which tends to scatter
Tempo Correct Partially OK Incorrect Total60 25 7 0 089100 24 8 0 0875140 24 7 1 086180 15 9 8 061220 12 7 13 048
Table 3 Results of exercise 1 with different tempos
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
51 Tempo limitations 39
Regarding the 128 exercise (Figures 21 and 22) we were not able to record faster
than 100 bpm But in 128 the quarter notes equivalent in tempo is 300 quarter
note per minute similarly to the 140 bpm on 44 which quarter note per minute
temporsquos is 280 The results on 128 (Table 4) are also better because there are more
rsquoonly hi hatrsquo events which are better predicted
Figure 21 Good reading and good tempo Ex 2 60 bpm
Figure 22 Good reading and good tempo Ex 2 100 bpm
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
40 Chapter 5 Results
Tempo Correct Partially OK Incorrect Total60 39 8 1 089100 37 10 1 0875
Table 4 Results of exercise 2 with different tempos
52 Saturation limitations
Another limitation to the system is the saturation of the signal submitted Hearing
the submissions the hi-hat events are recorded with less amplitude than the snare
and kick events for this reason we think that the classifier starts to fail at +18dB
As can be seen in Tables 5 and 6 the same counting scheme as in the previous section
is done with Figure 23 and Figure 24 The hi-hat is the last waveform to saturate
and at this gain level the overall waveform is so clipped leading to a high-frequency
content that is predicted as a hi-hat in all the cases
Level Correct Partially OK Incorrect Total+0dB 25 7 0 089+6dB 23 9 0 086+12dB 23 9 0 086+18dB 24 7 1 086+24dB 18 5 9 064+30dB 13 5 14 048
Table 5 Results of exercise 1 at 60 bpm with different amplification levels
Level Correct Partially OK Incorrect Total+0dB 12 7 13 048+6dB 13 10 9 056+12dB 10 8 14 05+18dB 9 2 21 031+24dB 8 0 24 025+30dB 9 0 23 028
Table 6 Results of exercise 1 at 220 bpm with different amplification levels
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
52 Saturation limitations 41
Figure 23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB ateach new staff
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
42 Chapter 5 Results
Figure 24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB ateach new staff
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
53 Evaluation of the assessment 43
53 Evaluation of the assessment
Until now the evaluation of results has been focused on the drums event classifier
accuracy but we think that is also important to evaluate if the system can assess
properly a studentrsquos submission
As shown in Figures 25 and 26 if the student does not play the first beat or some of
the beats are not read the system can map the rest of the events to the expected in
the correspondent onset time step This is due to a checking done in the assessment
which assumes that before starting the first beat there is a count-back of one bar
and the rest of the beats have to be after this interval
Figure 25 Bad reading and bad tempo Ex 1 100 bpm
Figure 26 Bad reading and bad tempo Ex 1 180 bpm
To evaluate the assessment we will proceed as in previous sections counting the
number of correct predictions but now in terms of assessment The analyzed results
will be the rsquoBad reading good temporsquo ones shown in Figures 27 28 and 29
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
44 Chapter 5 Results
Figure 27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpmat each new staff
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
53 Evaluation of the assessment 45
Figure 28 Bad reading and good tempo Ex 2 60 bpm
Figure 29 Bad reading and good tempo Ex 2 100 bpm
On Tables 7 and 8 the counting is summarized and works as follows we count a
correct assessment if the note is green or light green and the event is the one in the
music score or if the note is red and the event is not the one in the music score
The rest of the cases will be counted as incorrect assessments The total value is
the number of correct assessments over the total number of events
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
46 Chapter 5 Results
Tempo Correct assessment Incorrect assessment Total60 32 0 1100 32 0 1140 32 0 1180 25 7 078220 22 10 068
Table 7 Assessment result of a bad reading with different tempos 44 exercise
Tempo Correct assessment Incorrect assessment Total60 47 1 098100 45 3 09
Table 8 Assessment result of a bad reading with different tempos 128 exercise
We can see that for a controlled environment and low tempos the system performs
pretty well the assessment based on the predictions This can be helpful for a student
to know which parts of the music sheet are well ridden and which not Also the
tempo visualization can help the student to recognize if is slowing down or rushing
when reading the score as can be seen in Figure 30 the onsets detected (black lines
in the bottom part of the waveform) are mostly behind the correspondent expected
onset
Figure 30 Good reading and bad tempo Ex 1 100 bpm
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Chapter 6
Discussion and conclusions
At this point all the work of the project has been done and the results have been
analyzed In this chapter a discussion is developed about which objectives have been
accomplished and which not Also a set of further improvements is given and a final
thought on my work and my apprenticeship The chapter ends with an analysis of
how reusable and reproducible is my work
61 Discussion of results
Having in mind all the concepts explained along with this document we can now list
them defining the completeness and our contributions
Firstly the creation of the 29k Samples Drums Dataset is now publicly available and
downloadable from Freesound and Zenodo Apart from being used in this project
this dataset might be useful to other researchers and students in their projects
The dataset indeed is useful in order to balance datasets of drums based on real
interpretations as the class distribution of these interpretations is very unbalanced
as explained with the IDMT and MDB drums datasets
Secondly a drums event classifier with a machine learning approach has been pro-
posed and trained with the aforementioned dataset One of the reasons for using
this approach to predict the events was that there was no literature focused on
47
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
48 Chapter 6 Discussion and conclusions
classifying drums events in this manner As the results have shown more complex
methods based on the context might be used as the ones proposed in [16] and [17]
It is important to take into account that the task that the model is trained to do
is very hard for a human being able to differentiate drums events in an individual
drum sample without any context is almost impossible even for a trained ear as my
drums teacher or mine
Thirdly a review of the different music sheet technologies has been done as well as
the development of a MusicXML parser This part took around one month to be
developed and from my point of view it was a great way to understand how these
file formats work and how can be improved as they are majorly focused on the
visualization not the symbolic representation of events and timesteps
Finally two exercises in different time signatures have been proposed to demonstrate
the functionality of the system As well as tests of these exercises have been recorded
in a different environment than the 30k Samples Drums Dataset It would be fine
to get recordings in different spaces and with different drumsets and microphones
to test more exhaustively the system
62 Further work
In terms of the dataset created it could be larger It could be expanded with
different drumsets tuning differently each drumset using different sticks to hit the
instruments and even different people playing This could introduce more variance
in the drums sample dataset Moreover on June 9th 2021 a paper about a large
drums datasets with MIDI data was presented [26] in the ICASSP 20211 This new
dataset could be included in the training process as the authors state that having a
large-scale dataset improves the results of the existing models
Regarding the classification model it is clear that needs improvements to ensure the
overall system robustness It would be appropriate to introduce the aforementioned
methods in [16] [17] and [26] in the ADT part of the pipeline
1httpswww2021ieeeicassporg
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
63 Work reproducibility 49
Also in terms of classes in the drumset there is a large path to cover in this way
There are no solutions that transcribe in a robust way a whole set including the toms
and different kinds of cymbals In this way we think that a proper approach would
be to work with professional musicians which helps researchers to better understand
the instrument and create datasets with different techniques
In respect of the assessment step apart from the feedback visualization of the tempo
deviations and the reading accuracy a regression model could be trained with drums
exercises assessed and give a mark to each student In this path introducing an
electronic drumset with MIDI output would make things a lot easier as the drums
classifier step would be omitted
About the implementation a good contribution would be to introduce the models
and algorithms to the Pysimmusic workflow and develop a demo web app like the
MusicCriticrsquos But better results and more robustness are needed to do this step
63 Work reproducibility
In computational sciences a work is reproducible if code and data are available and
other researchersstudents can execute them getting the same results
All the code has been developed in Python a widely known general-purpose pro-
gramming language It is available in my GitHub repository2 as well as the data
used to test the system and the classification models
The data created ie the studio recordings are available in a Zenodo repository3
and some samples in Freesound4 This is the 29kDrumsSamplesDataset as not all
the 40k samples used to train are of our property and we are not able to share them
under our full authorship despite this the other datasets used in this project are
available individually
2httpsgithubcomMaciACtfg_DrumsAssessment3httpszenodoorgrecord4923588YMRgNm4p7ow4httpsfreesoundorgpeopleMaciaACpacks32397
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
50 Chapter 6 Discussion and conclusions
64 Conclusions
This project has been developed over one year At this point with the work de-
scribed the goal of supporting drums learning has been accomplished Besides this
work rests in terms of robustness and reliability But a first approximation has been
presented as well as several paths of improvement proposed
Moreover some fields of engineering and computer science have been covered such
as signal processing music information retrieval and machine learning Not only
in terms of implementation but investigating for methods and gathering already
existing experiments and results
About my relationship with computers I have improved my fluency with git and
its web version GitHub Also at the beginning of the project I wanted to execute
everything on my local computer having to install and compile libraries that were
not able to install in macOS via the pip command (ie Essentia) which has been
a tough path to take and accomplish In a more advanced phase of the project
I realized that the LilyPond tools were not possible to install and use fluently in
my local machine so I have moved all the code to my Google Drive to execute the
notebook on a Collaboratory machine Developing code in this environment has
also its clues which I have had to learn In summary I have spent a bunch of time
looking for the ideal way to develop the project and the process indeed has been
fruitful in terms of knowledge gained
In my personal opinion developing this project has been a nice way to close my
Bachelorrsquos degree as I reviewed some of the concepts of more personal interest
And being able to relate the project with music and drums helped me to keep
my motivation and focus I am quite satisfied with the feedback visualization that
results of the system and I hope that more people get interested in this field of
research to get better tools in the future
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
List of Figures
1 Datasets pre-processing 15
2 Sample drums score from music school drums grade 1 17
3 Microphone setup for drums recording 19
4 Number of samples before Train set recording 20
5 Number of samples after Train set recording 21
6 Proposed pipeline for a drums performance assessment system in-
spired by [2] 25
7 Confusion matrix after training with the dataset in Figure 4 28
8 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 1 29
9 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 30
10 Confusion matrix after training with the dataset in Figure 5 and
parameter C = 10 but only hh sd and kd classes 30
11 Onsets detected in a 60bpm drums interpretation 32
12 Onsets detected in a 220bpm drums interpretation 32
13 Onset deviation plot of a good tempo submission 34
14 Onset deviation plot of a bad tempo submission 34
15 Example of coloured notes 35
16 Good reading and good tempo Ex 1 60 bpm 37
17 Good reading and good tempo Ex 1 100 bpm 37
18 Good reading and good tempo Ex 1 140 bpm 37
19 Good reading and good tempo Ex 1 180 bpm 38
20 Good reading and good tempo Ex 1 220 bpm 38
51
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
52 LIST OF FIGURES
21 Good reading and good tempo Ex 2 60 bpm 39
22 Good reading and good tempo Ex 2 100 bpm 39
23 Good reading and good tempo Ex 1 60 bpm accumulating +6dB at
each new staff 41
24 Good reading and good tempo Ex 1 220 bpm accumulating +6dB
at each new staff 42
25 Bad reading and bad tempo Ex 1 100 bpm 43
26 Bad reading and bad tempo Ex 1 180 bpm 43
27 Bad reading and good tempo Ex 1 starts on 60 bpm and adds 60bpm
at each new staff 44
28 Bad reading and good tempo Ex 2 60 bpm 45
29 Bad reading and good tempo Ex 2 100 bpm 45
30 Good reading and bad tempo Ex 1 100 bpm 46
31 Recording routine 1 57
32 Recording routine 2 58
33 Recording routine 3 58
34 Drumset configuration 1 58
35 Drumset configuration 2 59
36 Drumset configuration 3 59
37 Good reading and bad tempo Ex 1 60 bpm 60
38 Bad reading and bad tempo Ex 1 60 bpm 60
39 Good reading and bad tempo Ex 1 140 bpm 61
40 Bad reading and bad tempo Ex 1 140 bpm 61
41 Good reading and bad tempo Ex 1 180 bpm 61
42 Good reading and bad tempo Ex 1 220 bpm 62
43 Bad reading and bad tempo Ex 1 220 bpm 62
44 Good reading and bad tempo Ex 2 60 bpm 62
45 Bad reading and bad tempo Ex 2 60 bpm 63
46 Good reading and bad tempo Ex 2 100 bpm 63
47 Bad reading and bad tempo Ex 2 100 bpm 64
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
List of Tables
1 Abbreviationsrsquo legend 4
2 Microphones used 19
3 Results of exercise 1 with different tempos 38
4 Results of exercise 2 with different tempos 40
5 Results of exercise 1 at 60 bpm with different amplification levels 40
6 Results of exercise 1 at 220 bpm with different amplification levels 40
7 Assessment result of a bad reading with different tempos 44 exercise 46
8 Assessment result of a bad reading with different tempos 128 exercise 46
53
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Bibliography
[1] Wu C-W et al A review of automatic drum transcription IEEEACM Trans
Audio Speech and Lang Proc 26 (2018)
[2] Eremenko V Morsi A Narang J amp Serra X Performance assessment
technologies for the support of musical instrument learning e-repository UPF
(2020)
[3] MusicTechnologyGroup Pysimmusic httpsgithubcomMTGpysimmusic
[private] (2019)
[4] Kernan T J Drum set [drum kit trap set] Grove Encyclopedy of Music
(2013)
[5] Wachsmann K J Kartomi M von Hornbostel E M amp Sachs C Instru-
ments classification of Grove Encyclopedy of Music (2001)
[6] Mierswa I amp Morik K Automatic feature extraction for classifyng audio data
Mach Learn 58 (2005)
[7] Vos J amp Rasch R The perceptual onset of musical tones Perception Psy-
chophysics 29 (1981)
[8] Bello J P et al A tutorial on onset detection in music signals IEEE Trans-
actions on Speech and Audio Processing (2005)
[9] Essentia Algorithm reference Onsetdetection httpsessentiaupfedu
referencestreaming_OnsetDetectionhtml (2021)
54
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
BIBLIOGRAPHY 55
[10] Herrera P Peeters G amp Dubnov S Automatic classification of musical
instrument sound Journal of New Music Research 32 (2010)
[11] Schedl M Goacutemez E amp Urbano J Music information retrieval Recent de-
velopments and applications Foundations and Trends in Information Retrieval
8 (2014)
[12] A van Dyk D amp Meng X-L The art of data augmentation Journal of Com-
putational and Graphical Statistics - J COMPUT GRAPH STAT 10 (2012)
[13] Nanni L Maguoloa G amp Paci M Data augmentation approaches for im-
proving animal audio classification CoRR (2020)
[14] Kol T Peddinti V Povey D amp Khudanpur S Audio augmentation for
speech recognition INTERSPEECH (2020)
[15] Adavanne S M Fayek H amp Tourbabin V Sound event classification and
detection with weakly labeled data DCASE 2019 (2019)
[16] Southall C Stables R amp Hockman J Automatic drum transcription for
polyphonicrecordings using soft attention mechanisms andconvolutional neural
networks ISMIR (2017)
[17] Lindsay-Smith H McDonald S amp Sandler M Drumkit transcription via
convolutive nmf 15th International Conference on Digital Audio Effects DAFx
2012 Proceedings (2012)
[18] Miron M EP Davies M amp Gouyon F An open-source drum transcription
system for pure data and max msp 2013 IEEE International Conference on
Acoustics Speech and Signal Processing (2012)
[19] Dittmar C amp Gaumlrtner D Real-time transcription and separation of drum
recordings based onnmf decomposition DAFx (2014)
[20] Southall C Wu C-W Lerch A amp Hockman J Mdb drums ndash an annotated
subset of medleydb forautomatic drum transcription ISMIR (2017)
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
56 BIBLIOGRAPHY
[21] Gillet O amp Richard G Enst-drums an extensive audio-visual database for
drum signals processing ISMIR (2006)
[22] Marxer R amp Janer J Study of regularizations and constraints in nmf-based
drums monaural separation DAFx (2013)
[23] Bogdanov D et al Essentia An audio analysis library for musicinformation
retrieval Proceedings - 14th International Society for Music Information Re-
trieval Conference (2010)
[24] Goacutemez E Harte C Sandler M amp Abdallah S Symbolic representation of
musical chords A proposed syntax for text annotations ISMIR (2005)
[25] Upton G amp Cook I Laws of large numbers A dictionary of statistics (2008)
[26] Wei I-C Wu C-W amp Su L Improving automatic drum transcription using
large-scale audio-to-midi aligned data ICASSP 2021 - 2021 IEEE International
Conference on Acoustics Speech and Signal Processing (ICASSP) (2021)
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Appendix A
Studio recording media
pound poundpound pound pound pound pound pound = 60
pound
poundpound poundpoundpound9 pound poundpound
poundpound poundpoundpound13pound poundpound
pound pound poundpound pound pound pound pound pound 17 pound pound pound pound poundpound pound
pound pound poundpound pound pound pound pound pound 21 pound pound pound pound poundpound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 31 Recording routine 1
57
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
58 Appendix A Studio recording media
frac34frac34frac34frac34pound = 60 frac34frac34
frac34frac34 frac34frac345 frac34frac34
frac34frac34frac34frac34 frac34frac349
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 32 Recording routine 2
poundpoundpound poundpoundpound = 60 pound poundpound
pound pound poundpound pound pound pound poundpound pound pound pound pound poundpound5
pound
poundpoundpound poundpoundpound pound pound pound pound poundpoundpound poundpoundpoundpound pound poundpoundpound 9 pound poundpoundpound pound pound pound poundpound pound pound
Music engraving by LilyPond 2182mdashwwwlilypondorg
Figure 33 Recording routine 3
Figure 34 Drumset configuration 1
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
59
Figure 35 Drumset configuration 2
Figure 36 Drumset configuration 3
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
Appendix B
Extra results
Figure 37 Good reading and bad tempo Ex 1 60 bpm
Figure 38 Bad reading and bad tempo Ex 1 60 bpm
60
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
61
Figure 39 Good reading and bad tempo Ex 1 140 bpm
Figure 40 Bad reading and bad tempo Ex 1 140 bpm
Figure 41 Good reading and bad tempo Ex 1 180 bpm
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
62 Appendix B Extra results
Figure 42 Good reading and bad tempo Ex 1 220 bpm
Figure 43 Bad reading and bad tempo Ex 1 220 bpm
Figure 44 Good reading and bad tempo Ex 2 60 bpm
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
63
Figure 45 Bad reading and bad tempo Ex 2 60 bpm
Figure 46 Good reading and bad tempo Ex 2 100 bpm
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-
64 Appendix B Extra results
Figure 47 Bad reading and bad tempo Ex 2 100 bpm
- Introduction
-
- Motivation
- Existing solutions
- Identified challenges
-
- Guitar vs drums
- Dataset creation
- Signal quality
-
- Objectives
- Project overview
-
- State of the art
-
- Signal processing
-
- Feature extraction
- Data augmentation
-
- Sound event classification
-
- Drums event classification
-
- Digital sheet music
- Software tools
-
- Essentia
- Scikit-learn
- Lilypond
- Pysimmusic
- Music Critic
-
- Summary
-
- The 40kSamples Drums Dataset
-
- Existing datasets
-
- MDB Drums
- IDMT Drums
-
- Created datasets
-
- Music school
- Studio recordings
-
- Data augmentation
- Drums events trim
- Summary
-
- Methodology
-
- Problem definition
- Drums event classifier
-
- Feature extraction
- Training and validating
- Testing
-
- Music performance assessment
-
- Visualization
- Files used
-
- Results
-
- Tempo limitations
- Saturation limitations
- Evaluation of the assessment
-
- Discussion and conclusions
-
- Discussion of results
- Further work
- Work reproducibility
- Conclusions
-
- List of Figures
- List of Tables
- Bibliography
- Studio recording media
-
- Extra results
-