musical instrument recognition
TRANSCRIPT
MUSICAL INSTRUMENT RECOGNITION
BY
Hrishikesh P. Kanjalkar
DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION
Index
Chapter 1 Introduction……………………………………………………………
1.1 Motivation for the work………………………………………………………1.2 Defining the problem…………………………………………………………1.3 Theory of music………………………………………………………………1.4 Musical notes and scales……………………………………………………...
Chapter 2 Literature review………………………………………………………
2.1 Musical instruments recognition system……………………………………..2.2 Comparison between artificial systems and human abilities………………....2.3 Physical properties of musical instrument…………………………………....
Chapter 3 Feature Extraction…………………………………………………….
3.1 Temporal Features…………………………………………………………...3.2 Spectral Features……………………………………………………………..
Chapter 4 Classifier………………………………………………………………
Chapter 6 Conclusion ……………………............................................................
Chapter 7 Bibliography…………………………………………………………….
Introduction
Chapter 1
1.1 Motivation for the work.
Motivation relates to the generic problem of sound source recognition and analysis of auditory
scenes. The idea is to compile a toolbox of generic feature extractors and classification methods
that can be applied to a variety of audio related analysis and understanding problems. In fact,
some of the methods implemented for this study and the knowledge gained have been already
used in [2].
Secondly There has been a great deal of research concerning the automatic annotation of
music files. Musical instrument recognition can be a great help for all the musicians in getting
the information of particular notes been played. Just as using Google for searching the data, we
can have a music search engine where we play a note and get also basic information regarding
it. This will give the musicians a better platform for their learning process.
1.2 Defining the problem.
This project aims to accurately detect the instrument family and instruments of a signal.
To accomplish this, we intend to record and analyze the entire range of a few instruments, and
then use this analysis to decompose monophonic, or one instrument, signals into their component
instruments. This project basically deals with the recognition of a musical instrument on playing
its note. Software is developed that is capable of listening to the recording and classify the
instrument. Musical data (Audio data) is redundant in nature. 1 sec of signal has 44100 samples
in it (sampling frequency is 44.1 kHz). Thus the input signal requires feature extraction which
helps in recognizing the instrument.
1.3 Theory of Music.
For those unfamiliar with music, we offer a (very) brief introduction into the technical
aspects of music.
The sounds you hear over the airwaves and in all manner of places may be grouped into
12 superficially disparate categories. Each category is labeled a "note" and given an alpha
symbolic representation. That is, the letters A through G represent seven of the notes and
the other five are represented by appending either a pound sign (#, or sharp) or something
that looks remarkably similar to a lower-case b (also called a flat).
Although these notes were conjured in an age where the modern theory of waves and
optics was not dreamt of even by the greatest of thinkers, they share some remarkable
characteristics. Namely, every note that shares its name with another (notes occupying
separate "octaves," with one sounding higher or lower than the other) has a frequency that is
some rational multiple of the frequency of the notes with which it shares a name. More
simply, an A in one octave has a frequency twice that of an A one octave below.
As it turns out, every note is related to every other note by a common multiplicative
factor. To run the full gamut, one need only multiply a given note by the 12th root of two n
times to find the nth note "above" it (i.e. going up in frequency).
1.4 Musical notes, intervals and scales
Musical notes are the symbols or signs that represent the frequencies, durations and timings
of the
elementary musical sounds. It can be said that musical notes play a similar role to the alphabet of
a language; they allow music compositions and scores to be recorded in a symbolic form and
read and
played by musicians. The systems of musical notes also allow standardization of musical
instruments
and their tuning frequencies.
Table 1. The main musical note frequencies and symbols
Note that on a piano the sharp (or flat) notes are signified by black keys. The full table is given in
Table 13.2. In this table the notes are ordered in C-major scale (C, D, E, F, G, A, B).
The Western musical note system, as shown in table is based on seven basic notes, also
known as the natural notes, these are:
[C_D_E_F_G_A_B]
There are also five ‘sharp’ notes:
[C#_D#_ F#_G#_A#]
and five ‘flat’ notes:
[Db_ Eb_Gb_Ab_Bb]
The hash sign # denotes a sharp note and the sign ‘b’ denotes a flat note. The sharp version of a
note
is a semitone higher than that note, e.g. C# = 1√2 2C, whereas the flat version of a note is a
semitone
lower than that note, e.g. Db = D/ 1√2 2.
Musical Scales
In music theory, a musical scale is a specific pattern of the pitch ratios of successive notes. The
pitch
difference between successive notes is known as a scale step. The scale step may be a constant or
it
may vary. Musical scales are usually known by the type of interval and scale step that they
contain.
Musical scales are typically ordered in terms of the pitch or frequencies of notes. Some of the
most
important examples of musical scales are the chromatic scale, diatonic scale and Pythagorean
scale.
Musical materials are written in terms of a musical scale.
Fig 1. The frequencies of the keys on a piano
Note that the piano keys are arranged in groups of 12. Each set of 12 keys spans an octave which is the doubling of frequency. For example the frequency of AN is 2NA0 or N octaves higher than A0, e.g. A7 = 27 ×27_5 = 3520 Hz. Black keys correspond to sharp notes.
Literature Survey
Chapter 2
2.1 Musical instruments recognition system.
Various attempts have been made to construct automatic musical instrument recognition systems.
Researchers have used different approaches and scopes, achieving different performances. Most
systems have operated on isolated notes, often taken from the same, single source, and having
notes over a very small pitch range. The most recent systems have operated on solo music taken
from commercial recordings. The studies using isolated tones and monophonic phrases are the
most relevant in our scope.
Recognition of single tones
These studies have used isolated notes as test material, with varying number of instruments and
pitches.
Studies using one example of each instrument
Kaminskyj[1] and Materka used features derived from a root-mean-square (RMS) energy
envelope via PCA and used a neural network or a k-nearest neighbor (k-NN) classifier to classify
guitar, piano, marimba and accordion tones over a one-octave band [1]. Both classifiers achieved
a good performance, approximately 98 %. However, strong conclusions cannot be made since
the instruments were very different, there was only one example of each instrument, the note
range was small, and the training and test data were from the same recording session. More
recently, Kaminskyj ([1]) has extended the system to recognize 19 instruments over three octave
pitch range from the McGill collection [2]. Using features derived from the RMS-energy
envelope and constant-Q transform ([3]), an accuracy of 82 % was reported using a classifier
combination scheme.
Table 2: Summary of recognition percentages of isolated note recognition systems usingonly one example of each instrument.
Study Percentage correct Number of instruments
[Kaminskyj95] 98 4 guitar, piano, marimba and accordion
[Kaminskyj00] 82 19[Fujinaga98] 50 23[Frasea99] 64 23
[Fujinaga00] 68 23[Martin98] 72(93) 14[Kostek99] 97
81
4 bass trombone, trombone, English horn and
contra bassoon20
[Kostek01] 93
90
4 oboe, trumpet, violin, cello
18
Fujinaga and Fraser trained a k-NN with features extracted from 1338 spectral slices of 23
instruments playing a range of pitches [4]. Using leave-one-out cross validation and a genetic
algorithm for finding good feature combinations, a recognition accuracy of 50 % was obtained
with 23 instruments. When the authors added features relating to the dynamically changing
spectral envelope, and velocity of spectral centroid and its variance, the accuracy increased to 64
% [4]. Finally, after small refinements and adding spectral irregularity and tristimulus features,
an accuracy of 68% was reported [4].
Martin and Kim reported a system operating on full pitch ranges of 14 instruments [8]. The
samples were a subset of the isolated notes on the McGill collection [2]. The best classifier was
the k-NN, enhanced with the Fisher discriminant analysis to reduce the dimensions of the data,
and a hierarchical classification architecture for first recognizing the instrument families. Using
70 % / 30 % splits between the training and test data, they obtained a recognition rate of 72 % in
individual instrument, and after finding a 10-feature set giving the best average performance, an
accuracy of 93 % in classification between five instrument families.
Kostek has calculated several different features relating to the spectral shape and onset
characteristics of tones taken from chromatic scales with different articulation styles [7]. A two-
layer feed-forward neural network was used as a classifier. The author reports excellent
recognition percentages with four instruments: the bass trombone, trombone, English horn and
contra bassoon. However, the pitch of the note was provided for the system, and the training and
test material were from different channels of the same stereo recording setup.
Kostek and Czyzewski also tried using wavelet-analysis based features for musical instrument
recognition, but their preliminary results were worse than with the earlier features [11].
In the most recent paper, the same authors expanded their feature set to include 34 FFT-based
features, and 23 wavelet features [12]. A promising percentage of 90 % with 18 classes is
reported, however, a leave-one-out cross-validation scheme probably increases the recognition
rate. The results obtained with the wavelet features were almost as good as with the other
features. Table 2 summarizes the recognition percentages reported in isolated note studies. The
most severe limitation of all these studies is that they all used only one example of each
instrument. This significantly decreases the generalizability of the results, as we will demonstrate
with our system in later part. The study described next is the only study using isolated tones from
more than one source and represents the state-of-the-art in isolated tone recognition.
2.2 Comparison between artificial systems and human abilities.
The current state-of-the-art in artificial sound source recognition is still very limited in its
practical applicability. Under laboratory conditions, the systems are able to successfully
recognize a wider set of sound sources. However, if the conditions become more realistic, i.e. the
material is noisy, recorded in different locations with different setups, or there are interfering
sounds, the systems are able to successfully handle only a small number of sound sources. The
main challenge for the future is to build systems that can recognize wider sets of sound sources
with increased generality and in realistic conditions [13].
In general, humans are superior with regard to all the evaluation criteria presented in Section 2.3
[13]. They are able to generalize between different pieces of instruments, and recognize more
abstract classes such as bowed string instruments. People are robust recognizers because they are
able to focus on a sound of a single instrument in a concert, or a single voice within a babble. In
addition, they are able to learn new sound sources easily, and learn to become experts in
recognizing, for example, orchestral instruments. The recognition accuracy of human subjects
gradually worsens as the level of background noise, and interfering sound sources increases.
Only in limited contexts, such as discriminating between four woodwind instruments, computer
systems have performed comparable to human subjects [14]. With more general tasks, a lot of
work needs to be done.
2.3 Physical properties of musical instrument.
Traditionally, the instruments are divided into four classes: the strings, the brass, keyboards and
woodwinds. The sound of the instrument members within each family are similar, and often
humans make confusions within, but not easily between, these families. Examples include
confusing the violin and viola, the oboe and English horn, or the trombone and French horn [13].
In the following, we briefly present the different members of each family and their physical
build.
The strings
The members of the string family include the violin, viola, cello and double bass, guitar in the
order
of increasing size. These five form a tight perceptual family, and human subjects consistently
make confusions within this family [13].
The string instruments consist of a wooden body with a top and back plate and sides, and an
extended neck. The strings are stretched along the neck and over a fingerboard. At the other end,
the strings are attached to the bridge and at the other end to the tuning pegs which control the
string tension. The strings can be excited by plucking with fingers, drawing a bow over them or
hitting them with the bow (martele style of playing). The strings itself move very little air, but
the sound is produced by the vibration of the body and the air in it [9]. They are set into motion
by the string vibration which transmits to the body via the coupling through the bridge. The
motion of the top plate is the source of the most of the sound, and is a result of the interaction
between the driving force from the bridge and the resonances of the instrument body [9].
The brass
The members of the brass family considered in this project include the trumpet, French horn, and
tuba. The brass instruments have the simplest acoustic structure among the three families. They
consist of a long, hard walled tube with a flaring bell attached at one end. The sound is produced
by blowing at the other end of the tube, and the pitch of the instrument can be varied by changing
the lip tension. The player can use mutes to alter the sound, or insert his hand into the bell with
the French horn.
The woodwind
The woodwind family is more heterogeneous than the string and brass families, and there exists
several acoustically and perceptually distinct subgroups [13]. The subgroups are 19 the single
reed clarinets, the double reeds, the flutes with an air reed, and the single reed saxophones. In
wind instruments, the single or double reed operates in a similar way as the player’s lips in brass
instruments, allowing puffs of air into a conical tube where standing waves are then created. The
effective length of the tube is varied by opening and closing tone holes, changing the pitch of the
played note. [10]
Flutes
The members of the flute or air reed family include the piccolo, flute, alto flute and bass flute in
the order of increasing size. They consist of a more or less cylindrical pipe, which has finger
holes along its length. The pipe is stopped at one end, and has a blowing hole near the stopped
end [10].
In this chapter I have discussed papers which enlightened us few dominant features for
classification of musical instruments. They are listed below.
From these papers I have studied temporal, spectral features and cepstral which are helpful in
musical instrument recognition also considered various classification methods. The features I
have considered are
1. Temporal features
I. Decay time
II. Energy
III. Zero crossing
2. Spectral features
I. Spectral centroid
II. Spectral roll-off
III. Spectral flux
K-NN is used as classifier.
Methodology
Chapter 3
3.1. System Block Diagram:
Fig 3.1 System Block Diagram
Database Creation:For this work we require a database consisting of 5 different families of instruments. Instruments being considered are enlisted below.
Table3.2: Instruments and their respective families
The notes are 16 bit, mono channel with sampling frequency of 44.1 kHz. All the audio samples
were recorded in ‘.wav’ format. The platform used for this work is MATLAB (R2009b).
(Recording was done by a professional in a studio)
Music Samples
Feature Calculation
Classification
Trained Model
Decision Box
Families InstrumentsString Guitar, violin, Harp, Santoor , Sitar, Sarod , BanjoKeyboard Piano, AccordionWoodwind Flute, Oboe, ClarinetBrass French Horn, Trumpet, Tuba
Block diagram of Training Model:
Fig 3.3
Feature extraction:
The feature extraction stage is also called as the front end processor and it generates training
vectors. Our main intentions are to investigate the performance of different feature schemes and
find a good feature combination for a robust instrument classifier. Two categories of features are
being considered, temporal and spectral. Features studied for classification are mentioned in the
table below.
Temporal Features Spectral Features
Energy, ZCR, ZCRM, ZCRMD, long
attack time, ADSR, amplitude envelope.
Spectral distribution, harmonicity,
spectral centroid, roll-off, skewness, flux,
flatness, crest, spectral mean deviation,
fundamental frequency, timbre, spectral
range, bandwidth, MFCC, LPC, relative
power.
Table 3.4 Features studied
Features in time domain:
Trained Model
To extract temporal-based features, music sound samples were segmented into 10-ms frames.
The following are some important time domain features used in this project.
I. Energy of the signal
II. Zero crossing
III. Attack time
IV. Decay time
1. Energy
The energy of each sample is calculated. The energy of the signal is calculated as
If the energy in the analysis window is high, implication is the frame is voiced
(vowel/diphthong/semi vowel/voiced-consonant).
If the energy is low, then the frame is unvoiced (unvoiced-consonant/silence)
Thus the energy analysis helps to detect:
–Voice and silenced regions
–Silence and non-silence part of speech.
2. Zero crossing:
A zero crossing is said to occur between samples x(t) and x(t+1) if
sign (x(t)) != sign (x(t+1))
Example: consider the signal given below
3. Attack time:
The rise time is taken as the time difference between the time at the end of attack and the
backtracked position where the magnitude is 25% of the magnitude at the end of attack.
Ta = t1 – t2
Where
Ta is the attack time,
t1 is the time where the amplitude of the signal is maximum,
t2 is the time where the amplitude of the signal is 25% of the maximum.
4. Decay time:
The decay time is obtained as the time difference between the end of attack and the
forward position where the magnitude is 25% of the magnitude at the end of attack.
Td = t3 – t4
Where
Td is the decay time,
T3 is the time when the amplitude of the signal is 25% of the maximum after the
maximum amplitude has occurred.
T4 is the time when the signal has maximum amplitude.
1.3.2 Spectral features
While some instruments generate sounds which have energy concentrated in the lower
frequency bands, there are other instruments which produce sounds with energy almost
evenly distributed among lower, mid, and higher frequency bands.
The spectral features which have been considered in the project are
I. Spectral centroid
II. Spectral roll-off
III. Spectral flux
1.Spectral centroid:
The spectral centroid is a measure used in digital signal processing to characterise a
spectrum. It indicates where the "center of mass" of the spectrum is. Perceptually, it has a robust
connection with the impression of "brightness" of a sound. It is calculated as the weighted mean
of the frequencies present in the signal, determined using a Fourier transform, with their
magnitudes as the weights.
where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n)
represents the centre frequency of that bin.
2.Spectral roll-off:
The Roll-Off is another measure of spectral shape. It is the point where frequency that is
below some percentage (usually at 85%) of the power spectrum resides.
3.Spectral flux:
This feature measures frame-to-frame spectral difference. In short, it tells the changes in
the spectral shape. It is defined as the squared difference between the normalized magnitudes of
successive spectral distribution.
Chapter 4
Classifier
Various Classifiers are present & they are:
SVM(Support Vector Machine)
HMM(Hidden Markov Model)
K-NN(k- Nearest Neighbour)
What is kNN classifier?
Instance-based classifiers such as the kNN classifier operate on the premises that classification
of unknown instances can be done by relating the unknown to the known according to
some distance/similarity function. The intuition is that two instances far apart in the instance
space defined by the appropriate distance function are less likely than two closely situated
instances to belong to the same class.
1. The learning process
Unlike many artificial learners, instance-based learners do not abstract any information from the
training data during the learning phase. Learning is merely a question of encapsulating the
training data. The process of generalization is postponed until it is absolutely unavoidable, that
is, at the time of classification. This property has lead to the referring to instance-based learners
as lazy learners, whereas classifiers such as feed forward neural networks, where proper
abstraction is done during the learning phase, often are entitled eager learners.
2. Classification
Classification (generalization) using an instance-based classifier can be a simple matter of
locating the nearest neighbour in instance space and labelling the unknown instance with the
same class label as that of the located (known) neighbour. This approach is often referred to as
a nearest neighbour classifier. The downside of this simple approach is the lack of robustness
that characterize the resulting classifiers. The high degree of local sensitivity makes nearest
neighbour classifiers highly susceptible to noise in the training data.
More robust models can be achieved by locating k, where k > 1, neighbours and letting the
majority vote decide the outcome of the class labelling. A higher value of k results in a smoother,
less locally sensitive, function. The nearest neighbour classifier can be regarded as a special case
of the more general k-nearest neighbours classifier, hereafter referred to as a kNN classifier. The
drawback of increasing the value of k is of course that as k approaches n, where n is the size of
the instance base, the performance of the classifier will approach that of the most
straightforward statistical baseline, the assumption that all unknown instances belong to the class
most frequently represented in the training data.
Fig 4. Example of kNN
3. Example of k-NN classification
The test sample (green circle) should be classified either to the first class of blue squares or to the
second class of red triangles. If k = 3 it is assigned to the second class because there are 2
triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3
squares vs. 2 triangles inside the outer circle).
4. Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label.
The training phase of the algorithm consists only of storing the feature vectors and class labels of
the training samples.
In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test
point) is classified by assigning the label which is most frequent among the k training samples
nearest to that query point.
Usually Euclidean distance is used as the distance metric. Often, the classification accuracy of
"k"-NN can be improved significantly if the distance metric is learned with specialized
algorithms such as Large Margin Nearest Neighbour or Neighbourhood components analysis.
A drawback to the basic "majority voting" classification is that the classes with the more frequent
examples tend to dominate the prediction of the new vector, as they tend to come up in
the k nearest neighbours when the neighbours are computed due to their large number. One way
to overcome this problem is to weight the classification taking into account the distance from the
test point to each of its k nearest neighbours.
Conclusion:
I have described a system that can listen to a musical instrument and recognize it. The
work started by reviewing human perception: how well humans can recognize different
instruments and what are the underlying phenomena taking place in the auditory system. Then I
studied the qualities of musical sounds making them distinguishable from each other, as well as
acoustics of musical instruments. Physical properties of instrument families were studies. The
knowledge of the perceptually salient acoustic cues possibly used by human subjects in
recognition was the basis for the development of feature extraction algorithms.
In the first evaluation, temporal based features was demonstrated. Using the hierarchic
classifier architecture could not bring improvement in the recognition accuracy. However, it was
concluded that the recognition rates in this experiment were highly optimistic because of
insufficient testing material. The next experiment addressed this problem by introducing a wide
data set including several examples of a particular instrument.
Next phase was of spectral features, which showed more distinguishing results. To get
more accurate results standard deviation of all features were taken into consideration. Spectral
features include spectral centroid, spectral roll-off, spectral flux.
The within-instrument-family confusions made by the system were similar to those made by
human
subjects, although the system made more both inside and outside-family confusions. In the final
experiment, techniques commonly used in speaker recognition were applied for musical
instrument recognition. The benefit of the approach was that it is directly applicable to solo
phrases.
In order to make truly realistic evaluations, more acoustic data would be needed,
including monophonic material. The environment and differences between instrument instances
proved out to have a more significant effect on the difficulty of the problem than what was
expected at the beginning. In general, the task of reliably recognizing a wide set of instruments
from realistic monophonic recordings is not a trivial one; it is difficult for humans and especially
for computers. It becomes easier as longer segments of music are used and the recognition is
performed at the level of instrument families.
BIBLIOGRAPHY:
[1] Kaminskyj, Materka. (1995). “Automatic Source Identification of Monophonic Musical Instrument Sounds”. Proceedings of the IEEE Int. Conf. on Neural Networks, 1995.
[2] Opolko, F. & Wapnick, J. “McGill University Master Samples” (compact disk). McGill University,1987.
[3] Brown, Puckette. (1992). “An Efficient Algorithm for the Calculation of a Constant Q Transform”. J. Acoust. Soc. Am. 92, pp. 2698-2701.
[4] Fraser, Fujinaga. (1999). “Towards real-time recognition of acoustic musical instruments”. Proceedings of the International Computer Music Conference, 1999.
[5] Fujinaga. (1998). “Machine recognition of timbre using steady-state tone of acoustic musical instruments”. Proceedings of the International Computer Music Conference, 1998.
[6] Fujinaga. (2000). “Realtime recognition of orchestral instruments”. Proceedings of the International Computer Music Conference, 2000.
[7] Kostek. (1999). “Soft Computing in Acoustics: Applications of Neural Networks, Fuzzy Logic and Rough Sets to Musical Acoustics”. Physica-Verlag, 1999.
[8] Martin. (1998). “Musical instrument identification: A pattern-recognition approach“. Presented at the 136th meeting of the Acoustical Society of America, October 13, 1998.
[9] Rossing. (1990). “The Science of Sound“. Second edition, Addison-Wesley Publishing Co.
[10] Fletcher, Rossing. (1998). “The Physics of Musical Instruments“. Springer-Verlag New York, Inc.
[11] “Automatic Classification of Musical Sounds” In Proc. 108th Audio Eng. Soc. Convention.
[12] Kostek, Czyzewski. (2001). “Automatic Recognition of Musical Instrument Sounds - Further Developments”. In Proc. 110th Audio Eng. Soc. Convention, Amsterdam, Netherlands, May 2001.
[13] Martin. (1999). “Sound-Source Recognition: A Theory and Computational Model”. Ph.D. thesis,
MIT.
[14] Brown. (2001). “Feature dependence in the automatic identification of musical woodwind instruments”. J. Acoust. Soc. Am. 109(3), March 2001.