musical instrument recognition

MUSICAL INSTRUMENT RECOGNITION

BY

Hrishikesh P. Kanjalkar

DEPARTMENT OF ELECTRONICS AND TELECOMMUNICATION

Index

Chapter 1 Introduction……………………………………………………………

1.1 Motivation for the work………………………………………………………1.2 Defining the problem…………………………………………………………1.3 Theory of music………………………………………………………………1.4 Musical notes and scales……………………………………………………...

Chapter 2 Literature review………………………………………………………

2.1 Musical instruments recognition system……………………………………..2.2 Comparison between artificial systems and human abilities………………....2.3 Physical properties of musical instrument…………………………………....

Chapter 3 Feature Extraction…………………………………………………….

3.1 Temporal Features…………………………………………………………...3.2 Spectral Features……………………………………………………………..

Chapter 4 Classifier………………………………………………………………

Chapter 6 Conclusion ……………………............................................................

Chapter 7 Bibliography…………………………………………………………….

Introduction

Chapter 1

1.1 Motivation for the work.

Motivation relates to the generic problem of sound source recognition and analysis of auditory

scenes. The idea is to compile a toolbox of generic feature extractors and classification methods

that can be applied to a variety of audio related analysis and understanding problems. In fact,

some of the methods implemented for this study and the knowledge gained have been already

used in [2].

Secondly There has been a great deal of research concerning the automatic annotation of

music files. Musical instrument recognition can be a great help for all the musicians in getting

the information of particular notes been played. Just as using Google for searching the data, we

can have a music search engine where we play a note and get also basic information regarding

it. This will give the musicians a better platform for their learning process.

1.2 Defining the problem.

This project aims to accurately detect the instrument family and instruments of a signal.

To accomplish this, we intend to record and analyze the entire range of a few instruments, and

then use this analysis to decompose monophonic, or one instrument, signals into their component

instruments. This project basically deals with the recognition of a musical instrument on playing

its note. Software is developed that is capable of listening to the recording and classify the

instrument. Musical data (Audio data) is redundant in nature. 1 sec of signal has 44100 samples

in it (sampling frequency is 44.1 kHz). Thus the input signal requires feature extraction which

helps in recognizing the instrument.

1.3 Theory of Music.

For those unfamiliar with music, we offer a (very) brief introduction into the technical

aspects of music.

The sounds you hear over the airwaves and in all manner of places may be grouped into

12 superficially disparate categories. Each category is labeled a "note" and given an alpha

symbolic representation. That is, the letters A through G represent seven of the notes and

the other five are represented by appending either a pound sign (#, or sharp) or something

that looks remarkably similar to a lower-case b (also called a flat).

Although these notes were conjured in an age where the modern theory of waves and

optics was not dreamt of even by the greatest of thinkers, they share some remarkable

characteristics. Namely, every note that shares its name with another (notes occupying

separate "octaves," with one sounding higher or lower than the other) has a frequency that is

some rational multiple of the frequency of the notes with which it shares a name. More

simply, an A in one octave has a frequency twice that of an A one octave below.

As it turns out, every note is related to every other note by a common multiplicative

factor. To run the full gamut, one need only multiply a given note by the 12th root of two n

times to find the nth note "above" it (i.e. going up in frequency).

1.4 Musical notes, intervals and scales

Musical notes are the symbols or signs that represent the frequencies, durations and timings

of the

elementary musical sounds. It can be said that musical notes play a similar role to the alphabet of

a language; they allow music compositions and scores to be recorded in a symbolic form and

read and

played by musicians. The systems of musical notes also allow standardization of musical

instruments

and their tuning frequencies.

Table 1. The main musical note frequencies and symbols

Note that on a piano the sharp (or flat) notes are signified by black keys. The full table is given in

Table 13.2. In this table the notes are ordered in C-major scale (C, D, E, F, G, A, B).

The Western musical note system, as shown in table is based on seven basic notes, also

known as the natural notes, these are:

[C_D_E_F_G_A_B]

There are also five ‘sharp’ notes:

[C#_D#_ F#_G#_A#]

and five ‘flat’ notes:

[Db_ Eb_Gb_Ab_Bb]

The hash sign # denotes a sharp note and the sign ‘b’ denotes a flat note. The sharp version of a

note

is a semitone higher than that note, e.g. C# = 1√2 2C, whereas the flat version of a note is a

semitone

lower than that note, e.g. Db = D/ 1√2 2.

Musical Scales

In music theory, a musical scale is a specific pattern of the pitch ratios of successive notes. The

pitch

difference between successive notes is known as a scale step. The scale step may be a constant or

it

may vary. Musical scales are usually known by the type of interval and scale step that they

contain.

Musical scales are typically ordered in terms of the pitch or frequencies of notes. Some of the

most

important examples of musical scales are the chromatic scale, diatonic scale and Pythagorean

scale.

Musical materials are written in terms of a musical scale.

Fig 1. The frequencies of the keys on a piano

Note that the piano keys are arranged in groups of 12. Each set of 12 keys spans an octave which is the doubling of frequency. For example the frequency of AN is 2NA0 or N octaves higher than A0, e.g. A7 = 27 ×27_5 = 3520 Hz. Black keys correspond to sharp notes.

Literature Survey

Chapter 2

2.1 Musical instruments recognition system.

Various attempts have been made to construct automatic musical instrument recognition systems.

Researchers have used different approaches and scopes, achieving different performances. Most

systems have operated on isolated notes, often taken from the same, single source, and having

notes over a very small pitch range. The most recent systems have operated on solo music taken

from commercial recordings. The studies using isolated tones and monophonic phrases are the

most relevant in our scope.

Recognition of single tones

These studies have used isolated notes as test material, with varying number of instruments and

pitches.

Studies using one example of each instrument

Kaminskyj[1] and Materka used features derived from a root-mean-square (RMS) energy

envelope via PCA and used a neural network or a k-nearest neighbor (k-NN) classifier to classify

guitar, piano, marimba and accordion tones over a one-octave band [1]. Both classifiers achieved

a good performance, approximately 98 %. However, strong conclusions cannot be made since

the instruments were very different, there was only one example of each instrument, the note

range was small, and the training and test data were from the same recording session. More

recently, Kaminskyj ([1]) has extended the system to recognize 19 instruments over three octave

pitch range from the McGill collection [2]. Using features derived from the RMS-energy

envelope and constant-Q transform ([3]), an accuracy of 82 % was reported using a classifier

combination scheme.

Table 2: Summary of recognition percentages of isolated note recognition systems usingonly one example of each instrument.

Study Percentage correct Number of instruments

[Kaminskyj95] 98 4 guitar, piano, marimba and accordion

[Kaminskyj00] 82 19[Fujinaga98] 50 23[Frasea99] 64 23

[Fujinaga00] 68 23[Martin98] 72(93) 14[Kostek99] 97

81

4 bass trombone, trombone, English horn and

contra bassoon20

[Kostek01] 93

90

4 oboe, trumpet, violin, cello

18

Fujinaga and Fraser trained a k-NN with features extracted from 1338 spectral slices of 23

instruments playing a range of pitches [4]. Using leave-one-out cross validation and a genetic

algorithm for finding good feature combinations, a recognition accuracy of 50 % was obtained

with 23 instruments. When the authors added features relating to the dynamically changing

spectral envelope, and velocity of spectral centroid and its variance, the accuracy increased to 64

% [4]. Finally, after small refinements and adding spectral irregularity and tristimulus features,

an accuracy of 68% was reported [4].

Martin and Kim reported a system operating on full pitch ranges of 14 instruments [8]. The

samples were a subset of the isolated notes on the McGill collection [2]. The best classifier was

the k-NN, enhanced with the Fisher discriminant analysis to reduce the dimensions of the data,

and a hierarchical classification architecture for first recognizing the instrument families. Using

70 % / 30 % splits between the training and test data, they obtained a recognition rate of 72 % in

individual instrument, and after finding a 10-feature set giving the best average performance, an

accuracy of 93 % in classification between five instrument families.

Kostek has calculated several different features relating to the spectral shape and onset

characteristics of tones taken from chromatic scales with different articulation styles [7]. A two-

layer feed-forward neural network was used as a classifier. The author reports excellent

recognition percentages with four instruments: the bass trombone, trombone, English horn and

contra bassoon. However, the pitch of the note was provided for the system, and the training and

test material were from different channels of the same stereo recording setup.

Kostek and Czyzewski also tried using wavelet-analysis based features for musical instrument

recognition, but their preliminary results were worse than with the earlier features [11].

In the most recent paper, the same authors expanded their feature set to include 34 FFT-based

features, and 23 wavelet features [12]. A promising percentage of 90 % with 18 classes is

reported, however, a leave-one-out cross-validation scheme probably increases the recognition

rate. The results obtained with the wavelet features were almost as good as with the other

features. Table 2 summarizes the recognition percentages reported in isolated note studies. The

most severe limitation of all these studies is that they all used only one example of each

instrument. This significantly decreases the generalizability of the results, as we will demonstrate

with our system in later part. The study described next is the only study using isolated tones from

more than one source and represents the state-of-the-art in isolated tone recognition.

2.2 Comparison between artificial systems and human abilities.

The current state-of-the-art in artificial sound source recognition is still very limited in its

practical applicability. Under laboratory conditions, the systems are able to successfully

recognize a wider set of sound sources. However, if the conditions become more realistic, i.e. the

material is noisy, recorded in different locations with different setups, or there are interfering

sounds, the systems are able to successfully handle only a small number of sound sources. The

main challenge for the future is to build systems that can recognize wider sets of sound sources

with increased generality and in realistic conditions [13].

In general, humans are superior with regard to all the evaluation criteria presented in Section 2.3

[13]. They are able to generalize between different pieces of instruments, and recognize more

abstract classes such as bowed string instruments. People are robust recognizers because they are

able to focus on a sound of a single instrument in a concert, or a single voice within a babble. In

addition, they are able to learn new sound sources easily, and learn to become experts in

recognizing, for example, orchestral instruments. The recognition accuracy of human subjects

gradually worsens as the level of background noise, and interfering sound sources increases.

Only in limited contexts, such as discriminating between four woodwind instruments, computer

systems have performed comparable to human subjects [14]. With more general tasks, a lot of

work needs to be done.

2.3 Physical properties of musical instrument.

Traditionally, the instruments are divided into four classes: the strings, the brass, keyboards and

woodwinds. The sound of the instrument members within each family are similar, and often

humans make confusions within, but not easily between, these families. Examples include

confusing the violin and viola, the oboe and English horn, or the trombone and French horn [13].

In the following, we briefly present the different members of each family and their physical

build.

The strings

The members of the string family include the violin, viola, cello and double bass, guitar in the

order

of increasing size. These five form a tight perceptual family, and human subjects consistently

make confusions within this family [13].

The string instruments consist of a wooden body with a top and back plate and sides, and an

extended neck. The strings are stretched along the neck and over a fingerboard. At the other end,

the strings are attached to the bridge and at the other end to the tuning pegs which control the

string tension. The strings can be excited by plucking with fingers, drawing a bow over them or

hitting them with the bow (martele style of playing). The strings itself move very little air, but

the sound is produced by the vibration of the body and the air in it [9]. They are set into motion

by the string vibration which transmits to the body via the coupling through the bridge. The

motion of the top plate is the source of the most of the sound, and is a result of the interaction

between the driving force from the bridge and the resonances of the instrument body [9].

The brass

The members of the brass family considered in this project include the trumpet, French horn, and

tuba. The brass instruments have the simplest acoustic structure among the three families. They

consist of a long, hard walled tube with a flaring bell attached at one end. The sound is produced

by blowing at the other end of the tube, and the pitch of the instrument can be varied by changing

the lip tension. The player can use mutes to alter the sound, or insert his hand into the bell with

the French horn.

The woodwind

The woodwind family is more heterogeneous than the string and brass families, and there exists

several acoustically and perceptually distinct subgroups [13]. The subgroups are 19 the single

reed clarinets, the double reeds, the flutes with an air reed, and the single reed saxophones. In

wind instruments, the single or double reed operates in a similar way as the player’s lips in brass

instruments, allowing puffs of air into a conical tube where standing waves are then created. The

effective length of the tube is varied by opening and closing tone holes, changing the pitch of the

played note. [10]

Flutes

The members of the flute or air reed family include the piccolo, flute, alto flute and bass flute in

the order of increasing size. They consist of a more or less cylindrical pipe, which has finger

holes along its length. The pipe is stopped at one end, and has a blowing hole near the stopped

end [10].

In this chapter I have discussed papers which enlightened us few dominant features for

classification of musical instruments. They are listed below.

From these papers I have studied temporal, spectral features and cepstral which are helpful in

musical instrument recognition also considered various classification methods. The features I

have considered are

1. Temporal features

I. Decay time

II. Energy

III. Zero crossing

2. Spectral features

I. Spectral centroid

II. Spectral roll-off

III. Spectral flux

K-NN is used as classifier.

Methodology

Chapter 3

3.1. System Block Diagram:

Fig 3.1 System Block Diagram

Database Creation:For this work we require a database consisting of 5 different families of instruments. Instruments being considered are enlisted below.

Table3.2: Instruments and their respective families

The notes are 16 bit, mono channel with sampling frequency of 44.1 kHz. All the audio samples

were recorded in ‘.wav’ format. The platform used for this work is MATLAB (R2009b).

(Recording was done by a professional in a studio)

Music Samples

Feature Calculation

Classification

Trained Model

Decision Box

Families InstrumentsString Guitar, violin, Harp, Santoor , Sitar, Sarod , BanjoKeyboard Piano, AccordionWoodwind Flute, Oboe, ClarinetBrass French Horn, Trumpet, Tuba

Block diagram of Training Model:

Fig 3.3

Feature extraction:

The feature extraction stage is also called as the front end processor and it generates training

vectors. Our main intentions are to investigate the performance of different feature schemes and

find a good feature combination for a robust instrument classifier. Two categories of features are

being considered, temporal and spectral. Features studied for classification are mentioned in the

table below.

Temporal Features Spectral Features

Energy, ZCR, ZCRM, ZCRMD, long

attack time, ADSR, amplitude envelope.

Spectral distribution, harmonicity,

spectral centroid, roll-off, skewness, flux,

flatness, crest, spectral mean deviation,

fundamental frequency, timbre, spectral

range, bandwidth, MFCC, LPC, relative

power.

Table 3.4 Features studied

Features in time domain:

Trained Model

To extract temporal-based features, music sound samples were segmented into 10-ms frames.

The following are some important time domain features used in this project.

I. Energy of the signal

II. Zero crossing

III. Attack time

IV. Decay time

1. Energy

The energy of each sample is calculated. The energy of the signal is calculated as

If the energy in the analysis window is high, implication is the frame is voiced

(vowel/diphthong/semi vowel/voiced-consonant).

If the energy is low, then the frame is unvoiced (unvoiced-consonant/silence)

Thus the energy analysis helps to detect:

–Voice and silenced regions

–Silence and non-silence part of speech.

2. Zero crossing:

A zero crossing is said to occur between samples x(t) and x(t+1) if

sign (x(t)) != sign (x(t+1))

Example: consider the signal given below

3. Attack time:

The rise time is taken as the time difference between the time at the end of attack and the

backtracked position where the magnitude is 25% of the magnitude at the end of attack.

Ta = t1 – t2

Where

Ta is the attack time,

t1 is the time where the amplitude of the signal is maximum,

t2 is the time where the amplitude of the signal is 25% of the maximum.

4. Decay time:

The decay time is obtained as the time difference between the end of attack and the

forward position where the magnitude is 25% of the magnitude at the end of attack.

Td = t3 – t4

Where

Td is the decay time,

T3 is the time when the amplitude of the signal is 25% of the maximum after the

maximum amplitude has occurred.

T4 is the time when the signal has maximum amplitude.

1.3.2 Spectral features

While some instruments generate sounds which have energy concentrated in the lower

frequency bands, there are other instruments which produce sounds with energy almost

evenly distributed among lower, mid, and higher frequency bands.

The spectral features which have been considered in the project are

I. Spectral centroid

II. Spectral roll-off

III. Spectral flux

1.Spectral centroid:

The spectral centroid is a measure used in digital signal processing to characterise a

spectrum. It indicates where the "center of mass" of the spectrum is. Perceptually, it has a robust

connection with the impression of "brightness" of a sound. It is calculated as the weighted mean

of the frequencies present in the signal, determined using a Fourier transform, with their

magnitudes as the weights.

where x(n) represents the weighted frequency value, or magnitude, of bin number n, and f(n)

represents the centre frequency of that bin.

2.Spectral roll-off:

The Roll-Off is another measure of spectral shape. It is the point where frequency that is

below some percentage (usually at 85%) of the power spectrum resides.

3.Spectral flux:

This feature measures frame-to-frame spectral difference. In short, it tells the changes in

the spectral shape. It is defined as the squared difference between the normalized magnitudes of

successive spectral distribution.

Chapter 4

Classifier

Various Classifiers are present & they are:

SVM(Support Vector Machine)

HMM(Hidden Markov Model)

K-NN(k- Nearest Neighbour)

What is kNN classifier?

Instance-based classifiers such as the kNN classifier operate on the premises that classification

of unknown instances can be done by relating the unknown to the known according to

some distance/similarity function. The intuition is that two instances far apart in the instance

space defined by the appropriate distance function are less likely than two closely situated

instances to belong to the same class.

1. The learning process

Unlike many artificial learners, instance-based learners do not abstract any information from the

training data during the learning phase. Learning is merely a question of encapsulating the

training data. The process of generalization is postponed until it is absolutely unavoidable, that

is, at the time of classification. This property has lead to the referring to instance-based learners

as lazy learners, whereas classifiers such as feed forward neural networks, where proper

abstraction is done during the learning phase, often are entitled eager learners.

2. Classification

Classification (generalization) using an instance-based classifier can be a simple matter of

locating the nearest neighbour in instance space and labelling the unknown instance with the

same class label as that of the located (known) neighbour. This approach is often referred to as

a nearest neighbour classifier. The downside of this simple approach is the lack of robustness

that characterize the resulting classifiers. The high degree of local sensitivity makes nearest

neighbour classifiers highly susceptible to noise in the training data.

More robust models can be achieved by locating k, where k > 1, neighbours and letting the

majority vote decide the outcome of the class labelling. A higher value of k results in a smoother,

less locally sensitive, function. The nearest neighbour classifier can be regarded as a special case

of the more general k-nearest neighbours classifier, hereafter referred to as a kNN classifier. The

drawback of increasing the value of k is of course that as k approaches n, where n is the size of

the instance base, the performance of the classifier will approach that of the most

straightforward statistical baseline, the assumption that all unknown instances belong to the class

most frequently represented in the training data.

Fig 4. Example of kNN

3. Example of k-NN classification

The test sample (green circle) should be classified either to the first class of blue squares or to the

second class of red triangles. If k = 3 it is assigned to the second class because there are 2

triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3

squares vs. 2 triangles inside the outer circle).

4. Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label.

The training phase of the algorithm consists only of storing the feature vectors and class labels of

the training samples.

In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test

point) is classified by assigning the label which is most frequent among the k training samples

nearest to that query point.

Usually Euclidean distance is used as the distance metric. Often, the classification accuracy of

"k"-NN can be improved significantly if the distance metric is learned with specialized

algorithms such as Large Margin Nearest Neighbour or Neighbourhood components analysis.

A drawback to the basic "majority voting" classification is that the classes with the more frequent

examples tend to dominate the prediction of the new vector, as they tend to come up in

the k nearest neighbours when the neighbours are computed due to their large number. One way

to overcome this problem is to weight the classification taking into account the distance from the

test point to each of its k nearest neighbours.

Conclusion:

I have described a system that can listen to a musical instrument and recognize it. The

work started by reviewing human perception: how well humans can recognize different

instruments and what are the underlying phenomena taking place in the auditory system. Then I

studied the qualities of musical sounds making them distinguishable from each other, as well as

acoustics of musical instruments. Physical properties of instrument families were studies. The

knowledge of the perceptually salient acoustic cues possibly used by human subjects in

recognition was the basis for the development of feature extraction algorithms.

In the first evaluation, temporal based features was demonstrated. Using the hierarchic

classifier architecture could not bring improvement in the recognition accuracy. However, it was

concluded that the recognition rates in this experiment were highly optimistic because of

insufficient testing material. The next experiment addressed this problem by introducing a wide

data set including several examples of a particular instrument.

Next phase was of spectral features, which showed more distinguishing results. To get

more accurate results standard deviation of all features were taken into consideration. Spectral

features include spectral centroid, spectral roll-off, spectral flux.

The within-instrument-family confusions made by the system were similar to those made by

human

subjects, although the system made more both inside and outside-family confusions. In the final

experiment, techniques commonly used in speaker recognition were applied for musical

instrument recognition. The benefit of the approach was that it is directly applicable to solo

phrases.

In order to make truly realistic evaluations, more acoustic data would be needed,

including monophonic material. The environment and differences between instrument instances

proved out to have a more significant effect on the difficulty of the problem than what was

expected at the beginning. In general, the task of reliably recognizing a wide set of instruments

from realistic monophonic recordings is not a trivial one; it is difficult for humans and especially

for computers. It becomes easier as longer segments of music are used and the recognition is

performed at the level of instrument families.

BIBLIOGRAPHY:

[1] Kaminskyj, Materka. (1995). “Automatic Source Identification of Monophonic Musical Instrument Sounds”. Proceedings of the IEEE Int. Conf. on Neural Networks, 1995.

[2] Opolko, F. & Wapnick, J. “McGill University Master Samples” (compact disk). McGill University,1987.

[3] Brown, Puckette. (1992). “An Efficient Algorithm for the Calculation of a Constant Q Transform”. J. Acoust. Soc. Am. 92, pp. 2698-2701.

[4] Fraser, Fujinaga. (1999). “Towards real-time recognition of acoustic musical instruments”. Proceedings of the International Computer Music Conference, 1999.

[5] Fujinaga. (1998). “Machine recognition of timbre using steady-state tone of acoustic musical instruments”. Proceedings of the International Computer Music Conference, 1998.

[6] Fujinaga. (2000). “Realtime recognition of orchestral instruments”. Proceedings of the International Computer Music Conference, 2000.

[7] Kostek. (1999). “Soft Computing in Acoustics: Applications of Neural Networks, Fuzzy Logic and Rough Sets to Musical Acoustics”. Physica-Verlag, 1999.

[8] Martin. (1998). “Musical instrument identification: A pattern-recognition approach“. Presented at the 136th meeting of the Acoustical Society of America, October 13, 1998.

[9] Rossing. (1990). “The Science of Sound“. Second edition, Addison-Wesley Publishing Co.

[10] Fletcher, Rossing. (1998). “The Physics of Musical Instruments“. Springer-Verlag New York, Inc.

[11] “Automatic Classification of Musical Sounds” In Proc. 108th Audio Eng. Soc. Convention.

[12] Kostek, Czyzewski. (2001). “Automatic Recognition of Musical Instrument Sounds - Further Developments”. In Proc. 110th Audio Eng. Soc. Convention, Amsterdam, Netherlands, May 2001.

[13] Martin. (1999). “Sound-Source Recognition: A Theory and Computational Model”. Ph.D. thesis,

MIT.

[14] Brown. (2001). “Feature dependence in the automatic identification of musical woodwind instruments”. J. Acoust. Soc. Am. 109(3), March 2001.

musical instrument recognition

Documents