music classification

Music Classification using Neural Networks

Paul Scott [email protected]

EE 373B Project

Prof. Bernard Widrow

Spring 2001

EE 373B Paul Scott

1

1. Introduction Neural networks have found profound success in the area of pattern recognition. By repeatedly showing a neural network inputs classified into groups, the network can be trained to discern the criteria used to classify, and it can do so in a generalized manner allowing successful classification of new inputs not used during training. With the explosion of digital music in recent years due to Napster and the Internet, the application of pattern recognition technology to digital audio has become increasingly interesting. On the user end, many people have downloaded large collections of music files (e.g. MP3s and WAVs) that are often stored in directory structures classified by genre or artist. Thus one can imagine the usefulness of a program that would automatically classify and store new downloaded music using the existing classification system set by the user. A second useful program would be one that searches through a collection of files and extracts only those with characteristics chosen by the user. For instance, a user may want to search through a library of files stored on a computer in Austria for those that are of the classical music genre, but due to a language difference and the Austrian users own preferences for file naming, determining the genre of each of the files may be very difficult to do using just file names. Thus a program that makes classifications based on music content would be much more appropriate and useful. On Napsters and the recording industrys ends, classification of music based on content is necessary for ensuring that copyrighted music is not freely distributed across the internet. Filters based on file names have been found to be very ineffective, for clever users simply alter the names of the files to circumvent such filters. What is needed is a classification system that only looks at the content of the file to make its classification decisions, for such a system would be much more effective since altering the content of the file is not a very appealing option to users. The purpose of this project is to study the feasibility of a music classification system based on music content using a neural network. Figure 1 below is a block diagram of the classification system. A 1.5 second audio file stored in WAV format is passed to a feature extraction function. The WAV format for digital audio is simply the left and right stereo signal samples. The feature extraction function calculates 124 numerical features that characterize the sample. When training the system, this feature extraction process is performed on many different input WAV files to create a matrix of column feature vectors. This matrix is then preprocessed to reduce the number of inputs to the neural network and then sent to the neural network for training. After training, single column vectors can be fed to the preprocessing block, which processes them in the same manner as the training vectors, and then classified by the neural network. The following sections will discuss each of these blocks in more detail.

EE 373B Paul Scott

2

Figure 1. Block diagram of the digital audio classification system.

2. System Setup This section describes the setup of the digital audio classification system. This system is composed primarily of the blocks above and was developed in the Matlab environment. Matlab code can be provided upon request. 2.1 Input Files Data for training and testing the system was taken from ten compact discs, six classified as rock (labeled R01-R04), two classified as classical (C01 and C02), two classified as soul or R&B (S01 and S02), and two classified as country and western (W01 and W02). The four rock CDs are recorded by four different artists. A complete source listing for these CDs can be found in Appendix A. The tracks on each of these CDs were extracted and converted to WAV format and then divided into segments of length 218 bits, or six seconds. To avoid periods within the music not characteristic of the whole song, the segments were all taken from the middle of each track. From this procedure I was able to produce 2,781segments of music. The segments of music were then further divided into two sub-segments by extracting the first 216 bits (1.5 seconds) and the third 216 bits. Thus, in total, I generated 5,562 sub-segments of music to use for training and testing the system. For classification by genre, CDs R01, R02, C01, C02, S01, S02, W01, and W02 were used. For classification by artist the four rock CDs were used. 2.2 Feature Extraction Ideally, all the samples in the WAV file would be passed to the neural network, and then the neural network would determine the best way to process the data to arrive at a classification of the file. However, at a sampling rate of 44.1 kHz, even a one second sample of audio would result in a prohibitive amount of information for the neural

Feature

Extraction

Preprocessor

Neural Network

Classification Vector

1.5 second WAV file

feature vector

preprocessedfeature vector

EE 373B Paul Scott

3

network and Matlab. Therefore, a feature extraction function is needed to reduce the amount of data passed to the neural network. Extracting useful features from a digital audio sample is an evolving science and remains a popular research field. From the infinite amount of calculations that could be performed, this system uses only 124. These features fall into six categories described below. Table 1 outlines the format of the feature vector.

Table 1. Feature vector format.

Feature Numbers

1-32 33-64 65-96 97-108 109-123 124

Feature Type LPC Taps

DFT Amplitude

Value

Log of DFT Amplitude

Values

IDFT of Log of DFT Amplitude

Values

Mel-Frequency Cepstral

Coefficients

Volume

2.2.1 Linear Predictive Coding Taps In linear predictive coding (LPC), a signal is modeled by the following equation:

yn+1 = w0*yn + w1*yn-1 + w2*yn-2 + + wL-1*yn-L-1 + en+1

The goal of this model is to predict the next sample of the signal by linearly combining the L most current samples while minimizing the mean squared error over the entire signal. The weights (wis) are determined by using an adaptive filter and the LMS algorithm. For this system, the music segments were modeled using 32 taps (L=32). A block diagram of the adaptive filter used is shown below in Figure 2. To speed up the execution time required to calculate the LPC taps, the code was written in C and compiled using the Matlab MEX compiler, which resulted in a very significant decrease in execution time. 2.2.2 Frequency Content Frequency content was found to be an important feature for classifying music. Three different frequency content calculations were performed and included in the feature vectors. The first frequency content features that were calculated were the amplitude values of the discrete Fourier transform (DFT) of the signal. Because the sampling rate for the WAV files was 44.1 kHz, the DFT of the audio sample shows only the frequency content up to

EE 373B Paul Scott

4

22 kHz. Initial analysis of the audio signal being tested revealed that the vast majority of the frequency power lies in the lower portion of this spectrum; therefore, the signals were sampled at T=2 before taking the DFT to effectively zoom in on the lower half of the spectrum. The positive values of the DFT spectrum were then grouped into 32 evenly spaced bins, and the average spectral energies in each of the bins were reported as 32 features. The second calculation made was to take the natural logarithm of the 32 DFT amplitude values and report these values as 32 additional features. These features emphasize the differences in the values at frequencies with very small DFT amplitude values, which are mostly the higher frequencies. These features are provided to distinguish different samples by their higher frequency content. The final calculation made was to take the inverse DFT of the logarithm of the amplitude of the DFT values. The lower 12 values of this calculation were reported as 12 more features and were included to further emphasize the higher frequencies of the samples. 2.2.3 Mel-Frequency Cepstral Coefficients Mel-Frequency Cepstral Coefficients (MFCCs) have been used very successfully in the field of speech recognition as classification features for speech audio signals. The processing sequence for finding the MFCCs of an audio signal is the following:

Window the data with a Hamming window Find the amplitude values of the DFT of the data Convert the amplitude values to filter bank outputs Calculate the log base 10 Find the cosine transform

z-1 z-1 z-1

w0 w1 w2 wL-1

S

yn+1 z-1

S en+1

Figure 2. Adaptive filter for calculating LPC taps.

+

-

EE 373B Paul Scott

5

The filter bank consists of 40 triangle filters with 13 spaced linearly by 133.33 Hz and 27 spaced logarithmically by a factor of 1.0711703 in frequency. The DFT amplitude values are combined using these triangle filters to form the filter bank outputs. Code developed by Malcolm Slaney as a part of his Auditory Toolbox was used to calculate the MFCC values. Fifteen MFCC values were reported as features and included in the feature vector [3]. 2.2.4 Volume The volume of a musical piece is easily calculated as the variance of the samples. 2.3 Data Preprocessing The feature vectors returned by the feature extraction block were first preprocessed before inputting them to the neural network. Two types of preprocessing were performed, one to scale the data to fall within the range of 1 to 1 and one to reduce the length of the input vector. The data was divided into three sets, one for training, one for validation, and one for testing. The preprocessing parameters were determined using the matrix containing all feature vectors used for training and validation. For testing, these same parameters were used to preprocess test feature vectors before passing them to the trained neural network. The first preprocessing function used was premnmx, which preprocesses the data so that the minimum and maximum of each feature across all training and validation feature vectors is 1 and 1. Premnmx returns two parameters, minp and maxp, which were used with the function tramnmx for preprocessing the test feature vectors. The second preprocessing function used was prepca, which performs principle component analysis on the training and validation feature vectors. Principle component analysis is used to reduce the dimensionality of the feature vectors from a length of 124 to a length more manageable by the neural network. It does this by orthogonalizing the features across all feature vectors, ordering the features so that those with the most variation come first, and then removing those that contribute least to the variation [4]. Precpa was used with a value of .001 so that only those features that contribute to 99.9% of the variation were used. This procedure reduced the length of the feature vectors by one half. Precpa returns the matrix transMat, which is used with the function trapca to perform the same principle component analysis procedure on the test feature vectors as performed on the training and validation feature vectors. This was done before passing the test feature vectors to the trained neural network.

EE 373B Paul Scott

6

2.4 Neural Network A three-layer feedforward backpropagation neural network, shown in Figure 3, was used for classifying the feature vectors. By trial and error, an architecture consisting of 20 adalines in the input layer, 10 adalines in the middle layer, and 3 adalines in the output layer was found to provide good performance. The transfer function used for all adalines was a tangent sigmoid, tansig. The Levenberg-Marquardt backpropagation algorithm, trainlm, was used to train the neural network. 2.5 Classification Vectors Two music classification systems where implemented and tested, one to classify by genre and one to classify by artist. Figure 4 shows the constellations used for each of these classification systems, and Table 2 lists the specific coordinates of the constellation for each classification scheme. The constellations were chosen so that all points where equidistant from each other, all coordinates where within the 1 to 1 range, and the distance between points was maximized. Originally a two dimensional constellation was used, but the increased distance between points gained by moving to three dimensions

Input Feature Vector

Classification Vector

20 Adalines

10 Adalines

S

S

S

S

S

S

S

S

S

S 3 Adalines

Figure 3. Neural network used for classification.

EE 373B Paul Scott

7

provided a significant performance increase. Constellations of dimension greater than three did not provide a significant enough performance increase to justify the added computational complexity.

Figure 4. Classification constellation.

Table 2. Constellation coordinates.

Classification by Genre Coordinate Rock ( 1, 1, 1) Classical (-1, -1, 1) Soul/R&B ( 1, -1, -1) Country & Western (-1, 1, -1)

Classification by Artist Coordinate R01 ( 1, 1, 1) R02 (-1, -1, 1) R03 ( 1, -1, -1) R04 (-1, 1, -1)

3. Results This section will discuss the results of training and testing the classification system. Two separate results will be presented, one for classification by genre and one for classification by artist. 3.1 Classification by Genre To test the performance of the music classification system, the system was first configured to classify music by genre. The four genres used were rock, classical, soul/R&B, and country and western. The first step in performing this test was to generate the data set. As discussed above, the data set was taken from eight CDs, two per genre, and consisted of 4,425 feature vectors. From these 4,425 feature vectors, 2,213 were used for training and the other 2,212 were

EE 373B Paul Scott

8

reserved for testing. Before training, data preprocessing was performed on the training data, as was discussed above. After preprocessing, the training data was divided further into two groups, one for training and one for validation. A validation data set was needed to ensure that the neural network did not overfit the data. The next step was to create the neural network discussed above in the system setup section. The training function used was the Levenberg-Marquardt backpropagation algorithm, trainlm. The parameters mu, mu_dec, and mu_inc of trainlm were set to 1, 0.8, and 1.5 in order to ensure that the algorithm did not converge too quickly, which helped to limit the amount of overfitting that occurred before a validation stop of the training. Figure 5 below shows the MSE versus training epoch plot both the training data MSE and validation data MSE curves are shown. The MSE reached 0.0228 before a validation stop occurred. After training, the system was then tested using the data set reserved for testing. Before passing the test feature vectors to the trained neural network, data preprocessing was performed using the saved parameters from the preprocessing of the training data. The results are summarized in Tables 3 and 4. Figure 6 shows a three-dimensional plot of the output vectors of the neural network for each of the test input vectors. The decision rule used for classifying the output of the neural network was a minimum distance rule. A decision was made by first calculating the distance from the output of the neural network to each of the constellation points and then choosing the constellation point that produced the minimum distance.

Figure 5. MSE versus training epoch classification by genre. (Training data solid line, Validation data dashed line)

EE 373B Paul Scott

9

Genre classification was performed at a success rate of 94.8%, with classical music being classified the most successfully, 96.7%, and country and western, soul/R&B, and rock music being classified the least successfully at success rates of 91.0%, 93.1%, and 93.3%. The separation of success rates between classical music and the other three genres was expected since the four genres are not equally distinct in style. Classical music is

Table 3. System output classifications classification by genre.

System Output Classification

Genres Rock Classical Soul/R&B C&W

Rock 482 1 5 29

Classical 3 645 0 19

Soul/R&B 11 4 501 22

Cor

rect

Cla

ssif

icat

ion

C&W 22 13 9 447

Table 4. Error rates classification by genre.

Rock .0677 Classical .0330 Soul/R&B .0688 Country & Western .0896 Total .0624

Figure 6. Outputs of the trained neural network classification by genre.

EE 373B Paul Scott

10

definitely the genre that stands out as being the most distinct among the four genres, while country and western, rock, and soul/R&B can be grouped as musical genres of a somewhat similar style. Country and western, rock, and soul/R&B have each influenced one another throughout their growth into separate musical genres, and thus one would expect several features of each genre to be mimicked in the other two. Furthermore, out of the three non-classical music genres, country and western music was the genre that was classified incorrectly as classical music the most. This was also an expected result, since country and western music features instruments that are the most similar to those used in classical music (i.e. stringed instruments such as the acoustic guitar and violin). 3.2 Classification by Artist To further test the music classification system, the system was configured to classify music by artist. Four rock artists were used which I will call R01, R02, R03, and R04. Data for this test was taken from the four rock CDs, which are listed in Appendix A. The training and testing of this system were performed identically to the training and testing of the system for classifying by genre. From the four CDs, 2,187 feature vectors were extracted and split into two equal groups, one for training and one for testing. The training data set was then further divided to form the training and validation data sets. Training was performed using the same preprocessing, training function, and parameters as described in the classification by genre section. Figure 7 below shows the MSE versus training epoch plot both the training data MSE and validation data MSE curves are shown. The MSE reached 1.81e-5 before a validation stop occurred. By comparing Figures 5 and 7, it is evident that more overfitting occurred when training the system to classify by artist, which is discussed further below.

Figure 7. MSE versus training epoch classification by artist. (Training data solid line, Validation data dashed line)

EE 373B Paul Scott

11

After training, the system was tested using the feature vectors reserved for testing. The results are summarized in Tables 5 and 6, and Figure 8 shows a three-dimensional plot of the output vectors of the neural network for each of the test input vectors

Table 5. System output classifications classification by artist.

System Output Classification

Artists R01 R02 R03 R04

R01 182 12 14 2

R02 4 283 6 9

R03 18 3 242 4

Cor

rect

Cla

ssif

icat

ion

R04 4 7 0 305

Table 6. Error rates classification by artist.

R01 .1333 R02 .0629 R03 .0936 R04 .0348 Total .0758

Figure 8. Outputs of the trained neural network classification by artist.

EE 373B Paul Scott

12

Classifying music by artists within the same genre is definitely a much more difficult task than classifying by genre. The reason for this is that one has to extract features that distinguish subtleties in style among music samples that are primarily performed using the same instruments, tempo, and volume. However, it was exciting to discover that the system was still able to perform at an overall success rate of 92.4%. Further research into feature extraction would definitely improve this rate even more. From Tables 5 and 6, it is evident that some overfitting still occurred despite attempts to minimize it, for the ordering of success rates is the same as the ordering of the amounts of extracted data. Thus there is a somewhat large range among the success rates. However, this range is not solely caused by overfitting but again, as discussed in the classification by genre section, is a result of all four artists not being equally distinct in style. As a quick experiment, the trained system was tested using data from a fifth CD recorded by one of the four artists. The system did not perform well under this test. The reason for this is that artists styles evolve with time, and thus the features in the extracted feature vectors also change with time, or in this case, from CD to CD. This highlights the problem of finding features that distinguish an artists style over a career, or several CDs, which is much more difficult than finding features that distinguish an artists style on a single CD. Training the neural network with data from the fifth CD as well as the other four would have increased the success rate, but the problem of style change with time would still most likely have led to much worse success rates than those listed in Table 6. 4. Future Work The music classification system was developed as a feasibility study for the development of a generic system that would classify, very successfully, music files according to a classification scheme set by the user. This section will discuss future work that needs to be done in order to further study this problem and develop such a system. 4.1 More Advanced Feature Extraction The field of music feature extraction is a rich research area, for improving feature extraction will most likely have the largest impact on the performance of a music classification system. For the system detailed in this paper, feature vectors were extracted from 1.5 second music samples, and although the system performed well, 1.5 seconds does not capture all the characteristics of an entire song. What is needed is a feature extraction method that looks at more of the song in an attempt to not only capture short-time features but also long-time features that describe how the song evolves over time. One way to implement this is to simply use entire songs as the input to the feature extractor, but at high sampling rates, this leads to a prohibitively large amount of data for the feature extractor to process. A second approach would be to send several small samples, such as 1.5 second samples, that are equally spaced throughout a song to the feature extractor.

EE 373B Paul Scott

13

The feature extractor could then extract short-time features from each of the samples and then produce long-time features by examining how the extracted short-time features evolve with time. A feature extractor that considers an entire song would be a start towards developing a more advanced feature extractor, but even more needs to be done. Probably the toughest problem that needs to be solved is how to extract features that describe the very personal performance style of a musical piece. These are the features that will be necessary for making correct decisions when the differences in the pieces of music are very subtle, such as occurs when classifying music by artists within the same genre. 4.2 More Advanced Decision Rule The decision rule used in this system assumes that the noise that drives the output of the neural network away from constellation points is equal among all classification categories. From the results of the classification by genre section, Figure 6, this assumption is obviously not accurate. A more advanced decision rule that partitions the output space into classification regions in a more clever manner would definitely lead to better results. The main approach to implementing this is to observe the outputs while training and to assign larger regions to the classification groups experiencing the most noise or deviation. By providing more room for error for the more noisy classification groups, the error rate will be driven closer to zero and better balanced among all groups. 4.3 MP3 Files instead of WAV Files Given the popularity of the MP3 format for digital audio, a system that would take MP3 files as input instead of WAV files is desired. The system presented in this paper can be easily converted to take MP3 files as input by pre-appending an MP3 to WAV converter to Figure 1. This approach is valid and may be the best choice, but currently converting MP3 files to WAV files is a computationally intense procedure that requires a somewhat significant amount of execution time. However, as computer performance continues to advance, this problem will become negligible. An alternate approach is to design a system that works exclusively with MP3 files, that is, extracts features directly from files in the MP3 format. The drawback of this approach is that new methods for extracting features from highly compressed data would have to be researched, and most of the current feature extraction research would become irrelevant. However, highly compressed data may contain valuable features not obvious in the uncompressed data, making such research worthwhile. This leads to the idea of creating a hybrid system that extracts features from both the WAV and MP3 versions of the file, thus using the best of both worlds. 4.4 More Classification Categories To make a music classification tool useful, the number of classification categories needs to be increased to more than four. Implementing this improvement would involve work

EE 373B Paul Scott

14

in several different areas. For instance, one would need to find a way of determining the constellation dimensionality needed to provide enough distance between points to provide acceptable system performance. Another area that will need work is the area of feature extraction, for more advanced feature extraction may be necessary to provide a sufficient set of features for the neural network to have enough information to discern more than four classifications. Also, a more advanced decision rule will be needed to provide a clever partitioning strategy of the output space so that categories experiencing more noise will be given more room for error. One alternate approach to increasing the number of categories is to set up a categorization system in the form of a tree structure. If each node in the tree has a maximum of four children, then the four-category classification system presented in this paper could be used to move down the tree from node to node until a category at the bottom of the tree is reached. Such a system would require a separate trained neural network for each node in the tree, but it would avoid many of the issues discussed above involved in implementing a flat categorization system. 5. Sources

1. B. Widrow and S. Stearns, Adaptive Signal Processing, Prentice Hall, 1985. 2. B. Widrow and M. Lehr, 30 Years of Adaptive Neural Networks, Proceedings

of the IEEE, Vol. 78, No. 9, September 1990.

3. Malcolm Slaney, Auditory Toolbox for Matlab, Interval Research Corporation, Version 2.

4. H. Demuth and M. Beale, Neural Network Toolbox Users Guide, Version 4, The

Mathworks, 2001.

5. S. Haykin, Neural Networks, Prentice Hall, 2nd Edition, 1998. 6. T. Zhang and J. Kuo, Content Based Classification and Retrieval of Audio,

SPIE's 43rd Annual Meeting - Conference on Advanced Signal Processing Algorithms, Architectures, and Implementations VIII, SPIE Vol. 3461, p432-443, July 1998.

EE 373B Paul Scott

15

APPENDIX A Data for training and testing the music classification system were taken from the CDs listed below. CDs R01, R02, C01, C02, S01, S02, W01, and W02 were used for classification by genre, and CDs R01, R02, R03, and R04 were used for classification by artist. R01 Nirvana, Nevermind, The David Geffen Company, 1991. R02 Smashing Pumpkins, Siamese Dream, Virgin Records, 1993. R03 Pearl Jam, Ten, Sony Music Entertainment Inc., 1991. R04 Metallica, Metallica, Elektra Entertainment, 1991. C01 Pachelbel Kanon and the Greatest Hits of the Baroque, Intersound, 1990. C02 20 Classical Favorites, Madacy Records, No. 201, 1994. S01 BET: Best of Planet Groove, Virgin Records, 1999. S02 The Platinum Collection, Arista Records, 2000. W01 Country Road Songs, Disc 1, Harley-Davidson, 1996. W02 Country Road Songs, Disc 2, Harley-Davidson, 1996.