music information retrieval based on multi-label cascade classification system

48
Music Information Retrieval based on multi-label cascade classification system presented by presented by Zbigniew W. Ras Zbigniew W. Ras www.kdd.uncc.e du http//:www.mir.uncc.edu CCI, UNC-Charlotte Research sponsored by NSF IIS-0414815, IIS-0968647

Upload: glain

Post on 15-Jan-2016

49 views

Category:

Documents


0 download

DESCRIPTION

www.kdd.uncc.edu. Music Information Retrieval based on multi-label cascade classification system. CCI, UNC-Charlotte. http//:www.mir.uncc.edu. Research sponsored by NSF IIS-0414815, IIS-0968647. presented by Zbigniew W. Ras. Collaborators: - PowerPoint PPT Presentation

TRANSCRIPT

  • Music Information Retrieval based on multi-label cascade classification system

    presented byZbigniew W. Ras

    www.kdd.uncc.edu

    http//:www.mir.uncc.eduCCI, UNC-CharlotteResearch sponsored by NSFIIS-0414815, IIS-0968647

  • Collaborators:

    Alicja Wieczorkowska (Polish-Japanese Institute of IT, Warsaw, Poland)Krzysztof Marasek (Polish-Japanese Institute of IT, Warsaw, Poland)

    Former PhD students:Elzbieta Kubera (Maria Curie-Sklodowska University, Lublin, Poland )Rory Lewis (University of Colorado at Colorado Springs, USA)Wenxin Jiang (Fred Hutchinson Cancer Research Center in Seattle, USA)Xin Zhang (University of North Carolina, Pembroke, USA)Jacek Grekow (Bialystok University of Technology, Poland)

    Current PhD student:Amanda Cohen-Mostafavi (University of North Carolina, Charlotte, USA)

  • Outcome:Musical Database represented as FS-tree guarantying efficient storage and retrieval [music pieces indexed by instruments and emotions].MIRAI - Musical Database (mostly MUMS)[music pieces played by 57 different music instruments]Goal: Design and Implement a System for Automatic Indexing of Music by Instruments (objective task) and Emotions (subjective task)

  • Alto Flute, Bach-trumpet, bass-clarinet, bassoon, bass-trombone, Bb trumpet, b-flat clarinet, cello, cello-bowed, cello-martele, cello-muted, cello-pizzicato, contrabassclarinet, contrabassoon, crotales, c-trumpet, ctrumpet-harmonStemOut, doublebass-bowed, doublebass-martele, doublebass-muted, doublebass-pizzicato, eflatclarinet, electric-bass, electric-guitar, englishhorn, flute, frenchhorn, frenchHorn-muted, glockenspiel, marimba-crescendo, marimba-singlestroke, oboe, piano-9ft, piano-hamburg, piccolo, piccolo-flutter, saxophone-soprano, saxophone-tenor, steeldrums, symphonic, tenor-trombone, tenor-trombone-muted, tuba, tubular-bells, vibraphone-bowed, vibraphone-hardmallet, viola-bowed, viola-martele, viola-muted, viola-natural, viola-pizzicato, violin-artificial, violin-bowed, violin-ensemble, violin-muted, violin-natural-harmonics, xylophone.MIRAI - Musical Database [music pieces played by 57+ different music instruments (see below)and described by over 910 attributes]

  • What is needed?Database of monophonic and polyphonic music signals and their descriptions in terms of new features (including temporal) in addition to the standard MPEG7 features. These signals are labeled by instruments and emotions forming additional features called decision features.Automatic Indexing of Music

    Why is needed?To build classifiers for automatic indexing of musical sound by instruments and emotions.

  • MIRAI - Cooperative Music Information Retrieval System based on Automatic IndexingUserInstrumentsQueryIndexed Audio Database QueryAdapterDurationsEmptyAnswer?Music Objects

  • The nature and types of raw dataChallenges to applying KDD in MIR

    Data sourceorganizationvolumeTypeQualityTraditional dataStructuredModest Discrete,CategoricalCleanAudio dataUnstructuredVery largeContinuous,NumericNoise

  • Feature Databasetraditional pattern recognition FeatureExtractionlower level raw dataHigher level representations classificationclusteringregressionSignal Data Sampling0.12 s frame size0.04 s hop sizemanageableFeature extractionsMATLAB

  • MPEG7 features Instantaneous Harmonic Spectral Centroid Instantaneous Harmonic Spectral Deviation Signal Hamming Window STFT Signal envelope FundamentalFrequency Harmonic Peaks Detection Instantaneous Harmonic Spectral Spread Temporal Centroid Power SpectrumSpectral Centroid Log Attack Time Instantaneous Harmonic Spectral Variation Hamming WindowSTFTNFFT FFT points

  • Derived Database MPEG7 features Non-MPEG7 features & new temporal features

    Roll-Off Flux Mel frequency cepstral coefficients (MFCC)Tristimulus and similar parameters (contents of odd and even partials- Od, Ev)Mean frequency deviation for low partials Changing ratios of spectral spreadChanging ratios of spectral centroid

    Spectrum CentroidSpectrum SpreadSpectrum FlatnessSpectrum Basic FunctionsSpectrum Projection FunctionsLog Attack TimeHarmonic Peaks..

  • S(i) = [S(i+1) S(i)]/S(i) ; C(i) = [C(i+1) C(i)]/C(i)where S(i+1), S(i) and C(i+1), C(i) are the spectral spread and spectral centroid of two consecutive frames: frame i+1 and frame i.

    The changing ratios of spectral spread and spectral centroid for two consecutive frames are considered as the first derivatives of the spread and spectral centroid.

    Following the same method we calculate the second derivatives:

    S(i) = [S(i+1) S(i)]/S(i) ; C(i) = [C(i+1) C(i)]/C(i)New Temporal Features S(i), C(i), S(i), C(i)Remark: Sequence [S(i), S(i+1), S(i+2),.., S(i+k)] can be approximated by polynomialp(x)=a0+a1*x+a2*x2 + a3*x3 + ; new features: a0, a1, a2, a3,

  • Classification confidence with temporal features

    Experiment with WEKA: 19 instruments [flute, piano, violin, saxophone, vibraphone, trumpet, marimba, french-horn, viola, basson, clarinet, cello, trombone, accordian, guitar, tuba, english-horn, oboe, double-bass], J48 with 0.25 confidence factor for pruning tree, minimum number of instances per leaf 10; KNN number of neighbors 3Euclidean distance is used as similarity function.

    Experiment FeaturesClassifierConfidence1S, CDecision Tree80.47%2S, C, S , CDecision Tree83.68%3S, C, S , C , S , CDecision Tree84.76%4S ,CKNN80.31%5S, C, S , CKNN84.07%6S, C, S , C , S , CKNN85.51%

  • Confusion matrices: left is from Experiment 1, right is from Experiment 3. The correctly classified instances are highlighted in green and the incorrectly classified instances are highlighted in yellow

  • Precision of the decision tree for each instrumentRecall of the decision tree for each instrumentF-score of the decision tree for each instrument

  • .Polyphonic Sound

    segmentation

    Feature extraction

    Classifier

    Get Instrument

    Polyphonic sounds how to handle?Single-label classification Based on Sound SeparationMulti-labeled classifiersGet frameProblems?Information loss during the signal subtractionSound Separation Flowchart

  • Timbre estimation in polyphonic sounds and designing multi-labeled classifierstimbre relevant descriptorsSpectrum Centroid, Spread Spectrum Flatness Band Coefficients

    Harmonic Peaks Mel frequency cepstral coefficients (MFCC) Tristimulus

  • Features ExtractionClassifierTimbre estimation based on multi-label classifier timbre descriptors

    instrumentconfidenceCandidate 170%Candidate 250%......Candidate N10%

    instrumentconfidenceCandidate 170%Candidate 250%......Candidate N10%

    instrumentconfidenceCandidate 170%Candidate 250%......Candidate N10%

  • Timbre Estimation Results based on different methods[Instruments - 45, Training Data (TD) - 2917 single instr. sounds from MUMS, Testing on 308 mixed sounds randomly chosen from TD, window size 1s, frame size 120ms, hop size 40ms (~25 frames), Mel-frequency cepstral coefficients (MFCC) extracted from each frame Threshold 0.4 controls the total number of estimations for each index window.

    experiment #pitch basedSound SeparationN(Labels) maxRecallPrecisionF-score1YesYes/No154.55%39.2%45.60%2YesYes261.20%38.1%46.96%3YesNo264.28%44.8%52.81%4YesNo467.69%37.9%48.60%5YesNo868.3%36.9%47.91%

    Chart9

    0.5455

    0.612

    Recall

    Single Label Vs Multiple Label

    Sheet1

    pitch basedSound SeparationN(Labels)recall

    YesSP154.55%

    YesSP261.20%

    YesNS264.28%

    YesNS467.69%

    NoNS470.13%

    experiment #

    pitch based

    Sound Separation

    N(Labels)

    recall

    1

    Yes

    Yes

    1

    54.55%

    2

    Yes

    Yes

    2

    61.20%

    3

    Yes

    NO

    2

    64.28%

    4

    Yes

    NO

    4

    67.69%

    5

    No

    NO

    4

    70.13%

    Sheet1

    Recall

    Single Label Vs Multiple Label

    Sheet2

    Recall

    Seperation Vs Non-Sperataion

    Sheet3

    experiment #descriptionRecognition Rate

    1Feature-based and separation + Decision Tree (n=1)36.49%

    2Feature-based and separation + Decision Tree (n=2)48.65%

    3Spectrum Match + KNN (k=1;n=2)79.41%

    4Spectrum Match + KNN (k=5;n=2)82.43%

    5Spectrum Match + KNN (k=5;n=2) without percussion instrument87.10%

    Sheet3

    Recognition Rate

    Sound Separation

    y54.55%61.20%

    n64.28%67.69%70.13%

    Chart10

    0.612

    0.6428

    0.6769

    Recall

    Seperation Vs Non-Sperataion

    Sheet1

    pitch basedSound SeparationN(Labels)recall

    YesSP154.55%

    YesSP261.20%

    YesNS264.28%

    YesNS467.69%

    NoNS470.13%

    experiment #

    pitch based

    Sound Separation

    N(Labels)

    recall

    1

    Yes

    Yes

    1

    54.55%

    2

    Yes

    Yes

    2

    61.20%

    3

    Yes

    NO

    2

    64.28%

    4

    Yes

    NO

    4

    67.69%

    5

    No

    NO

    4

    70.13%

    Sheet1

    0

    0

    Recall

    Single Label Vs Multiple Label

    Sheet2

    0

    0

    0

    Recall

    Seperation Vs Non-Sperataion

    Sheet3

    experiment #descriptionRecognition Rate

    1Feature-based and separation + Decision Tree (n=1)36.49%

    2Feature-based and separation + Decision Tree (n=2)48.65%

    3Spectrum Match + KNN (k=1;n=2)79.41%

    4Spectrum Match + KNN (k=5;n=2)82.43%

    5Spectrum Match + KNN (k=5;n=2) without percussion instrument87.10%

    Sheet3

    0

    0

    0

    0

    0

    Recognition Rate

    Sound Separation

    y54.55%61.20%

    n64.28%67.69%70.13%

  • Polyphonic Sound(window)

    Get frame

    Feature extraction

    Classifiers

    Multiple labelsCompressed representations of the signal: Harmonic Peaks, Mel Frequency Ceptral Coefficients (MFCC), Spectral Flatness, .

    Irrelevant information (inharmonic frequencies or partials) is removed.

    Violin and viola have similar MFCC patterns. The same is with double-bass and guitar. It is difficult to distinguish them in polyphonic sounds.

    More information from the raw signal is needed.Polyphonic Sounds

  • Short Term Power Spectrum low level representation of signal (calculated by STFT)Power Spectrum patterns of flute & trombone can be seen in the mixtureSpectrum slice 0.12 seconds long

  • Experiment:

    Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency -261.6 Hz

    Training set: Power Spectrum from 3323 frames - extracted by STFT from 26 single instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

    Testing Set: Fifty two audio files are mixed (using Sound Forge ) by two of these 26 singleinstrument sounds.

    Classifier (1) KNN with Euclidean distance (spectrum match based classification); (2) Decision Tree (multi label classification based on previously extracted features)

  • Timbre Pattern Match Based on Power Spectrum n number of labels assigned to each frame; k parameter for KNN

    experiment #descriptionRecallPrecisionF-score1Feature-based + Decision Tree (n=2)64.28%44.8%52.81%2Spectrum Match + KNN (k=1;n=2)79.41%50.8%61.96%3Spectrum Match + KNN (k=5;n=2)82.43%45.8%58.88%4Spectrum Match + KNN (k=5;n=2) without percussion instrument87.1%

    Chart1

    0.6428

    0.7941

    0.8243

    0.871

    Recognition Rate

    Spectrum-based VS Feature-based

    Sheet1

    experiment #descriptionRecognition Rate

    Feature-based64.28%

    k=1Spectrum Match79.41%

    k=5Spectrum Match82.43%

    k=5Spectrum Match (without percussion )87.10%

    Sheet1

    Recognition Rate

    Spectrum-based VS Feature-based

    Sheet2

    Sheet3

    MBD06C9E170.xls

    Chart10

    0.612

    0.6428

    0.6769

    Recall

    Seperation Vs Non-Sperataion

    Sheet1

    pitch basedSound SeparationN(Labels)recall

    YesSP154.55%

    YesSP261.20%

    YesNS264.28%

    YesNS467.69%

    NoNS470.13%

    experiment #

    pitch based

    Sound Separation

    N(Labels)

    recall

    1

    Yes

    Yes

    1

    54.55%

    2

    Yes

    Yes

    2

    61.20%

    3

    Yes

    NO

    2

    64.28%

    4

    Yes

    NO

    4

    67.69%

    5

    No

    NO

    4

    70.13%

    Sheet1

    0

    0

    Recall

    Single Label Vs Multiple Label

    Sheet2

    0

    0

    0

    Recall

    Seperation Vs Non-Sperataion

    Sheet3

    experiment #descriptionRecognition Rate

    1Feature-based and separation + Decision Tree (n=1)36.49%

    2Feature-based and separation + Decision Tree (n=2)48.65%

    3Spectrum Match + KNN (k=1;n=2)79.41%

    4Spectrum Match + KNN (k=5;n=2)82.43%

    5Spectrum Match + KNN (k=5;n=2) without percussion instrument87.10%

    Sheet3

    0

    0

    0

    0

    0

    Recognition Rate

    Sound Separation

    y54.55%61.20%

    n64.28%67.69%70.13%

  • Schema I - Hornbostel Sachs AerophoneChordophoneMembranophone IdiophoneFreeSingle ReedSideLip VibrationWhipAlto FluteFluteC TrumpetFrench HornTubaOboeBassoon

  • Schema II - Play Methods MutedPizzicatoBowedPickedPiccoloFluteBassoonAlto FluteShakenBlow

  • Xin Cynthia Zhang*Xin Cynthia Zhang*Decision Table

    ObjClassification AttributesDecision AttributesCA1 CAnHornbostel SachsPlay Method10.22 0.28[Aerophone, Side, Alto Flute][Blown, Alto Flute]20.31 0.77[Idiophone, Concussion, Bell][Concussive, Bell]30.05 0.21[Chordophone, Composite, Cello][Bowed, Cello]40.12 0.11[Chordophone, Composite, Violin][Martele, Violin]

    Xin Cynthia Zhang

  • Example1212C[1]C[2]C[2,1]C[2,2]1212d[1]d[2]d[3,1]d[3,2]3d[3]Level ILevel IIClassification AttributesDecision Attributes

    Xabcdx1a[1]b[2]c[1]d[3]x2a[1]b[1]c[1]d[3,1]x3a[1]b[2]c[2,2]d[1]x4a[2]b[2]c[2]d[1]

  • Instrument granularity classifiers which are trained at each level of the hierarchical treeHornbostel/SachsWe do not include membranophones because instruments in this family usuallydo not produce harmonic sound so that they need special techniques to be identified

  • Modules of cascade classifier for single instrument estimation --- Hornboch /SachsPitch 3B91.80%96.02%98.94%= 95.00%*>

  • New Experiment:

    Middle C instrument sounds (pitch equal to C4 in MIDI notation, frequency - 261.6 Hz

    Training set: 2762 frames extracted from the following instrument sounds: electric guitar, bassoon, oboe, B-flat, clarinet, marimba, C trumpet,E-flat clarinet, tenor trombone, French horn, flute, viola, violin, English horn, vibraphone,Accordion, electric bass, cello, tenor saxophone, B-flat trumpet, bass flute, double bass,Alto flute, piano, Bach trumpet, tuba, and bass clarinet.

    Classifiers WEKA: (1) KNN with Euclidean distance (spectrum match based classification); Decision Tree (classification based on previously extracted features)

    Confidence ratio of the correct classified instances over the total number of instances

  • Classification on different Feature Groups

    GroupFeature descriptionClassifierConfidenceA33 Spectrum Flatness Band CoefficientsKNN Decision Tree99.23%94.69%B13 MFCC coefficientsKNN Decision Tree98.19%93.57%C28 Harmonic PeaksKNN Decision Tree86.60%91.29%D38 Spectrum projection coefficientsKNN Decision Tree47.45%31.81%ELog spectral centroid, spread, flux, rolloff, zerocrossingKNN Decision Tree99.34%99.77%

  • Feature and classifier selection at each level of cascade systemKNN + Band Coefficients

    Nodefeature ClassifierchordophoneBand CoefficientsKNNaerophoneMFCC coefficientsKNNidiophoneBand CoefficientsKNN

    Nodefeature Classifierchrd_compositeBand CoefficientsKNNaero_double-reedMFCC coefficientsKNNaero_lip-vibratedMFCC coefficientsKNNaero_sideMFCC coefficientsKNNaero_single-reedBand CoefficientsDecision Treeidio_struckBand CoefficientsKNN

  • Classification on the combination of different feature groupsClassification based on KNNClassification based on Decision Tree

  • From those two experiments, we see that:

    KNN classifier works better with feature vectors such as spectral flatness coefficients, projection coefficients and MFCC. Decision tree works better with harmonic peaks and statistical features.

    Simply adding more features together does not improve the classifiers and sometime even worsens classification results (such as adding harmonic to other feature groups).

  • HIERARCHICAL STRUCTURE BUILT BY CLUSTERING ANALYSIS

    Seven common method to calculate the distance or similarity between clusters: single linkage (nearest neighbor), complete linkage (furthest neighbor), unweighted pair-group method using arithmetic averages (UPGMA), weighted pair-group method using arithmetic averages (WPGMA), unweighted pair-group method using the centroid average (UPGMC), weighted pair-group method using the centroid average (WPGMC), Ward's method.

    Six most common distance functions: Euclidean, Manhattan, Canberra (examines the sum of series of a fraction differences between coordinates of a pair of objects), Pearson correlation coefficient (PCC) measures the degree of association between objects, Spearman's rank correlation coefficient, Kendal (counts the number of pairwise disagreements between two lists)

    Clustering algorithm HCLUST (Agglomerative hierarchical clustering) R Package

  • Testing Datasets (MFCC, flatness coefficients, harmonic peaks) :

    The middle C pitch group which contains 46 different musical sound objects. Each sound object is segmented into multiple 0.12s frames and each frame is stored as an instance in the testing dataset. There are totally 2884 frames

    This dataset is represented by 3 different sets of features (MFCC, flatness coefficients, and harmonic peaks)

    Total number of experiments = 3 7 6 = 126

    Clustering:When the algorithm finishes the clustering process, a particular cluster ID is assigned to each single frame.

  • Contingency Table derived from clustering result

    Cluster 1Cluster jCluster n

    Instrument 1X11X1 jX1n

    Instrument iXi1XijXin

    Instrument nX n1X njX nn

  • Evaluation result of Hclust algorithm (14 results which yield the highest score among 126 experimentsw number of clusters, - average clustering accuracy of all the instruments, score= *w

    FeaturemethodmetricwscoreFlatness Coefficientswardpearson87.3%3732.30Flatness Coefficientswardeuclidean85.8%3731.74Flatness Coefficientswardmanhattan85.6%3630.83mfccwardkendall81.0%3629.18mfccwardpearson83.0%3529.05Flatness Coefficientswardkendall82.9%3529.03mfccwardeuclidean80.5%3528.17mfccwardmanhattan80.1%3528.04mfccwardspearman81.3%3427.63Flatness Coefficientswardspearman83.7%3327.62Flatness Coefficientswardmaximum86.1%3227.56mfccwardmaximum79.8%3427.12Flatness Coefficientsmcquittyeuclidean88.9%3026.67mfccaveragemanhattan87.3%3026.20

  • Clustering result from Hclust algorithm with Ward linkage method and Pearson distance measure; Flatness coefficients are used as the selected featurectrumpet and batchtrumpet are clustered in the same group. ctrumpet_harmonStemOut is clustered in one single group instead of merging with ctrumpet. Bassoon is considered as the sibling of the regular French horn. French horn muted is clustered in another different group together with English Horn and Oboe .

  • Looking for optimal [classification method data representation] in polyphonic music[Middle C pitch group - 46 different musical sound objects]

    Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.

    KNN (k=3) is used as the classifier for each experiment.

  • Looking for optimal [classification method data representation] in polyphonic music

    Testing Data: 49 polyphonic sounds are created by selecting three different single instrument sounds from the training database and mixing them together.

    KNN (k=3) is used as the classifier for each experiment.

    Exp# Classifier Method Recall PrecisionF-Score1Non-CascadeSingle-label based on sound separation31.48%43.06%36.37%2Non_CascadeFeature-based multi-label classification Spectrum-Match69.44%58.64%63.59%3Non_Cascademulti-label classification85.51%55.04%66.97%4Cascade(hornbostel)multi-label classification64.49%63.10%63.79%5Cascade(playmethod)multi-label classification66.67%55.25%60.43%6Cascade(machine Learned)multi-label classification63.77%69.67%66.59%

  • Auto indexing system for musical instruments

    intelligent query answering system for music instruments

    WWW.MIR.UNCC.EDU

  • User entering queryUser is not satisfied and he is entering a new query- Action Rules System

  • Action RuleAction rule is defined as a term

    Information Systemconjunction of fixed condition features shared by both groups proposed changes in values of flexible features desired effect of the action[() ( )] ()

    A B D

    a1 b2 d1

    a2 b2

    a2 b2 d2

  • Action Rules DiscoveryMeta-actions based decision system S(d)=(X,A{d}, V ), with A= {A1,A2,,Am}Influence Matrix r = [(A1 , a1 a1) (A2 , a2 a2) (A4 , a4 a4)]) (d , d1 d1)Candidate action rule -if E32 = [a2 a2], then E31 = [a1 a1], E34 = [a4 a4] Rule r is supported & covered by M3

    A1A2A3A4..AmM1E11E12E13E14E1mM2E21E22E23E24E2mM3E31E32E33E34E3mM4E41E42E43E44E4m..MnEm1Em2Em3Em4Emn

  • "Action Rules Discovery without pre-existing classification rules", Z.W. Ras, A. Dardzinska, Proceedings of RSCTC 2008 Conference, in Akron, Ohio, LNAI 5306, Springer, 2008, 181-190 http://www.cs.uncc.edu/~ras/Papers/Ras-Aga-AKRON.pdf

  • Since the window diminishes the signal on both edges, it leads to information loss due to the narrowing of frequency spectrum. In order to preserve this information, those consecutive analysis frames have overlap in time. The empirical experiments show the best overlap is two third of window size TimeABAAAA

  • WindowingHamming windowspectral leakage

    **Mining music data involves the complexity not found in the traditional business world. Traditional source data is typically structured modest and the content is quite homogeneous In addition, the data itself is usually discrete (categorical data) and there is not a significant amount of processing involved in noise-removing from raw data. In the digital audio domain, however, we deal with unstructured inhomogeneous datasets of extremely large size. On the top of that, the results we are trying to mine are floating point values.*Higher level representations are abstracted from this lower level data before we detect patterns or other useful information at the higher level.

    The tasks at the lower level would typically include object detection, tracking of objects, as well as feature measurements/extraction.

    At the higher level, there are more traditional pattern recognition tasks of classification, clustering, regression, interactive retrieval, novelty detection, verification and validation etc.

    *In practice, however, a mixture of sounds is difficult to separate without distortion, Difficult to extract the clear features from the original sound When there is no dominant instrumentThe main difficulty in identifying instruments in polyphonic music is the fact that acoustical features of each instrument cannot be extracted without blurring because of the overlapping harmonic content. After each sound separation process, the timbre information of the rest instruments could be partially lost due to the overlap of multiple timbre signals, which make it difficult to further analyze the remnant of sound signal.

    Spectrum Centroid describes the gravity center of the spectrum Spectrum Spread describes the deviation of the power spectrum with respect to the gravity center in a frame. Like Spectrum Centroid, it is an economic way to describe the shape of the power spectrum.Spectrum Band Coefficients describes the flatness property of the power spectrum within a frequency bin projection coefficients project the spectrum from high dimensional space of spectrum to low dimensional space with compact salient statistical information. Harmonic Peaks is a sequence of local peaks of harmonics of each frame .Mel frequency cepstral coefficients describe the spectrum according to the human perception system in the mel scale. They are computed by grouping the STFT points of each frame into a set of coefficients.Tristimulus The concept of tristimulus originates in the world of colour, describing the way three primary colours can be mixed together to create a given colour. By analogy, the musical tristimulus measures the mixture of harmonics in a given sound, grouped into three sections. parameters describe the ratio of the energy of 3 groups of harmonic partials to the total energy of harmonic partials The following groups are used: fundamental, medium partials (2, 3, and 4) and higher partials The first tristimulus measures the relative weight of the first harmonic; the second tristimulus measures the relative weight of the 2nd, 3rd, and 4th harmonics taken together; and the third tristimulus measures the relative weight of all the remaining harmonics.

    *Training data source: 2917 single instrument sound files [45 different instruments] Feature: Harmonic contentsTest data source:308mixed sounds synthesized by 2 single instrument sounds with different pitch which comes from 2917 training data.Pitch based indicates if the classification model is trained on specific pitch group.Sound Separation indicates if the sound separation process is involved in the whole indexing procedure.N indicates the number of classes which are labeled by classifier during the frame estimation process.[recall] = [positive response] / [original instruments]

    Conclusion:1.Verified the multi label classifier is better than single label classifier for multi-timbre estimation in polyphonic sound 2.multi label classifier yields even better result when global timbre information (music context) was taken into account .3.multi label classification is pitch independent algorithmAs Figure shows, the power spectrum patterns of single flute and single trombone could still been identified in mixture spectrum without blurring with each other (as marked in the figure). Therefore, we do get the clear picture of distinct pattern of each single instrument when we observe each spectrum slice of the polyphonic sound wave. This explains the reason that human hearing system could still accurately recognize the two different instruments from the mixture instead of misclassifying them as some other instruments. However those distinct timbre relevant characteristics for each instrument preserved in the signal wont be able to be observed in the previous feature space From the results shown in Table , we get the following conclusions:1. Using the multiple label classifier for each frame yields better results than using single label classifier2. Spectrum-based KNN classification improves the recognition rate of polyphonic sounds significantly.3. Some percussion instrument (such as vibraphone, marimba) are not suitable for spectrum-based classification, but most instruments generating harmonic sounds work well with this new method.The testing was done for music instrument sounds of pitch 3B. The results are shown in Table 3 and Table 4. The confidence of a standard classifier class(S, d, 3) for Hombostel-Sachs classification of instruments is 91.80%. However, we can get much better results by following the cascade approach. For instance, if we use the classifier class(S, d, 2) followed by the classifier class(S, d[1, 1], 3), then its precision in recognizing musical instruments in aero double reed class is equal to 96.02% * 98.94% = 95.00%. Also, its precision in recognizing in instruments in aero single reed class is equal to 96.02% * 99.54% = 95.57%. It has to be noted that this improvement in confidence is obtained without increasing the number of attributes in the subsystems of S used to build the cascade classifier replacing S. Clearly, if we increase the number of attributes in these subsystems then the resulting classifiers forming the cascade classifier may easily have higher confidence and the same the confidence of the cascade classifier will get increased.

    Energy describe total energy of harmonic partials.

    According to the previous discussion and conclusion, in order to get the highest accuracy for the ultimate estimation at bottom level of hierarchical tree, cascade system must be able to pick the pair of feature and classifier from the available features pool and classifiers pool in the way that system achieve the best estimation at each level of cascade classification. To get such information, We need to deduced the knowledge from current training database by combining each feature from feature pool (A,B,C,D) with each classifier from the classifier pool( NaiveBayes, KNN, Decision Tree), and running the classification experiments in weka on the subset which corresponds to the each node in the hierarchical tree used by cascade classification system. Because we measure the signal in a short period, there is no way to know where exactly the periodic signal starts and ends. If the period does not fit the measurement time, the frequency spectrum is not correct. Since we can't assume anything about the signal, we need a way to make any signal's ends connect smoothly to each other when repeated. One way to do this is to multiply the signal by a 'window' function.There are a lot of window functions to be choose, in our research we use Hamming window to windowing the short time signal.

    *