extracting information from music audiodpwe//talks/musicie-2005-09.pdfmusic info extraction - ellis...

25
Music Info Extraction - Ellis 2005-09-20 p. /25 1 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting Information from Music Audio Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Engineering, Columbia University, NY USA http://labrosa.ee.columbia.edu /

Upload: others

Post on 31-Dec-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /251

1. Learning Music2. Melody Extraction3. Music Similarity

Extracting Information from Music Audio

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Engineering, Columbia University, NY USA

http://labrosa.ee.columbia.edu/

Page 2: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /252

1. Learning from Music

• A lot of music data availablee.g. 60G of MP3 ≈ 1000 hr of audio, 15k tracks

• What can we do with it?implicit definition of ‘music’

• Quality vs. quantitySpeech recognition lesson:10x data, 1/10th annotation, twice as useful

• Motivating Applicationsmusic similarity / classificationcomputer (assisted) music generationinsight into music

Page 3: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /253

Ground Truth Data

• A lot of unlabeled music data availablemanual annotation is much rarer

• Unsupervised structure discovery possible.. but labels help to indicate what you want

• Weak annotation sourcesartist-level descriptionssymbol sequences without timing (MIDI)errorful transcripts

• Evaluation requires ground truthlimiting factor in Music IR evaluations?

File: /Users/dpwe/projects/aclass/aimee.wav

f 9 Printed: Tue Mar 11 13:04:28

5001000150020002500300035004000450050005500600065007000

Hz

0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t

mus mus musvoxvox

Page 4: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /254

Talk Roadmap

Musicaudio

Drumsextraction

Eigen- rhythms

Eventextraction

Melodyextraction

Fragmentclustering

Anchormodels

Similarity/recommend'n

Synthesis/generation

?

Semanticbases

1 2

3

4

Page 5: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

2. Melody Transcription

• Audio → Score very desirablefor data compression, searching, learning

• Full solution is elusivesignal separation of overlapping voicesmusic constructed to frustrate!

• Simplified problem:“Dominant Melody” at each time frame

5

with Graham Poliner

Time

Frequency

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

1000

2000

3000

4000

Page 6: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Conventional Transcription

• Pitched notes have harmonic spectra→ transcribe by searching for harmonicse.g. sinusoid modeling + grouping

• Explicit expert-derived knowledge

6

E6820 SAPR - Dan Ellis L10 - Music Analysis 2005-04-06 - 5

Spectrogram Modeling

• Sinusoid model

- as with synthesis, but signal

is more complex

• Break tracks

- need to detect new ‘onset’

at single frequencies

• Group by onset &

common harmonicity

- find sets of tracks that start

around the same time

- + stable harmonic pattern

• Pass on to constraint-

based filtering...

time / s

fre

q /

Hz

0 1 2 3 40

500

1000

1500

2000

2500

3000

0 0.5 1 1.5 time / s0

0.020.040.06

E6820 SAPR - Dan Ellis L10 - Music Analysis 2005-04-06 - 5

Spectrogram Modeling

• Sinusoid model

- as with synthesis, but signal

is more complex

• Break tracks

- need to detect new ‘onset’

at single frequencies

• Group by onset &

common harmonicity

- find sets of tracks that start

around the same time

- + stable harmonic pattern

• Pass on to constraint-

based filtering...

time / s

freq / H

z

0 1 2 3 40

500

1000

1500

2000

2500

3000

0 0.5 1 1.5 time / s0

0.020.040.06

Page 7: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /257

Transcription as Classification

• Signal models typically used for transcriptionharmonic spectrum, superposition

• But ... trade domain knowledge for datatranscription as pure classification problem:

single N-way discrimination for “melody”per-note classifiers for polyphonic transcription

Trainedclassifier

Audiop("C0"|Audio)p("C#0"|Audio)p("D0"|Audio)p("D#0"|Audio)p("E0"|Audio)p("F0"|Audio)

Page 8: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Melody Transcription Features

• Short-time Fourier Transform Magnitude (Spectrogram)

• Standardize over 50 pt frequency window8

Page 9: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Training Data

• Need {data, label} pairs for classifier training• Sources:

pre-mixing multitrack recordings + hand-labeling?synthetic music (MIDI) + forced-alignment?

9

fre

q / k

Hz

0

0.5

1

1.5

2

0

0.5

1

1.5

2

time / sec0 0.5 1 1.5 2 2.5 3 3.5

0

10

20

30

Page 10: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2510

Melody Transcription Results• Trained on 17 examples

.. plus transpositions out to +/- 6 semitonesSMO SVM (Weka)

• Tested on ISMIR MIREX 2005 setincludes foreground/background detection

Example...

Page 11: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2511

Melody Clustering

• Goal: Find ‘fragments’ that recur in melodies.. across large music database.. trade data for model sophistication

• Data sourcespitch tracker, or MIDI training data

• Melody fragment representationDCT(1:20) - removes average, smoothes detail

Trainingdata

Melodyextraction

5 secondfragments

Topclusters

VQ clustering

Page 12: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2512

Melody clustering results

• Clusters match underlying contour:

• Some interesting matches:e.g. Pink + Nsync

Page 13: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2513

3. Music Similarity

• Can we predict which songs “sound alike” to a listener?.. based on the audio waveforms?many aspects to subjective similarity

• Applicationsquery-by-exampleautomatic playlist generationdiscovering new music

• Problemsthe right representationmodeling individual similarity

with Mike Mandeland Adam Berenzweig

Page 14: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Music Similarity Features

• Need “timbral” features:Mel-Frequency Cepstral Coeffs (MFCCs)auditory-like frequency warping

log-domain

discrete cosine transform orthogonalization

14

Page 15: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Timbral Music Similarity• Measure similarity of feature distribution

i.e. collapse across time to get density p(xi)compare by e.g. KL divergence

• e.g. Artist Identificationlearn artist model p(xi | artist X) (e.g. as GMM)classify unknown song to closest model

15

KL

KL

MFCCs

Art

ist

1A

rtis

t 2

Test Song

Min Artist

Training

GMMs

Page 16: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

“Anchor Space”• Acoustic features describe each song

.. but from a signal, not a perceptual, perspective

.. and not the differences between songs

• Use genre classifiers to define new spaceprototype genres are “anchors”

16

n-dimensionalvector in "Anchor

Space"Anchor

Anchor

AnchorAudioInput

(Class j)

p(a1|x)

p(a2|x)

p(an|x)

GMMModeling

Conversion to Anchorspace

n-dimensionalvector in "Anchor

Space"Anchor

Anchor

AnchorAudioInput

(Class i)

p(a1|x)

p(a2|x)

p(an|x)

GMMModeling

Conversion to Anchorspace

SimilarityComputation

KL-d, EMD, etc.

Page 17: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2517

Anchor Space

• Frame-by-frame high-level categorizationscompare toraw features?

properties in distributions? dynamics?third cepstral coef

fifth

cep

stra

l coe

f

madonnabowie

Cepstral Features

1 0.5 0 0.5

0.8 0.6 0.4 0.2

00.20.40.6

Country

Elec

troni

ca

madonnabowie

10

5

Anchor Space Features

15 10 5

15

0

Page 18: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2518

‘Playola’ Similarity Browser

Page 19: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2519

Ground-truth data

• Hard to evaluate Playola’s ‘accuracy’user tests...ground truth?

• “Musicseer” online survey:ran for 9 months in 2002> 1,000 users, > 20k judgmentshttp://labrosa.ee.columbia.edu/projects/musicsim/

Page 20: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

si =N

∑r=1

αrrαkrc αr =

(12

)13

αc = α2r

Top rank agreement

0

10

20

30

40

50

60

70

80

cei cmb erd e3d opn kn2 rnd ANK

%

SrvKnw 4789x3.58

SrvAll 6178x8.93

GamKnw 7410x3.96

GamAll 7421x8.92

20

Evaluation• Compare Classifier measures against

Musicseer subjective results“triplet” agreement percentageTop-N ranking agreement score:

First-place agreement percentage- simple significance test

Page 21: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Using SVMs for Artist ID

• Support Vector Machines (SVMs) find hyperplanes in a high-dimensional spacerelies only on matrix of distances between pointsmuch ‘smarter’ than nearest-neighbor/overlapwant diversity of reference vectors...

21

(w x) + b = –1(w x) + b = + 1

x 1

y

y i = +1

w

(w x) + b = 0

x 2

i = – 1

Page 22: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Song-Level SVM Artist ID

• Instead of one model per artist/genre, use every training song as an ‘anchor’then SVM finds best support for each artist

22

D

D

D

D

D

D

MFCCs

Art

ist

1A

rtis

t 2

Song Features

DAG SVM

Test Song

Artist

Training

Page 23: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Artist ID Results

• ISMIR/MIREX 2005 also evaluated Artist ID• 148 artists, 1800 files (split train/test)

from ‘uspop2002’• Song-level SVM clearly dominates

using only MFCCs!

23

Table 4: Results of the formalMIREX 2005 Audio Artist ID evaluation (USPOP2002) from http://www.music-ir.

org/evaluation/mirex-results/audio-artist/.

Rank Participant Raw Accuracy Normalized Runtime / s

1 Mandel 68.3% 68.0% 10240

2 Bergstra 59.9% 60.9% 86400

3 Pampalk 56.2% 56.0% 4321

4 West 41.0% 41.0% 26871

5 Tzanetakis 28.6% 28.5% 2443

6 Logan 14.8% 14.8% ?

7 Lidy Did not complete

References

Jean-Julien Aucouturier and Francois Pachet. Improving

timbre similarity : How high’s the sky? Journal of

Negative Results in Speech and Audio Sciences, 1(1),

2004.

Adam Berenzweig, Beth Logan, Dan Ellis, and Brian

Whitman. A large-scale evaluation of acoustic and

subjective music similarity measures. In International

Symposium on Music Information Retrieval, October

2003.

Dan Ellis, Adam Berenzweig, and Brian Whitman.

The “uspop2002” pop music data set, 2005.

http://labrosa.ee.columbia.edu/projects/musicsim/

uspop2002.html.

Jonathan T. Foote. Content-based retrieval of music and

audio. In C.-C. J. Kuo, Shih-Fu Chang, and Venkat N.

Gudivada, editors, Proc. SPIE Vol. 3229, p. 138-147,

Multimedia Storage and Archiving Systems II, pages

138–147, October 1997.

Alex Ihler. Kernel density estimation toolbox for matlab,

2005. http://ssg.mit.edu/ ihler/code/.

Beth Logan. Mel frequency cepstral coefficients for mu-

sic modelling. In International Symposium on Music

Information Retrieval, 2000.

Beth Logan and Ariel Salomon. A music similarity func-

tion based on signal analysis. In ICME 2001, Tokyo,

Japan, 2001.

Michael I. Mandel, Graham E. Poliner, and Daniel P. W.

Ellis. Support vector machine active learning for mu-

sic retrieval. ACM Multimedia Systems Journal, 2005.

Submitted for review.

Pedro J. Moreno, Purdy P. Ho, and Nuno Vasconcelos.

A kullback-leibler divergence based kernel for SVM

classification in multimedia applications. In Sebastian

Thrun, Lawrence Saul, and Bernhard Scholkopf, edi-

tors, Advances in Neural Information Processing Sys-

tems 16. MIT Press, Cambridge, MA, 2004.

Alan V. Oppenheim. A speech analysis-synthesis system

based on homomorphic filtering. Journal of the Acosti-

cal Society of America, 45:458–465, February 1969.

William D. Penny. Kullback-liebler divergences of nor-

mal, gamma, dirichlet and wishart densities. Technical

report, Wellcome Department of Cognitive Neurology,

2001.

John C. Platt, Nello Cristianini, and John Shawe-Taylor.

Large margin dags for multiclass classification. In S.A.

Solla, T.K. Leen, and K.-R. Mueller, editors, Advances

in Neural Information Processing Systems 12, pages

547–553, 2000.

George Tzanetakis and Perry Cook. Musical genre classi-

fication of audio signals. IEEE Transactions on Speech

and Audio Processing, 10(5):293–302, July 2002.

Kristopher West and Stephen Cox. Features and classi-

fiers for the automatic classification of musical audio

signals. In International Symposium on Music Infor-

mation Retrieval, 2004.

Brian Whitman, Gary Flake, and Steve Lawrence. Artist

detection in music with minnowmatch. In IEEE Work-

shop on Neural Networks for Signal Processing, pages

559–568, Falmouth, Massachusetts, September 10–12

2001.

Changsheng Xu, Namunu C Maddage, Xi Shao, Fang

Cao, and Qi Tian. Musical genre classification using

support vector machines. In International Conference

on Acoustics, Speech, and Signal Processing. IEEE,

2003.

MIREX 05 Audio Artist (USPOP2002)

Page 24: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /25

Playlist Generation

• SVMs are well suited to “active learning”solicit labels on items closest to current boundary

• Automatic player with “skip”= Ground truth data collectionactive-SVM automatic playlist generation

24

Page 25: Extracting Information from Music Audiodpwe//talks/musicie-2005-09.pdfMusic Info Extraction - Ellis 2005-09-20 p. /251 1. Learning Music 2. Melody Extraction 3. Music Similarity Extracting

Music Info Extraction - Ellis 2005-09-20 p. /2525

Conclusions

• Lots of data + noisy transcription + weak clustering⇒ musical insights?

Musicaudio

Drumsextraction

Eigen- rhythms

Eventextraction

Melodyextraction

Fragmentclustering

Anchormodels

Similarity/recommend'n

Synthesis/generation

?

Semanticbases