labrosa research overview

LabROSA Overview - Dan Ellis 2010-09-10 /191

1. Real-World Sound2. Speech Separation3. Environmental Audio Classification4. Music Audio Analysis

LabROSA Research Overview

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

mailto:[email protected]

mailto:[email protected]

http://labrosa.ee.columbia.edu

http://labrosa.ee.columbia.edu


LabROSA Overview

2

InformationExtraction

MachineLearning

SignalProcessing

Speech

Music EnvironmentRecognition

Retrieval

Separation


1. Real-World Sound

3

• Sounds rarely occur in isolation.. so analyzing mixtures (“scenes”) is a problem.. for humans and machines

02_m+s-15-evil-goodvoice-fade

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)Stab

Rumble StringsChoir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0


Auditory Scene Analysis

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregmanʼ90)

• Received waveform is a mixture2 sensors, N sources - underconstrained

• Use prior knowledge (models) to constrain

4


2. Speech Separation

• Given models for sources, find “best” (most likely) states for spectra:

can include sequential constraints...

• E.g. stationary noise:

5

{i1(t), i2(t)} = argmaxi1,i2p(x(t)|i1, i2)p(x|i1, i2) = N (x;ci1+ ci2,Σ) combination

model

inference ofsource state

time / s

freq

/ mel

bin

Original speech

0 1 2

20

40

60

80

In speech-shaped noise (mel magsnr = 2.41 dB)

0 1 2

20

40

60

80

VQ inferred states (mel magsnr = 3.6 dB)

0 1 2

20

40

60

80

Roweis ’01, ’03Kristjannson ’04, ’06


Eigenvoices• Idea: Find speaker model

parameter space

generalize without losing detail?

• Eigenvoice model:280 states x 320 bins= 89,600 dimensions10-30 dimensions

6

Weiss & Ellis ’09, ’10

Speaker modelsSpeaker subspace bases

µ = µ̄ + U w + B hadapted mean eigenvoice weights channel channelmodel voice bases bases weights

Freq

uenc

y (kH

z)

Mean Voice

b d g p t k jh ch s z f th v dh m n l r w y iy ih eh ey ae aaaw ay ah aoowuw ax

2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 1


2

4

6

8

Freq

uenc

y (kH

z)Eigenvoice dimension 2


2

4

6

8

Freq

uenc

y (kH

z)

Eigenvoice dimension 3


2

4

6

8

50

40

30

20

10

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8


Speaker-Adapted Separation

7


Speaker-Adapted Separation

• Eigenvoices for Speech Separation taskspeaker adapted (SA) performs midway between speaker-dependent (SD) & speaker-indep (SI)

8

SI

SA


3. Soundtrack Classification

• Short video clips as the evolution of snapshots10-100 sec, one location, no editingbrowsing?

• Need information for indexing...video + audioforeground + background

9


MFCC Covariance Representation

• Each clip/segment → fixed-size statisticssimilar to speaker ID and music genre classification

• Full Covariance matrix of MFCCs maps the kinds of spectral shapes present

• Clip-to-clip distances for SVM classifierby KL or 2nd Gaussian model

10

VTS_04_0001 - Spectrogram

freq

/ kH

z

1 2 3 4 5 6 7 8 9012345678

-20

-10

0

10

20

30

time / sec

time / sec

level / dB

value

MFC

C b

in

1 2 3 4 5 6 7 8 92468

101214161820

-20-15-10-505101520

MFCC dimension

MFC

C d

imen

sion

MFCC covariance

5 10 15 20

2

4

6

8

10

12

14

16

18

20

-50

0

50

Video Soundtrack

MFCCfeatures

MFCCCovariance

Matrix


Classification Results

• Wide range:

audio (music, ski) vs. non-audio (group, night)large AP uncertainty on infrequent classes

11

Animal

BabyBeach

Birthday

Boat

Crowd

Group of 3

+

Group of 2

Museum

Night

One person

ParkPicn

ic

Playground

ShowSports

Sunset

Wedding

Dancing

Parade

SingingCheer

Music

Concepts

Ski0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

AP

1G + KL1G + MahGMM Hist. + pLSAGuess

Chang, Ellis et al. ’07Lee & Ellis ’10


Matching Videos via Fingerprints

• Landmark pairs are a noise-robust fingerprint

• Use to match distinct videos with same sound ambience

12

Cotton & Ellis ’10

VIdeo IMpLQaiHWbE at 195s

VIdeo Yi1hkNkqHBc at 218 s

195.5 196 196.5 197 197.5 198 198.5 199

218.5 219 219.5 220 220.5 221 221.5 2220

1

2

3

4fre

q / k

Hz

0

1

2

3

4

freq

/ kHz

time / sec

time / sec


4. Music Audio Analysis

• ... at all levels from notes to genres

13

freq

/ kH

z

0

2

4

162 164 166 168 170 172 174 time / s

time / beats

level / dB

C4C5

C2

C2

C3

C4C5

Signal

Onsets& Beats

Per-framechroma

Per-beatnormalized

chroma

Melody

Piano

C3

-20

0

20

intensity0

0.50.25

0.751

Let it Be (final verse)

390 395 400 405 410 415

ACDEG

ACDEG


High-Resolution Transcription

• Extract notes from audiodespite overlapping voiceswith precise tuning, timing

• Use musical scoretime alignment & approximate match

• Find most likely parametersfundamental, amplitudes of all harmonics

14

Smit & Ellis ’09

Freq

uenc

y (H

z)

Spectrogram of triplum track: YIN (red)

5 6 7 8 9 10200

300

400

500

600

700

dB

10

20

30

40

Time (s)

Freq

uenc

y (H

z)

Spectrogram of combined trackes: YIN (red), Prob (blue)

5 6 7 8 9 10200

300

400

500

600

700

dB

10

20

30

40


Polyphonic Transcription

• Apply the Eigenvoice idea to musiceigeninstruments? • Subspace NMF

15

Grindlay & Ellis ’09


Chord Recognition

• StructSVM vs. HMM: better results

16

Weller, Ellis, Jebara ’09


Melodic-Harmonic Mining

• 100,000 tracks from morecowbell.djas Echo Nest Analyze

• Frequent clusters of 12 x 8 binarized event-chroma

17

#1 (3491) #2 (2775) #3 (2255) #4 (1241) #5 (1224) #6 (1218) #7 (1092) #8 (1084) #9 (1080) #10 (1035)

#11 (1021)

#1 (3491)

#12 (1005)

)#2 (2775)

#13 (974)

#3 (2255)

#14 (942)

)#4 (1241)

#15 (936)

)#5 (1224)

#16 (924)

)#6 (1218)

#17 (920)

)#7 (1092)

#18 (913)

)#8 (1084)

#19 (901)

)) #10 (1035)#9 (1080)

#20 (897)#

#21 (887)

#11 (1021) #

#21 (887)#21 (887) #22 (882)

)#12 (1005)

))#22 (882)#22 (882) #23 (881)

#13 (974)

#23 (881)#23 (881) #24 (881)

)#14 (942)

))#24 (881)#24 (881) #25 (879)

)#15 (936)

))#25 (879)#25 (879) #26 (875)

)#16 (924)

))#26 (875)#26 (875) #27 (875)

)#17 (920)

))#27 (875)#27 (875) #28 (874)

)#18 (913)

))#28 (874)#28 (874) #29 (868)

) #20 (897)#19 (901)

))#29 (868)#29 (868) #30 (844)

#31 (839) #32 (839) #33 (794) #34 (786) #35 (785) #36 (747) #37 (731) #38 (714) #39 (706) #40 (698)

#41 (682)

#31 (839)#31 (839)

#42 (678)

))#32 (839)#32 (839)

#43 (675)

#33 (794)#33 (794)

#44 (657)

))#34 (786)#34 (786)

#45 (656)

))#35 (785)#35 (785)

#46 (651)

))#36 (747)#36 (747)

#47 (647)

))#37 (731)#37 (731)

#48 (638)

))#38 (714)#38 (714)

#49 (610)

)) #40 (698)#40 (698)#39 (706)#39 (706)

#50 (593)

#51 (592)

#41 (682)#41 (682)

#52 (591)

))#42 (678)#42 (678)

#53 (589)

#43 (675)#43 (675)

#54 (572)

))#44 (657)#44 (657)

#55 (571)

))#45 (656)#45 (656)

#56 (550)

))#46 (651)#46 (651)

#57 (549)

))#47 (647)#47 (647)

#58 (534)

))#48 (638)#48 (638)

#59 (534)

)) #50 (593)#50 (593)#49 (610)#49 (610)

#60 (531)

#61 (528)

#51 (592)

#62 (525)

)#52 (591)

#63 (522)

#53 (589)

#64 (514)

)#54 (572)

#65 (510)

)#55 (571)

#66 (507)

)#56 (550)

#67 (500)

)#57 (549)

#68 (497)

)#58 (534)

#69 (486)

) #60 (531)#59 (534)

#70 (479)

#71 (476)

#61 (528)#61 (528)

#72 (468)

))#62 (525)#62 (525)

#73 (468)

#63 (522)#63 (522)

#74 (466)

))#64 (514)#64 (514)

#75 (463)

))#65 (510)#65 (510)

#76 (454)

))#66 (507)#66 (507)

#77 (453)

))#67 (500)#67 (500)

#78 (448)

))#68 (497)#68 (497)

#79 (441)

)) #70 (479)#70 (479)#69 (486)#69 (486)

#80 (440)

#81 (435)

#71 (476)#71 (476)

#82 (430)

))#72 (468)#72 (468)

#83 (430)

#73 (468)#73 (468)

#84 (425)

))#74 (466)#74 (466)

#85 (425)

))#75 (463)#75 (463)

#86 (419)

))#76 (454)#76 (454)

#87 (419)

))#77 (453)#77 (453)

#88 (417)

))#78 (448)#78 (448)

#89 (416)

)) #80 (440)#80 (440)#79 (441)#79 (441)

#90 (414)

#91 (411)

#81 (435)#81 (435)

#91 (411)#91 (411) #92 (410)

))#82 (430)#82 (430)

))#92 (410)#92 (410) #93 (408)

#83 (430)#83 (430)

#93 (408)#93 (408) #94 (406)

))#84 (425)#84 (425)

))#94 (406)#94 (406) #95 (401)

))#85 (425)#85 (425)

))#95 (401)#95 (401) #96 (398)

))#86 (419)#86 (419)

))#96 (398)#96 (398) #97 (397)

))#87 (419)#87 (419)

))#97 (397)#97 (397) #98 (396)

))#88 (417)#88 (417)

))#98 (396)#98 (396) #99 (396)

)) #90 (414)#90 (414)#89 (416)#89 (416)

))#99 (396)#99 (396) #100 (395)

Musicaudio

LocalitySensitive

Hash Table

Beattracking

Chromafeatures

Keynormalization

Landmarkidentification

Bertin-Mahieux et al. ’10


Results - Beatles• Over 86 Beatles tracks

• All beat offsets = 41,705 patchesLSH takes 300 sec - approx NlogN in patches?

• High-pass along time to avoid sustainednotes

• Song filterremove hitsin same track

18

chro

ma bi

n

02-I Should Have Known Better 92.4-97.7s

2

4

6

8

10

12

chro

ma bi

n

05-Here There And Everywhere 12.1-20.5s

2

4

6

8

10

12

chro

ma bi

n

beat

09-Martha My Dear 90.9-98.6s

5 10 15 20

2

4

6

8

10

12

chro

ma bi

nbeat

12-Piggies 22.0-29.6s

5 10 15 20

2

4

6

8

10

12


Summary

• LabROSA : getting information from sound

• Speechmonaural separation using eigenvoices

binaural + reverb using MESSL

• Environmentalclassification of consumer video

landmark-based events and matching

• Musictranscription of notes, chords, ...

large corpus mining19

labrosa research overview

Documents