projects in speech, music & audio recognition &...

LabROSA

ROSA overview - Dan Ellis 2001-09-28 - 1

Recognition & Organizationof Speech & Audio

Dan Ellis

http://labrosa.ee.columbia.edu/

Outline

Introducing LabROSA

Projects in speech, music & audio

Summary

1

2

3

LabROSA


Sound organization

• Central operation:

- continuous sound mixture

→

distinct objects & events

• Perceptual impression is very strong

- but hard to ‘see’ in signal

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

LabROSA


Bregman’s lake

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Received waveform is a mixture

- two sensors, N signals ...

• Disentangling mixtures as primary goal

- perfect solution is not possible- need knowledge-based

constraints

LabROSA


The information in sound

• A sense of hearing is evolutionarily useful

- gives organisms ‘relevant’ information

• Auditory perception is

ecologically

grounded

- scene analysis is preconscious (

→

illusions)- special-purpose processing reflects

‘natural scene’ properties- subjective

not

canonical (ambiguity)

freq

/ H

z

0 1 2 3 40

1000

2000

3000

4000

time / s0 1 2 3 4

Steps 1 Steps 2

LabROSA


Key themes for LabROSA

http://labrosa.ee.columbia.edu/

• Sound organization: construct hierarchy

- at an instant (sources)- along time (segmentation)

• Scene analysis

- find attributes according to objects- use attributes to form objects- ... plus constraints of knowledge

• Exploiting large data sets (the ASR lesson)

- supervised/labeled: pattern recognition- unsupervised: structure discovery, clustering

• Special cases:

- speech recognition- other source-specific recognizers

• ... within a ‘complete explanation’

LabROSA


Outline

Introducing LabROSA


- Tandem speech recognition - ‘Meeting recorder’ speech analysis- Musical information extraction- Alarm sound detection

Summary

1

2

3

LabROSA


Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art’ word-error rates (WERs):

- 2% (dictation) - 30% (telephone conversations)

• Can use multiple streams...

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA


Tandem speech recognition

(with Manuel Reyes, ICSI, OGI, CMU)

• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail

• Combine them!

• Train net, then train GMM on net output

- GMM is ignorant of net output ‘meaning’

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Nowaydecoder

Phoneprobabilities

Words

s ah t

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Hybrid Connectionist-HMM ASR

Speechfeatures

Featurecalculation

Inputsound

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

Conventional ASR (HTK)

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Phoneprobabilities

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Tandem modeling

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

LabROSA


Tandem system results: Aurora digits

clean20151050-5

1

2

5

10

20

50

100

SNR / dB (averaged over 4 noises)

HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%

WE

R /

% (

log

scal

e)

WER as a function of SNR for various Aurora99 systems

HTK GMM baselineHybrid connectionist

Average WER ratio to baseline:

TandemTandem + PC

Aurora 2 Eurospeech 2001 Evaluation

- 1 0

0

1 0

2 0

3 0

4 0

5 0

6 0

Avg. rel. improvementRe

l im

pro

ve

me

nt

%

- M

ult

ico

nd

itio

n

Columbia

Philips

UPC Barcelona

Bell Labs

IBM

Motorola 1

Motorola 2

Nijmegen

ICSI/OGI/Qualcomm

ATR/Gri f f i th

AT&T

Alcatel

Siemens

UCLA

Microsoft

Slovenia

Granada

LabROSA


The Meeting Recorder project

(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings

- for summarization/retrieval/behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, ...):

- 100 hours collected, ongoing transcription- headsets + tabletop + ‘PDA’

LabROSA


Crosstalk cancellation

• Baseline speaker activity detection is hard:

• Noisy crosstalk model:

• Estimate subband C

Aa

from A’s peak energy

- ... including pure delay (10 ms frames)- ... then linear inversion

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

interruptions

speaker Bcedes floor

m C s⋅ n+=

LabROSA


PDA-based speaker change detection

• Goal: small conference-tabletop device

• Speaker turns from PDA mock-up signals?

• SCD algo on spectral + interaural features

- average spectral + per-channel ITD,

∆φ

pda.aif: excerpt with 512-pt xcorr, 80% max thresh

31 32 33 34 35 36 37 38 39 400

2000

4000

6000

8000

lag/

ms

31 32 33 34 35 36 37 38 39 40-1.5

-1

-0.5

0

0.5

1

1.5

time/s31 32 33 34 35 36 37 38 39 40

-0.5

0

0.5

Morgan

Adam

Dan

what's this one here this this one has a couple cheesy electrets oh that's connected too or that's connected two- both

yep yeah both channels

yeah that's that's the dummy dummy P.D.A. yeah it is

LabROSA


Music analysis: Lyrics extraction

(with Adam Berenzweig)

• Vocal content is highly salient, useful for retrieval

• Can we find the singing? Use an ASR classifier:

• Frame error rate ~20% for segmentation based on posterior-feature statistics

• Lyric segmentation + transcribed lyrics

→

training data for lyrics ASR...

time / sec time / sec time / sec0 1 2 0 1 2 0 1 2

freq

/ kH

z

0

2

4

phon

e cl

ass

speech (trnset #58)

Sp

ectr

og

ram

Po

ster

iog

ram

music (no vocals #1) singing (vocals #17 + 10.5s)

singing

LabROSA


Music analysis: Structure recovery

(with Rob Turetsky)

• Structure recovery by similarity matrices(after Foote)

- similarity distance measure?- segmentation & repetition structure- interpretation at different scales:

notes, phrases, movements- incorporating musical knowledge:

‘theme similarity’

LabROSA


Alarm sound detection

• Alarm sounds have particular structure

- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)- classify by attributes...

• Key: recognize

despite

background

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

LabROSA


The ‘Machine listener’

• Goal: An auditory system for machines

- use same environmental information as people

• Aspects:

- recognize spoken commands (but not others)- track ‘acoustic channel’ quality (for responses)- categorize environment (conversation, crowd...)

• Scenarios

- personal listener

→

summary of your day- autonomous robots: need awareness

LabROSA


Outline

Introducing LabROSA


Summary

1

2

3

LabROSA


LabROSA Summary

DO

MA

INS

AP

PLI

CA

TIO

NS

ROSA

• Broadcast• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Speech recognition• Speech characterization• Nonspeech recognition

• Object-based structure discovery & learning

• Scene analysis• Audio-visual integration• Music analysis

• Structuring• Search• Summarization• Awareness• Understanding

projects in speech, music & audio recognition &...

Documents