audio information extraction rosa labdpwe/talks/jhu-2002-03.pdf · “imagine two narrow channels...

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 1

Audio Information Extraction

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline

Sound organization

Speech, music, and other

General sound organization

Future work & summary

1

2

3

4

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 2

Sound Organization

• Analyzing and describing complex sounds:

- continuous sound mixture

→

distinct objects & events

• Human listeners as the prototype

- strong subjective impression when listening- ..but hard to ‘see’ in signal

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 3

Bregman’s lake

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”

(after Bregman’90)

• Received waveform is a mixture

- two sensors, N signals ...

• Disentangling mixtures as primary goal

- perfect solution is not possible- need knowledge-based

constraints

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 4

The information in sound

• Hearing confers evolutionary advantage

- optimized to get ‘useful’ information from sound

• Auditory perception is

ecologically

grounded

- scene analysis is preconscious (

→

illusions)- special-purpose processing reflects

‘natural scene’ properties- subjective

not

canonical (ambiguity)

freq

/ H

z

0 1 2 3 40

1000

2000

3000

4000

time / s0 1 2 3 4

Steps 1 Steps 2

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 5

Audio Information Extraction

• Domain

- text ... speech ... music ... general audio

• Operation

- recognize ... index/retrieve ... organize

Audio InformationExtraction

AIE

Unsupervised ClusteringComputational

Auditory Scene Analysis

CASA

Blind Source Separation

BSS

Independent Component

AnalysisICA

AudioContent-based

RetrievalAudio CBR

MultimediaInformationRetrieval

MMIR

MusicInformation

RetrievalMusic IR

AudioFingerprintingSpoken

DocumentRetrieval

SDR

AutomaticSpeechRecognition

ASR

TextInformation

RetrievalIR

TextInformation Extraction

IE

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 6

AIE Applications

• Multimedia access

- sound as complementary dimension- need all modalities for complete information

• Personal audio

- continuous sound capture quite practical- different kind of indexing problem

• Machine perception

- intelligence requires awareness- necessary for communication

• Music retrieval

- area of hot activity- specific economic factors

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 7

Outline

Sound organization


- Speech recognition - Multi-speaker processing- Music classification- Other sounds



1

2

3

4

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 8

Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art ’ word-error rates (WERs):

- 2% (dictation) - 30% (telephone conversations)

• Can use multiple streams...

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 9

Tandem speech recognition

(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)

• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail

• Combine them!

• Train net, then train GMM on net output

- GMM is ignorant of net output ‘meaning’

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Nowaydecoder

Phoneprobabilities

Words

s ah t

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Hybrid Connectionist-HMM ASR

Speechfeatures

Featurecalculation

Inputsound

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

Conventional ASR (HTK)

Speechfeatures

Featurecalculation

Inputsound


Phoneprobabilities

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Tandem modeling

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 10

Tandem system results

• It works very well ( ‘Aurora ’ noisy digits):

System-features Avg. WER 20-0 dB Baseline WER ratio

HTK-mfcc 13.7% 100%

Neural net-mfcc 9.3% 84.5%

Tandem-mfcc 7.4% 64.5%

Tandem-msg+plp 6.4% 47.2%

clean20151050-5

1

2

5

10

20

50

100

SNR / dB (averaged over 4 noises)

HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%

WE

R /

% (

log

scal

e)

WER as a function of SNR for various Aurora99 systems

HTK GMM baselineHybrid connectionist

Average WER ratio to baseline:

TandemTandem + PC

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 11

Relative contributions

• Approx relative impact on baseline WER ratiofor different component:

plp

Inputsound


C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

PCAorthog'n

msg Neural netclassifier

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Gauss mixmodels

HTKdecoder

Words

s ah t

+

Tandem combo over HTK mfcc baseline: +53%

Combo-into-HTK overCombo-into-noway:

+15%

Combo over msg: +20%

NN over HTK: +15%

Combo over mfcc: +25%

Tandem over hybrid: +25%

Tandem over HTK: +35%

Combo over plp:+20%

KLT over direct:+8%

Pre-nonlinearity over posteriors: +12%

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 12

Inside Tandem systems:What ’s going on?

• Visualizations of the net outputs

• Neural net normalizes away noise

freq

/ kH

z

0

1

2

3

4

0

1

2

3

4

-40 dB

-30

-20

-10

0

10

freq

/ m

el c

han

0

5

10

3

4

5

6

7

0

5

10

-20

-10

0

10

20

phon

e

0

0.5

1

time / s0 0.2 0.4 0.6 0.8 1

phon

e

time / s0 0.2 0.4 0.6 0.8 1

qh#axuwowaoahayeyehihiywrnvthfzs

kcltclkt


kcltclkt


kcltclkt


kcltclkt

Spectrogram

Clean 5dB SNR to ‘Hall ’ noise“one eight three ” (MFP_183A)

Cepstral-smoothedmel spectrum

Hidden layerlinear outputs

Phone posteriorestimates

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 13

Tandem for large vocabulary recognition

• CI Tandem front end + CD LVCSR back end

• Tandem bene fits reduced:

Inputsound

Neural netclassifier 1

Pre-nonlinearityoutputs

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

PCAdecorrelation

Neural netclassifier 2

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Subwordlikelihoods

MLLRadaptation

Words+

Tandemfeature

calculation

SPHINXrecognizer

PLP featurecalculation

MSG featurecalculation

GMMclassifier

HMMdecoder

s ah t

ContextIndependent

ContextDependent

Context Dep.+MLLR

30

35

40

45

50

55

60

65

70

75

WE

R /

%

mfcplptandem1tandem2

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 14

‘Tandem-domain ’ processing

(with Manuel Reyes)

• Can we improve the ‘tandem ’ features with conventional processing (deltas, normalization)?

• Somewhat..

Processing Avg. WER 20-0 dB Baseline WER ratio

Tandem PLP mis-match baseline (24 els)

11.1% 70.3%

Rank reduce @ 18 els 11.8% 77.1%

Delta

→

PCA 9.7% 60.8%

PCA

→

Norm 9.0% 58.8%

Delta

→

PCA

→

Norm 8.3% 53.6%

Neural netC0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

PCAdecorrelation

GMMclassifier... ...Norm /

deltas?

Rankreduce

?

Norm /deltas

?

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 15

Tandem vs. other approaches

• 50% of word errors corrected over baseline

• Beat ‘bells and whistles ’ systemusing large-vocabulary techniques

Aurora 2 Eurospeech 2001 Evaluation

- 1 0

0

1 0

2 0

3 0

4 0

5 0

6 0

Avg. rel. improvementRe

l im

pro

vem

en

t %

-

Mu

ltic

on

dit

ion

Columbia

Philips

UPC BarcelonaBell Labs

IBM

Motorola 1Motorola 2

NijmegenICSI/OGI/Qualcomm

ATR/Gri f f i th

AT&TAlcatel

Siemens

UCLAMicrosoft

SloveniaGranada

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 16

Missing data recognition

(Cooke, Green, Barker @ Sheffield)

• Energy overlaps in time-freq. hide features

- some observations are effectively missing

• Use missing feature theory...

- integrate over missing data

x

m

under model

M

• Effective in speech recognition

- trick is finding good/bad data mask

p x M( ) p xp xm M,( ) p xm M( ) xmd∫=

-5 0 5 10 15 20 Clean0

10

20

30

40

50

60

70

80

90 AURORA 2000 - Test Set AW

ER

MD Discrete SNRMD Soft SNRHTK clean trainingHTK multi-condition

"1754" + noise

Missing-dataRecognition

stationary noise estimate

"1754"

Mask based on

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 17

Multi-source decoding(Jon Barker @ Sheffield)

• Search of sound-fragment interpretations

- also search over data mask K:

• CASA for masks →→→→ larger fragments →→→→ faster

"1754" + noise

Common Onset/Offset

MultisourceDecoder

Spectro-Temporal Proximity

Mask split into subbands

stationary noise estimate

`Grouping' applied within bands:

"1754"

Mask based on

p M K x,( ) p x K M,( ) p K M( ) p M( )∝

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 18

The Meeting Recorder project(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings- for summarization/retrieval/behavior analysis- informal, overlapped speech

• Data collection (ICSI, UW, ...):

- 100 hours collected, ongoing transcription

0

CAOMJL

5 10 15 20 25 30 35 40 45 50 55 60

mr-2000-11-02

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 19

Crosstalk cancellation(with Sam Keene)

• Baseline speaker activity detection is hard:

• Noisy crosstalk model:

• Estimate subband C Aa from A ’s peak energy

- ... then linear inversion

120 125 130 135 140 145 150 155 time / secs

speakeractive

level/dB

mr-2000-06-30-1600

Spkr A

Spkr B

Spkr C

Spkr D

Spkr E

Tabletop

0

20

40

breathnoise

crosstalk

backchannel(signals desire to regain floor?) floor seizure

interruptions

speaker Bcedes floor

m C s⋅ n+=

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 20

Speaker localization(with Huan Wei Hee)

• Tabletop mics form an array;time differences locate speakers

• Ambiguity:- mic positions not fixed- speaker motions

• Detect speaker activity, overlap

-2

0

2

-2-1

01

20

0.5

1

Inferred talker positions ( x=mic)

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

lag 1-2 / ms

x / m

y / m

z / m

lag

3-4

/ ms

mr-2000-11-02-1440: PZM xcorr lags

4

4

11

2

2

3 3

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 21

Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- sinusoid peaks have invariant properties

- cepstral coefficients are easy to model

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec

freq

/ H

zP

r(al

arm

)

0 1 2 3 4 5 6 7 8 9

0

2000

4000

0.5

1

Speech + alarm

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 22

Music analysis: Structure recovery(with Rob Turetsky)

• Structure recovery by similarity matrices(after Foote)

- similarity distance measure?

- segmentation & repetition structure

- interpretation at different scales:notes, phrases, movements

- incorporating musical knowledge:‘theme similarity’

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 23

Music analysis: Lyrics extraction(with Adam Berenzweig)

• Vocal content is highly salient, useful for retrieval

• Can we find the singing? Use an ASR classi fier:

• Frame error rate ~20% for segmentation based on posterior-feature statistics

• Lyric segmentation + transcribed lyrics→ training data for lyrics ASR...

time / sec time / sec time / sec0 1 2 0 1 2 0 1 2

freq

/ kH

z

0

2

4

phon

e cl

ass

speech (trnset #58)

Spe

ctro

gram

Pos

terio

gram

music (no vocals #1) singing (vocals #17 + 10.5s)

singing

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 24

Artist similarity

• Train network to discriminate speci fic artists:

• Focus on vocal segments for consistency- accuracy 57% →→→→ 65%

• .. then clustering for recommendation

XT

C

0 50 100 150 200 250 0 100 200 300

tT

ori

0 100 200 300 0 100 200 300 400 500

Bec

k

0 50 100 150 200 250 0 200 400 600

w60o40 stats based on LE plp12 2001-12-28M

ouse

0 100 200 3000 50 100 150

Art

ist c

lass

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 25

Outline

Sound organization


General sound organization- Computational Auditory Scene Analysis- Audio Information Retrieval


1

2

3

4

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 26

Computational AuditoryScene Analysis (CASA)

• Goal: Automatic sound organization ;Systems to ‘pick out ’ sounds in a mixture- ... like people do

• E.g. voice against a noisy background- to improve speech recognition

• Approach:- psychoacoustics describes grouping ‘rules’- ... can we implement them?

CASAObject 1 descriptionObject 2 descriptionObject 3 description...

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 27

CASA front-end processing

• Correlogram:Loosely based on known/possible physiology

- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors

Cochleafilterbank

sound

envelopefollower

short-timeautocorrelation

delay line

freq

uenc

y ch

anne

ls

Correlogramslice

freq

lag

time

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 28

Adding top-down cues

Perception is not directbut a search for plausible hypotheses

• Data-driven (bottom-up)...

vs. Prediction-driven (top-down) (PDCASA)

• Motivations- detect non-tonal events (noise & click elements)- support ‘restoration illusions’...

• Machine Learning for sound models- corpus of isolated sounds?

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 29

PDCASA and complex scenes

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 30

Audio Information Retrieval(with Manuel Reyes)

• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?

• e.g. Muscle Fish “ SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space

- features are ‘global’ for each soundfile,no attempt to separate mixtures

Segmentfeatureanalysis

Sound segmentdatabase

Segmentfeatureanalysis

Seach/comparison

Results

Query example

Feature vectors

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 31

Audio Retrieval: Results

• Musclefish corpus- most commonly reported set

• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure

• Results: - 208 examples, 16 classes

Global features: 41% corr HMM models: 81% corr.

Mu Sp Env An Mec Mu Sp Env An Mec

Musical59/46

24 2 19136/

62 1 5

Speech 11/ 6 4 5 1 14/ 2 5 3 1

Eviron. 7/ 2 1 7/ 1

Animals 2 1/ 2 4/ 1

Mechan 1 4 1 8/ 4 3 3 12/

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 32

What are the HMM states?

• No sub-units defined for nonspeech sounds

• Final states depend on EM initialization- labels- clusters- transition matrix

• Have ideas of what we’d like to get- investigate features/initialization to get there

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

12345678

1.15 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2time

freq

/ kH

z

dogBarks2

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 33

CASA for audio retrieval

• When audio material contains mixtures, global features are insufficient

• Retrieval based on element/object analysis:

- features are calculated over grouped subsets

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Seach/comparison

Results

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 34

Outline

Sound organization




1

2

3

4

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 35

Automatic audio-video analysis(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)

• Documentary archive management- huge ratio of raw-to-finished material- costly manual logging- missed opportunities for cross-fertilization

• Problem: term ↔↔↔↔ signal mapping- training corpus of past annotations- interactive semi-automatic learning- need object-related features

A/V segmentation

and feature

extraction

text processing

concept discovery

A/V feature unsupervised

clustering A/V data

MM Concept Networkannotations

A/V features

terms

generic detectors

Multimedia Fusion:

Mining Concepts &

RelationshipsComplex Spatio-

Temporal Classification

Question Answering

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 36

The ‘Machine listener’

• Goal: An auditory system for machines- use same environmental information as people

• Signal understanding- monitor for particular sounds- real-time description

• Scenarios

- personal listener → summary of your day- future prosthetic hearing device- autonomous robots

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 37

LabROSA Summary

DO

MA

INS

AP

PLI

CA

TIO

NS

ROSA

• Broadcast• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Speech recognition• Speech characterization• Nonspeech recognition

• Object-based structure discovery & learning

• Scene analysis• Audio-visual integration• Music analysis

• Structuring• Search• Summarization• Awareness• Understanding

LabROSA

AIE @ JHU - Dan Ellis 2002-03-28 - 38

Summary: Audio Info Extraction

• Sound carries information- useful and detailed- often tangled in mixtures

• Various important general classes- Speech: activity, recognition- Music: segmentation, clustering- Other: detection, description

• General processing framework- Computational Auditory Scene Analysis- Audio Information Retrieval

• Future applications- Ubiquitous intelligent indexing- Intelligent monitoring & description

audio information extraction rosa labdpwe/talks/jhu-2002-03.pdf · “imagine two narrow channels...

Documents