audio information extraction rosa labdpwe/talks/jhu-2002-03.pdf · “imagine two narrow channels...
TRANSCRIPT
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 1
Audio Information Extraction
Dan Ellis<[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)
Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/
Outline
Sound organization
Speech, music, and other
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 2
Sound Organization
• Analyzing and describing complex sounds:
- continuous sound mixture
→
distinct objects & events
• Human listeners as the prototype
- strong subjective impression when listening- ..but hard to ‘see’ in signal
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 3
Bregman’s lake
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Received waveform is a mixture
- two sensors, N signals ...
• Disentangling mixtures as primary goal
- perfect solution is not possible- need knowledge-based
constraints
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 4
The information in sound
• Hearing confers evolutionary advantage
- optimized to get ‘useful’ information from sound
• Auditory perception is
ecologically
grounded
- scene analysis is preconscious (
→
illusions)- special-purpose processing reflects
‘natural scene’ properties- subjective
not
canonical (ambiguity)
freq
/ H
z
0 1 2 3 40
1000
2000
3000
4000
time / s0 1 2 3 4
Steps 1 Steps 2
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 5
Audio Information Extraction
• Domain
- text ... speech ... music ... general audio
• Operation
- recognize ... index/retrieve ... organize
Audio InformationExtraction
AIE
Unsupervised ClusteringComputational
Auditory Scene Analysis
CASA
Blind Source Separation
BSS
Independent Component
AnalysisICA
AudioContent-based
RetrievalAudio CBR
MultimediaInformationRetrieval
MMIR
MusicInformation
RetrievalMusic IR
AudioFingerprintingSpoken
DocumentRetrieval
SDR
AutomaticSpeechRecognition
ASR
TextInformation
RetrievalIR
TextInformation Extraction
IE
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 6
AIE Applications
• Multimedia access
- sound as complementary dimension- need all modalities for complete information
• Personal audio
- continuous sound capture quite practical- different kind of indexing problem
• Machine perception
- intelligence requires awareness- necessary for communication
• Music retrieval
- area of hot activity- specific economic factors
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 7
Outline
Sound organization
Speech, music, and other
- Speech recognition - Multi-speaker processing- Music classification- Other sounds
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 8
Automatic Speech Recognition (ASR)
• Standard speech recognition structure:
• ‘State of the art ’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
• Can use multiple streams...
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 9
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail
• Combine them!
• Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Nowaydecoder
Phoneprobabilities
Words
s ah t
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Hybrid Connectionist-HMM ASR
Speechfeatures
Featurecalculation
Inputsound
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Conventional ASR (HTK)
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Phoneprobabilities
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Tandem modeling
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 10
Tandem system results
• It works very well ( ‘Aurora ’ noisy digits):
System-features Avg. WER 20-0 dB Baseline WER ratio
HTK-mfcc 13.7% 100%
Neural net-mfcc 9.3% 84.5%
Tandem-mfcc 7.4% 64.5%
Tandem-msg+plp 6.4% 47.2%
clean20151050-5
1
2
5
10
20
50
100
SNR / dB (averaged over 4 noises)
HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%
WE
R /
% (
log
scal
e)
WER as a function of SNR for various Aurora99 systems
HTK GMM baselineHybrid connectionist
Average WER ratio to baseline:
TandemTandem + PC
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 11
Relative contributions
• Approx relative impact on baseline WER ratiofor different component:
plp
Inputsound
Neural netclassifier
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
PCAorthog'n
msg Neural netclassifier
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Gauss mixmodels
HTKdecoder
Words
s ah t
+
Tandem combo over HTK mfcc baseline: +53%
Combo-into-HTK overCombo-into-noway:
+15%
Combo over msg: +20%
NN over HTK: +15%
Combo over mfcc: +25%
Tandem over hybrid: +25%
Tandem over HTK: +35%
Combo over plp:+20%
KLT over direct:+8%
Pre-nonlinearity over posteriors: +12%
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 12
Inside Tandem systems:What ’s going on?
• Visualizations of the net outputs
• Neural net normalizes away noise
freq
/ kH
z
0
1
2
3
4
0
1
2
3
4
-40 dB
-30
-20
-10
0
10
freq
/ m
el c
han
0
5
10
3
4
5
6
7
0
5
10
-20
-10
0
10
20
phon
e
0
0.5
1
time / s0 0.2 0.4 0.6 0.8 1
phon
e
time / s0 0.2 0.4 0.6 0.8 1
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
Spectrogram
Clean 5dB SNR to ‘Hall ’ noise“one eight three ” (MFP_183A)
Cepstral-smoothedmel spectrum
Hidden layerlinear outputs
Phone posteriorestimates
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 13
Tandem for large vocabulary recognition
• CI Tandem front end + CD LVCSR back end
• Tandem bene fits reduced:
Inputsound
Neural netclassifier 1
Pre-nonlinearityoutputs
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
PCAdecorrelation
Neural netclassifier 2
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Subwordlikelihoods
MLLRadaptation
Words+
Tandemfeature
calculation
SPHINXrecognizer
PLP featurecalculation
MSG featurecalculation
GMMclassifier
HMMdecoder
s ah t
ContextIndependent
ContextDependent
Context Dep.+MLLR
30
35
40
45
50
55
60
65
70
75
WE
R /
%
mfcplptandem1tandem2
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 14
‘Tandem-domain ’ processing
(with Manuel Reyes)
• Can we improve the ‘tandem ’ features with conventional processing (deltas, normalization)?
• Somewhat..
Processing Avg. WER 20-0 dB Baseline WER ratio
Tandem PLP mis-match baseline (24 els)
11.1% 70.3%
Rank reduce @ 18 els 11.8% 77.1%
Delta
→
PCA 9.7% 60.8%
PCA
→
Norm 9.0% 58.8%
Delta
→
PCA
→
Norm 8.3% 53.6%
Neural netC0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
PCAdecorrelation
GMMclassifier... ...Norm /
deltas?
Rankreduce
?
Norm /deltas
?
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 15
Tandem vs. other approaches
• 50% of word errors corrected over baseline
• Beat ‘bells and whistles ’ systemusing large-vocabulary techniques
Aurora 2 Eurospeech 2001 Evaluation
- 1 0
0
1 0
2 0
3 0
4 0
5 0
6 0
Avg. rel. improvementRe
l im
pro
vem
en
t %
-
Mu
ltic
on
dit
ion
Columbia
Philips
UPC BarcelonaBell Labs
IBM
Motorola 1Motorola 2
NijmegenICSI/OGI/Qualcomm
ATR/Gri f f i th
AT&TAlcatel
Siemens
UCLAMicrosoft
SloveniaGranada
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 16
Missing data recognition
(Cooke, Green, Barker @ Sheffield)
• Energy overlaps in time-freq. hide features
- some observations are effectively missing
• Use missing feature theory...
- integrate over missing data
x
m
under model
M
• Effective in speech recognition
- trick is finding good/bad data mask
p x M( ) p xp xm M,( ) p xm M( ) xmd∫=
-5 0 5 10 15 20 Clean0
10
20
30
40
50
60
70
80
90 AURORA 2000 - Test Set AW
ER
MD Discrete SNRMD Soft SNRHTK clean trainingHTK multi-condition
"1754" + noise
Missing-dataRecognition
stationary noise estimate
"1754"
Mask based on
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 17
Multi-source decoding(Jon Barker @ Sheffield)
• Search of sound-fragment interpretations
- also search over data mask K:
• CASA for masks →→→→ larger fragments →→→→ faster
"1754" + noise
Common Onset/Offset
MultisourceDecoder
Spectro-Temporal Proximity
Mask split into subbands
stationary noise estimate
`Grouping' applied within bands:
"1754"
Mask based on
p M K x,( ) p x K M,( ) p K M( ) p M( )∝
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 18
The Meeting Recorder project(with ICSI, UW, SRI, IBM)
• Microphones in conventional meetings- for summarization/retrieval/behavior analysis- informal, overlapped speech
• Data collection (ICSI, UW, ...):
- 100 hours collected, ongoing transcription
0
CAOMJL
5 10 15 20 25 30 35 40 45 50 55 60
mr-2000-11-02
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 19
Crosstalk cancellation(with Sam Keene)
• Baseline speaker activity detection is hard:
• Noisy crosstalk model:
• Estimate subband C Aa from A ’s peak energy
- ... then linear inversion
120 125 130 135 140 145 150 155 time / secs
speakeractive
level/dB
mr-2000-06-30-1600
Spkr A
Spkr B
Spkr C
Spkr D
Spkr E
Tabletop
0
20
40
breathnoise
crosstalk
backchannel(signals desire to regain floor?) floor seizure
interruptions
speaker Bcedes floor
m C s⋅ n+=
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 20
Speaker localization(with Huan Wei Hee)
• Tabletop mics form an array;time differences locate speakers
• Ambiguity:- mic positions not fixed- speaker motions
• Detect speaker activity, overlap
-2
0
2
-2-1
01
20
0.5
1
Inferred talker positions ( x=mic)
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
lag 1-2 / ms
x / m
y / m
z / m
lag
3-4
/ ms
mr-2000-11-02-1440: PZM xcorr lags
4
4
11
2
2
3 3
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 21
Alarm sound detection
• Alarm sounds have particular structure- people ‘know them when they hear them’
• Isolate alarms in sound mixtures
- sinusoid peaks have invariant properties
- cepstral coefficients are easy to model
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec
freq
/ H
zP
r(al
arm
)
0 1 2 3 4 5 6 7 8 9
0
2000
4000
0.5
1
Speech + alarm
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 22
Music analysis: Structure recovery(with Rob Turetsky)
• Structure recovery by similarity matrices(after Foote)
- similarity distance measure?
- segmentation & repetition structure
- interpretation at different scales:notes, phrases, movements
- incorporating musical knowledge:‘theme similarity’
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 23
Music analysis: Lyrics extraction(with Adam Berenzweig)
• Vocal content is highly salient, useful for retrieval
• Can we find the singing? Use an ASR classi fier:
• Frame error rate ~20% for segmentation based on posterior-feature statistics
• Lyric segmentation + transcribed lyrics→ training data for lyrics ASR...
time / sec time / sec time / sec0 1 2 0 1 2 0 1 2
freq
/ kH
z
0
2
4
phon
e cl
ass
speech (trnset #58)
Spe
ctro
gram
Pos
terio
gram
music (no vocals #1) singing (vocals #17 + 10.5s)
singing
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 24
Artist similarity
• Train network to discriminate speci fic artists:
• Focus on vocal segments for consistency- accuracy 57% →→→→ 65%
• .. then clustering for recommendation
XT
C
0 50 100 150 200 250 0 100 200 300
tT
ori
0 100 200 300 0 100 200 300 400 500
Bec
k
0 50 100 150 200 250 0 200 400 600
w60o40 stats based on LE plp12 2001-12-28M
ouse
0 100 200 3000 50 100 150
Art
ist c
lass
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 25
Outline
Sound organization
Speech, music, and other
General sound organization- Computational Auditory Scene Analysis- Audio Information Retrieval
Future work & summary
1
2
3
4
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 26
Computational AuditoryScene Analysis (CASA)
• Goal: Automatic sound organization ;Systems to ‘pick out ’ sounds in a mixture- ... like people do
• E.g. voice against a noisy background- to improve speech recognition
• Approach:- psychoacoustics describes grouping ‘rules’- ... can we implement them?
CASAObject 1 descriptionObject 2 descriptionObject 3 description...
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 27
CASA front-end processing
• Correlogram:Loosely based on known/possible physiology
- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors
Cochleafilterbank
sound
envelopefollower
short-timeautocorrelation
delay line
freq
uenc
y ch
anne
ls
Correlogramslice
freq
lag
time
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 28
Adding top-down cues
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
vs. Prediction-driven (top-down) (PDCASA)
• Motivations- detect non-tonal events (noise & click elements)- support ‘restoration illusions’...
• Machine Learning for sound models- corpus of isolated sounds?
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 29
PDCASA and complex scenes
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 30
Audio Information Retrieval(with Manuel Reyes)
• Searching in a database of audio- speech .. use ASR- text annotations .. search them- sound effects library?
• e.g. Muscle Fish “ SoundFisher” browser- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space
- features are ‘global’ for each soundfile,no attempt to separate mixtures
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Results
Query example
Feature vectors
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 31
Audio Retrieval: Results
• Musclefish corpus- most commonly reported set
• Features- mfcc, brightness, bandwidth, pitch ...- no temporal sequence structure
• Results: - 208 examples, 16 classes
Global features: 41% corr HMM models: 81% corr.
Mu Sp Env An Mec Mu Sp Env An Mec
Musical59/46
24 2 19136/
62 1 5
Speech 11/ 6 4 5 1 14/ 2 5 3 1
Eviron. 7/ 2 1 7/ 1
Animals 2 1/ 2 4/ 1
Mechan 1 4 1 8/ 4 3 3 12/
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 32
What are the HMM states?
• No sub-units defined for nonspeech sounds
• Final states depend on EM initialization- labels- clusters- transition matrix
• Have ideas of what we’d like to get- investigate features/initialization to get there
s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7
1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2
s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7
12345678
1.15 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2time
freq
/ kH
z
dogBarks2
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 33
CASA for audio retrieval
• When audio material contains mixtures, global features are insufficient
• Retrieval based on element/object analysis:
- features are calculated over grouped subsets
Genericelementanalysis
Continuous audioarchive
Genericelementanalysis
Seach/comparison
Results
Query example
Element representations
Objectformation
(Objectformation)
Word-to-classmapping
Objects + properties
Properties aloneSymbolic query
rushing water
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 34
Outline
Sound organization
Speech, music, and other
General sound organization
Future work & summary
1
2
3
4
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 35
Automatic audio-video analysis(with Prof. Shih-Fu Chang, Prof. Kathy McKeown)
• Documentary archive management- huge ratio of raw-to-finished material- costly manual logging- missed opportunities for cross-fertilization
• Problem: term ↔↔↔↔ signal mapping- training corpus of past annotations- interactive semi-automatic learning- need object-related features
A/V segmentation
and feature
extraction
text processing
concept discovery
A/V feature unsupervised
clustering A/V data
MM Concept Networkannotations
A/V features
terms
generic detectors
Multimedia Fusion:
Mining Concepts &
RelationshipsComplex Spatio-
Temporal Classification
Question Answering
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 36
The ‘Machine listener’
• Goal: An auditory system for machines- use same environmental information as people
• Signal understanding- monitor for particular sounds- real-time description
• Scenarios
- personal listener → summary of your day- future prosthetic hearing device- autonomous robots
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 37
LabROSA Summary
DO
MA
INS
AP
PLI
CA
TIO
NS
ROSA
• Broadcast• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Speech recognition• Speech characterization• Nonspeech recognition
• Object-based structure discovery & learning
• Scene analysis• Audio-visual integration• Music analysis
• Structuring• Search• Summarization• Awareness• Understanding
LabROSA
AIE @ JHU - Dan Ellis 2002-03-28 - 38
Summary: Audio Info Extraction
• Sound carries information- useful and detailed- often tangled in mixtures
• Various important general classes- Speech: activity, recognition- Music: segmentation, clustering- Other: detection, description
• General processing framework- Computational Auditory Scene Analysis- Audio Information Retrieval
• Future applications- Ubiquitous intelligent indexing- Intelligent monitoring & description