audio and (multimedia) methods for video analysisfractor/fall2011/cs298-3.pdf · audio and...
TRANSCRIPT
![Page 1: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/1.jpg)
Audio and (Multimedia) Methods for Video Analysis
Dr. Gerald Friedland, [email protected]
CS-298 Seminar Fall 2011•
1
![Page 2: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/2.jpg)
More on Sound
2
•Responses to Email Questions•Useful Filters•More Features•Typical Machine Learning on
Audio•More on Tools
![Page 3: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/3.jpg)
Q&A
3
Why didn’t you talk about Loudspeakers?
![Page 4: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/4.jpg)
Loudspeakers vs Microphones
4
Source: http://www.mediacollege.com/audio/microphones/dynamic.html
Source: http://www.physics.org
![Page 5: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/5.jpg)
Loudspeaker Characteristics
Frequency dependent! (As are the response patterns of Microphones!)
![Page 6: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/6.jpg)
Q&A
6
- Speakers don’t really help with video analysis...
- Exception: Multichannel!
![Page 7: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/7.jpg)
Q&A
7
How is the directionality graph for Microphones generated?
- According to IEC 60268-4
![Page 8: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/8.jpg)
Q&A
8
Does lossy compression influence audio analysis?
- Yes, but degree unknown- Common Sense: It’s better to train your models based on the same data compression algorithm.
![Page 9: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/9.jpg)
Q&A
9
Why is the CD sampling rate 44100Hz and not just 44000?
- Because it is dividable by both 25 video frames per second (PAL) and 30 fps (NTSC)
![Page 10: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/10.jpg)
Useful Filters
10
•Low-,High-, Band-pass, Graphic Equalizer•Dynamic Range Compression•Filter by Example: Spectral Subtraction, Wiener Filter•Handling Multiple Channels
![Page 11: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/11.jpg)
Remember: Fourier Transform
11
DFT
iDFT
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
![Page 12: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/12.jpg)
Remember: Fast Fourier Transform
12
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
// Input: A number N of samples X. N must be an integer power of two.// s the stride// Output: The Discrete Fourier Transform of X in the array Y.FFT(X, N, s)
IF (N==1) THENY[0]:=X{0] // Trivial size-1 case: Sample = DC coefficient
ELSEY[0,...,N/2]:=FFT(X,N/2,2*s) // DFT of (X[0],X[2s},X[4s},...)Y[N/2,...,N-1]:=FFT(X[s,...,N-1],N/2,2*s)// DFT of (X[s],X[s+2s],X[s+4s],...)
FOR k:=0 TO N/2 // Combine the two halves into onet:=Y[k]Y[k]:=t+exp(-2*pi*k/N)*Y[k=N/2]Y[k+N/2]:=t-exp(-2*pi*k/N)*Y[k=N/2]
FFT := Y
![Page 13: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/13.jpg)
Remember: Convolution Theorem
13
More Information: “Introduction to Multimedia Computing”, Draft Chapter 8
Let f and g be two functions and f * g their convolution. Then:
![Page 14: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/14.jpg)
Bandpass Filters
14
![Page 15: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/15.jpg)
Dynamic Range Compression
15
![Page 16: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/16.jpg)
Dynamic Range Compression
16
Dark blue: Original, Light Blue: Dynamic Range Compressed
![Page 17: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/17.jpg)
Filter by Example: Spectral Subtraction
17
Dark blue: Original with humming noise, Light blue: Corrected using spectral subtraction of humming.
![Page 18: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/18.jpg)
Filter by Example: Wiener Filter
18
Assume:
where:•s(t) is original signal•n(t) is noise footprint•s(t) is estimated signal•g(t) is Wiener Filter
![Page 19: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/19.jpg)
Filter by Example: Wiener Filter
19
Define Error:
with being the delay. Thus:
(squared error)
α
![Page 20: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/20.jpg)
Filter by Example: Wiener Filter
20
Define Error:
with being the delay. Thus:
(squared error)
α
![Page 21: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/21.jpg)
Filter by Example: Wiener Filter
21
Results in:
where: •X(s(t)) is the power spectrum of the signal,•s(t) and N(s(t)) is the power spectrum of the noise, and •G(s(t)) is the frequency-space Wiener filter.
Problem: Do we know N(s(t))?
![Page 22: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/22.jpg)
Multichannel Audio
22
•Microphone Array Recordings popular with Speech Community•Stereo popular since the 60s.•Most cameras have stereo microphones•Most DVDs are encoded in Dolby Digital 5.1 and in Dolby Surround on Stereo Track
![Page 23: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/23.jpg)
Dolby Digital
23
Source: http://www.aspecttv.co.uk/surround-sound-setup/
![Page 24: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/24.jpg)
Dolby Surround
24
![Page 25: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/25.jpg)
How to Cope with Multiple Channels?
25
•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above
![Page 26: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/26.jpg)
How to Cope with Multiple Channels?
26
•Treat them individually (different channel=different content)•Combine them, ideally while reducing noise•Cross-Infer between channels•Any combination of the above
![Page 27: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/27.jpg)
Combining Channels
27
•Beamforming, two major techniques:- Delay-Sum- Frequency-Domain Beamformer
![Page 28: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/28.jpg)
Delay-Sum Beamforming
28
More Information: http://www.xavieranguera.com/phdthesis/
![Page 29: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/29.jpg)
Delay-Sum Beamforming
29
Output:- Sum of all channels (lower noise)- Delay between microphones (directional signal)
![Page 30: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/30.jpg)
More on Features
30
• Mel-Frequency-Scaled Coefficients (MFCC)
Other (not explained here):• LPC (Linear Prediction Coefficients)• PLP (Perceptual Linear Predictive) Features• RASTA (see Morgan et al)• MSG (Modulation Spectrogram)
![Page 31: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/31.jpg)
MFCC: Idea
31
power cepstrum of signal
Pre-emphasis
Windowing
FFT
Mel-Scale
Filterbank
Log-Scale
DCT
Audio Signal
MFCC
![Page 32: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/32.jpg)
MFCC: Mel Scale
32
![Page 33: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/33.jpg)
MFCC: Result
33
![Page 34: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/34.jpg)
MFCC Variants and Derivates
34
Derivates: •LFCC (no Mel scale)•AMFCC (anti Mel scale)
Parameters: •MFCC12 (often used for ASR)•MFCC19 (often used in speaker id, diarization)•“delta”: coefficients subtracted (“first derivative”)•“deltadelta”: “second derivative”•Short term: Usually calculated on 10-50ms window
![Page 35: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/35.jpg)
Typical Machine Learning for Audio Analysis
35
•Gaussian Mixture Models (today)
•HMMs•Bayesian Information Criterion•Supervector Approaches
![Page 36: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/36.jpg)
Reminder: K-Means
36
Choose k initial means µi at randomloop for all samples xj: assign membership of each element to a mean (closest mean) for all means µi calculate a new µi by averaging all values xj that were assigned membersuntil means µi are not updated significantly anymore
Algorithm Outline (Expectation Maximization)
![Page 37: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/37.jpg)
Reminder: Gaussian Mixtures
37
![Page 38: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/38.jpg)
Reminder: Training of Mixture Models
38
Goal: Find ai for
Expectation:
Maximization:
![Page 39: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/39.jpg)
Magic Duo
39
~ 90% of audio papers use the combination of MFCCs and Gaussian Mixture Models to model audio signals!
![Page 40: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/40.jpg)
More Tools: HTK
40
Open Source Speech Recognition Toolkit: http://htk.eng.cam.ac.uk/
![Page 41: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/41.jpg)
More Tools: A nice collection of them
41
Princeton SoundLab:http://soundlab.cs.princeton.edu/software/
![Page 42: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/42.jpg)
More Tools: BeamformIT
42
Open Source Beamformer: http://beamformit.sourceforge.net/
![Page 43: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/43.jpg)
Next Week
43
•High-Level Design of Typical Audio Analytics
•Measuring Error•Multimedia
![Page 44: Audio and (Multimedia) Methods for Video Analysisfractor/fall2011/cs298-3.pdf · Audio and (Multimedia) Methods for Video Analysis Dr. Gerald Friedland, fractor@icsi.berkeley.edu](https://reader036.vdocuments.mx/reader036/viewer/2022071012/5fca48a4cae2a7533069a22f/html5/thumbnails/44.jpg)
Organizing the Presentations
44
1) Shaunack Chatterjee2) David Sun3) David Sheffield4) Armin Samii5) Jonathan Kolker6) Venkye Enkambaram7) Tim Althoff8) Giulia Fanti9) Shuo-Yiin Chang10) Ekatarina Gonina11) Jaeyoung Choi