music processing applications of music processing · 2017. 2. 14. · gabor wavelet spectrogram...
TRANSCRIPT
Music Processing
Christian Dittmar
Lecture
Applications of Music Processing
International Audio Laboratories [email protected]
Singing Voice Detection
Important pre-requisite for: Music segmentation Music thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition
Singing Voice Detection
Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are
analyzed Singing voice performs dominant melody above
accompaniment
10 15 20 25 30 35 40 45
Time in seconds
Singing Voice Detection
Challenges: Complex characteristics of singing voice Large diversity of accompaniment music Accompaniment may play same melody as singing Pitch-fluctuating instruments my be similar to singing
Stable pitch Fluctuating pitch
Singing Voice Detection
Common approach: Frame-wise extraction of audio features Classification via machine learning
10 15 20 25 30 35 40 45
Time in seconds
Audio Feature Extraction
Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n)
Compute for eachanalysis frame: Time-domain features Spectral features Cepstral feature others …
Audio Feature Extraction
Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched
Linear Prediction Coeff. (LPC) Encodes spectral envelope
Audio Feature Extraction
Spectral features: Spectrogram, linear vs. logarithmic frequency spacing Spectral Flatness (SF), Spectral Centroid (SC), and
many others …STFT Spectrogram [dB]
Time [Sec]
Freq
uenc
y [H
z]
0.5 1 1.5 2
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
Gabor Wavelet Spectrogram [dB]
Time [Sec]
Freq
uenc
y [H
z]
0.5 1 1.5 2
277
407
599
880
1293
1901
2794
4106
6035
8870
Audio Feature Extraction
Cepstral features: Singing voice as an example
Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract
Magnitude spectrum Multiplicative: excitation · filter
Log-magnitude spectrum Additive: excitation + filter
“Liftering” Separation into smooth spectral
envelope and fine-structured excitation
0 0.5 1 1.5 2 2.5
x 104
Mag
nitu
de s
pect
rum
Extraction of spectral envelope via cepstral liftering
0 0.5 1 1.5 2 2.5
x 104Frequency (Hz)
Loga
rithm
ic m
agni
tude
Observed SpectrumSpectral EnvelopeExcitation Spectrum
Machine Learning
Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc …
Machine Learning
Learning principles: Unsupervised learning
Find structures in data
Supervised learning Human observer provides „ground truth“
Semi-supervised learning Combination of above principles
Reinforcement learning Feedback of „confident“ classifications to
the training
The Feature Space
Geometric and algebraic interpretation of ML problems Features contain numerical values
Concatenation of several features Dimensionality M
The data set contains N observations Cardinality N
Illustrative Example SFM & SCF of 6 complex tones
1
0
1
0SC K
k
K
k
ks
kskf
1
0
1
0
1SFK
k
KK
k
ksK
ks
The Feature Space
Each feature has one value M=2
Number of observations N=6
258.62 0.59
512.73 0.99
550.13 0.92
146.50 0.27
47.93 0.01
43.95 0.01
SpectralCentroid
SpectralFlatness
M
N
lpNoiseTone.wav
noiseTone.wav
hpNoiseTone.wav
harmonicNoise.wav
pianoTone.wav
harmonicTone.wav
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
100
200
300
400
500
600
Spectral Flatness
Spe
ctra
l Cen
troid
Scatter plot of Spectral Flatness vs. Spectral Centroid
lpNoiseTone.wavnoiseTone.wavhpNoiseTone.wavharmonicNoiseTone.wavpianoTone.wavharmonicTone.wav
The Feature Space
Each feature has one value M=2
Number of observations N=6
Mapping of features SC to y-axis SF to x-axis Scatter plot with
unnormalized axes
The Feature Space
Each feature has one value M=2
Number of observations N=6
Mapping of features SC to y-axis SF to x-axis Scatter plot with
unnormalized axes
Target class labels Provided by manual
annotation
258.62 0.59
512.73 0.99
550.13 0.92
146.50 0.27
47.93 0.01
43.95 0.01
SpectralCentroid
SpectralFlatness
0
0
0
1
1
1
TargetLabels
⋮ ⋮
Classification methods
k-Nearest Neighbours (kNN)
Singing VoiceAccompanimentUnknown data
Singing VoiceAccompanimentUnknown data
Classification methods
k-Nearest Neighbours (kNN)
L1-Dist. (Manhattan)
M
mmm yxd
11
L2-Dist. (Euclidean)2
12
M
mmm yxd
L∞-Dist. (Maximum)
MM yxyxd
,,max 11
Singing VoiceAccompanimentUnknown data
Classification methods
Decision Trees (DT)
Singing VoiceAccompanimentUnknown data
Classification methods
Random Forests (RF)
Classification methods
Gaussian Mixture Models (GMM)
Singing VoiceAccompanimentUnknown data
∙ Σ
Classification methods
Gaussian Mixture Models (GMM)
Singing VoiceAccompanimentUnknown data
Gauss components
Classification methods
Support Vector Machines (SVM)
Singing VoiceAccompanimentUnknown data
sgn ,
Classification methods
Deep Neural Networks (DNN)
Singing VoiceAccompanimentUnknown data
⋯ , ⋯ ,
Classification methods
Deep Neural Networks (DNN)
Singing VoiceAccompanimentUnknown data
Loss function
Classification methods
Further methods: Hidden Markov Models
Transition probabilities between GMMs Sparse Representation Classifier
Sparse linear combination of training data Boosting
Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks Multiple Kernel Learning others …
25
Mel-scale Frequency Cepstral Coefficients
Filter BankFrame
txGaussian Mixture Model (GMM)
x
11,Σ22 ,Σ ... GG Σ,
+w
1w
2w
G
)|( xp
N () N ()
V V V V V N N N N N N N NV V NNN N
Segment-by-Segment Classification
1
0
1
0)|(log)|(log
W
iMitW
W
iSitW pp xx
Singing
Accompaniment
Singing Voice Detection
Audio MosaicingSource signal: BeesTarget signal: Beatles–Let it be
Mosaic signal: Let it Bee
NMF-Inspired Audio Mosaicing
≈
. =
Non-negative matrix factorization (NMF)
Proposed audio mosaicing approach
≈
.
Non-negative matrix Components Activations
Target’s spectrogram Source’s spectrogram Activations Mosaic’s spectrogram
fixed
learnedfixed
learned
fixed
learned
[Driedger et al. ISMIR 2015]
=
Time source
Freq
uenc
y
Tim
e so
urce
Time targetTime target
Freq
uenc
y
Basic NMF-Inspired Audio Mosaicing
Time target
Freq
uenc
y
Time source
Freq
uenc
y
Freq
uenc
y
Tim
e so
urce
Time targetTime target
. =≈
Spectrogram target
Spectrogram source
SpectrogrammosaicActivation matrix
Basic NMF-Inspired Audio Mosaicing
Time target
Freq
uenc
y
Time source
Freq
uenc
y
Freq
uenc
y
Tim
e so
urce
Time targetTime target
. =≈
Spectrogram target
Spectrogram source
SpectrogrammosaicActivation matrix
Core idea: support the development of sparse diagonal activation structures
Activation matrix
Iterative updates
Preserve temporal context
Time target
Freq
uenc
y
Time source
Freq
uenc
y
Freq
uenc
y
Tim
e so
urce
Time targetTime target
. =≈
Spectrogram target
Spectrogram source
SpectrogrammosaicActivation matrix
Basic NMF-Inspired Audio Mosaicing
Time target
Freq
uenc
y
Time source
Freq
uenc
y
Freq
uenc
y
Tim
e so
urce
Time targetTime target
. =≈
Spectrogram target
Spectrogram source
SpectrogrammosaicActivation matrix
Basic NMF-Inspired Audio Mosaicing
Audio MosaicingSource signal: WhalesTarget signal: Chic–Good times
Mosaic signal
https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee
Audio MosaicingSource signal: Race carTarget signal: Adele–Rolling in the Deep
Mosaic signal
https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee
Drum Source Separation
2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.82 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8
Time (seconds)
Rel
ativ
e am
plitu
de
Log-
frequ
ency
V V V V
V
V
V
STFT
iSTFT
Time (seconds)
Drum Source Separation Signal Model
Drum Sound Separation Decomposition via NMFD
Row
s ofH
Time (seconds)Lateral slices from W
UU U
W
Log-
frequ
ency
Score-based information(drum notation)
Audio-based information(training drum sounds)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
Log-
frequ
ency
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5
Drum Sound Separation
https://www.audiolabs-erlangen.de/resources/MIR/2016-IEEE-TASLP-DrumSeparation
Time (seconds)
Rel
ativ
e am
plitu
de