preetirao - pompeu fabra university
TRANSCRIPT
Preeti Rao
2nd CompMusic Workshop, Istanbul 2012
o Music signal characteristics
o Perceptual attributes and acoustic properties
o Signal representations for pitch detection
o STFT
o Sinusoidal model
o Pitch detection algorithms
o Polyphonic context and predominant pitch tracking
o Applications in MIR
2
WiSSAP 2007
*The Physics Classroom:http://www.glenbrook.k12.il.us/gbssci/
phys/Class/sound/u11l2a.html
Digital audio format: PCM
•Sampling rate: 44.1 kHz, 22.05 kHz
•Amplitude resolution: 16 bits/sample
Department of Electrical
Engineering , IIT Bombay
Interesting sounds are typically coded in the form of a
temporal sequence of “atomic sound events”.
E.g. speech -> a sequence of phones
music -> an evolving pattern of notes
An atomic sound event, or a single gestalt, can be a
complex acoustical signal described by a set of
temporal and spectral properties => an evoked
sensation.
Department of Electrical Engineering , IIT Bombay
A sound of given frequency components and sound
pressure levels leads to perceived sensations that
can be distinguished in terms of:
o loudness <-- intensity
o pitch <-- fundamental frequency
o timbre (“quality” or “colour”)
<--ther spectro-temporal properties
Department of Electrical Engineering , IIT Bombay
T0 =
3.3 msec
T0 = 10 msec
low pitch tone
high pitch tone
Frequency = 100 Hz
Frequency = 300 Hz
Air
pre
ssu
re v
aria
tion
1 Hertz = 1 vibration/sec
Department of Electrical Engineering , IIT Bombay
Musical pitch scale
low pitch high pitch
semitone = 21/12
Department of Electrical Engineering , IIT Bombay
o The construction of a musical scale is based on two
assumptions about the human hearing process:
o The ear is sensitive to ratios of fundamental frequencies (pitches),
not so much to absolute pitch.
o The preferred “musical intervals”, i.e. those perceived to be most
consonant, are the ratios of small whole numbers.
o A musical sound is typically comprised of several frequencies.
The frequencies are evident if we observe the “spectrum” of
the sound
Department of Electrical Engineering , IIT Bombay
300 Hz
600 Hz
900 Hz
300 Hz +
600Hz
300 Hz +
600Hz +
900Hz
50
-0.6
0
0.7
500
0.8
( )tx1
)(mst
)(Hzf
( )fX1
Sound “atoms” : Single tone signal
500
0.2
-0.5
0
0.7
50
( )tx2
)(mst
)(Hzf
( )fX 2
Non-tonal Signal
500
0.2
1000-0.4
0
0.5
50
( )tx3
)(mst
)(Hzf
( )fX 3
Complex tone signal
250 800
1
-0.3
0
0.3
50
( )tx4
)(mst
)(Hzf
( )fX 4
Bandpass noise signal
( )dBfX1
)(kHzf
-20
-705
( )tx1
50
-0.5
0
0.5
)(mst
A flute note
o We see that the distinctive signal characteristics are
more evident in the frequency domain.
o The ear is a frequency analyzer. It represents a unique
combination of analysis and synthesis => we do not
perceive spectral components but rather the composite
sounds.
o We observe that a single “note” is perceived as one
entity of well-defined subjective sensations. This is due
to the spatial pattern recognition process achieved by
the central auditory system.
15
Major dimensions of music for retrieval are melody,
rhythm, harmony and timbre.
o Melody, harmony -> based on pitch content
o Rhythm -> based on timing information
o Timbre -> relates to instrumentation, texture
A representation of these high-level attributes can be
obtained from pitch, timing and spectro-temporal
information extracted by audio signal analysis.
Representations are then compared via a similarity
measure to achieve retrieval.
16
o The temporal pattern of frame-level features can offer
important cues to signal identity
17
Feature Extraction
Texture
windows
Analysis
windows
Frame-level
features
Feature summary
Feature
vector
Audio signal
<= duration: 50 – 100 ms
<= duration: 0.5 – 1.0 s
M. F. Martin and J. Breebaart, "Features
for Audio and Music Classification," in
Proc.ISMIR, 2003.
frequency/note
time
Melody: pitch related feature
Melody is the temporal sequence of notes usually played
by a single instrument (fixed timbre). The discrete notes
(pitches) are typically selected from a musical scale.
19
o Typical implementation:
o Pitch detection is carried out on the audio signal at uniformly spaced intervals
o The pitch sequence is segmented into notes (regions of relatively steady pitch)
o Notes are labeled
o Note patterns are matched to determine melodic similarity
o Challenges:
o Note segmentation can be a difficult task
o Pitch detection in polyphonic music is tough
Department of Electrical Engineering , IIT Bombay
Spectrum Waveform
“Schroeder histogram” PDA
Monophonic Signal: cues to perceived pitch
A. de Cheveigne. Multiple F0
estimation. In D.-L. Wang and
G.J. Brown, editors,
Computational Auditory Scene
Analysis : Principles, Algorithms
and Applications, IEEE Press /
Wiley, 2006.
o Time (Lag) domain: maximise autocorrelation
value
o Frequency domain: minimise error between
estimated and predicted harmonic structures
o Other
21
22
Department of Electrical
Engineering , IIT Bombay
Music and speech signals are typically time-varying in nature =>
a time-frequency representation is required to visualize signal
characteristics.
The short-time Fourier transform (STFT) affords such a
representation based on an assumption of signal quasi-
stationarity. The window shape dictates the time and frequency
resolution trade-off.
∑∑∑∑∞∞∞∞
−∞−∞−∞−∞====
−−−−−−−−====
m
mj
SemnwmxnX ωωωωωωωω )()(),(
0 ω
ω( , )X n
π
w(n-m)
x(m)
x(m)w(n-m)
DFT
=
Φ +∑[ ]
1
ˆ[ ]= [ ]cos [ ] [ ]I t
i ii
x t a t t e t
[ ]ia t
iΦ [ ]t
[ ]I t
- amplitude variation of ith sinusoidal component (“partial”)
- total phase (represents both frequency and phase variation)
- Number of partials, can vary with time
ωΦ = + ϕ[ ] [ ] [ ]i i it t t t
ω ϕ{ , , }i i i laModel parameters to be estimated:
DFTPeak
detection
Peak
tracking
Additive
synthesisWindow
Sinusoid
parameters
Residual
Audio
signal
Tonal component
x
_
+
ω ϕ{ , , }i i i la
For the smooth evolution of the signal, sine components are detected in
each frame and linked to tracks from the previous frame based on
frequency proximity.
Σ
0 500 1000 1500 2000 2500 3000-50
-40
-30
-20
-10
0
10
20
30
40
50
Frequency (Hz)
Magnitude (dB)
Spectral magnitude
Fixed threshold (MaxPeak - 40 dB)
Final peaks picked
0 500 1000 1500 2000 2500 3000-50
-40
-30
-20
-10
0
10
20
30
40
50
Magnitude (dB)
Frequency (Hz)
Spectral magnitude
Envelope - 20 dB
Envelope - 25 dB
Envelope - 30 dB
Department of Electrical
Engineering , IIT Bombay
Match spectrum around peak with that of
ideal sinusoid. Apply threshold to the error.
track
born
track
dies
sine peak
Fre
qu
en
cy
Time
D
C
B
A
0 1 2 3 4
Peak tracking
Time (sec)
Fre
qu
en
cy (
Hz)
0 5 10 15 200
500
1000
1500
2000
Ghe Na Tun
Tabla (percussion)
Tanpura (drone)Singer (main melody)
Harmonium (secondary melody)
Department of Electrical Engineering , IIT Bombay
o Input : magnitudes + locations of sinusoids
o For a range of trial fundamentals, generate predicted harmonics
o Minimise TWM error w.r.t. trial fundamentals
p m m p
total
Err ErrErr
N K
→ →= + ρ
200
100
300
400
500
600
700
800
100
200
375
420
700
800
Nearest Neighbour Matching
PredictedComponents
MeasuredComponents
a b
Department of Electrical Engineering , IIT Bombay
Department of Electrical Engineering , IIT Bombay
j
p E(p,j)
E(p',j+1)
W(p,p')
p → Pitch candidates, j → Frame (time instant)
E → Measurement cost (local), W → Smoothness cost
Minimize the Global transition cost over the singing spurt
Department of Electrical Engineering , IIT Bombay
Signal
representation
Multi-F0
analysis
Predominant-F0
trajectory extraction
Singing voice
detection
Polyphonic
audio signal
Voice F0
contour
37
38
“Pitch class profile”
oPitch histogram
oSimilarity measure involves match between histograms
Positive Positive Positive Positive phrasesphrasesphrasesphrases
Negative Negative Negative Negative phrasephrasephrasephrase
Positive phrases
Negative phrase
Detects phrases melodically similar to ‘Guru Bina’ pitch contour
Emphatic beat
sam
Swaras: S S N R
43
Signal
representation
Multi-F0
analysis
Predominant-F0
trajectory extraction
Singing voice
detection
Polyphonic
audio signal
Voice F0
contour
Department of Electrical Engineering , IIT Bombay
o Input : magnitudes + locations of sinusoids
o For a range of trial fundamentals, generate predicted harmonics
o Minimise TWM error w.r.t. trial fundamentals
p m m p
total
Err ErrErr
N K
→ →= + ρ
200
100
300
400
500
600
700
800
100
200
375
420
700
800
Nearest Neighbour Matching
PredictedComponents
MeasuredComponents
a b
Department of Electrical Engineering , IIT Bombay
• Predicted to measured error
• Significant term : Δf / (f)p
o Δf = frequency mismatch error
o f = partial frequency
• Measured to predicted error
Np pn
p m n n n n
n 1 max
aErr f (f ) ( ) [q f (f ) r]
A
− −
→
=
= ∆ ⋅ + × ∆ ⋅ −∑
Kp pk
m p k k k k
n 1 max
aErr f (f ) ( ) [q f (f ) r]
A
− −
→
=
= ∆ ⋅ + × ∆ ⋅ −∑
Melody detection system [1]
Department of Electrical Engineering , IIT Bombay
o F0 search range (male/female)
o p, q, r
o ρ (male/female)
o Window length (pitch range and rate of variation)
o Smoothness cost parameter (rate of pitch variation)
o Voicing threshold
Department of Electrical Engineering , IIT Bombay
o Window length is an analysis parameter that
influences the accuracy of sinusoidal modeling of
the signal
o Closely-spaced components in the polyphony =>
need for higher frequency resolution = longer
windows
o Pitch variation with time can be rapid in
ornamented regions => need for better time
resolution = shorter windows
o Easily computable measures for adapting window length
o Signal sparsity : a sparse spectrum is more “concentrated” =>
better represented sinusoidal components
o Window length selection (20, 30, 40 ms) based on maximizing
signal sparsity
1. V. Rao and P. Rao, “Vocal melody extraction in the presence of
pitched accompaniment in polyphonic music,” IEEE
Transactions on Audio, Speech and Language Processing, vol.
18, no. 8, pp. 2145–2154, Nov. 2010.
2. V. Rao, P. Gaddipati and P. Rao, “Signal-driven window
adaptation for sinusoid identification in polyphonic music,”
IEEE Transactions on Audio, Speech, and Language Processing,
Jan. 2012.
51