preetirao - pompeu fabra university

51
PreetiRao 2 nd CompMusic Workshop, Istanbul 2012

Upload: others

Post on 08-Jul-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PreetiRao - Pompeu Fabra University

Preeti Rao

2nd CompMusic Workshop, Istanbul 2012

Page 2: PreetiRao - Pompeu Fabra University

o Music signal characteristics

o Perceptual attributes and acoustic properties

o Signal representations for pitch detection

o STFT

o Sinusoidal model

o Pitch detection algorithms

o Polyphonic context and predominant pitch tracking

o Applications in MIR

2

Page 3: PreetiRao - Pompeu Fabra University

WiSSAP 2007

*The Physics Classroom:http://www.glenbrook.k12.il.us/gbssci/

phys/Class/sound/u11l2a.html

Digital audio format: PCM

•Sampling rate: 44.1 kHz, 22.05 kHz

•Amplitude resolution: 16 bits/sample

Page 4: PreetiRao - Pompeu Fabra University

Department of Electrical

Engineering , IIT Bombay

Interesting sounds are typically coded in the form of a

temporal sequence of “atomic sound events”.

E.g. speech -> a sequence of phones

music -> an evolving pattern of notes

An atomic sound event, or a single gestalt, can be a

complex acoustical signal described by a set of

temporal and spectral properties => an evoked

sensation.

Page 5: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

A sound of given frequency components and sound

pressure levels leads to perceived sensations that

can be distinguished in terms of:

o loudness <-- intensity

o pitch <-- fundamental frequency

o timbre (“quality” or “colour”)

<--ther spectro-temporal properties

Page 6: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

T0 =

3.3 msec

T0 = 10 msec

low pitch tone

high pitch tone

Frequency = 100 Hz

Frequency = 300 Hz

Air

pre

ssu

re v

aria

tion

1 Hertz = 1 vibration/sec

Page 7: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

Musical pitch scale

low pitch high pitch

semitone = 21/12

Page 8: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

o The construction of a musical scale is based on two

assumptions about the human hearing process:

o The ear is sensitive to ratios of fundamental frequencies (pitches),

not so much to absolute pitch.

o The preferred “musical intervals”, i.e. those perceived to be most

consonant, are the ratios of small whole numbers.

o A musical sound is typically comprised of several frequencies.

The frequencies are evident if we observe the “spectrum” of

the sound

Page 9: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

300 Hz

600 Hz

900 Hz

300 Hz +

600Hz

300 Hz +

600Hz +

900Hz

Page 10: PreetiRao - Pompeu Fabra University

50

-0.6

0

0.7

500

0.8

( )tx1

)(mst

)(Hzf

( )fX1

Sound “atoms” : Single tone signal

Page 11: PreetiRao - Pompeu Fabra University

500

0.2

-0.5

0

0.7

50

( )tx2

)(mst

)(Hzf

( )fX 2

Non-tonal Signal

Page 12: PreetiRao - Pompeu Fabra University

500

0.2

1000-0.4

0

0.5

50

( )tx3

)(mst

)(Hzf

( )fX 3

Complex tone signal

Page 13: PreetiRao - Pompeu Fabra University

250 800

1

-0.3

0

0.3

50

( )tx4

)(mst

)(Hzf

( )fX 4

Bandpass noise signal

Page 14: PreetiRao - Pompeu Fabra University

( )dBfX1

)(kHzf

-20

-705

( )tx1

50

-0.5

0

0.5

)(mst

A flute note

Page 15: PreetiRao - Pompeu Fabra University

o We see that the distinctive signal characteristics are

more evident in the frequency domain.

o The ear is a frequency analyzer. It represents a unique

combination of analysis and synthesis => we do not

perceive spectral components but rather the composite

sounds.

o We observe that a single “note” is perceived as one

entity of well-defined subjective sensations. This is due

to the spatial pattern recognition process achieved by

the central auditory system.

15

Page 16: PreetiRao - Pompeu Fabra University

Major dimensions of music for retrieval are melody,

rhythm, harmony and timbre.

o Melody, harmony -> based on pitch content

o Rhythm -> based on timing information

o Timbre -> relates to instrumentation, texture

A representation of these high-level attributes can be

obtained from pitch, timing and spectro-temporal

information extracted by audio signal analysis.

Representations are then compared via a similarity

measure to achieve retrieval.

16

Page 17: PreetiRao - Pompeu Fabra University

o The temporal pattern of frame-level features can offer

important cues to signal identity

17

Feature Extraction

Texture

windows

Analysis

windows

Frame-level

features

Feature summary

Feature

vector

Audio signal

<= duration: 50 – 100 ms

<= duration: 0.5 – 1.0 s

M. F. Martin and J. Breebaart, "Features

for Audio and Music Classification," in

Proc.ISMIR, 2003.

Page 18: PreetiRao - Pompeu Fabra University

frequency/note

time

Melody: pitch related feature

Melody is the temporal sequence of notes usually played

by a single instrument (fixed timbre). The discrete notes

(pitches) are typically selected from a musical scale.

Page 19: PreetiRao - Pompeu Fabra University

19

o Typical implementation:

o Pitch detection is carried out on the audio signal at uniformly spaced intervals

o The pitch sequence is segmented into notes (regions of relatively steady pitch)

o Notes are labeled

o Note patterns are matched to determine melodic similarity

o Challenges:

o Note segmentation can be a difficult task

o Pitch detection in polyphonic music is tough

Page 20: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

Spectrum Waveform

“Schroeder histogram” PDA

Monophonic Signal: cues to perceived pitch

A. de Cheveigne. Multiple F0

estimation. In D.-L. Wang and

G.J. Brown, editors,

Computational Auditory Scene

Analysis : Principles, Algorithms

and Applications, IEEE Press /

Wiley, 2006.

Page 21: PreetiRao - Pompeu Fabra University

o Time (Lag) domain: maximise autocorrelation

value

o Frequency domain: minimise error between

estimated and predicted harmonic structures

o Other

21

Page 22: PreetiRao - Pompeu Fabra University

22

Page 23: PreetiRao - Pompeu Fabra University

Department of Electrical

Engineering , IIT Bombay

Music and speech signals are typically time-varying in nature =>

a time-frequency representation is required to visualize signal

characteristics.

The short-time Fourier transform (STFT) affords such a

representation based on an assumption of signal quasi-

stationarity. The window shape dictates the time and frequency

resolution trade-off.

∑∑∑∑∞∞∞∞

−∞−∞−∞−∞====

−−−−−−−−====

m

mj

SemnwmxnX ωωωωωωωω )()(),(

Page 24: PreetiRao - Pompeu Fabra University

0 ω

ω( , )X n

π

w(n-m)

x(m)

x(m)w(n-m)

DFT

Page 25: PreetiRao - Pompeu Fabra University
Page 26: PreetiRao - Pompeu Fabra University

=

Φ +∑[ ]

1

ˆ[ ]= [ ]cos [ ] [ ]I t

i ii

x t a t t e t

[ ]ia t

iΦ [ ]t

[ ]I t

- amplitude variation of ith sinusoidal component (“partial”)

- total phase (represents both frequency and phase variation)

- Number of partials, can vary with time

ωΦ = + ϕ[ ] [ ] [ ]i i it t t t

ω ϕ{ , , }i i i laModel parameters to be estimated:

Page 27: PreetiRao - Pompeu Fabra University

DFTPeak

detection

Peak

tracking

Additive

synthesisWindow

Sinusoid

parameters

Residual

Audio

signal

Tonal component

x

_

+

ω ϕ{ , , }i i i la

For the smooth evolution of the signal, sine components are detected in

each frame and linked to tracks from the previous frame based on

frequency proximity.

Σ

Page 28: PreetiRao - Pompeu Fabra University

0 500 1000 1500 2000 2500 3000-50

-40

-30

-20

-10

0

10

20

30

40

50

Frequency (Hz)

Magnitude (dB)

Spectral magnitude

Fixed threshold (MaxPeak - 40 dB)

Final peaks picked

0 500 1000 1500 2000 2500 3000-50

-40

-30

-20

-10

0

10

20

30

40

50

Magnitude (dB)

Frequency (Hz)

Spectral magnitude

Envelope - 20 dB

Envelope - 25 dB

Envelope - 30 dB

Page 29: PreetiRao - Pompeu Fabra University

Department of Electrical

Engineering , IIT Bombay

Match spectrum around peak with that of

ideal sinusoid. Apply threshold to the error.

Page 30: PreetiRao - Pompeu Fabra University

track

born

track

dies

sine peak

Fre

qu

en

cy

Time

D

C

B

A

0 1 2 3 4

Peak tracking

Page 31: PreetiRao - Pompeu Fabra University

Time (sec)

Fre

qu

en

cy (

Hz)

0 5 10 15 200

500

1000

1500

2000

Ghe Na Tun

Tabla (percussion)

Tanpura (drone)Singer (main melody)

Harmonium (secondary melody)

Page 32: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

o Input : magnitudes + locations of sinusoids

o For a range of trial fundamentals, generate predicted harmonics

o Minimise TWM error w.r.t. trial fundamentals

p m m p

total

Err ErrErr

N K

→ →= + ρ

200

100

300

400

500

600

700

800

100

200

375

420

700

800

Nearest Neighbour Matching

PredictedComponents

MeasuredComponents

a b

Page 33: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

Page 34: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

j

p E(p,j)

E(p',j+1)

W(p,p')

p → Pitch candidates, j → Frame (time instant)

E → Measurement cost (local), W → Smoothness cost

Minimize the Global transition cost over the singing spurt

Page 35: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

Page 36: PreetiRao - Pompeu Fabra University

Signal

representation

Multi-F0

analysis

Predominant-F0

trajectory extraction

Singing voice

detection

Polyphonic

audio signal

Voice F0

contour

Page 37: PreetiRao - Pompeu Fabra University

37

Page 38: PreetiRao - Pompeu Fabra University

38

“Pitch class profile”

oPitch histogram

oSimilarity measure involves match between histograms

Page 39: PreetiRao - Pompeu Fabra University
Page 40: PreetiRao - Pompeu Fabra University

Positive Positive Positive Positive phrasesphrasesphrasesphrases

Negative Negative Negative Negative phrasephrasephrasephrase

Page 41: PreetiRao - Pompeu Fabra University
Page 42: PreetiRao - Pompeu Fabra University

Positive phrases

Negative phrase

Detects phrases melodically similar to ‘Guru Bina’ pitch contour

Emphatic beat

sam

Swaras: S S N R

Page 43: PreetiRao - Pompeu Fabra University

43

Page 44: PreetiRao - Pompeu Fabra University

Signal

representation

Multi-F0

analysis

Predominant-F0

trajectory extraction

Singing voice

detection

Polyphonic

audio signal

Voice F0

contour

Page 45: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

o Input : magnitudes + locations of sinusoids

o For a range of trial fundamentals, generate predicted harmonics

o Minimise TWM error w.r.t. trial fundamentals

p m m p

total

Err ErrErr

N K

→ →= + ρ

200

100

300

400

500

600

700

800

100

200

375

420

700

800

Nearest Neighbour Matching

PredictedComponents

MeasuredComponents

a b

Page 46: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

• Predicted to measured error

• Significant term : Δf / (f)p

o Δf = frequency mismatch error

o f = partial frequency

• Measured to predicted error

Np pn

p m n n n n

n 1 max

aErr f (f ) ( ) [q f (f ) r]

A

− −

=

= ∆ ⋅ + × ∆ ⋅ −∑

Kp pk

m p k k k k

n 1 max

aErr f (f ) ( ) [q f (f ) r]

A

− −

=

= ∆ ⋅ + × ∆ ⋅ −∑

Page 47: PreetiRao - Pompeu Fabra University

Melody detection system [1]

Page 48: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

o F0 search range (male/female)

o p, q, r

o ρ (male/female)

o Window length (pitch range and rate of variation)

o Smoothness cost parameter (rate of pitch variation)

o Voicing threshold

Page 49: PreetiRao - Pompeu Fabra University

Department of Electrical Engineering , IIT Bombay

o Window length is an analysis parameter that

influences the accuracy of sinusoidal modeling of

the signal

o Closely-spaced components in the polyphony =>

need for higher frequency resolution = longer

windows

o Pitch variation with time can be rapid in

ornamented regions => need for better time

resolution = shorter windows

Page 50: PreetiRao - Pompeu Fabra University

o Easily computable measures for adapting window length

o Signal sparsity : a sparse spectrum is more “concentrated” =>

better represented sinusoidal components

o Window length selection (20, 30, 40 ms) based on maximizing

signal sparsity

Page 51: PreetiRao - Pompeu Fabra University

1. V. Rao and P. Rao, “Vocal melody extraction in the presence of

pitched accompaniment in polyphonic music,” IEEE

Transactions on Audio, Speech and Language Processing, vol.

18, no. 8, pp. 2145–2154, Nov. 2010.

2. V. Rao, P. Gaddipati and P. Rao, “Signal-driven window

adaptation for sinusoid identification in polyphonic music,”

IEEE Transactions on Audio, Speech, and Language Processing,

Jan. 2012.

51