gct535-sound technology for multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf ·...

29
GCT535- Sound Technology for Multimedia Pitch Analysis Graduate School of Culture Technology KAIST Juhan Nam 1

Upload: others

Post on 21-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

GCT535- Sound Technology for MultimediaPitch Analysis

Graduate School of Culture TechnologyKAIST

Juhan Nam

1

Page 2: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Outlines

§ Introduction– Definition of Pitch– Information in Pitch

§ Monophonic Pitch Detection Algorithms– Time-Domain Approaches– Frequency-Domain Approaches– Psychoacoustic Model Approaches

§ Pitch Tracking

§ Applications

2

Page 3: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Definition of Pitch

§ Pitch– Defined as auditory attribute of sound according to which sounds can be ordered on

a scale from low and high (ANSI, 1994) – One way of measuring pitch is finding the frequency of a sine wave that is matched

to the target sound in a psychophysical experiment – thus, subject to individual persons: e.g. tone-deaf

§ Fundamental Frequency – Physical attribute of sounds measured from periodicity– Often called F0

§ Pitch should be discriminated from F0: – However, in practice, they are exchangeably used.

3

Page 4: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Information in pitch

§ Music– Notes or melody– Tonality (in polyphony)– Size (or register) of musical instruments: bass, cello, violin

§ Speech – Context (prosody): question, mood, attitude– Speaker: gender, age, identity– Meaning: Chinese (Mandarin)

§ Others– Vocalization of animals (e.g. bird’s chirp, whale): size and types, communication

4

Page 5: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pitch and Musical Instruments

§ Pitch is determined by the spectral characteristics of musical instruments– Not all musical instruments have pitch

§ Type of musical Instruments by harmonicity– Harmonic and steady: guitar, flute – Harmonic and dynamic: violin, organ, singing voice(vowel)– Inharmonic: piano, vibraphone– Non-harmonic: drum, percussion, singing voice (consonant)

5*Inharmonicity inPianoVibraphone[FromKlapuri’s slides]

Page 6: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pitch Detection Algorithms

§ Time-Domain Approaches– Periodicity in time

§ Frequency-Domain Approaches– Periodicity in frequency

§ Psychoacoustic Model Approaches– Both time and frequency

6

228 230 232 234 236 238 240 242 244

−0.2

−0.1

0

0.1

0.2

0.3

time [ms]

Ampl

itude

0 1000 2000 3000 4000 5000 6000−20

−10

0

10

20

30

40

50

freqeuncy [Hertz]M

agni

tude

(dB)

waveform

spectrum

Page 7: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Time-Domain Approach

§ Basic Ideas– Periodicity: x(t) = x(t+T) – Measure similarity (or distance) between two adjacent segments– Find the period (T ) that gives the closest distance

§ Two main approaches– Auto-correlation function (ACF): distance by inner product– Average magnitude difference function(AMDF): distance by difference

(e.g., L1, L2 norm)

7

Page 8: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Auto-Correlation Function (ACF)

§ Measuring self-similarity by

8

rt (l) = xt (n)n=0

N−1−l

∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1

Singing Voice

(Sondhi 1967)

100 200 300 400 500 600 700 800 900 1000−1

−0.5

0

0.5

1

time [sample]

Waveform

100 200 300 400 500 600 700 800 900 1000−40

−20

0

20

40

60

80

lag [sample]

Auto−correlation

Page 9: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Auto-Correlation Function (ACF)

§ Biased auto-correlation

§ Unbiased auto-correlation

9

rbiased,t (l) = xt (n)n=0

N−1−l

∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1

runbiased,t (l) =1

N − lxt (n)

n=0

N−1−l

∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1

100 200 300 400 500 600 700 800 900 1000−0.04

−0.02

0

0.02

0.04

0.06

0.08

lag [sample]

Auto−correlation

Page 10: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pitch Detection by ACF

10

Spectrogram(tracking max values)

ACF(tracking max values)

Page 11: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Interpretation of ACF in Frequency Domain

§ By convolution theorem, auto-correlation can be computed in frequency domain and also efficiently using FFT

§ Thus, the ACF can be computed as

11

x(n)n=0

N−1−l

∑ ⋅ x(n+ l) = FFT−1(X(k)X*(k)) = FFT−1( X(k) 2 )

r(l) = 1N − l

real(FFT−1( X(k) 2 ))

X(k) = FFT(x(n))

Page 12: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Interpretation of ACF in Frequency Domain

§ This is equivalent to

§ ACF is a simple template-based approach in the frequency domain– Positive weights for (harmonic) peaks and negative weights for valleys

12

r(l) = 1N − l

cos(2π lkK) X(k) 2

k=0

K−1

10 20 30 40 50 60 70 80 90 100−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Freqeuncy [bin]

Mag

nitu

de P

ower

Power SpectrogramWeight

Page 13: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Problems in ACF

§ Bias to the large peak around zero lag

§ Not robust to octave errors, particularly, lower octaves – ACF is sensitive to amplitude changes

§ Equal weights for all harmonic partials– In general, low-numbered harmonic partials are more important in determining pitch

13

Page 14: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Average Magnitude Difference Function (AMDF)

§ Measuring self-similarity by

§ In YIN, p is set to 2

§ And the AMDF is normalized as

14

dt (l) = xt (n)− xt (n+ l)p

n=0

N−1−l

∑ , l = 0,1, 2,...,L −1

d̂(l) =1 l = 0

d(l) [1l

d(u)u=1

l

∑ ] otherwise

"

#$$

%$$

dt (l) = (xt (n)− xt (n+ l))2

n=0

N−1−l

∑ = xt (n)2 − 2xt (n)xt (n+ l)+ xt (n+ l)

2

n=0

N−1−l

= rt (0)− 2rt (l)+ rt+l (0) MinimizethenegativeACFplusalag-dependentterm

(de Cheveigné & Kawahara, 2002)

Page 15: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Average Magnitude Difference Function (AMDF)

15

AMDF

NormalizedAMDF

Page 16: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Why YIN (AMDF) works better

16

§ Robust to changes in amplitude– The difference (instead of correlation) takes care of amplitude changes.– This reduces octave errors.

§ Zero-lag bias is avoided by the normalized AMDF

§ The normalized AMDF allows using a fixed threshold– Can choose multiple candidates and refine peaks

Page 17: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Example of AMDF (YIN)

17

Page 18: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Frequency-Domain Approach

§ Basic Ideas– Periodic in time domain à Harmonic in frequency domain– Measure how harmonic the spectrum is– Find F0 that best explains the harmonic patterns (harmonic partials)

§ Algorithms– Pattern Matching – Cepstrum– Harmonic-Product-Sum (HPS)

18

Page 19: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pattern Matching: Comb-filtering

§ Using sharp harmonic sieves to take harmonic peak regions only– Compute pitch saliency for F0 candidates

19

(Puckette et al. 1998)

Page 20: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pattern Matching: Cross-correlation

§ Cross-correlation with an ideal template on a log-scale spectrogram

20[FromEllis’e4896courseslides]

Page 21: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

0 500 1000 1500 2000 2500 3000 3500 4000−20

0

20

40

60

80

100

120

Frequency [Hz]

Mag

nitu

de [d

B]

0 100 200 300 400 500 600 700 800−100

−50

0

50

100

150

200

Quefrency

Cepstrum

Cepstrum

§ Real Cepstrum is defined as

§ Basic ideas– Harmonic partials are periodic in frequency domain– (Inverse) FFT find the the periodicity

21

cx (l) = real(FFT−1(log( FFT(x) ))) (Noll,1967)

Liftering

Page 22: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Harmonic Product Sum (HPS)

§ Harmonic Product Sum (HPS) is obtained by multiplying the original magnitude spectrum its decimated spectra by an integer number

22

HPS(k)= X(mk)m=1

M

∏ (Noll,1969)

Page 23: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Auditory Filter bank

§ A set of filter bank that imitates the magnitude and delay of traveling waves on basilar membrane in cochlear

§ Correlogram– Formed by concatenating the ACF of individual HC output – 3-D representation (time-channel-lag) or “auditory images”

23CochlearFilterbanks

Ovalwindow

HighFreq. LowFreq.

Stabilize&Combine

input ...

HC

HC

HC

...

ACF

ACF

ACF

SummaryACF

Correlogram

SummaryACF

Correlogram

Haircells

Auto-correlationFunctions

Page 24: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Types of Auditory Filter Banks

§ Gamma-tone Filter banks – Gamma-tone:– Used in Patterson’s auditory filter banks based on ERB

§ Pole-Zero Filter Cascade (Lyon)

24

g(t) = atn−1e−2πbt cos(2π ft +ϕ )u(t)

Page 25: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Hair-Cell

§ (Inner) Hair-cell– Transform mechanical movement into neural spikes

§ Modeled as cascade of – Half-wave rectification– Compression– Low-pass filtering

§ This conducts a non-linear processing – Generate new harmonic partials– Associated with missing fundamentals

25

Page 26: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pitch Analysis Using Auditory Model

26

SummaryACF

§ Summary ACF is computed by summing the ACF across all channels– The peaks in the ACF represent periodicity features– This is known to be robust to band-limited noises

Page 27: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Pitch Tracking

§ Pitch is usually continuous over time– Once a pitch with strong harmonicity is detected on a frame, the following frames

form smooth pitch contour

§ Pitch tracking methods– Post processing: first detect pitch in a frame-by-frame manner and then find a

continuous path by smoothing.• Median Filtering • Dynamic Programming (Talkin, 1995)

– Probabilistic approach: detect multiple pitch candidates every frame and and find the best path • Viterbi-decoding: Probabilistic YIN (Mauch, 2014)

27

Page 28: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

Applications

§ Sound Modification– Time-stretching using PSOLA– Auto-tune: pitch-correction or T-Pain effect

§ Music Performance– Tuning musical instruments– Pitch-based sound control– Score-following and auto-accompaniment

§ Query-by humming– Relative pitch change might be more important

§ Singing evaluation (e.g. karaoke) and visualization

28

Page 29: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100

References

§ A. de Cheveigne ́ and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music”, 2002.

§ A. Noll, “Cepstrum Pitch Determination,” 1967. § A. Noll, “Pitch Determination of Human Speech by the Harmonic Product

Spectrum, the harmonic sum spectrum and a maximum likelihood estimate”, 1969

§ M. Puckette, T. Apel and D. Zicarelli, “Real-time audio analysis tools for Pd and MSP,” 1998

§ M. Sondhi,“New Methods of Pitch Extraction,” 1968. § D. Talkin,“A Robust Algorithm for Pitch Tracking (RAPT),” 1995. § M. Mauch and S. Dixon ,“PYIN: A Fundamental Frequency Estimator Using

Probabilistic Threshold Distributions,” 2014.

29