gct535-sound technology for multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf ·...
TRANSCRIPT
![Page 1: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/1.jpg)
GCT535- Sound Technology for MultimediaPitch Analysis
Graduate School of Culture TechnologyKAIST
Juhan Nam
1
![Page 2: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/2.jpg)
Outlines
§ Introduction– Definition of Pitch– Information in Pitch
§ Monophonic Pitch Detection Algorithms– Time-Domain Approaches– Frequency-Domain Approaches– Psychoacoustic Model Approaches
§ Pitch Tracking
§ Applications
2
![Page 3: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/3.jpg)
Definition of Pitch
§ Pitch– Defined as auditory attribute of sound according to which sounds can be ordered on
a scale from low and high (ANSI, 1994) – One way of measuring pitch is finding the frequency of a sine wave that is matched
to the target sound in a psychophysical experiment – thus, subject to individual persons: e.g. tone-deaf
§ Fundamental Frequency – Physical attribute of sounds measured from periodicity– Often called F0
§ Pitch should be discriminated from F0: – However, in practice, they are exchangeably used.
3
![Page 4: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/4.jpg)
Information in pitch
§ Music– Notes or melody– Tonality (in polyphony)– Size (or register) of musical instruments: bass, cello, violin
§ Speech – Context (prosody): question, mood, attitude– Speaker: gender, age, identity– Meaning: Chinese (Mandarin)
§ Others– Vocalization of animals (e.g. bird’s chirp, whale): size and types, communication
4
![Page 5: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/5.jpg)
Pitch and Musical Instruments
§ Pitch is determined by the spectral characteristics of musical instruments– Not all musical instruments have pitch
§ Type of musical Instruments by harmonicity– Harmonic and steady: guitar, flute – Harmonic and dynamic: violin, organ, singing voice(vowel)– Inharmonic: piano, vibraphone– Non-harmonic: drum, percussion, singing voice (consonant)
5*Inharmonicity inPianoVibraphone[FromKlapuri’s slides]
![Page 6: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/6.jpg)
Pitch Detection Algorithms
§ Time-Domain Approaches– Periodicity in time
§ Frequency-Domain Approaches– Periodicity in frequency
§ Psychoacoustic Model Approaches– Both time and frequency
6
228 230 232 234 236 238 240 242 244
−0.2
−0.1
0
0.1
0.2
0.3
time [ms]
Ampl
itude
0 1000 2000 3000 4000 5000 6000−20
−10
0
10
20
30
40
50
freqeuncy [Hertz]M
agni
tude
(dB)
waveform
spectrum
![Page 7: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/7.jpg)
Time-Domain Approach
§ Basic Ideas– Periodicity: x(t) = x(t+T) – Measure similarity (or distance) between two adjacent segments– Find the period (T ) that gives the closest distance
§ Two main approaches– Auto-correlation function (ACF): distance by inner product– Average magnitude difference function(AMDF): distance by difference
(e.g., L1, L2 norm)
7
![Page 8: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/8.jpg)
Auto-Correlation Function (ACF)
§ Measuring self-similarity by
8
rt (l) = xt (n)n=0
N−1−l
∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1
Singing Voice
(Sondhi 1967)
100 200 300 400 500 600 700 800 900 1000−1
−0.5
0
0.5
1
time [sample]
Waveform
100 200 300 400 500 600 700 800 900 1000−40
−20
0
20
40
60
80
lag [sample]
Auto−correlation
![Page 9: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/9.jpg)
Auto-Correlation Function (ACF)
§ Biased auto-correlation
§ Unbiased auto-correlation
9
rbiased,t (l) = xt (n)n=0
N−1−l
∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1
runbiased,t (l) =1
N − lxt (n)
n=0
N−1−l
∑ ⋅ xt (n+ l), l = 0,1, 2,...,L −1
100 200 300 400 500 600 700 800 900 1000−0.04
−0.02
0
0.02
0.04
0.06
0.08
lag [sample]
Auto−correlation
![Page 10: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/10.jpg)
Pitch Detection by ACF
10
Spectrogram(tracking max values)
ACF(tracking max values)
![Page 11: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/11.jpg)
Interpretation of ACF in Frequency Domain
§ By convolution theorem, auto-correlation can be computed in frequency domain and also efficiently using FFT
§ Thus, the ACF can be computed as
11
x(n)n=0
N−1−l
∑ ⋅ x(n+ l) = FFT−1(X(k)X*(k)) = FFT−1( X(k) 2 )
r(l) = 1N − l
real(FFT−1( X(k) 2 ))
X(k) = FFT(x(n))
![Page 12: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/12.jpg)
Interpretation of ACF in Frequency Domain
§ This is equivalent to
§ ACF is a simple template-based approach in the frequency domain– Positive weights for (harmonic) peaks and negative weights for valleys
12
r(l) = 1N − l
cos(2π lkK) X(k) 2
k=0
K−1
∑
10 20 30 40 50 60 70 80 90 100−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Freqeuncy [bin]
Mag
nitu
de P
ower
Power SpectrogramWeight
![Page 13: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/13.jpg)
Problems in ACF
§ Bias to the large peak around zero lag
§ Not robust to octave errors, particularly, lower octaves – ACF is sensitive to amplitude changes
§ Equal weights for all harmonic partials– In general, low-numbered harmonic partials are more important in determining pitch
13
![Page 14: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/14.jpg)
Average Magnitude Difference Function (AMDF)
§ Measuring self-similarity by
§ In YIN, p is set to 2
§ And the AMDF is normalized as
14
dt (l) = xt (n)− xt (n+ l)p
n=0
N−1−l
∑ , l = 0,1, 2,...,L −1
d̂(l) =1 l = 0
d(l) [1l
d(u)u=1
l
∑ ] otherwise
"
#$$
%$$
dt (l) = (xt (n)− xt (n+ l))2
n=0
N−1−l
∑ = xt (n)2 − 2xt (n)xt (n+ l)+ xt (n+ l)
2
n=0
N−1−l
∑
= rt (0)− 2rt (l)+ rt+l (0) MinimizethenegativeACFplusalag-dependentterm
(de Cheveigné & Kawahara, 2002)
![Page 15: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/15.jpg)
Average Magnitude Difference Function (AMDF)
15
AMDF
NormalizedAMDF
![Page 16: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/16.jpg)
Why YIN (AMDF) works better
16
§ Robust to changes in amplitude– The difference (instead of correlation) takes care of amplitude changes.– This reduces octave errors.
§ Zero-lag bias is avoided by the normalized AMDF
§ The normalized AMDF allows using a fixed threshold– Can choose multiple candidates and refine peaks
![Page 17: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/17.jpg)
Example of AMDF (YIN)
17
![Page 18: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/18.jpg)
Frequency-Domain Approach
§ Basic Ideas– Periodic in time domain à Harmonic in frequency domain– Measure how harmonic the spectrum is– Find F0 that best explains the harmonic patterns (harmonic partials)
§ Algorithms– Pattern Matching – Cepstrum– Harmonic-Product-Sum (HPS)
18
![Page 19: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/19.jpg)
Pattern Matching: Comb-filtering
§ Using sharp harmonic sieves to take harmonic peak regions only– Compute pitch saliency for F0 candidates
19
(Puckette et al. 1998)
![Page 20: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/20.jpg)
Pattern Matching: Cross-correlation
§ Cross-correlation with an ideal template on a log-scale spectrogram
20[FromEllis’e4896courseslides]
![Page 21: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/21.jpg)
0 500 1000 1500 2000 2500 3000 3500 4000−20
0
20
40
60
80
100
120
Frequency [Hz]
Mag
nitu
de [d
B]
0 100 200 300 400 500 600 700 800−100
−50
0
50
100
150
200
Quefrency
Cepstrum
Cepstrum
§ Real Cepstrum is defined as
§ Basic ideas– Harmonic partials are periodic in frequency domain– (Inverse) FFT find the the periodicity
21
cx (l) = real(FFT−1(log( FFT(x) ))) (Noll,1967)
Liftering
![Page 22: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/22.jpg)
Harmonic Product Sum (HPS)
§ Harmonic Product Sum (HPS) is obtained by multiplying the original magnitude spectrum its decimated spectra by an integer number
22
HPS(k)= X(mk)m=1
M
∏ (Noll,1969)
![Page 23: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/23.jpg)
Auditory Filter bank
§ A set of filter bank that imitates the magnitude and delay of traveling waves on basilar membrane in cochlear
§ Correlogram– Formed by concatenating the ACF of individual HC output – 3-D representation (time-channel-lag) or “auditory images”
23CochlearFilterbanks
Ovalwindow
HighFreq. LowFreq.
Stabilize&Combine
input ...
HC
HC
HC
...
ACF
ACF
ACF
SummaryACF
Correlogram
SummaryACF
Correlogram
Haircells
Auto-correlationFunctions
![Page 24: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/24.jpg)
Types of Auditory Filter Banks
§ Gamma-tone Filter banks – Gamma-tone:– Used in Patterson’s auditory filter banks based on ERB
§ Pole-Zero Filter Cascade (Lyon)
24
g(t) = atn−1e−2πbt cos(2π ft +ϕ )u(t)
![Page 25: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/25.jpg)
Hair-Cell
§ (Inner) Hair-cell– Transform mechanical movement into neural spikes
§ Modeled as cascade of – Half-wave rectification– Compression– Low-pass filtering
§ This conducts a non-linear processing – Generate new harmonic partials– Associated with missing fundamentals
25
![Page 26: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/26.jpg)
Pitch Analysis Using Auditory Model
26
SummaryACF
§ Summary ACF is computed by summing the ACF across all channels– The peaks in the ACF represent periodicity features– This is known to be robust to band-limited noises
![Page 27: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/27.jpg)
Pitch Tracking
§ Pitch is usually continuous over time– Once a pitch with strong harmonicity is detected on a frame, the following frames
form smooth pitch contour
§ Pitch tracking methods– Post processing: first detect pitch in a frame-by-frame manner and then find a
continuous path by smoothing.• Median Filtering • Dynamic Programming (Talkin, 1995)
– Probabilistic approach: detect multiple pitch candidates every frame and and find the best path • Viterbi-decoding: Probabilistic YIN (Mauch, 2014)
27
![Page 28: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/28.jpg)
Applications
§ Sound Modification– Time-stretching using PSOLA– Auto-tune: pitch-correction or T-Pain effect
§ Music Performance– Tuning musical instruments– Pitch-based sound control– Score-following and auto-accompaniment
§ Query-by humming– Relative pitch change might be more important
§ Singing evaluation (e.g. karaoke) and visualization
28
![Page 29: GCT535-Sound Technology for Multimediamac.kaist.ac.kr/~juhan/gct535/slides/10-pitch analysis.pdf · 2018-09-14 · 0 500 1000 1500 2000 2500 3000 3500 4000 −20 0 20 40 60 80 100](https://reader034.vdocuments.mx/reader034/viewer/2022042102/5e7fb99af4a1485e24741423/html5/thumbnails/29.jpg)
References
§ A. de Cheveigne ́ and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music”, 2002.
§ A. Noll, “Cepstrum Pitch Determination,” 1967. § A. Noll, “Pitch Determination of Human Speech by the Harmonic Product
Spectrum, the harmonic sum spectrum and a maximum likelihood estimate”, 1969
§ M. Puckette, T. Apel and D. Zicarelli, “Real-time audio analysis tools for Pd and MSP,” 1998
§ M. Sondhi,“New Methods of Pitch Extraction,” 1968. § D. Talkin,“A Robust Algorithm for Pitch Tracking (RAPT),” 1995. § M. Mauch and S. Dixon ,“PYIN: A Fundamental Frequency Estimator Using
Probabilistic Threshold Distributions,” 2014.
29