pitch analysis for active music discovery - justin salamon...1 pitch analysis for active music...
TRANSCRIPT
1
Pitch Analysis for Active Music Discovery
1Music and Audio Research Laboratory2Center for Urban Science and Progress
New York University, USA
Thursday June 23rd 2016
Justin Salamon1,2
@justin_salamon [email protected] www.justinsalamon.com
Collaborators: Juan Pablo Bello, Rachel Bittner, Jordi Bonada, Juan J. Bosch, Jose Miguel Diaz-Bañez, Chris Cannam, Francisco Escobar, Slim Essid, Emilia Gómez, Paco Gómez, Sankalp Gulati, Helga Jiang, Edith Law, Matthias Mauch, Joaquin Mora, Sergio Oramas, Geoffroy Peeters, Aggelos Pikrakis, Axel Röbel, Bruno Rocha, Joan Serrà, Xavier Serra, Tim Tse, Alex Williams
In A Nutshell
‣ GOAL: active music discovery
‣ MEANS: pitch content analysis
‣ CHALLENGE: data scarcity (and solution!)
5/35
Active Search & Discovery
10/35Song A
Song B
Versions?
Yes! Similarity = 0.92
COVER SONG ID
(Salamon, Serrà & Gómez, 2013)
Active Search & Discovery
11/35Song A
Song B
Versions?
Yes! Similarity = 0.92
COVER SONG ID
(Salamon, Serrà & Gómez, 2013)
Query-by-humming
(Salamon, Serrà & Gómez, 2013)
Active Search & Discovery
12/35Song A
Song B
Versions?
Yes! Similarity = 0.92
COVER SONG ID
(Salamon, Serrà & Gómez, 2013)
Query-by-humming
(Salamon, Serrà & Gómez, 2013)
melodic pattern discovery
(Pikrakis et al., 2012)
Active Search & Discovery
13/35Song A
Song B
Versions?
Yes! Similarity = 0.92
COVER SONG ID
(Salamon, Serrà & Gómez, 2013)
Query-by-humming
(Salamon, Serrà & Gómez, 2013)
100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency bins (1 bin = 10 cents), Ref: 55Hz
Norm
aliz
ed s
alie
nce
Multipitch Histogram
ρ2
ρ4 ρ3
ρ5
Tonic ID
(Salamon, Gulati & Serra, 2012)
melodic pattern discovery
(Pikrakis et al., 2012)
Active Search & Discovery
14/35Song A
Song B
Versions?
Yes! Similarity = 0.92
COVER SONG ID
(Salamon, Serrà & Gómez, 2013)
Query-by-humming
(Salamon, Serrà & Gómez, 2013)
5.6 5.8 6 6.2 6.4 6.60
0.1
0.2
0.3
0.4
0.5
Mean vibrato rate (vr:mean*)
Me
an
vib
rato
co
ve
rag
e (
v c:me
an
*)
Flamenco
Inst. jazz
Opera
Pop
Vocal jazz
SINGING STYLE classification
(Salamon, Rocha & Gómez, 2012)
100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency bins (1 bin = 10 cents), Ref: 55Hz
Norm
aliz
ed s
alie
nce
Multipitch Histogram
ρ2
ρ4 ρ3
ρ5
Tonic ID
(Salamon, Gulati & Serra, 2012)
melodic pattern discovery
(Pikrakis et al., 2012)
25/35Melodia (Salamon & Gómez, 2012)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Contour Extraction Contour Selection
26/35Melodia (Salamon & Gómez, 2012)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Contour Extraction Contour Selection
27/35Melodia (Salamon & Gómez, 2012)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Contour Extraction Contour Selection
?
1000 2000 3000 4000 5000 60000
0.1
0.2
(a)Pitch mean distribution (cents)
0 100 200 300 400 5000
0.4
0.8
(b)Pitch standard deviation distribution (cents)
0 1 2 30
0.075
0.15
(c)Contour mean salience distribution
0 1 2 3 4 50
0.15
0.3
(d)Contour salience standard deviation distribution
0 2 4 6 8 100
0.4
0.8
(e)Contour total salience distribution
0 2 4 60
0.4
0.8
(f)Contour length distribution (seconds)
Melodia (Salamon & Gómez, 2012)
31/35
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Melodia (Salamon & Gómez, 2012)
Contour Extraction Contour Selection
32/35
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Melodia (Salamon & Gómez, 2012)
Contour Extraction Contour Selection
33/35
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Melodia (Salamon & Gómez, 2012)
Contour Extraction Contour Selection
MIREX: 75%MedleyDB: 57%
34/35Contour Classification (Salamon, Peeters & Röbel, 2012)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Melody contour features
Accompaniment contour features
Contour Extraction Contour Selection
35/35Contour Classification (Bittner et al., 2015)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Contour Extraction Contour Selection
36/35Contour Classification (Bittner et al., 2015)
Melody f0
Audio signal
Sinusoid extraction
Salience function
Pitch contour creation
Melody selection
Iterative
Melody f0 sequence
Spectral peaks
Time-frequency salience
Pitch contours
Bin salience mapping with harmonic weighting
Contour characterisation
Peak streaming Peak filtering
Frequency/amplitude correction
Spectral transform
Equal loudness filter
Melody peak selection
Pitch outlier removal
Melody pitch mean
Octave error removal
Voicing detection
Melody pitch mean
Contour Extraction Contour Selection
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop‣ MedleyDB (Bittner et al., 2014):
‣ 7.5 hr (108 songs), varied
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop‣ MedleyDB (Bittner et al., 2014):
‣ 7.5 hr (108 songs), varied‣ MedleyDBv2 (coming soon):
‣ ~17hr (~240 songs), varied
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop‣ MedleyDB (Bittner et al., 2014):
‣ 7.5 hr (108 songs), varied‣ MedleyDBv2 (coming soon):
‣ ~17hr (~240 songs), varied
~50 annotator-hours108 songs
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop‣ MedleyDB (Bittner et al., 2014):
‣ 7.5 hr (108 songs), varied‣ MedleyDBv2 (coming soon):
‣ ~17hr (~240 songs), varied
~50 annotator-hours108 songs 20 million songs
Datasets for Melody Extraction Eval
‣ MIREX‣ ADC2004: 6 minutes (20 excerpts)‣ MIREX05: 12 minutes (25 excerpts)
‣ MIR1K: 2.2 hr (1000 excerpts), C-Pop‣ RWC-Pop: 6.8 hr (100 songs), J-Pop‣ MedleyDB (Bittner et al., 2014):
‣ 7.5 hr (108 songs), varied‣ MedleyDBv2 (coming soon):
‣ ~17hr (~240 songs), varied
~50 annotator-hours108 songs
~1057 annotator-years20 million songs
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
Synthesis
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
SynthesisMixing
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
SynthesisMixing
Data Augmentation: f0 Annotation-by-Synthesis
Cleaning +Smoothing
MonophonicPitch Tracker
SinusoidalModelling
SynthesisMixing
Algorithm 1
Algorithm 2
Algorithm 3
Metrics: A B C D E
Original data + manual annotations
Synthesized data + automatic annotations
Data Augmentation: f0 Annotation-by-Synthesis
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…
72/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
73/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level
74/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models
75/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models‣ Solutions:
76/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models‣ Solutions:
‣ Crowdsourcing
77/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models‣ Solutions:
‣ Crowdsourcing ‣ Data augmentation: annotation-by-synthesis
78/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models‣ Solutions:
‣ Crowdsourcing ‣ Data augmentation: annotation-by-synthesis
‣ Thanks!
79/35
Summary
‣ Active music discovery: user plays active role in retrieval‣ Examples: QBH, search-by-singing-style, computational
(ethno)musicology…‣ Require (auto) extraction of pitch content: melody extraction
‣ Melody extraction: classification at contour level‣ Data scarcity: can’t explore high-capacity (and data
hungry) models‣ Solutions:
‣ Crowdsourcing ‣ Data augmentation: annotation-by-synthesis
‣ Thanks!
80/35
@justin_salamon [email protected] www.justinsalamon.com