onset detection and attack phase descriptors i quite user-friendly: well documented, easy to access...
TRANSCRIPT
I Onset detection VS Attack phase descriptionI MIREX competition:
I Detect the approximate temporal location of new onsets in an audio file.I Algoritims are compared against manual expert annotation (which is inherently
imprecise).I False positives and false negatives are penalized
I Attack phase descriptionI What are the slient time points in the beginning of this sound event?I What are the relations between these time points?
I Paper:I K. Nymoen, A. Danielsen and J. London: Attack Phase Descriptor Estimation in
Matlab toolboxes. Submitted for SMC2017, Helsinki.I Comparing the MIRtoolbox (Lartillot) and the Timbre Toolbox (Peeters)
onset detection
0 1 2 3 4 5 6 7
time (s)
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
am
plit
ude
Audio waveform
1 2 3 4 5 6 7
time (s)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
plit
ud
e
Onset curve (Envelope)
are these really onsets?
1 2 3 4 5 6 7
time (s)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
plit
ud
e
Onset curve (Envelope)
I What would our research question typically be when using this function in theMIRtoolbox?
I SegmentationI MelodyI Rhythm(?)I Microrhythm
I Are we interested in onsets, or rather perceived moments of metrical alignment?
are these really onsets?
1 2 3 4 5 6 7
time (s)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
plit
ud
e
Onset curve (Envelope)
I What would our research question typically be when using this function in theMIRtoolbox?
I SegmentationI MelodyI Rhythm(?)I Microrhythm
I Are we interested in onsets, or rather perceived moments of metrical alignment?
are these really onsets?
1 2 3 4 5 6 7
time (s)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
am
plit
ud
e
Onset curve (Envelope)
I What would our research question typically be when using this function in theMIRtoolbox?
I SegmentationI MelodyI Rhythm(?)I Microrhythm
I Are we interested in onsets, or rather perceived moments of metrical alignment?
Salient time points in the initial phase of a sonic event
PerceptualOnset
PerceptualAttack
Energy peak
PhysicalOnset
Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)
Salient time points in the initial phase of a sonic event
PerceptualOnset
PerceptualAttack
Energy peak
PhysicalOnset
Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)
Salient time points in the initial phase of a sonic event
PerceptualOnset
PerceptualAttack
Energy peak
PhysicalOnset
Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)
Attack phase descriptors
PerceptualOnset
PerceptualAttack
Attack time Log-Attack Time = log(Attack time)
Energy peak
Temporal centroid
AttackSlope Attack
Leap
PhysicalOnset Rise time
I Time pointsI Time spansI Energy spansI (Energy points)
Attack phase descriptors (our definitions)
Name Type Description
Physical onset phTP Time point where the sound energy first rises from 0.
Perceptual onset peTP Time point when the sound event becomes audible.
Perceptual attack peTP Time point perceived as the rhythmic emphasis of the sound.
Energy peak phTP Time point when the energy envelope reaches its maximum value.
Rise time phTS Time span between physical onset and energy peak.
Attack time peTS Time span between perceptual onset and perceptual attack.
Log-Attack Time phTS The base 10 logarithm of attack time.
Attack slope peES Weighted average of the energy envelope slope in the attack phase.
Attack leap peES The di↵erence between energy level at perceptual attack and perceptual
onset.
Temporal centroid phTP The temporal barycentre of the sound event’s energy envelope.
Log-Attack Time is a commonly used descriptor in the MIR community. No consensus:some use physical descriptors, some use perceptual, and some use a combination toestimate it.
Attack phase descriptors (our definitions)
Name Type Description
Physical onset phTP Time point where the sound energy first rises from 0.
Perceptual onset peTP Time point when the sound event becomes audible.
Perceptual attack peTP Time point perceived as the rhythmic emphasis of the sound.
Energy peak phTP Time point when the energy envelope reaches its maximum value.
Rise time phTS Time span between physical onset and energy peak.
Attack time peTS Time span between perceptual onset and perceptual attack.
Log-Attack Time phTS The base 10 logarithm of attack time.
Attack slope peES Weighted average of the energy envelope slope in the attack phase.
Attack leap peES The di↵erence between energy level at perceptual attack and perceptual
onset.
Temporal centroid phTP The temporal barycentre of the sound event’s energy envelope.
Log-Attack Time is a commonly used descriptor in the MIR community. No consensus:some use physical descriptors, some use perceptual, and some use a combination toestimate it.
PerceptualOnset
PerceptualAttack
Attack time Log-Attack Time = log(Attack time)
Energy peak
Temporal centroid
AttackSlope Attack
Leap
PhysicalOnset Rise time
I Time points
I Time spans
I Energy spans
I (Energy points)
Attack phase descriptors step 1: Envelope extractionTimbre Toolbox
I Apply Hilbert transform to the audio signal,I followed by a 3rd-order Butterworth lowpass filter with cuto↵ frequency at 5 Hz.I No compensation for filter group delay
MIRtoolboxI Spectrogram, hanning window, 100ms frame, 10% hopI Envelope = sum of columns in spectrogram
-0.5
0
0.5
0
0.5
1
0 0.05 0.1Time (seconds)
0.15 0.2 0.25
Audio waveform
Energy EnvelopeMIRtoolbox (D)
Timbre Toolbox (D)Timbre toolbox (A)
MIRtoolbox (A)
Attack phase descriptors step 2: Salient time steps
I Both the MIRtoolbox and the Timbre toolbox provide equvalents to “beginning ofattack” and “end of attack”.
Effort FunctionMean effort
Energy envelopeAttack start – end
θ1
θ10
θ2
time (seconds)
Timbre Toolbox attack phase estimation
0.2 0.4 0.6 0.8 10
0
0.01
0.02
0
0.5
1
Time derivativePeak position
Threshold
Energy envelopeAttack start – end
time (seconds)
MIRToolbox attack phase estimation
0.2 0.4 0.6 0.8 10
I But are these supposed to reflect physical or perceptual features?
Attack phase descriptors step 2: Salient time steps
I Both the MIRtoolbox and the Timbre toolbox provide equvalents to “beginning ofattack” and “end of attack”.
Effort FunctionMean effort
Energy envelopeAttack start – end
θ1
θ10
θ2
time (seconds)
Timbre Toolbox attack phase estimation
0.2 0.4 0.6 0.8 10
0
0.01
0.02
0
0.5
1
Time derivativePeak position
Threshold
Energy envelopeAttack start – end
time (seconds)
MIRToolbox attack phase estimation
0.2 0.4 0.6 0.8 10
I But are these supposed to reflect physical or perceptual features?
Attack phase descriptors step 3: ...
I Decide on the definitions of attack phase descriptors, and the methods forextracting salient time points
Attack phase descriptors step 3: ...
I Decide on the definitions of attack phase descriptors, and the methods forextracting salient time points
Into the nitty-gritty
Timbre toolbox
I NB! Make sure that you download the latest version from github... (Don’t trustthe CIRMMT link 1st hit on Google which gives you version 1.2 from 2003)
I My impression: Best used if you need a large range of audio descriptors for a largeaudio set, and don’t want to fiddle with choosing parameters for your functions
I Need to dig deep into the code to change the parameters (hard-coded):I Lowpass filter cuto↵ frequencyI ↵ valueI Fix group delay problem
Into the nitty-gritty
MIRtoolbox
I Quite user-friendly: well documented, easy to access most parameters
I mironsets() function - ‘attack’ option
I threshold value is hard-codedI mirgetdata-problem:
I uncell(get(A,’AttackPosUnit’))I uncell(get(A,’PeakPosUnit’))
Perceptual experiment
I Task: align a repeated musical sound to a click track.
I 17 participants
I 9 sound stimuli (8 musical instruments + click)
I inter-stimuli-interval of 600 ms
I click track and stimuli started with a random o↵set.
I controlling sync using a keyboard and/or a slider on the screen.
Parameter optimisation and perceptual results
-80
-60
-40
-20
0
20
40
60
80T
ime
re
lativ
e to
ph
ysic
al o
nse
t (in
mill
ise
con
ds)
BrightPiano
SnareDrum
DarkPiano
KickDrum
Fiddle Shaker SynthBass
ArcoBass
Click
MIRtoolbox (D)
Timbre Toolbox (D)Timbre toolbox (O)
MIRtoolbox (O)
Perceptual results
Jaccard index for MIRtoolbox (mean for all sounds)
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15 0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Treshold (fraction of e’peak)
Fram
e s
ize
(se
con
ds)
Jaccard In
de
x
Toolbox Envelope parameter Threshold parameter
Timbretoolbox
LPfilter cuto↵ frequencyDefault: 5 HzOptimised: 37 Hz
↵
Default: 3Optimised: 3.75
MIR-toolbox
Frame sizeDefault: 0.1 sOptimised: 0.03 s
fraction of e0peakDefault: 20%Optimised: 7.5%