onset detection and attack phase descriptors i quite user-friendly: well documented, easy to access...

22
Onset detection and Attack Phase Descriptors IMV Signal Processing Meetup, 16 March 2017

Upload: lethuy

Post on 08-May-2018

220 views

Category:

Documents


6 download

TRANSCRIPT

Onset detection and Attack Phase Descriptors

IMV Signal Processing Meetup, 16 March 2017

I Onset detection VS Attack phase descriptionI MIREX competition:

I Detect the approximate temporal location of new onsets in an audio file.I Algoritims are compared against manual expert annotation (which is inherently

imprecise).I False positives and false negatives are penalized

I Attack phase descriptionI What are the slient time points in the beginning of this sound event?I What are the relations between these time points?

I Paper:I K. Nymoen, A. Danielsen and J. London: Attack Phase Descriptor Estimation in

Matlab toolboxes. Submitted for SMC2017, Helsinki.I Comparing the MIRtoolbox (Lartillot) and the Timbre Toolbox (Peeters)

onset detection

0 1 2 3 4 5 6 7

time (s)

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

am

plit

ude

Audio waveform

1 2 3 4 5 6 7

time (s)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

am

plit

ud

e

Onset curve (Envelope)

I

are these really onsets?

1 2 3 4 5 6 7

time (s)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

am

plit

ud

e

Onset curve (Envelope)

I What would our research question typically be when using this function in theMIRtoolbox?

I SegmentationI MelodyI Rhythm(?)I Microrhythm

I Are we interested in onsets, or rather perceived moments of metrical alignment?

are these really onsets?

1 2 3 4 5 6 7

time (s)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

am

plit

ud

e

Onset curve (Envelope)

I What would our research question typically be when using this function in theMIRtoolbox?

I SegmentationI MelodyI Rhythm(?)I Microrhythm

I Are we interested in onsets, or rather perceived moments of metrical alignment?

are these really onsets?

1 2 3 4 5 6 7

time (s)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

am

plit

ud

e

Onset curve (Envelope)

I What would our research question typically be when using this function in theMIRtoolbox?

I SegmentationI MelodyI Rhythm(?)I Microrhythm

I Are we interested in onsets, or rather perceived moments of metrical alignment?

Salient time points in the initial phase of a sonic event

PerceptualOnset

PerceptualAttack

Energy peak

PhysicalOnset

Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)

Salient time points in the initial phase of a sonic event

PerceptualOnset

PerceptualAttack

Energy peak

PhysicalOnset

Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)

Salient time points in the initial phase of a sonic event

PerceptualOnset

PerceptualAttack

Energy peak

PhysicalOnset

Schae↵er (196x)Gordon (1987)Collins (2006)Wright (2008)Villing (2010)

Attack phase descriptors

PerceptualOnset

PerceptualAttack

Attack time Log-Attack Time = log(Attack time)

Energy peak

Temporal centroid

AttackSlope Attack

Leap

PhysicalOnset Rise time

I Time pointsI Time spansI Energy spansI (Energy points)

Attack phase descriptors (our definitions)

Name Type Description

Physical onset phTP Time point where the sound energy first rises from 0.

Perceptual onset peTP Time point when the sound event becomes audible.

Perceptual attack peTP Time point perceived as the rhythmic emphasis of the sound.

Energy peak phTP Time point when the energy envelope reaches its maximum value.

Rise time phTS Time span between physical onset and energy peak.

Attack time peTS Time span between perceptual onset and perceptual attack.

Log-Attack Time phTS The base 10 logarithm of attack time.

Attack slope peES Weighted average of the energy envelope slope in the attack phase.

Attack leap peES The di↵erence between energy level at perceptual attack and perceptual

onset.

Temporal centroid phTP The temporal barycentre of the sound event’s energy envelope.

Log-Attack Time is a commonly used descriptor in the MIR community. No consensus:some use physical descriptors, some use perceptual, and some use a combination toestimate it.

Attack phase descriptors (our definitions)

Name Type Description

Physical onset phTP Time point where the sound energy first rises from 0.

Perceptual onset peTP Time point when the sound event becomes audible.

Perceptual attack peTP Time point perceived as the rhythmic emphasis of the sound.

Energy peak phTP Time point when the energy envelope reaches its maximum value.

Rise time phTS Time span between physical onset and energy peak.

Attack time peTS Time span between perceptual onset and perceptual attack.

Log-Attack Time phTS The base 10 logarithm of attack time.

Attack slope peES Weighted average of the energy envelope slope in the attack phase.

Attack leap peES The di↵erence between energy level at perceptual attack and perceptual

onset.

Temporal centroid phTP The temporal barycentre of the sound event’s energy envelope.

Log-Attack Time is a commonly used descriptor in the MIR community. No consensus:some use physical descriptors, some use perceptual, and some use a combination toestimate it.

PerceptualOnset

PerceptualAttack

Attack time Log-Attack Time = log(Attack time)

Energy peak

Temporal centroid

AttackSlope Attack

Leap

PhysicalOnset Rise time

I Time points

I Time spans

I Energy spans

I (Energy points)

Attack phase descriptors step 1: Envelope extractionTimbre Toolbox

I Apply Hilbert transform to the audio signal,I followed by a 3rd-order Butterworth lowpass filter with cuto↵ frequency at 5 Hz.I No compensation for filter group delay

MIRtoolboxI Spectrogram, hanning window, 100ms frame, 10% hopI Envelope = sum of columns in spectrogram

-0.5

0

0.5

0

0.5

1

0 0.05 0.1Time (seconds)

0.15 0.2 0.25

Audio waveform

Energy EnvelopeMIRtoolbox (D)

Timbre Toolbox (D)Timbre toolbox (A)

MIRtoolbox (A)

Attack phase descriptors step 2: Salient time steps

I Both the MIRtoolbox and the Timbre toolbox provide equvalents to “beginning ofattack” and “end of attack”.

Effort FunctionMean effort

Energy envelopeAttack start – end

θ1

θ10

θ2

time (seconds)

Timbre Toolbox attack phase estimation

0.2 0.4 0.6 0.8 10

0

0.01

0.02

0

0.5

1

Time derivativePeak position

Threshold

Energy envelopeAttack start – end

time (seconds)

MIRToolbox attack phase estimation

0.2 0.4 0.6 0.8 10

I But are these supposed to reflect physical or perceptual features?

Attack phase descriptors step 2: Salient time steps

I Both the MIRtoolbox and the Timbre toolbox provide equvalents to “beginning ofattack” and “end of attack”.

Effort FunctionMean effort

Energy envelopeAttack start – end

θ1

θ10

θ2

time (seconds)

Timbre Toolbox attack phase estimation

0.2 0.4 0.6 0.8 10

0

0.01

0.02

0

0.5

1

Time derivativePeak position

Threshold

Energy envelopeAttack start – end

time (seconds)

MIRToolbox attack phase estimation

0.2 0.4 0.6 0.8 10

I But are these supposed to reflect physical or perceptual features?

Attack phase descriptors step 3: ...

I Decide on the definitions of attack phase descriptors, and the methods forextracting salient time points

Attack phase descriptors step 3: ...

I Decide on the definitions of attack phase descriptors, and the methods forextracting salient time points

Into the nitty-gritty

Timbre toolbox

I NB! Make sure that you download the latest version from github... (Don’t trustthe CIRMMT link 1st hit on Google which gives you version 1.2 from 2003)

I My impression: Best used if you need a large range of audio descriptors for a largeaudio set, and don’t want to fiddle with choosing parameters for your functions

I Need to dig deep into the code to change the parameters (hard-coded):I Lowpass filter cuto↵ frequencyI ↵ valueI Fix group delay problem

Into the nitty-gritty

MIRtoolbox

I Quite user-friendly: well documented, easy to access most parameters

I mironsets() function - ‘attack’ option

I threshold value is hard-codedI mirgetdata-problem:

I uncell(get(A,’AttackPosUnit’))I uncell(get(A,’PeakPosUnit’))

Perceptual experiment

I Task: align a repeated musical sound to a click track.

I 17 participants

I 9 sound stimuli (8 musical instruments + click)

I inter-stimuli-interval of 600 ms

I click track and stimuli started with a random o↵set.

I controlling sync using a keyboard and/or a slider on the screen.

Parameter optimisation and perceptual results

-80

-60

-40

-20

0

20

40

60

80T

ime

re

lativ

e to

ph

ysic

al o

nse

t (in

mill

ise

con

ds)

BrightPiano

SnareDrum

DarkPiano

KickDrum

Fiddle Shaker SynthBass

ArcoBass

Click

MIRtoolbox (D)

Timbre Toolbox (D)Timbre toolbox (O)

MIRtoolbox (O)

Perceptual results

Jaccard index for MIRtoolbox (mean for all sounds)

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15 0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Treshold (fraction of e’peak)

Fram

e s

ize

(se

con

ds)

Jaccard In

de

x

Toolbox Envelope parameter Threshold parameter

Timbretoolbox

LPfilter cuto↵ frequencyDefault: 5 HzOptimised: 37 Hz

Default: 3Optimised: 3.75

MIR-toolbox

Frame sizeDefault: 0.1 sOptimised: 0.03 s

fraction of e0peakDefault: 20%Optimised: 7.5%