sgn-24006 / a.k. automatic music transcriptionsgn24006/pdf/l11-music-transcription.pdf · music...

Music transcription 1SGN-24006 / A.K.

Automatic music transcriptionSources:* Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.pdf* Klapuri, Eronen, Astola: Analysis of the Meter of Acoustic Musical Signals, IEEE TASLP 2006.* Klapuri, Multiple fundamental frequency estimation by summing harmonic amplitudes, ISMIR 2006.* Ryynänen, Klapuri, Automatic transcription of melody, bass line, and chords in polyphonic music,Computer Music Journal, 2008.

Contents:Introduction to music transcriptionRhythm analysisMultiple-F0 analysisAcoustic and musicological modelsVocals separation and lyricsApplication to music retrieval


1 Introduction to music transcription


3 / klapMusictranscription

Excerpt fromSong #034 in theRWC popularmusic database:

Figures top-down:1. time-domain signal2. spectrogram3. musical notation4. piano roll

Anything missing?


Music transcription

Complete vs. partial transcriptioncomplete transcription is sometimes impossible or irrelevantpartial transcription: for example melody / bass line / percussions /chords etc.

Applications and related areasmusic retrievalstructured audio codingintelligent processing / effects

stage lighting, automatic accompaniment etc. equipmentcomputer gamesmusic perception


Perspectives on music transcription

Music transcription is a wide topicIt is useful to structure the problem by decomposing it intosmaller and more tractable subproblems

Music transcription 6SGN-24006 / A.K.Perspectives on music transcription

Acoustic and musicological modelsSpeech recognition systems depend on language models

e.g. probabilities of different word sequences (N gram models)

Musicological information is equally important fortranscription

e.g. probabilities of tone sequences or combinationsinstrument models

P( )P( )

Acousticsignal

Internal models

AnalysisResult


2 Onset detection and meter analysis


Time structure analysis

Onset detection = Detection of the beginnings of sounds in anacoustic signalMeter analysis

for example tapping foot to music (beat tracking)may include several time scalesdetect moments of musical stress in an audio signal anddiscover underlying periodicities in them

Applicationsbeat-synchronous feature extractiontemporal framework for audio editingsynchronization of audio/audio or audio/video


Meter analysis

Characterizes the temporal regularity of the moments of stressBasic idea is to analyse the periodicity of the change signalFigure: Musical meter is hierarchical structure

pulse sensations at different time scalestactus level is the most prominent ( foot tapping rate )tatum: time quantum (fastest pulse)measure pulse: related to harmonic change rate


Measuring degree of change in music

Moments of change are important for onset detection and meteranalysis

change in the intensity, pitch or timbre of a soundmoments of musical stress (accents) are caused by the beginnings ofsound events, sudden changes in loudness or timbre, harmonic changes

Perceptual change should be estimatedto detect what humans detect and to ignore what humans ignoremusically meaningful rhythmic parsing



Time-domain signalsome data reductionis needed

But: the power envelopeof a signal is notsufficientFrequency selectivity of hearing: audibility of a change at eachcritical band is only affected by the spectral components withinthe same band

components within a single critical band may mask each otherbut this does not happen if the frequency separation is sufficiently large

Measure change independently at critical bands, and thencombine the results



Scheirer: perceived rhythmic content of many music typesremains the same if only the power envelopes of a fewsubbands are preserved and then used to modulate a whitenoise signal

one band is not enoughapplies to music with strong beat


Measuring degree of change in music: In practice

Filterbank:Fourier transforms in successive ~ 20ms time frames (50% overlap)in each frame n, measure the power xb(n) within b=1,2,...,36 triangular-responsebandpass filters that are uniformly distributed on Mel-frequency scale (50Hz 20kHz)

Filterbank

Perceived change at subband Com

bineresults

music signal

...

...

output)(ln

)()(/ nx

dtd

nxnxdtd

bb

b

)700

1(log2595 10Hz

Melff

Music transcription 14SGN-24006 / A.K.Measuring degree of change in music

Degree of change at each band

Denote by xb(n) the power at critical band b=1,...,36 as a function of time(frame index) nHow to measure the degree of change at subbands? Differential?

For humans, the smallest detectable change in intensity, I, is approximatelyproportional to the intensity I of the signal, the same amount of increase beingmore prominent in a quiet signal.Audible ratio I / I is approximately constant

Thus it is reasonable to normalize the differential of power with power:

Figure (piano onset):dashed line: (d/dt) xb(n)solid line: (d/dt) ln[xb(n)]

)(ln)(

)(/ nxdtd

nxnxdtd

bb

b


Degree of change at each band

A numerically robust way of calculatingthe logarithm is the µ-law compression,

constant determines the degree ofcompression for xb(n) ( =10...104 / x)

Differentiate, and retain only positive changes (HWR(x)=max(x, 0)):yb (n) = HWR{yb(n) yb(n 1)}

1ln1ln nxny b

b


Summary

Finally: sum across channels to estimate overall change

Filterbank

Perceived change at subband Com

bineresults

music signal

...

...

output

( ) = '( )=1

36

å

powerenvelope

-lawcompress

d / dt,rectify

xb(n)

v(n)


Measured change signals

v(n)

v(n)

v(n)

signal level adaptationwould be needed

Music transcription 18SGN-24006 / A.K.Meter analysis

Degree of change ( accent )

Accent signals (degree of change)Degree of accentas a function of timeAs described above


Pulse strengths ( saliences )

Metrical pulse saliencesStrengths of differentmetrical pulses at time n(resonator energies)Use comb filters for period analysis


Bank of comb filtersUse bank of comb filtersfor periodicity analysisWe used a = 0.5 whereT is half-time in samples (3s)

Magnituderesponse:a = 0.9k = 7

Impulseresponse:

x(n) y(n)

z-k

a1-a


Bank of comb filtersTime-varying energiesof each comb filterin the filterbank

Figure: r( ,n), 1,2,...,100 for an impulse train (period 24 samples)and for white noiser( ,n) can be furthernormalized to get ridof the trend (detailsare beyond the scopeof this course)

(t , )= 1t

(t , )éë ùû2

= -t +1å

r( ,n), input impulse train r( ,n), input white noise


Higher-level modeling

Metertatum,tactus,measure


Higher-level modelingObserved: (normalized) combfilter energies r( ,n)

Prior probabilities(typical tempo values):log-normal distribution

Temporal continuityconstraints:p(next tempo / prev tempo)


Demonstrations

http://www.cs.tut.fi/~klap/iiro/meter/


3 Polyphonic pitch analysis


Introduction

Pitch information is an essential part of almost all WesternmusicExtracting pitch information from recorded audio is hard

spectrogram can be calculated straightforwardlypiano-roll... more tricky

Multiple F0 estimation= F0 estimation in polyphonic signals

music variety of sources, wide pitch range, presence of drums

A number of completely different approaches have beenproposed in the literature


Musical sounds

Most Western instruments produce harmonic soundsFigure: trumpet sound (260Hz) in time and frequency domainsperiod in time-domain: 1/F0 period in frequency-domain: F0

Frequency (Hz)

1/ F0

F0


How about just autocorrelation function (ACF)?

Autocorrelation function (ACF) based algorithms are among the mostfrequently used single-pitch estimators

Usually the maximum value in ACF is taken as 1/F0 periodShort-time ACF r( ) for a discrete time domain signal x(n):

1

0)()(1)(

nN

nnxnx

Nr

ACF:

Signal x(n):(vowel [ae])


Autocorrelation function

Short-time ACF within a time frame of length N :

Short-time ACF for real-valued signals can be computed via FFT as

where IDFT is inverse Fouriertransform and X(k) is DFT ofx(n) (padding zeros so thatFFT length is twice the length of x)The latter identity is true only forreal-valued (audio) signals

1

0( ) ( ) ( )

N

nr x n x n

/ 2 12 2

0

2 2( ) IDFT cosK

k

kr X k X kK K


Autocorrelation function

From the frequency-domain interpretation, we see at least threeproperties of ACF that make it non-robust for the period analysisof polyphonic audio

the entire spectrum is used (weighting with values btw -1 and 1)all integer multiples of fs/ are given the same (unity) weightsquaring the spectrum emphasizes timbral properties (formants etc.)

In the following, we propose a method which makes three basicmodifications to ACF to enhance its robustness

1. sharper peaks (cf. comb filter); 2. weight harmonics ( 1

g( ,m)


More reliable method*

Starting point is conceptually very simple1. Input signal is first spectrally flattened ( whitened ) to suppress timbral

information2. The salience (strength) of a F0 candidate is calculated as a weighted sum

of the amplitudes of its harmonic partials

where f ,m = mfs / is the frequency of the m:th harmonic partial of a F0candidate fs /fs is the sampling rate, and function g( ,m) defines the weight of partial m ofperiod in the sumY(f) is the short-time Fourier transform of the whitened time-domain signal

* Klapuri, A., Multiple fundamental frequency estimation by summing harmonic amplitudes,"7th International Conference on Music Information Retrieval, Victoria, Canada, Oct. 2006.

t( ) = t ,( )=1

å t ,( )

Music transcription 32SGN-24006 / A.K.Proposed method

Summing harmonic amplitudes

The basic idea of harmonic summation is intuitively appealing:pitch perception is closely related to time-domain periodicity of soundsFourier theorem states that a periodic signal can be represented with spectralcomponents at integer multiples of the inverse of the period

Question of an optimal mapping of the Fourier spectrum to pitch spectrum(or, a piano roll) is closely related to these methods

here, function g( ,m) is learned by brute-force optimization ( 300Hz):

,1

,M

mm

s g m Y fmmf

ffggmgs

sm

1//, ,21


Spectral whitening

One of the big challenges in F0 estimation is to make systems robustfor different sound sourcesA way to achieve this is to try to suppress timbral information prior tothe actual F0 estimationWhitening1. Calculate DFT X(k) of the input signal x(n)2. Calculate standard deviations b (= sqrt(power)) within subbands in the frequency

domain (square and sum frequency bins within bands, then sqrt)

3. Calculate bandwise compression coefficients b = b / b, where = 0.3 is aparameter determining the amount of spectral whitening

4. Whitened spectrum Y(k) is obtained by weighting each subband with its compressioncoefficent and then recombining the subbands


Calculation of the F0 salience function

Calculated as

where the set ,m defines a range of frequency bins in thevicinity of the m:th overtone of the F0 candidate fs / :

where denotes rounding and denotes spacing betweenfundamental period candidates ( = 1 or 0.5)Weight function was found by optimisation ( 300Hz):

,1

, maxm

M

kms g m Y k

kt , = / t + Dt / 2( ) , , / t - Dt / 2( )

1 2 ,/,/

sm

s

fg m g g fmf


Predominant F0 estimation

Maximum of the salience function s( ) is a quite robust indicatorof one of the correct F0s in a polyphonic audio signal

predominant F0 estimation: find one (any) of the correct F0s

But the second or third-highest peak is often due to the samesound and located at that is half or twice the position of thehighest peakMultiple-F0 estimation accuracy can be improved by an iterativeestimation and cancellation scheme where each detectedsound is cancelled from the mixture and s( ) is updatedaccordingly before deciding the next F0

Music transcription 36SGN-24006 / A.K.Iterative estimation

and cancellationStep 1: Residual spectrum YR(k) is initialized to Y(k).

A spectrum of detected sounds, YD(k), is initialized to zero.Step 2: Fundamental period 0 is estimated using YR(k) to compute

s( ). The maximum of s( ) determines 0

Step 3: Harmonic partials of 0 are located at bins mK / 0m=1,2,...M. Spectrum of the time-domain window function istranslated to those frequencies, weighted by g( ,m) and addedto YD(k).

Step 4: The residual spectrum is updated asYR(k) max(0, YR(k) d YD(k))

where d = 0.2 is a free parameter.Step 5: Return to Step 2.

YR(k)


first,... second,... third,... fourth iteration:

Iterative estimation and cancellationMusic transcription 38

SGN-24006 / A.K.

F0 gram : piano-roll with confidence levels


F0 gram : piano-roll with salience (RWC-P #25)Music transcription 40

SGN-24006 / A.K.

F0 gram : piano-roll with salience (RWC-P #95)


Remarks

The principle of summing harmonic amplitude is very simple,yet it suffices for predominant-F0 estimation in polyphonicsignals, provided that the weight g( ,m) are appropriateIterative detection and cancellation helps to remove harmonicsand subharmonics of already detected sounds and to revealremaining sounds behind the most prominent onesReasonably accurate for a wide range of instruments and F0s


4 Acoustic and musicological modeling


Why acoustic modeling of notes?

Frame-wise F0 strengthsmust be processed to getdiscrete notes (MIDI, score)

pitch quantization, onsets, offsetsclean up frame-wise errors

Examples in the followingRyynänen, M. and Klapuri, A., Automatic transcription of melody, bass line,and chords in polyphonic music, Computer Music Journal, 32(3), Fall 2008.Ryynänen, Klapuri, WASPAA 2005.


Acoustic modeling of notes

1. Extract frame-wiseF0 salience (strength)and its differential(here not doingpeak-picking oriterative cancellation)

2. Use training data (RWC Popular Music database) to learn acoustic modelsfor note events (100 pieces with audio + time-aligned MIDI)


Music transcription system

Figure:Acoustic modelMusicologicalmodel:

musical keyestimationN-gram modelsfor note sequences


Music transcription system

Combination of an acoustic model and a musicological model (HMMs)


Transcription examples

Complete polyphonictranscriptionhttp://www.cs.tut.fi/sgn/arg/matti/demos/polytrans.html

Transcription of melody, bass, and chords:http://www.cs.tut.fi/sgn/arg/matti/demos/mbctrans/


Case study: Singing transcription

Ryynänen, Klapuri, Modeling of note events for singing transcription, SAPAWorkshop, 2004.

Estimated pitch track has to be post-processed to get notes

Featureextraction

Probabilisticmodels

pitch

voicing,accent, meter

acousticsignal

discretenote sequence


Case study: Singing transcription

Brother can you spare me a dime

Pieni tytön tylleröinen

sgn-24006 / a.k. automatic music transcriptionsgn24006/pdf/l11-music-transcription.pdf · music...

Documents