berenzweig and ellis - waspaa 011 locating singing voice segments within music signals adam...

Berenzweig and Ellis - WASPAA 01 1

Locating Singing Voice Segments Within Music Signals

Adam Berenzweig and Daniel P.W. Ellis

LabROSA, Columbia [email protected], [email protected]


LabROSA

• What

• Where

• Who

• Why you love us


The Future as We Hear It

• Online Digital Music Libraries

• The Coming Age of Streaming Music Services

• Information Retrieval: How do we find what we want?

• Recommendation: How do we know what we want to find?– Collaborative Filtering vs. Content-Based

– What is Quality?


Motivation

• Lyrics Recognition: Baby Steps– Segmentation

– Forced Alignment

– A Corpus

• Song structure through singing structure?– Fingerprinting

– Retreival

– Feature for similarity measures


Lyrics Recognition: Can YOU do it?

• Notoriously hard, even for humans.– amIright.com, kissThisGuy.com

• Why so hard?– Noise, music, whatever.

– Singing is not speech: voice transformations

– Strange word sequences (“poetry”)

• Need a corpus


History of the Problem

• Segmentation for Speech Recognition: Music/Speech– Scheirer & Slaney

• Forced Alignment - Karaoke– Cano et al. [REF NEEDED]

• Acoustic feature design: Custom job or Kitchen Sink?

• Idea! Use a speech recognizer: PPF (Posterior Probability Features)– Williams & Ellis

• Ultimately: Source separation, CASA


A Peek at the End


Architecture Overview

Audio PLPSpeech

Recognizer(Neural Net)

FeatureCalculation

posteriogramcepstra

Time-averaging

•Entropy H•H/h#

•Dynamism D•P(h#)

Segmentation(HMM)

GaussianModel

GaussianModel


Architecture Overview

Audio PLPSpeech

Recognizer(Neural Net)

posteriogramcepstra

Segmentation(HMM)

NeuralNet

NeuralNet


“So how’s that working out for you, being clever?”

• Entropy

• Entropy excluding background

• Dynamism

• Background probability

• Distribution Match: Likelihoods under single Gaussian model– Cepstra

– PPF


Recovering context with the HMM

• Transition probabilities– Inverse average segment duration

• Emission probabilities– Gaussian fit to time-averaged

distribution

• Segmentation: the Viterbi path

• Evaluation– Frame error rate (no boundary

consideration)


Results

• [Table, figures]

• Listen!– Good, bad

– trigger & stick

– genre effects?


Results


• E = .075

• P(h#) in effect


• E = .68

• P(h#) gone bad


• E = .61

• Strong phones trigger, but can’t hold it

•Production quality effect?

‘ey’

‘uw’

‘m’,’n’


• E = .25

• “Trigger and Stick”

‘s’


• E = .54

• False phones

‘bcl’,’dcl’,’b’, ‘d’

‘l’,’r’


• E = .20

• Genre effect?


Discussion

• The Moral of the Story: Just give it the data

• PPF is better than cepstra. Speech Recognizer is pretty powerful.

• Why does the extra Gaussian model help PPF but not cepstra?

• Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)

berenzweig and ellis - waspaa 011 locating singing voice segments within music signals adam...

Documents