berenzweig and ellis - waspaa 011 locating singing voice segments within music signals adam...

20
Berenzweig and Ellis - WASPAA 01 1 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University [email protected], [email protected]

Post on 20-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 1

Locating Singing Voice Segments Within Music Signals

Adam Berenzweig and Daniel P.W. Ellis

LabROSA, Columbia [email protected], [email protected]

Page 2: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 2

LabROSA

• What

• Where

• Who

• Why you love us

Page 3: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 3

The Future as We Hear It

• Online Digital Music Libraries

• The Coming Age of Streaming Music Services

• Information Retrieval: How do we find what we want?

• Recommendation: How do we know what we want to find?– Collaborative Filtering vs. Content-Based

– What is Quality?

Page 4: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 4

Motivation

• Lyrics Recognition: Baby Steps– Segmentation

– Forced Alignment

– A Corpus

• Song structure through singing structure?– Fingerprinting

– Retreival

– Feature for similarity measures

Page 5: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 5

Lyrics Recognition: Can YOU do it?

• Notoriously hard, even for humans.– amIright.com, kissThisGuy.com

• Why so hard?– Noise, music, whatever.

– Singing is not speech: voice transformations

– Strange word sequences (“poetry”)

• Need a corpus

Page 6: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 6

History of the Problem

• Segmentation for Speech Recognition: Music/Speech– Scheirer & Slaney

• Forced Alignment - Karaoke– Cano et al. [REF NEEDED]

• Acoustic feature design: Custom job or Kitchen Sink?

• Idea! Use a speech recognizer: PPF (Posterior Probability Features)– Williams & Ellis

• Ultimately: Source separation, CASA

Page 7: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 7

A Peek at the End

Page 8: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 8

Architecture Overview

Audio PLPSpeech

Recognizer(Neural Net)

FeatureCalculation

posteriogramcepstra

Time-averaging

•Entropy H•H/h#

•Dynamism D•P(h#)

Segmentation(HMM)

GaussianModel

GaussianModel

Page 9: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 9

Architecture Overview

Audio PLPSpeech

Recognizer(Neural Net)

posteriogramcepstra

Segmentation(HMM)

NeuralNet

NeuralNet

Page 10: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 10

“So how’s that working out for you, being clever?”

• Entropy

• Entropy excluding background

• Dynamism

• Background probability

• Distribution Match: Likelihoods under single Gaussian model– Cepstra

– PPF

Page 11: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 11

Recovering context with the HMM

• Transition probabilities– Inverse average segment duration

• Emission probabilities– Gaussian fit to time-averaged

distribution

• Segmentation: the Viterbi path

• Evaluation– Frame error rate (no boundary

consideration)

Page 12: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 12

Results

• [Table, figures]

• Listen!– Good, bad

– trigger & stick

– genre effects?

Page 13: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 13

Results

Page 14: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 14

• E = .075

• P(h#) in effect

Page 15: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 15

• E = .68

• P(h#) gone bad

Page 16: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 16

• E = .61

• Strong phones trigger, but can’t hold it

•Production quality effect?

‘ey’

‘uw’

‘m’,’n’

Page 17: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 17

• E = .25

• “Trigger and Stick”

‘s’

Page 18: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 18

• E = .54

• False phones

‘bcl’,’dcl’,’b’, ‘d’

‘l’,’r’

Page 19: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 19

• E = .20

• Genre effect?

Page 20: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University

Berenzweig and Ellis - WASPAA 01 20

Discussion

• The Moral of the Story: Just give it the data

• PPF is better than cepstra. Speech Recognizer is pretty powerful.

• Why does the extra Gaussian model help PPF but not cepstra?

• Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)