berenzweig and ellis - waspaa 011 locating singing voice segments within music signals adam...
Post on 20-Jan-2016
217 views
TRANSCRIPT
![Page 1: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/1.jpg)
Berenzweig and Ellis - WASPAA 01 1
Locating Singing Voice Segments Within Music Signals
Adam Berenzweig and Daniel P.W. Ellis
LabROSA, Columbia [email protected], [email protected]
![Page 2: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/2.jpg)
Berenzweig and Ellis - WASPAA 01 2
LabROSA
• What
• Where
• Who
• Why you love us
![Page 3: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/3.jpg)
Berenzweig and Ellis - WASPAA 01 3
The Future as We Hear It
• Online Digital Music Libraries
• The Coming Age of Streaming Music Services
• Information Retrieval: How do we find what we want?
• Recommendation: How do we know what we want to find?– Collaborative Filtering vs. Content-Based
– What is Quality?
![Page 4: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/4.jpg)
Berenzweig and Ellis - WASPAA 01 4
Motivation
• Lyrics Recognition: Baby Steps– Segmentation
– Forced Alignment
– A Corpus
• Song structure through singing structure?– Fingerprinting
– Retreival
– Feature for similarity measures
![Page 5: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/5.jpg)
Berenzweig and Ellis - WASPAA 01 5
Lyrics Recognition: Can YOU do it?
• Notoriously hard, even for humans.– amIright.com, kissThisGuy.com
• Why so hard?– Noise, music, whatever.
– Singing is not speech: voice transformations
– Strange word sequences (“poetry”)
• Need a corpus
![Page 6: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/6.jpg)
Berenzweig and Ellis - WASPAA 01 6
History of the Problem
• Segmentation for Speech Recognition: Music/Speech– Scheirer & Slaney
• Forced Alignment - Karaoke– Cano et al. [REF NEEDED]
• Acoustic feature design: Custom job or Kitchen Sink?
• Idea! Use a speech recognizer: PPF (Posterior Probability Features)– Williams & Ellis
• Ultimately: Source separation, CASA
![Page 7: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/7.jpg)
Berenzweig and Ellis - WASPAA 01 7
A Peek at the End
![Page 8: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/8.jpg)
Berenzweig and Ellis - WASPAA 01 8
Architecture Overview
Audio PLPSpeech
Recognizer(Neural Net)
FeatureCalculation
posteriogramcepstra
Time-averaging
•Entropy H•H/h#
•Dynamism D•P(h#)
Segmentation(HMM)
GaussianModel
GaussianModel
![Page 9: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/9.jpg)
Berenzweig and Ellis - WASPAA 01 9
Architecture Overview
Audio PLPSpeech
Recognizer(Neural Net)
posteriogramcepstra
Segmentation(HMM)
NeuralNet
NeuralNet
![Page 10: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/10.jpg)
Berenzweig and Ellis - WASPAA 01 10
“So how’s that working out for you, being clever?”
• Entropy
• Entropy excluding background
• Dynamism
• Background probability
• Distribution Match: Likelihoods under single Gaussian model– Cepstra
– PPF
![Page 11: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/11.jpg)
Berenzweig and Ellis - WASPAA 01 11
Recovering context with the HMM
• Transition probabilities– Inverse average segment duration
• Emission probabilities– Gaussian fit to time-averaged
distribution
• Segmentation: the Viterbi path
• Evaluation– Frame error rate (no boundary
consideration)
![Page 12: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/12.jpg)
Berenzweig and Ellis - WASPAA 01 12
Results
• [Table, figures]
• Listen!– Good, bad
– trigger & stick
– genre effects?
![Page 13: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/13.jpg)
Berenzweig and Ellis - WASPAA 01 13
Results
![Page 14: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/14.jpg)
Berenzweig and Ellis - WASPAA 01 14
• E = .075
• P(h#) in effect
![Page 15: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/15.jpg)
Berenzweig and Ellis - WASPAA 01 15
• E = .68
• P(h#) gone bad
![Page 16: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/16.jpg)
Berenzweig and Ellis - WASPAA 01 16
• E = .61
• Strong phones trigger, but can’t hold it
•Production quality effect?
‘ey’
‘uw’
‘m’,’n’
![Page 17: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/17.jpg)
Berenzweig and Ellis - WASPAA 01 17
• E = .25
• “Trigger and Stick”
‘s’
![Page 18: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/18.jpg)
Berenzweig and Ellis - WASPAA 01 18
• E = .54
• False phones
‘bcl’,’dcl’,’b’, ‘d’
‘l’,’r’
![Page 19: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/19.jpg)
Berenzweig and Ellis - WASPAA 01 19
• E = .20
• Genre effect?
![Page 20: Berenzweig and Ellis - WASPAA 011 Locating Singing Voice Segments Within Music Signals Adam Berenzweig and Daniel P.W. Ellis LabROSA, Columbia University](https://reader036.vdocuments.mx/reader036/viewer/2022070415/56649d3a5503460f94a14c67/html5/thumbnails/20.jpg)
Berenzweig and Ellis - WASPAA 01 20
Discussion
• The Moral of the Story: Just give it the data
• PPF is better than cepstra. Speech Recognizer is pretty powerful.
• Why does the extra Gaussian model help PPF but not cepstra?
• Time averaging helps PPF: proves that it’s using the overall distribution, not short-time detail (at least, when modelled by single gaussians)