multi-talker speech separation and tracing at ai next conference
TRANSCRIPT
DongYuDistinguishedScientistandViceGeneralManager
Tencent AILabworkwasdonewhile@MicrosoftResearch
JointworkwithMortenKolbæk,Zheng-HuaTan,andJesperJensen
Multi-talkerSpeechSeparationandTracingwith
PermutationInvariantTraining
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 2
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 3
FrontierShift
• Drivenbydemandfromuserstointeractwithdeviceswithoutwearingorcarryingaclose-talkmicrophone.
• Manydifficultieshiddenbyclose-talkmicrophonesnowsurface:
• Theenergyofspeechsignalisverylowwhenitreachesthemicrophones.
• Theinterferingsignals,suchasbackgroundnoise,reverberation,andspeechfromothertalkers,becomesodistinctthattheycannolongerbeignored.
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 4
close-talkmicrophone far-fieldmicrophone
reverberation from surface reflections
additive noise from other sound sources
source
Channeldistortion
ASRinRealWorldScenarios
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 5
CocktailPartyProblem• TermcoinedbyCherry
• “Oneofourmostimportantfacultiesisourabilitytolistento,andfollow,onespeakerinthepresenceofothers.Thisissuchacommonexperiencethatwemaytakeitforgranted;wemaycallit‘thecocktailpartyproblem’…”(Cherry’57)
• Human’sperformanceissuperiortomachine• “For‘cocktailparty’-likesituations…whenallvoicesareequallyloud,speechremainsintelligiblefornormal-hearinglisteners evenwhenthereareasmanyassixinterferingtalkers”(Bronkhorst &Plomp’92)
• Speechseparationproblem• Separate andtrace audiostreams• Sometimescalledspeechenhancementwhendealingwithnon-speechinterference
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 6
IsSpeechSeparationWorkNeeded?• End-to-endASRsystemsufficient?
• CurrentASRtechniquesrequirehugeamountoftrainingdatathatcoversvariousconditionstotrainwell
• Speechseparationcanbeusedasadvancedfront-end• SpeechseparationcriterioncanbeusedasregularizationtoaidandspeeduptrainingofASRsystems
• MoreapplicationsthanASR• Hearingaids• Cochlearimplants• Noisereductionformobilecommunication• Audioinformationretrieval
• Usingmicrophonearraysufficient?• Mic-arrayaloneisnotsufficient,e.g.,whenatsamedirection• Manyrecordingsarestillcollectedwithsinglemicrophone
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 7
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 8
ProblemDefinition• Sourcespeechstreams• Mixedspeech• STFTdomain• EstimateMask• ReconstructwithMask
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 9
• Ill-posedproblem(#constraints<#freeparams:• Thereareaninfinitenumberofpossible 𝑋" 𝑡, 𝑓 combinationsthatleadtothesame 𝑌 𝑡,𝑓
• Solution:• Learnfromtrainingsettolookforhiddenregularities(complicatedsoftconstraints)
PriorArtsBeforeDeepLearningEra• Computationalauditorysceneanalysis(CASA)
• Useperceptualgroupingcuestoestimatetime-frequencymasks• Non-negativematrixfactorization(NMF)
• Learnasetofnon-negativebasesduringtraining• Estimatemixingfactorsduringevaluation
• ModelbasedapproachsuchasfactorialGMM-HMM• Modelstheinteractionbetweenthetargetandcompetingspeechsignalsandtheirtemporaldynamics
• Spatialfilteringwithamicrophonearray• Beamforming:Extracttargetsoundfromaspecificspatialdirection• Independentcomponentanalysis:Findademixingmatrixfrommultiplemixturesofsoundsources
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 10
TrainingCriteriaforDeepLearning• Idealamplitudemask(IAM)𝑀" 𝑡, 𝑓 = )* +,,
- +,,• Minimizemask estimationerror(twoproblems)
• Insilencesegments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 isnotwelldefined• Smallererroronmasksmaynotleadtoasmallererroronmagnitude(whichiswhatwecareabout)
• Minimizemagnitude estimationerror(usedinthisstudy)
• Magnitudestillestimatedthroughmasks:oftenleadtobetterperformanceesp.whentrainingsetissmall
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 11
PriorArtswithDL:Speech+Others(manyworks,OSU,MERL,CUST,etc.)
• BasicArchitecture:mixofdifferenttypesofsignals
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 12
Noise/Music/OtherSpeakers
Est.Noise/Music/OtherSpeakers
PriorArtswithDL:FocusonSpeech(manyworks,OSU,MERL,CUST,etc.)
• BasicArchitecture:mixofdifferenttypesofsignals
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 13
Noise/Music/OtherSpeakers
Est.Noise/Music/OtherSpeakers
Speech +noiseSpeech +musicSpecificspeaker+otherspeakers
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 14
Multi-TalkerSpeechSeparation• LabelAmbiguity/LabelPermutationProblem
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 15
Speaker1à output1 ?Speaker1à output2 ?
Solution1:DeepClustering(Hershey,Chen,Roux,Watanabe,2016)
• Learnaunit-sizeembeddingforeachtime-frequencybin• Iftwobinsbelongtothesamespeakertheyarecloseintheembeddingspace,andfatherawayotherwise.
• Trainedonalargewindowofframes
• Separationisdonebyclusteringembeddingspacerepresentations(i.e.,segmentthebins)
• Shortcomings• Pipelineiscomplicated• Eachbinisassumedtobelongtooneandonlyonespeakerà limiteditsabilitytocombinewithothertechniques
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 16
Solution2:UseManuallyDefinedRules(Weng,Yu,Seltzer,Droppo,14,15)
• UseinstantaneousenergyinsteadofspeakerIDtoassignlabels:manuallydesignedlimitedcues
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 17
Low-energyspeech
High-energyspeech
OurSolution:PermutationInvariantTraining(Yu, Kolbæk,Tan,Jensen,16,17)
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 18
SimpletoimplementCanbeeasilyextendedto3-speakers
𝑋0 − 𝑋203+ 𝑋3 − 𝑋23
3
𝑋3 − 𝑋203+ 𝑋0 − 𝑋23
3
Testing
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 19
• Defaultassignment:concatenateoutputs’sframestoformstreams• Optimalassignment:outputofeachframeiscorrectlyassignedtospeakers.Concatenateframesbelongtospeakerstoformstreams
• Gapbetweenthemindicatesthegainfromadditionalspeakertracing
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 20
ExperimentSetup:Datasets• WSJ0-2mixand3-mix
• DerivedfromWSJ0corpus• 2- and3-speakermixtures(artificiallygenerated)• 30htrainingset,10hvalidationset,5htestset• MixedatSIRsbetween0dBand5dB.
• Danish-2mixand3-mix• DerivedfromaDanishcorpus• 2- or3-speakermixtures(artificiallygenerated)• 10k,1k,1k+1kutterancesintraining,validation,andtestsets• Mixedat0dB
• WSJ0-2mix-other• SameasWSJ0-2mixbutmixedat0dB
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 21
Models• ImplementedusingtheMicrosoftcognitivetoolkit(CNTK)• Input:257dimSTFT;Output:257xSstreams• Segment-based(PIT-S):Eachsegmentisindependent,notracing
• DNN:3hiddenlayerseachwith1024ReLU units• PITwithtracing(PIT-T):forceallframesfromthesameoutputlayertobelongtothesamespeaker
• LSTM:3LSTMlayerseachwith1792units• BLSTM:3BLSTMlayerseachwith896units
• TestConditions• Closedcondition(CC): seenspeakers• Opencondition(OC):unseenspeakers
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 22
PIT-STrainingBehavior:WSJ0-2mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 23
PIT-S:SDRGain(dB)onWSJ0-2MIX
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 24
PIT-TTrainingBehavior:WSJ0-2mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 25
PIT-T:SDRGain(dB)onWSJ0-2MIX
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 26
SDR(dB)andPESQGainComparison
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 27
CrossLanguageBehavioron2-talkerMix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 28
PIT-TonWSJ0-3mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 29
PIT-TTrainedwithBoth2- and3-mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 30
Examples:2-talkerMix•Male+Female:
•Mix:•S1:•S2:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 31
•Female+Male:•Mix:•S1:•S2:
•Female+Female:•Mix:•S1:•S2:
•Male+Male:•Mix:•S1:•S2:
Examples:3-talkerMix•Male+2Female:
•Mix:•S1:•S2:•S3:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 32
•Female+2Male:•Mix:•S1:•S2:•S3:
Example:Trainedon3-MixTeston2-Mix
•DiffGender:•Mix:•S1:•S2:•S3:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 33
•SameGender:•Mix:•S1:•S2:•S3:
Example:Trainedon2and3-Mix,teston2-Mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 34
•DiffGender:•Mix:•S1:•S2:•S3:
•SameGender:•Mix:•S1:•S2:•S3:
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 35
Conclusion
• PITcansolvethelabelpermutationproblem• PITiseffectiveinspeechseparationwithoutknowingnumberofspeakers
• PITtrainedmodelsgeneralizewelltounseenspeakersandlanguages• PITissimpletoimplement• PIThasgreatpotentialsinceitcanbeeasilyintegratedandcombinedwithothertechniques
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 36
ClassificationView(supervisedapproach)
Segmentationview(deepclustering)
SeparationView(PIT)
PITisanimportantingredientinthefinalsolutiontothecocktailpartyproblem