![Page 1: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/1.jpg)
DongYuDistinguishedScientistandViceGeneralManager
Tencent AILabworkwasdonewhile@MicrosoftResearch
JointworkwithMortenKolbæk,Zheng-HuaTan,andJesperJensen
Multi-talkerSpeechSeparationandTracingwith
PermutationInvariantTraining
![Page 2: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/2.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 2
![Page 3: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/3.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 3
![Page 4: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/4.jpg)
FrontierShift
• Drivenbydemandfromuserstointeractwithdeviceswithoutwearingorcarryingaclose-talkmicrophone.
• Manydifficultieshiddenbyclose-talkmicrophonesnowsurface:
• Theenergyofspeechsignalisverylowwhenitreachesthemicrophones.
• Theinterferingsignals,suchasbackgroundnoise,reverberation,andspeechfromothertalkers,becomesodistinctthattheycannolongerbeignored.
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 4
close-talkmicrophone far-fieldmicrophone
![Page 5: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/5.jpg)
reverberation from surface reflections
additive noise from other sound sources
source
Channeldistortion
ASRinRealWorldScenarios
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 5
![Page 6: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/6.jpg)
CocktailPartyProblem• TermcoinedbyCherry
• “Oneofourmostimportantfacultiesisourabilitytolistento,andfollow,onespeakerinthepresenceofothers.Thisissuchacommonexperiencethatwemaytakeitforgranted;wemaycallit‘thecocktailpartyproblem’…”(Cherry’57)
• Human’sperformanceissuperiortomachine• “For‘cocktailparty’-likesituations…whenallvoicesareequallyloud,speechremainsintelligiblefornormal-hearinglisteners evenwhenthereareasmanyassixinterferingtalkers”(Bronkhorst &Plomp’92)
• Speechseparationproblem• Separate andtrace audiostreams• Sometimescalledspeechenhancementwhendealingwithnon-speechinterference
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 6
![Page 7: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/7.jpg)
IsSpeechSeparationWorkNeeded?• End-to-endASRsystemsufficient?
• CurrentASRtechniquesrequirehugeamountoftrainingdatathatcoversvariousconditionstotrainwell
• Speechseparationcanbeusedasadvancedfront-end• SpeechseparationcriterioncanbeusedasregularizationtoaidandspeeduptrainingofASRsystems
• MoreapplicationsthanASR• Hearingaids• Cochlearimplants• Noisereductionformobilecommunication• Audioinformationretrieval
• Usingmicrophonearraysufficient?• Mic-arrayaloneisnotsufficient,e.g.,whenatsamedirection• Manyrecordingsarestillcollectedwithsinglemicrophone
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 7
![Page 8: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/8.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 8
![Page 9: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/9.jpg)
ProblemDefinition• Sourcespeechstreams• Mixedspeech• STFTdomain• EstimateMask• ReconstructwithMask
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 9
• Ill-posedproblem(#constraints<#freeparams:• Thereareaninfinitenumberofpossible 𝑋" 𝑡, 𝑓 combinationsthatleadtothesame 𝑌 𝑡,𝑓
• Solution:• Learnfromtrainingsettolookforhiddenregularities(complicatedsoftconstraints)
![Page 10: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/10.jpg)
PriorArtsBeforeDeepLearningEra• Computationalauditorysceneanalysis(CASA)
• Useperceptualgroupingcuestoestimatetime-frequencymasks• Non-negativematrixfactorization(NMF)
• Learnasetofnon-negativebasesduringtraining• Estimatemixingfactorsduringevaluation
• ModelbasedapproachsuchasfactorialGMM-HMM• Modelstheinteractionbetweenthetargetandcompetingspeechsignalsandtheirtemporaldynamics
• Spatialfilteringwithamicrophonearray• Beamforming:Extracttargetsoundfromaspecificspatialdirection• Independentcomponentanalysis:Findademixingmatrixfrommultiplemixturesofsoundsources
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 10
![Page 11: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/11.jpg)
TrainingCriteriaforDeepLearning• Idealamplitudemask(IAM)𝑀" 𝑡, 𝑓 = )* +,,
- +,,• Minimizemask estimationerror(twoproblems)
• Insilencesegments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 isnotwelldefined• Smallererroronmasksmaynotleadtoasmallererroronmagnitude(whichiswhatwecareabout)
• Minimizemagnitude estimationerror(usedinthisstudy)
• Magnitudestillestimatedthroughmasks:oftenleadtobetterperformanceesp.whentrainingsetissmall
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 11
![Page 12: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/12.jpg)
PriorArtswithDL:Speech+Others(manyworks,OSU,MERL,CUST,etc.)
• BasicArchitecture:mixofdifferenttypesofsignals
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 12
Noise/Music/OtherSpeakers
Est.Noise/Music/OtherSpeakers
![Page 13: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/13.jpg)
PriorArtswithDL:FocusonSpeech(manyworks,OSU,MERL,CUST,etc.)
• BasicArchitecture:mixofdifferenttypesofsignals
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 13
Noise/Music/OtherSpeakers
Est.Noise/Music/OtherSpeakers
Speech +noiseSpeech +musicSpecificspeaker+otherspeakers
![Page 14: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/14.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 14
![Page 15: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/15.jpg)
Multi-TalkerSpeechSeparation• LabelAmbiguity/LabelPermutationProblem
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 15
Speaker1à output1 ?Speaker1à output2 ?
![Page 16: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/16.jpg)
Solution1:DeepClustering(Hershey,Chen,Roux,Watanabe,2016)
• Learnaunit-sizeembeddingforeachtime-frequencybin• Iftwobinsbelongtothesamespeakertheyarecloseintheembeddingspace,andfatherawayotherwise.
• Trainedonalargewindowofframes
• Separationisdonebyclusteringembeddingspacerepresentations(i.e.,segmentthebins)
• Shortcomings• Pipelineiscomplicated• Eachbinisassumedtobelongtooneandonlyonespeakerà limiteditsabilitytocombinewithothertechniques
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 16
![Page 17: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/17.jpg)
Solution2:UseManuallyDefinedRules(Weng,Yu,Seltzer,Droppo,14,15)
• UseinstantaneousenergyinsteadofspeakerIDtoassignlabels:manuallydesignedlimitedcues
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 17
Low-energyspeech
High-energyspeech
![Page 18: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/18.jpg)
OurSolution:PermutationInvariantTraining(Yu, Kolbæk,Tan,Jensen,16,17)
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 18
SimpletoimplementCanbeeasilyextendedto3-speakers
𝑋0 − 𝑋203+ 𝑋3 − 𝑋23
3
𝑋3 − 𝑋203+ 𝑋0 − 𝑋23
3
![Page 19: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/19.jpg)
Testing
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 19
• Defaultassignment:concatenateoutputs’sframestoformstreams• Optimalassignment:outputofeachframeiscorrectlyassignedtospeakers.Concatenateframesbelongtospeakerstoformstreams
• Gapbetweenthemindicatesthegainfromadditionalspeakertracing
![Page 20: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/20.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 20
![Page 21: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/21.jpg)
ExperimentSetup:Datasets• WSJ0-2mixand3-mix
• DerivedfromWSJ0corpus• 2- and3-speakermixtures(artificiallygenerated)• 30htrainingset,10hvalidationset,5htestset• MixedatSIRsbetween0dBand5dB.
• Danish-2mixand3-mix• DerivedfromaDanishcorpus• 2- or3-speakermixtures(artificiallygenerated)• 10k,1k,1k+1kutterancesintraining,validation,andtestsets• Mixedat0dB
• WSJ0-2mix-other• SameasWSJ0-2mixbutmixedat0dB
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 21
![Page 22: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/22.jpg)
Models• ImplementedusingtheMicrosoftcognitivetoolkit(CNTK)• Input:257dimSTFT;Output:257xSstreams• Segment-based(PIT-S):Eachsegmentisindependent,notracing
• DNN:3hiddenlayerseachwith1024ReLU units• PITwithtracing(PIT-T):forceallframesfromthesameoutputlayertobelongtothesamespeaker
• LSTM:3LSTMlayerseachwith1792units• BLSTM:3BLSTMlayerseachwith896units
• TestConditions• Closedcondition(CC): seenspeakers• Opencondition(OC):unseenspeakers
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 22
![Page 23: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/23.jpg)
PIT-STrainingBehavior:WSJ0-2mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 23
![Page 24: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/24.jpg)
PIT-S:SDRGain(dB)onWSJ0-2MIX
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 24
![Page 25: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/25.jpg)
PIT-TTrainingBehavior:WSJ0-2mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 25
![Page 26: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/26.jpg)
PIT-T:SDRGain(dB)onWSJ0-2MIX
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 26
![Page 27: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/27.jpg)
SDR(dB)andPESQGainComparison
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 27
![Page 28: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/28.jpg)
CrossLanguageBehavioron2-talkerMix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 28
![Page 29: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/29.jpg)
PIT-TonWSJ0-3mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 29
![Page 30: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/30.jpg)
PIT-TTrainedwithBoth2- and3-mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 30
![Page 31: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/31.jpg)
Examples:2-talkerMix•Male+Female:
•Mix:•S1:•S2:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 31
•Female+Male:•Mix:•S1:•S2:
•Female+Female:•Mix:•S1:•S2:
•Male+Male:•Mix:•S1:•S2:
![Page 32: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/32.jpg)
Examples:3-talkerMix•Male+2Female:
•Mix:•S1:•S2:•S3:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 32
•Female+2Male:•Mix:•S1:•S2:•S3:
![Page 33: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/33.jpg)
Example:Trainedon3-MixTeston2-Mix
•DiffGender:•Mix:•S1:•S2:•S3:
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 33
•SameGender:•Mix:•S1:•S2:•S3:
![Page 34: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/34.jpg)
Example:Trainedon2and3-Mix,teston2-Mix
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 34
•DiffGender:•Mix:•S1:•S2:•S3:
•SameGender:•Mix:•S1:•S2:•S3:
![Page 35: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/35.jpg)
Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 35
![Page 36: Multi-talker Speech Separation and Tracing at AI NEXT Conference](https://reader031.vdocuments.mx/reader031/viewer/2022021922/58ecf01e1a28ab2b378b4659/html5/thumbnails/36.jpg)
Conclusion
• PITcansolvethelabelpermutationproblem• PITiseffectiveinspeechseparationwithoutknowingnumberofspeakers
• PITtrainedmodelsgeneralizewelltounseenspeakersandlanguages• PITissimpletoimplement• PIThasgreatpotentialsinceitcanbeeasilyintegratedandcombinedwithothertechniques
3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 36
ClassificationView(supervisedapproach)
Segmentationview(deepclustering)
SeparationView(PIT)
PITisanimportantingredientinthefinalsolutiontothecocktailpartyproblem