multi-talker speech separation and tracing at ai next conference

DongYuDistinguishedScientistandViceGeneralManager

Tencent AILabworkwasdonewhile@MicrosoftResearch

JointworkwithMortenKolbæk,Zheng-HuaTan,andJesperJensen

Multi-talkerSpeechSeparationandTracingwith

PermutationInvariantTraining

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 2

FrontierShift

• Drivenbydemandfromuserstointeractwithdeviceswithoutwearingorcarryingaclose-talkmicrophone.

• Manydifficultieshiddenbyclose-talkmicrophonesnowsurface:

• Theenergyofspeechsignalisverylowwhenitreachesthemicrophones.

• Theinterferingsignals,suchasbackgroundnoise,reverberation,andspeechfromothertalkers,becomesodistinctthattheycannolongerbeignored.


close-talkmicrophone far-fieldmicrophone

reverberation from surface reflections

additive noise from other sound sources

source

Channeldistortion

ASRinRealWorldScenarios


CocktailPartyProblem• TermcoinedbyCherry

• “Oneofourmostimportantfacultiesisourabilitytolistento,andfollow,onespeakerinthepresenceofothers.Thisissuchacommonexperiencethatwemaytakeitforgranted;wemaycallit‘thecocktailpartyproblem’…”(Cherry’57)

• Human’sperformanceissuperiortomachine• “For‘cocktailparty’-likesituations…whenallvoicesareequallyloud,speechremainsintelligiblefornormal-hearinglisteners evenwhenthereareasmanyassixinterferingtalkers”(Bronkhorst &Plomp’92)

• Speechseparationproblem• Separate andtrace audiostreams• Sometimescalledspeechenhancementwhendealingwithnon-speechinterference


IsSpeechSeparationWorkNeeded?• End-to-endASRsystemsufficient?

• CurrentASRtechniquesrequirehugeamountoftrainingdatathatcoversvariousconditionstotrainwell

• Speechseparationcanbeusedasadvancedfront-end• SpeechseparationcriterioncanbeusedasregularizationtoaidandspeeduptrainingofASRsystems

• MoreapplicationsthanASR• Hearingaids• Cochlearimplants• Noisereductionformobilecommunication• Audioinformationretrieval

• Usingmicrophonearraysufficient?• Mic-arrayaloneisnotsufficient,e.g.,whenatsamedirection• Manyrecordingsarestillcollectedwithsinglemicrophone


ProblemDefinition• Sourcespeechstreams• Mixedspeech• STFTdomain• EstimateMask• ReconstructwithMask


• Ill-posedproblem(#constraints<#freeparams:• Thereareaninfinitenumberofpossible 𝑋" 𝑡, 𝑓 combinationsthatleadtothesame 𝑌 𝑡,𝑓

• Solution:• Learnfromtrainingsettolookforhiddenregularities(complicatedsoftconstraints)

PriorArtsBeforeDeepLearningEra• Computationalauditorysceneanalysis(CASA)

• Useperceptualgroupingcuestoestimatetime-frequencymasks• Non-negativematrixfactorization(NMF)

• Learnasetofnon-negativebasesduringtraining• Estimatemixingfactorsduringevaluation

• ModelbasedapproachsuchasfactorialGMM-HMM• Modelstheinteractionbetweenthetargetandcompetingspeechsignalsandtheirtemporaldynamics

• Spatialfilteringwithamicrophonearray• Beamforming:Extracttargetsoundfromaspecificspatialdirection• Independentcomponentanalysis:Findademixingmatrixfrommultiplemixturesofsoundsources


TrainingCriteriaforDeepLearning• Idealamplitudemask(IAM)𝑀" 𝑡, 𝑓 = )* +,,

- +,,• Minimizemask estimationerror(twoproblems)

• Insilencesegments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 isnotwelldefined• Smallererroronmasksmaynotleadtoasmallererroronmagnitude(whichiswhatwecareabout)

• Minimizemagnitude estimationerror(usedinthisstudy)

• Magnitudestillestimatedthroughmasks:oftenleadtobetterperformanceesp.whentrainingsetissmall


PriorArtswithDL:Speech+Others(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals


Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

PriorArtswithDL:FocusonSpeech(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals


Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

Speech +noiseSpeech +musicSpecificspeaker+otherspeakers

Multi-TalkerSpeechSeparation• LabelAmbiguity/LabelPermutationProblem


Speaker1à output1 ?Speaker1à output2 ?

Solution1:DeepClustering(Hershey,Chen,Roux,Watanabe,2016)

• Learnaunit-sizeembeddingforeachtime-frequencybin• Iftwobinsbelongtothesamespeakertheyarecloseintheembeddingspace,andfatherawayotherwise.

• Trainedonalargewindowofframes

• Separationisdonebyclusteringembeddingspacerepresentations(i.e.,segmentthebins)

• Shortcomings• Pipelineiscomplicated• Eachbinisassumedtobelongtooneandonlyonespeakerà limiteditsabilitytocombinewithothertechniques


Solution2:UseManuallyDefinedRules(Weng,Yu,Seltzer,Droppo,14,15)

• UseinstantaneousenergyinsteadofspeakerIDtoassignlabels:manuallydesignedlimitedcues


Low-energyspeech

High-energyspeech

OurSolution:PermutationInvariantTraining(Yu, Kolbæk,Tan,Jensen,16,17)


SimpletoimplementCanbeeasilyextendedto3-speakers

𝑋0 − 𝑋203+ 𝑋3 − 𝑋23

3

𝑋3 − 𝑋203+ 𝑋0 − 𝑋23

3

Testing


• Defaultassignment:concatenateoutputs’sframestoformstreams• Optimalassignment:outputofeachframeiscorrectlyassignedtospeakers.Concatenateframesbelongtospeakerstoformstreams

• Gapbetweenthemindicatesthegainfromadditionalspeakertracing

ExperimentSetup:Datasets• WSJ0-2mixand3-mix

• DerivedfromWSJ0corpus• 2- and3-speakermixtures(artificiallygenerated)• 30htrainingset,10hvalidationset,5htestset• MixedatSIRsbetween0dBand5dB.

• Danish-2mixand3-mix• DerivedfromaDanishcorpus• 2- or3-speakermixtures(artificiallygenerated)• 10k,1k,1k+1kutterancesintraining,validation,andtestsets• Mixedat0dB

• WSJ0-2mix-other• SameasWSJ0-2mixbutmixedat0dB


Models• ImplementedusingtheMicrosoftcognitivetoolkit(CNTK)• Input:257dimSTFT;Output:257xSstreams• Segment-based(PIT-S):Eachsegmentisindependent,notracing

• DNN:3hiddenlayerseachwith1024ReLU units• PITwithtracing(PIT-T):forceallframesfromthesameoutputlayertobelongtothesamespeaker

• LSTM:3LSTMlayerseachwith1792units• BLSTM:3BLSTMlayerseachwith896units

• TestConditions• Closedcondition(CC): seenspeakers• Opencondition(OC):unseenspeakers


PIT-STrainingBehavior:WSJ0-2mix


PIT-S:SDRGain(dB)onWSJ0-2MIX


PIT-TTrainingBehavior:WSJ0-2mix


PIT-T:SDRGain(dB)onWSJ0-2MIX


SDR(dB)andPESQGainComparison


CrossLanguageBehavioron2-talkerMix


PIT-TonWSJ0-3mix


PIT-TTrainedwithBoth2- and3-mix


Examples:2-talkerMix•Male+Female:

•Mix:•S1:•S2:


•Female+Male:•Mix:•S1:•S2:

•Female+Female:•Mix:•S1:•S2:

•Male+Male:•Mix:•S1:•S2:

Examples:3-talkerMix•Male+2Female:

•Mix:•S1:•S2:•S3:


•Female+2Male:•Mix:•S1:•S2:•S3:

Example:Trainedon3-MixTeston2-Mix

•DiffGender:•Mix:•S1:•S2:•S3:


•SameGender:•Mix:•S1:•S2:•S3:

Example:Trainedon2and3-Mix,teston2-Mix


•DiffGender:•Mix:•S1:•S2:•S3:

•SameGender:•Mix:•S1:•S2:•S3:

Conclusion

• PITcansolvethelabelpermutationproblem• PITiseffectiveinspeechseparationwithoutknowingnumberofspeakers

• PITtrainedmodelsgeneralizewelltounseenspeakersandlanguages• PITissimpletoimplement• PIThasgreatpotentialsinceitcanbeeasilyintegratedandcombinedwithothertechniques


ClassificationView(supervisedapproach)

Segmentationview(deepclustering)

SeparationView(PIT)

PITisanimportantingredientinthefinalsolutiontothecocktailpartyproblem

multi-talker speech separation and tracing at ai next conference

Technology