conversational speech translation - challenges and techniques, by chris wendt, microsoft

PowerPoint Presentation

Conversational Speech TranslationChallenges and [email protected]@Tian500with Will Lewis

TAUS Forum Beijing April 26, 2016

As we all know the idea of being able to speak naturally with someone who doesnt understand your language has been a long-held dream. Whether were talking about the biblical story of the tower of babel, or 20th century sci-fi such as the Star Trek universal communicator or Douglas Adams babelfish

1

Why now?Confluence of factors:Steady progress in MT quality over the last few yearsUsing huge amounts of data

Technological Leap in ASRDeep Learning (DNNs) 33+% WER reduction over GMMs (Seide et al 2011)Now above 42%More robust to noise, speaker variation, accents

SkypeA global platform to put speech translation in the hands of 100s of Millions of users

About this time last year, we set ourselves a goal of trying to turn this age-old dream into reality. We realized that there was a confluence of factors that taken together gave us the opportunity to make this happen. We ourselves in the MT field have been making steady progress on MT quality both with better algorithms and by applying ever greater amounts of data, such that our best MT systems today are really quite good. At the same time, the ASR field has seen a technological leap over the last few years with the use of DNNs leading to dramatically lower errors rates. And finally, we at Microsoft Research, felt we had a golden opportunity not just to do a great technology demo, but to actually put this in the hands of 100s of Ms of users through Skype. In order to achieve this we faced a number of challenges, which is what I will be talking about for the next hour.2

Skype Translator: GoalTo support open-domain conversations between Skype users in different parts of the world, speaking different languagesSpeech-to-Speech (S2S): GoalTo support open-domain conversations between any individuals speaking different languages anywhere

Skype Translator: What is it?

Current state of the art in Speech Recognition and Machine Translation embedded in a VoIP client: Skype

Skype Translator: What is it?Current state of the art in Speech Recognition and Machine Translation embedded in a VoIP client: Skype

Skype Translator = Universal Translator?

Skype Translator = Universal Translator?

Copyright Paramount StudiosEnglish: Your ship runs like a garbage scow!Klingon:

Voice: Currently supports Arabic, Chinese (Mandarin), English, French, German, Italian, Portuguese, Spanish (any pairing) Text (IM): Currently supports IM conversations between 50+ languages (any pairing)

Microsoft Speech to Speech (S2S): What is it?

Current state of the art in Speech Recognition and Machine Translation

1. High quality speech recognition2. Disfluency processing TrueText3. High quality conversational Machine Translation (MT)Not only in Skype, also in the Microsoft Translator API

The ChallengesThe gulf between speech and textIts not enough to just chain a really good ASR system with a really good MT systemHow people talk to each other is not how they writeBuilding really good conversational ASR and MT systemsSignificant changes in the data we use to train the ASR and MT systems. The gap between technology demo and consumer productProducing models with shippable latencyInteresting problems one encounters with real consumers

There are three main challenges here that we need to go through to have a realistic S2S system :

The gulf between speech and textIts not enough to just chain a really good ASR system with a really good MT systemHow people talk to each other is not how they writeBuilding really good conversational ASR and MT systemsSignificant changes in the data we use to train the ASR and MT systems. The gap between technology demo and consumer productPlugging into Skype Interesting problems one encounters with real consumers

9

How people really speakWhat person thought they said: Yeah. I guess it was worth it.Ja. Ich denke, es hat sich gelohnt.What they actually said: Yeah, but um, but it was you know, it was, I guess, it was worth it.Ja, aber hm, aber es war, weit du, es war, ich denke, es hat sich gelohnt.Disfluency removalMore than just removing um and ah

[READ SLIDE] So, if we take the raw ASR output and just throw it at MT, it doesnt work so well. We need components to process the ASR, remove disfluencies, etc and make it more palatable to MT. Likewise we need to adapt MT to handle this kind of input10

Disfluencies in Conversational Speechum no i mean yes but you know i am i've never done it myself have you done that uh yesDisfluency types:Filler PausesDiscourse MarkersRepetitionCorrections (speech repairs)

um no i mean yes but you know i am i've never done it myself have you done that uh yesYes.But, Ive never done it myself.Have you done that?Yes?

So lets take a closer look at the different types of disfluencies first you have uh, your, um, fillers, then, you know, I mean, your discourse markers and and and repetition, and finally correct--, I mean, speech repairs, where people go back and repeat, I mean, change what they said.

In this example, here the speaker changed no to yes, and I am to I have.

11

Disfluencies in Conversational Speechum no i mean yes but you know i am i've never done it myself have you done that uh yesum no i mean yes but you know i am i've never done it myself have you done that uh yesYes.But, Ive never done it myself.Have you done that?Yes?

Need to:SegmentRemove disfluenciesPunctuateAdd case



12

No ASR post-process for MTum no i mean yes but you know i am i've never done it myself have you done that uh yes

Translate

Post-process of ASR for MT

TranslateYes.But, Ive never done it myself.Have you done that?Yes?

Missing punc Changes in meaningQuestionsvas ahora? are you going now?vas ahora go now

Negationno es mi segundo it is not my secondno. es mi segundo no. its my secondSeriously embarrassingtienes una hija no? es muy preciosa you have a daughter right? is very beautifultienes una hija no es muy preciosa you have a daughter is not very beautiful

Another thing that is missing is punctuation -- If we dont get punctuation right, we are risking a lot more than just word salad. You may know the example Lets eat grandma, where a missing comma could lead to a tragic outcome. When translating, the problem gets worse. 15

Accents/Wrong chars Changes in meaningAccented words (sound-alikes)Written with different forms different meaningsBut pronounced the same

Si los vinos mendocinos son muy famososIf the wines from Mendoza are very famous

S los vinos mendocinos son muy famososYes the wines from Mendoza are very famousMisrecognized words/characters (sound-alikes) Do you often fall asleep without listening to it? You often fall asleep without listening to it.

In the case of Spanish (and possibly other languages), we have an additional problem with accented words [READ SLIDE] 16

What people talk aboutHeres what we need to recognize and translateHe ain't my choice. But, hey, we hated the last guy.We're going to hit it and quit it.Boy, that story gets better every time you hear it.I swear to God I am done with guys like that.Unfortunately a lot of our MT training data looks like thisMr President, Commissioner, Mr Sacconi, ladies and gentlemen, as the PPE-DE's coordinator for regional policy, I want to stress that some very important points are made in this resolution.I am therefore calling for integrated policies, all-encompassing policies that we can adapt to society, which must listen to our recommendations and comply with them.

In addition to *how* people speak (genre), theres also a big difference in *what* the talk about (domain).17

Data mismatch & scarcityTraining data mismatchMT training is clearly mismatchedASR training data is a mixed bag Data scarcityTraditional data sources (govt, news, web) not well matchedNot a lot of parallel conversational data (for MT)Not a lot of transcribed conversational data (for ASR)

ASR: word errors, missing vocabASR vocab issues e.g. namesHi Arul Hi AaronI went skiing at Snoqualmie pass I went skiing at snow call me passASR errorsHow do we minimize the impact of misrecognized words?

The S2S APIPublic API for speech translationwww.microsoft.com/translator

API Documentationhttp://docs.microsofttranslator.com/

21

Sample Code using APIhttps://github.com/MicrosoftTranslator

Demo of S2S API

iPad App demo

Demo of S2S API

Cmdline code demo

Cmdline parameters (ex of API usage)Usage: CmdLineSpeechTranslate.exe ClientId ClientSecret FilePath SrcLanguage TargetLanguageExample: CmdLineSpeechTranslate.exe ClientId ClientSecret helloworld.wav en-us es-es

Source: 1 of 8 spoken languagesTarget: 1 of 50+ spoken languages

25

Speech Recognition

26

The ChallengesProbably the hardest ASR taskConversational speaking styleOpen domain Key enabler: dramatic ASR improvements from using Deep Neural NetworksWhere to get training data?US English: DARPA Switchboard (2000h) is a great start; but no comparable corpus for other languagesUse found captioned speech. Many thousands of hours of speech used for English system

27

Training Data: Audio w/ Fluent Transcripts

Disfluent (what we want): Well I uh started this this project while I was a student uh grad student at uh Stan- StanfordFluent (what we get): I started this project while I was a grad student at Stanford

Machine Translation > ASRAdapting MT to Conversational Domain

Now lets take a closer look on Speech Correction component which helps in bridging the gap between spoken and written text.29

ASR/MT MismatchSignificant data mismatch between ASR output (even when cleaned) and MT:

He ain't my choice. But, hey, we hated the last guy.We're going to hit it and quit it.vs.Mr President, Commissioner, Mr Sacconi, ladies and gentlemen, as the PPE-DE's coordinator for regional policy, I want to stress that some very important points are made in this resolution.But where do we get parallel conversational data?Simple experiment: train on movie subtitle content

[READ SLIDE] Adapting MT to ASR starts with building a good baseline conversationally-oriented MT system30

Data SelectionSample in-domain (in-register?) data from our en-fr parallel data storeLeverage the fact that the data pool does not match the target domainUse monolingual conversational data as seed (in-domain): CallHome, SWBDUse Cross-Entropy Difference method (Moore-Lewis 2010) against very large parallel corpus (for ENU-FRA, 100s of Ms sentences)

Train on combination of subtitle and DA data

Speech Correction (ASR > MT)Bridging the gap between ASR and MT

Now lets take a closer look on Speech Correction component which helps in bridging the gap between spoken and written text.32

Oui.Mais je ne l'ai jamais fait moi-mme.Avez-vous utilis le vtre avant ?Gurdeep va demander de l'aide.

SpeechRecognitionSpeechCorrectionTranslationTexttoSpeech

Raw ASR Output

um no i mean yes but i am i've never done it myself did users before uh I will ask go deep to help meEuh non je veux dire oui mais je suis je l'ai jamais fait moi-mme fait util-isateurs avant euh je vais demander aller pro-fond pour m'aider

Customizationand Personalizationum no i mean yes but i am i've never done it myself did users before uh I will ask gurdeep to help me

SegmentationPunctuationand True CasingYes.But Ive never done it myself.Did you use yours before?I will ask Gurdeep to help me.

Disfluency Removalno i mean yes but I am i've never done it myself did you use yours before uh I will ask gurdeep to help me

LatticeRescoringum no i mean yes but i am i've never done it myself did you use yours before uh I will ask gurdeep to help me

Speech Correction component helps in bridging the gap between spoken and written text.33


Raw ASR Output

um no i mean yes but i am i've never done it myself did users before uh I will ask go deep to help me






Lets listen to what the user said.Very good fluent English.But if we read the english transcripts, it is hard to understand it, not to mention the translation.

So, if we take the raw ASR output and just throw it at MT, it doesnt work so well. We need components to process the ASR, remove disfluencies etc and make it more palatable to MT. Likewise we need to adapt MT to handle this kind of input

Missing sentence boundaries, punctuation, casing and disfluency removal

34


Raw ASR Output







First we know who is talking to who, we can have user profiles for both of them that can enable us to do better job in both recognition and translation.Personalization and customization plays crucial role in open domain S2S since people are usually talking about broadly different things.For example here we can recognize a person name gurdeep rather than go deep

Good customization and personalization is very crucial for open domain Skype Translator, people can talk about their palnned vacation to Columbia to Spanish speaker which needs different vocabulary than talking to a Chinese supplier next day next product plans.

35

Personalization and Customization

ClientSkype Translator ServiceUser ProfilesObject Stores

Speech Recognition

Customized Language Models Cloud Storage

Customized Models Machine Translation

CLM

Open domain conversational S2S can benefit from customizing and personalizing the models according to the users profiles.We can use users profile to create customized models that can fir their topics and vocabulary.Describe the diagram above. Currency we use this infra-structure for contact names recognition.

36

Personalized Names HandlingName recognition is a well known problem in large vocabulary ASRSupporting high-recall names recognition usually compromises WER.We deploy high-precision approach to support contacts names recognition using personalized names listsPersonalized names can be recognized in any contextExamples:Hello Ignacio, how are you doing today?I will meet Arul Menezes for lunch tomorrow.

Client

ClientSpeech Recognitionwith Generic LM

Customized LMContact names

One of the issues with ASR is that the vocab cannot include every possible person name (or place name etc). Expanding the vocab drastically to include millions of names can compromise WER, because names may be misrecognized in place of regular words.

However in the Skype Translator, we found that when the system didnt recognize the caller or callees name at the start of a call, it often derailed the entire conversation.

So we opted for a surgical fix for now while we investigate more broad-based options. What weve done is added a very small restrictive grammar comprised of common greetings etc, but with placeholders for names. At the start of a call we dynamically compile the contact names for the current caller and callee into this grammar, and our ASR engine can use this grammar in parallel with its broad based regular LM. 37


Raw ASR Output







Instead of using one best output from ASR, we can use n-best in lattice format.Lattice rescoring can help us in getting many possible alternatives form SR and then use very large LM to score them

38

Some early experiments: The error cascadeSpeech RecognitionTranslation Engine

1-best ASRProposed solutionsFeed n-best list of ASR output to MT

Use speech lattice directly as input to MT (e.g. Matusov et al., 2005, Lavie et al. 2004, Dyer et al, 2008)

Confusion network decoding (e.g. Bertoldi et al., 2007; Bertoldi and Federico, 2005)

In the early days we were fixated on WER and its effect on BLEU and the error cascade, meaning that if we were to pipeline multiple error-prone components together, the errors would multiply. This is a well-studied problem that other researchers have studied for a number of years.

One approach thats been tried is to take the ASR lattice directly as input to the MT decoder. This has been studied by many groups and is conceptually elegant, but the implementation is quite complex. A lattice representation allows an MT system to arbitrate between multiple ambiguous hypotheses from upstream processing so that the best translation can be produced.

A simplification is decoding over a confusion network, where the ASR confusables are compactly encoded as a word sausage. This is very easy to decode in MT because it affects mostly just the phrase-lookup portion, leaving the rest of the decoding untouched except for some extra features. We found that the MT portion of this worked well. However collapsing an ASR-lattice into a confusion network is an ill-defined operation which can result in some nasty artifacts in the confusion network such as a proliferation of epsilon arcs.

After some experimentation with N-best rescoring and confusion networks, we decided to try a couple of different things.

Confusion network: A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Speech Lattice: Maintains a set of candidates as a subset of models.

39

Lattice rescoringRescoring ASR lattice withMuch bigger LM (100x larger than first pass)MT-specific featuresTuning weightsWER reduction 1-2% absoluteBLEU improvement 1-2% absoluteCherry picked examplesRef: what do you use yours for mostlyASR: do users for mostlyRescored: do you use yours for mostly

Ref:but we're in a subdivisionASR: but where in a subdivisionRescored:but we're in a subdivision

We decided that before we plunged into full-fledged MT decoding over lattices, we would first try simply rescoring the ASR lattice with a much bigger LM, adding some extra MT-friendly features and tuning model weights.

We discovered we could get very good ASR and end-to-end BLEU gains, at which point we decided not to bother with decoding over lattices in the MT decoder itself40




Raw ASR Output




Now we have almost perfect transcription that match what is the user said, improved and customized as well. Are we ready to send this to MT. No yet. Disfleuncy handling should be done before w

41




Raw ASR Output




Finally, we come into segmentation into sentence units, punctuation and casing which should be ready for a state-of-the art MT system to produce reliable translation.

42

Important missing informationSentence boundaries, punctuation, caseImportant for translation, missing in ASR outputBoundaries are not apparent in speechFor example pauses are not good indicators of sentence boundariesCan result in word salad

claro tambin es verdad s eso es cierto also clear is true yes that is trueClaro. Tambin, es verdad. S. Eso es cierto. Of course. Also, it is true. Yes. That is true.Disfluency removal More than just removing um and ah

To understand how curial segmentation and punctuation and disfluency handling to MT, lets take a look at some examples.

43

Disfluencies in Conversational Speechum no i mean yes but you know i am i've never done it myself have you done that uh yesDisfluency types:Filler PausesDiscourse MarkersRepetitionCorrections (speech repairs)

um no i mean yes but you know i am i've never done it myself have you done that uh yes



44

um no i mean yes but i am i've never done it myself have you done that uh yes

um noi mean yes but i ami've never done it myselfhave you done thatuh yes no ,, yes , but , i am , i've never done it myself have you done that yes

No.I mean yes.I am.i've never done it myself.have you done that?Yes?

No,,yesbut , i am , i've never done it myself have you done thatyesYes.I've never done it myself .Have you done that?Yes?Segmentation and Disfluency removal interact with each otherSegmentationDisfluency removalSimple disfluency removalSegmentationComplex disfluency removal

Traditionally this problem has been solved by first doing segmentation and then disfluency removal

But there is an interaction between segmentation and disfluency handling.

If segmentation is done first, disfluency will lose a chance to make a better correction, and well be left with numerous disfluent fragments.

On the other hand, complex disfluency removal (speech repairs) needs sentence boundaries, so you cant do that first either.

What we did is split the difference. We do some simple disfluency is first, then segmentation, then complex disfluency removal.

45

CRF-based Classifiers for annotation

Simple Disfluency Segmentation and Punctuation Complex DisfluencySegmentation and Disfluency Removal for Conversational Speech TranslationHany Hassan, Lee Schwartz, Dilek Hakkani-Tur, and Gokhan Tur INTERSPEECH 2014

Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural language processing predicts sequences of labels for sequences of input samples.

First remove simple disfluencies, then segment, then remove complex disfluenciesFirst two stages are CRF sequence taggersComplex disfluency handling:Uses metadata annotated by previous stages Using iterative parsing (NLPwin parser) Needs sentence units

46

Sentence Unit Boundary Detection CRF Classifier: L2 Regularization, Features Cut-off=2 Lexical FeaturesBrown ClustersPOS tags trained on conversational data (another CRF classifier)Speech Pause-based durationPhrase-translation table n-gramFeatures on a window of two words on each side

Brown clustering is a hard hierarchical agglomerative clustering problem based on distributional information. It is typically applied to text, grouping words into clusters that are assumed to be semantically related by virtue of their having been embedded in similar contexts.

47

Disfluency Removal, Punctuation insertion and TrueCaserCRF Classifiers: L2 Regularization, Features Cut-off=2 Lexical FeaturesBrown ClustersPOS tags trained on conversational data (another CRF classifier)Features on a window of two words on each side

48

Example of Complex Disfluency Removal

but , im , Ive never done that before.

In complex disfluency removal we take advantage of the NLPwin parser that was built for the MS Word grammar checker and so is robust to ungrammatical input. For example here we have a repeated subtree that is linguistically similar, and so the first subtree is removed. We also look for are constituents that appear to be disconnected from other parts of the tree. When we spot a disfluency we remove it and reparse the resulting sentence because the removal of the disfluency could change the entire parse. We remove errors one by one and stop when we have no more edits or we hit a limit on the number of parses.49

Example of Complex Disfluency removalBut Ive never done that before.

50

Segmentation and Disfluency Removal Effect (EnglishSpanish)SegmentationDisfluency HandlingBLEU on TranscriptsBLEU on ASRNo segmentation (full utterance)None22.1319.13No segmentation(full utterance)1-stage23.4620.49Segment on pausesNone20.3218.78Segment on pauses1-stage22.5319.32CRF Segmenter1-stage after segmenter25.1121.24CRF Segmenter1-stage before segmenter24.7920.95CRF Segmenter2-stages (before & after)25.65 (+16%)21.76 (+13.7%)

Disfluency handling :None: No Disfluency handling appliedAfter: applied after segmentationSplit: Simple Disfluency applied before segmentation , Complex Disfluency applied after segmentation1 point of BLEU improvement is roughly equivalent to 1% absolute improvement in accuracy

Lots of numbers here. Looking at the last column which is what we care about, here are the takeawaysSentence breaking based on speaker pauses is a bad idea (lose 0.5 BLEU points)CRF sentence breaking by itself adds about 0.5 BLEU (vs no breaking at all i.e. translate the full utterance)Disfluency removal by itself adds about 1.3 BLEUDoing both gives you about 2 BLEU points and doing the split before/after adds another 0.551

S2S in the SchoolsBilingual Mystery SkypeDeaf/Hard of Hearing Students

Seattle and Beijing, China

Seattle and Beijing, China

This is Vinny, who participated in the Mystery Skype session we had with the schools in Beijing. Vinnys deaf, so it was wonderful for him to participate in these calls with his classmates. Even when he was unable to hear the response back from the students, he could read the translation of what they were saying.54

Deaf and Hard of Hearing StudentsIn Seattle Public Schools, Jean Rogers (Chief Audiologist) and Liz Haydens (Teacher of the Deaf) idea:Use Skype Translator with the mainstreamed deaf and hard of hearing kids

Deaf and Hard of Hearing Students



Although the use here demonstrates the use of the technology with deaf or hard of hearing students, its not much of a stretch to adapt the technology, since the components already exist, to hearing students that speak other languages. In fact, it could be used in that manner now. We havent tested it in this scenarioyet.58

S2S in the Classroomhttps://www.microsoft.com/en-us/design/inclusive#inclusive-skype_video

conversational speech translation - challenges and techniques, by chris wendt, microsoft

Technology