cs 479, section 1: natural language processing

CS 479, section 1: Natural Language Processing

CS 479, section 1:Natural Language ProcessingLecture #16: Speech Recognition Overview (cont.)

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.Thanks to Alex Acero (Microsoft Research), Jeff Adams (Nuance), Simon Arnfield (Sheffield), Dan Klein (UC Berkeley), Mazin Rahim (AT&T Research) for many of the materials used in this lecture.http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/1AnnouncementsReading Report #6 on Youngs OverviewDue: now

Reading Report #7 on M&S 7Due: Friday

Review QuestionsTyped list of 5 questions for Mid-term exam reviewDue next WednesdayObjectivesContinue our overview of an approach to speech recognition, picking up at acoustic modeling

See other examples of the source / channel (noisy channel) paradigm for modeling interesting processes

Apply language modelsRecall: Front End

SourceNoisyChannelTextASRSpeechText

FEFeatures

4Acoustic ModelingGoal:Map acoustic feature vectors into distinct linguistic unitsSuch as phones, syllables, words, etc.Feature ExtractionDecoder,SearchLanguage ModelAcoustic ModelWord Lexicon

5Acoustic Trajectories

EEeechzhEYihshjdhyIHzstdunOOHehthghngkfOOAEAApnvbEHwrummUHURurOYAYullohuhAWHOHAHOWAW39 dimensions reduced to 26Acoustic Models:Neighborhoods are not PointsHow do we describe what points in our feature space are likely to come from a given phoneme? Its clearly more complicated than just identifying a single point.Also, the boundaries are notclean.Use the normal distribution:Points are likely to lie nearthe center.We describe thedistribution with the mean& variance.Easy to compute with

Acoustic Models:Neighborhoods are not Points (2)Normal distributions in M dimensions are analogousA.k.a. GaussiansSpecify the mean point in M dimensionsLike an M-dimensional hill centered around the mean pointSpecify the variances(as Co-variance matrix)Diagonal gives the widthsof the distributionin each directionOff-diagonal values describe theorientationFull covariance possibly tiltedDiagonal covariance not tiltedAMs: Gaussians dont really cut itConsider the AY frames in our example. How can we describe these with an (elliptical) Gaussian?A single (diagonal) Gaussian is too big to be helpful.Full-covariance Gaussians are hard to train.We often use multiple Gaussians (a.k.a. Gaussian mixture models)(1 dimensional) Gaussian Mixture Models

10from www.igi.tugraz.at/lehre/CI/tutorials/MixtGaussian/MixtGaussian.pdf Mixtures of Gaussians; A Tutorial for the Course Computational Intelligence

STATE 1STATE 2STATE 3STATE 4AMs: Phonemes are a path, not a destinationPhonemes, like stories, have beginnings, middles, and ends.This might be clear if you think of how the AY sound moves from a sort of EH to an EE.Even non-diphthongs show these properties.We often represent a phoneme with multiple states.E.g. in our AY model, we might have 4 states.And each of these states is modeled by a mixture of Gaussians.

AMs: Phonemes are a path, not a destinationPhonemes, like stories, have beginnings, middles, and ends.This might be clear if you think of how the AY sound moves from a sort of EH to an EE.Even non-diphthongs show these properties.We often represent a phoneme with multiple states.E.g. in our AY model, we might have 4 states.STATE 1STATE 3STATE 4STATE 2AMs: Whence & WhitherIt matters where you come from (whence)and where you are going (whither).Phonetic contextual effectsA way to model this is to use triphonesI.e. Depend on the previous & following phonemesE.g. Our AY model should really be a silence-AY-S model( or pentaphones: use 2 phonemes before & after)So what we really need for our AY model is a:Mixture of GaussiansFor each of multiple statesFor each possible set of predecessor & successor phonemesHidden Markov Model (HMM)Captures:Transitions between hidden statesFeature emissions as mixturesof gaussiansSpectral properties modeled bya parametric random processi.e., a directed graphical model!Advantages:Powerful statistical method for a wide range of data and conditionsHighly reliable for recognizing speechA collection of HMMs for each:sub-word unit typeextraneous event: cough, um, sneeze, More on HMMs coming up in the course after classification!

The basic theory of HMMs was published by a series of papers by Baum and colleagues in 60s. AndWas incoportaed into speech by Baker at CMU in 70s and by Jelinek at IBM in 70s. It was popularizedBy Rabiner and colleagues in 80s.

14Anatomy of an HMMHMM for /AY/ in context of preceding silence, followed by /S/

sil-AY+S[1]sil-AY+S[2]sil-AY+S[3]0.50.20.30.20.80.70.815CSE 552/652Hidden Markov Models for Speech Recognition

Spring, 2005Oregon Health & Science UniversityOGI School of Science & Engineering

John-Paul Hosom

Lecture Notes for April 13:HMMs for speech;review anatomy/framework of HMMhttp://www.cse.ogi.edu/class/cse552/258,20,Slide 20HMMs as Phone Modelssil-AY+S[1]sil-AY+S[2]sil-AY+S[3]0.50.20.30.20.80.70.8Words and Phones

How do we know how to segment words into phones?Word LexiconGoal: Map sub-word units into words Usual sub-word units are phone(me)s

Lexicon: (CMUDict, ARPABET)Phoneme Example TranslationAA odd AA DAE at AE TAH hut HH AH TAO ought AO TAW cow K AWAY hide HH AY DB be B IYCH cheese CH IY ZProperties: Simple Typically knowledge-engineered (not learned shock!)Feature ExtractionDecoder,SearchLanguage ModelAcoustic ModelWord Lexicon18The basic theory of HMMs was published by a series of papers by Baum and colleagues in 60s. AndWas incoportaed into speech by Baker at CMU in 70s and by Jelinek at IBM in 70s. It was popularizedBy Rabiner and colleagues in 80s.Decoder

SourceNoisyChannelTextASRSpeechText

FEFeatures

19Decoding: as State-Space SearchFeature ExtractionPattern ClassificationLanguage ModelAcoustic ModelWord LexiconBirds Eye ViewDecoding as SearchViterbi Dynamic ProgrammingMulti-passA* (stack decoding)N-bestViterbi: DPNoisy Channel ApplicationsSpeech recognition (dictation, commands, etc.)text neurons, acoustic signal, transmission acoustic waveforms textOCRtext print, smudge, scan image textHandwriting recognitiontext neurons, muscles, ink, smudge, scan image textSpelling correctiontext your spelling mis-spelled text textMachine Translation (?)text in target language translation in head text in source language text in target language

24Noisy-Channel ModelsOCR

Handwriting recognition

Spelling Correction

Translation?

Whats NextUpcoming lectures:Classification / categorizationNave-Bayes modelsClass-conditional language modelsExtra 196219671972197719821987199219972003YearIsolated WordsFilter-bank analysis Time-normalization Dynamic programmingIsolated Words Connected Digits Continuous SpeechPattern recognition LPC analysis Clustering algorithms Level buildingContinuous Speech Speech UnderstandingStochastic language understanding Finite-state machines Statistical learningSmall Vocabulary, Acoustic Phonetics-basedMedium Vocabulary, Template-basedLarge Vocabulary; Syntax, Semantics, Connected Words Continuous SpeechLarge Vocabulary, Statistical-basedHidden Markov models Stochastic Language modelingSpoken dialog; Multiple modalitiesVery Large Vocabulary; Semantics, Multimodal DialogConcatenative synthesis Machine learning Mixed-initiative dialogMilestones in Speech Recognition281950s various research labs tried to exploit the fundamental of acoustic/phonetic to ASR. Isolated digits recognition experiments were reported by Bell Labs for a single speaker relying heavily on measuring spectral resonances during the vowel regions of digits. 1960s three key research developments in ASR; (a) RCA labs developed solutions for dealing with non-uniformity of time scales in speech events, (2) Vintsyuk (Russia) proposed dynamic programming (only became popular in the West in the 70s and 80s); (3) dynamic tracking of phonemes at CMU.1970s use of pattern recognition for ASR was developed Itakura applied the idea of LPC in ASR (already established in coding). IBM demonstrated simple dictation systems (Tangora). AT&T bagan series of efforts for making ASR truly speaker independent.1980s connected word recognition was further developed with the creation of level building dynamic programming methods. This decade was characterized by a shift from template-based methods to statistical modeling methods. Essentially the use of HMMs. DARPA invested into LVASR (e.g., SPHINX/CMU, BYPLOS/BBN).1990s emphasis on natural language with ASR being increasingly used for telephone networks and enhance operator services.

VRCP - automation of 0+ calls (AT&T Labs invented a new wordspotting technique, which made this service viable). Saves AT&T $200M-$300M. (deployed in 5ESS/OSPS) 800 Voice Recognition - automation of AT&Ts Call Prompter service using voice commands (e.g., Press or say 1 for information). Deployed in 4ESS/Isaic. DA Call Completion - allows AT&T to complete a call made to AT&Ts DA. An ASR system listens to the digits played out by RBOC operators VRCP 2.0 - added connected digit recognition to allow for more automation of calling card and third-number billed phone calls AT&T VoiceLine - gave customers voice dialing by name and number. Speaker verification was also deployed to give added security but it was not turned on. Only a couple of thousand users subscribed. Universal Card - connected digit recognition is used by customers to enter their account number in order to access UCSs customer care line. Voice Touch - deployed in AWS network pre-AT&T/McCaw merge. It provides voice dialing from cell phones (uses non-AT&T speech technology) AT&T Direct - ASR is used to automate international calls made back to the US through the AT&T Direct service. 00 - automation - when AT&T customers dial 00 to get to an operator, they hit a speech-controlled user interface SDN/NRA - connected digit recognition is used to allow SDN/NRA subscribers to do voice-dialing. Infoworx - AT&T Solutions uses Infoworx to deploy IVR and voice-enabled services for its customers. CALL ATT R2.0 - provide voice control of the CALL AT&T service. This was developed and deployment started. Then it was halted. ATPS - this was an BMD/Infoworx play. Advanced voice-enabled service such as hotel or car rent services could not be developed. The was fully developed, but not deployed. Universal VoiceLine - provided voice-dialing to consumers. It was access via 8YY number, with auto login from AT&T cell phones. HMIHY - provides very advanced automation of 00- customer care using a sophisticated voice interface. This uses large vocab. ASR, language understanding and dialog management OneReach - used TTS to provide customers with information over the phone VoiceTone - see earlier viewgraphs AT&T Mail Center - initial will use TTS to read e-mail messages over the phone. In 2000, Natoce has asked for adding ASR to provide full voice control of messaging. 800 Gold - provided simple voice dialing of 3 words (home, office, voicemail). Developed. Deployment on hold. (uses non- AT&T speech technology)Dragon Dictate ProgressWERR* from Dragon NaturallySpeaking version 7 to version 8 to version 9:

DOMAIN7889US English:27%23%UK English:21%10%German:16%10%French:24%14%Dutch:27%18%Italian:22%14%Spanish:26%17%

* WERR means relative word error rate reduction on an in-house evaluation set.

Results from Jeff Adams, ca. 2006Crazy Speech MarketplaceIBMSpeechworksPhilipsDictaphoneInsoL&HScanSoftDragonKurzweilNuanceMedRemoteArticulateetc.Dictaphoneetc.NuanceVoiceSignalTegic ca. 1980 ca. 2004YearSpeech vs. text:tokens vs. charactersSpeech recognition recognizes a sequence of tokens taken from a discrete & finite set, called the lexicon.Informally, tokens correspond to words, but the correspondence is inexact. In dictation applications, where we have to worry about converting between speech & text, we need to sort out a token philosophy:Do we recognize forty-two or forty two or 42 or 40 2?Do we recognize millimeters or mms or mm?What about common words which can also be names, e.g. Brown and brown?What about capitalized phrases like Nuance Communications or The White House or Main Street?What multi-word tokens should be in the lexicon, like of_the?What do we do with complex morphologies or compounding?Converting between tokens& textProfits rose to $28 million. See fig. 1a on p. 124.profits rose to twenty eight million dollars .\period see figure one a\a on page one twenty four .\periodTEXTTOKENSTOKENIZATIONITNTOKEN PHILOSOPHYLEXICONThree examples (Tokenization)TEXTP.J. ORourke said, "Giving money and power to government is like giving whiskey and car keys to teenage boys."The 18-speed I bought sold on www.eBay.com for $611.00, including 8.5% sales tax.From 1832 until August 15, 1838 they lived at No. 235 Main Street, "opposite the Academy," and from there they could see it all.TOKENSPJ O'Rourke said ,\comma"\open-quotes giving money and power to government is like giving whiskey and car keys to teenage boys .\period "\close-quotes the eighteen speed I bought sold on www.\WWW_dot eBay .com\dot_com for six hundred and eleven dollars zero cents ,\comma including eight .\point five percent sales tax .\periodfrom one eight three two until the fifteenth of August eighteen thirty eight they lived at number two thirty five Main_Street ,\comma "\open-quotes opposite the Academy ,\comma "\close-quotes and from there they could see it all .\periodMissing from speech: punctuationWhen people speak they dont explicitly indicate phrase and section boundaries instead listeners rely on prosody and syntax to know where these boundaries belong in dictation applications we normally rely on speakers to speak punctuation explicitly how can we remove that requirementWhen people speak, they dont explicitly indicate phrase and section boundaries.Instead, listeners rely on prosody and syntax to know where these boundaries belong.In dictation applications, we normally rely on speakers to speak punctuation explicitly.How can we remove that requirement?

Punctuation Guessing ExamplePunctuation GuessingAs currently shipping in DragonTargeted towards free, unpunctuated speech

My personal experience with camping has been rather limited. Having lived overseas in a very urban situation in which camping in the wilderness is not really possible. My only chances at camping came when I returned to the United States. My most memory, I had two most memorable camping trips both with my father. My first one was when I was a preteen, and we went hiking on Bigalow mountain in Maine, central western Maine. We went hiking for a day took a trail that leads off of the Appalachian Trail and goes down to the village of Stratton in the township of Eustis, just north and west of Sugarloaf U.S.A., the ski area.

cs 479, section 1: natural language processing

Documents

acoustic models

covariance gaussians

mixture of gaussians

acoustic modelinggoal

multiple gaussians

gaussians dont

mixtures of gaussians

map acoustic feature