dev days, speech recognition, lm aubert
DESCRIPTION
Overview of Automatic Speech Recognition (ASR) for embedded devices - Large vocabulary, continuous speech recognition. - Technical overview - Potential application - Upcoming alternatives to embedded engines Presented DevDays, Belfast, UK, 24 April 09 Louis-Marie Aubert, ECIT, Queen's University BelfastTRANSCRIPT
Speech Recognition on embedded devices
Louis-Marie AubertECIT – Queen’s University Belfast
DevDays – Belfast – April 24, 2009
What should we expect from speech recognition?
Speech Recognition success?
• Natural continuous speech • Real-time• Large vocabulary (up to 100,000 words)• No training (speaker independent)• Adaptive to speaker accent• Robust against
– Background noise– Audio frontend imperfections
• N-best hypotheses with confidence value
What are the solutions on the market?
Existing solutions• Server-based
– Telephony, IVR
– Dictation (Heath care industry)
– Audio indexing
Either offline or with important delays
Existing solutions• Desktop-based
– Real-time dictation
– Language learning
Requires a good setup, powerful computer, quiet environmentVery good accuracy, no training required
Existing solutions• Embedded applications
– Simple voice commands(‘Call-mum’ type command)
– Disconnected word recognition
Small vocabulary and lack of naturalness restricts the range of applications
Is it so difficult?
Technical challenge
Speech waveformTranscription
SpeechRecognizer
‘Hello world’
Technical challenge
Speech waveform Acoustic feature vectors
Spectral Analyser ~40 coeff.
10 ms
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
Acoustic Models
• 4000 acoustic models
• Sub-acoustic unit
• Functions that score 10 ms of speech
• Sets of mean and variance 40-long vectors of Gaussian mixtures (16)
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
Phoneme Lexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexiconPhoneme
LexiconSenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
Phoneme
• 50 in English
• Differentiable sounds
• Represent a sequence of senomes: HMM (Hidden Markov Model)
ah1 ah2 ah3‘ah’:
l1 l2 l3‘l’:
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
Triphone
• 2500 in English
• Differentiable sounds in their context
continuous speech
ah1 ah2 ah3‘hh-ah+l’:
l1 l2 l3‘ah-l+ow’:
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Word
• Large vocabulary: 64000
• Represent a sequence of phonemes/triphones
‘hello’:
‘world’:
hh ah l ow
w er l d
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
Phoneme Lexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Statistical language model
• Bi-gram / Tri-gram
• Give the probability of sequence of 2/3 words
• 64000 words leads to roughly 10 million states / 50 million arcs
hello
world
mum
dad0.20.05
0.3
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
Technical challenge
Acoustic feature vectors Recognizer Transcription
Multi-dim.Gaussian mixt.
calculation
WordLexicon
TriphoneLexicon
SenomeLexicon
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
Viterbi decoding
• Token passing algorithm
• 5000/10000 tokens to propagate every 10 ms
• Select the most promising tokens and output associated sequence of: senomes triphones words sentence
l1 l2 l3
s1 s2 s3
ow1 ow2 ow3
ey1
ey2
ey3
d1
d2
d3
v3 v2
v1
Technical challenge
Acoustic feature vectors Recognizer Transcription
Senomecalculation
WordLexicon
TriphoneLexicon
Acoustic Models
Viterbi decoding
StatisticalLanguage
Model
‘Hello world’
~ 25 million states / 250 million arcs
Challenges in embedded systems
• Low computational resources• Power consumption constraints• Noisy environment, poor audio quality
For a truly embedded speech recognition engine that works, we must move away from the pure software approach:
• Make the best of all hardware acceleration available• Dedicated chip (accelerator) to unload CPU and
relax memory constraints
Why do we want speech recognition on embedded devices anyway?
Applications on mobiles• Complement touch screen interface with
speech interface• Speech enable existing mobile applications
– Browse complex menus– Easily find items in large libraries,
local or online (contacts, music…)– Browse Web and search maps– Games– Compose text-messages,
emails…
Applications on mobiles• Speech enable mobile applications
Rubicon, "The Apple iPhone: Successes and Challenges for the Mobile Industry", 31 March 2008
Applications on mobiles• Key to safety when driving
– Text-messaging– Satellite-Navigation function
• Voice Memo– Shopping list– Activity scheduler
• Market of Speech technology in embedded devices– $125 million in 2006– $500 million in 2010
Opus Research report, March 2007
Other markets• Developing countries
– Access to information technology for illiterate people• Administrative tasks• Education• Social integration
• Health-care at home (self-manage diseases)– Exploding market
• Chronic diseases• Elderly people (Baby Boomers reach retirement age)• Market for home health care products is evaluated at $4.3 billion today
– Place for Speech recognition• Inexperience of patients with electronic interfaces• Poor physical condition (e.g. low vision)• Illiteracy Medical device today, March 2009
Other applications• Speech translation
– IraqCom
Okay, I can’t wait! Is there anything I can use now?
Upcoming solutions• Voicemail accessible via text-message,
email or dedicated application
– Server-based– Require agreement and implementation by the
carriers
Upcoming solutions• Nuance Voice Control 2
– Online search – Text-messaging
• Embedded software for simple voice command
• Server-based engine for large vocabulary speech recognition
• Speech Recognition API on Android 1.5
So?
Conclusion• A truly embedded speech recognition system
– A range of exciting applications• Real-time dictation with no perceived delay• Natural language interface (ASR + TTS)• Applications independent of the carrier
– But… not available yet!
• New speech recognition API are arriving soon– Rely on network/server availability– Can still lead to innovative applications
Conclusion• Key to succeed
– Robustness, accuracy– Fast to load and execute– Well designed interface
• Speech cannot be used on its own• Should be cleverly combined with other interfaces
– Graphical– Touch– …
– Don’t put customers off by clumsy speech recognition widgets, again!
Questions?