dev days, speech recognition, lm aubert

Speech Recognition on embedded devices

Louis-Marie AubertECIT – Queen’s University Belfast

DevDays – Belfast – April 24, 2009

What should we expect from speech recognition?

Speech Recognition success?

• Natural continuous speech • Real-time• Large vocabulary (up to 100,000 words)• No training (speaker independent)• Adaptive to speaker accent• Robust against

– Background noise– Audio frontend imperfections

• N-best hypotheses with confidence value

What are the solutions on the market?

Existing solutions• Server-based

– Telephony, IVR

– Dictation (Heath care industry)

– Audio indexing

Either offline or with important delays

Existing solutions• Desktop-based

– Real-time dictation

– Language learning

Requires a good setup, powerful computer, quiet environmentVery good accuracy, no training required

Existing solutions• Embedded applications

– Simple voice commands(‘Call-mum’ type command)

– Disconnected word recognition

Small vocabulary and lack of naturalness restricts the range of applications

Is it so difficult?

Technical challenge

Speech waveformTranscription

SpeechRecognizer

‘Hello world’

Technical challenge

Speech waveform Acoustic feature vectors

Spectral Analyser ~40 coeff.

10 ms

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Technical challenge


Multi-dim.Gaussian mixt.

calculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

Acoustic Models

• 4000 acoustic models

• Sub-acoustic unit

• Functions that score 10 ms of speech

• Sets of mean and variance 40-long vectors of Gaussian mixtures (16)

‘Hello world’

Technical challenge


Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Technical challenge



calculation

WordLexiconPhoneme

LexiconSenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

Phoneme

• 50 in English

• Differentiable sounds

• Represent a sequence of senomes: HMM (Hidden Markov Model)

ah1 ah2 ah3‘ah’:

l1 l2 l3‘l’:

‘Hello world’

Technical challenge



calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

Triphone

• 2500 in English

• Differentiable sounds in their context

continuous speech

ah1 ah2 ah3‘hh-ah+l’:

l1 l2 l3‘ah-l+ow’:

‘Hello world’

Technical challenge


Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Technical challenge



calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Word

• Large vocabulary: 64000

• Represent a sequence of phonemes/triphones

‘hello’:

‘world’:

hh ah l ow

w er l d

Technical challenge


Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Technical challenge



calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Statistical language model

• Bi-gram / Tri-gram

• Give the probability of sequence of 2/3 words

• 64000 words leads to roughly 10 million states / 50 million arcs

hello

world

mum

dad0.20.05

0.3

Technical challenge


Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Technical challenge


Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

~ 25 million states / 250 million arcs

Technical challenge



calculation

WordLexicon

TriphoneLexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’


Viterbi decoding

• Token passing algorithm

• 5000/10000 tokens to propagate every 10 ms

• Select the most promising tokens and output associated sequence of: senomes triphones words sentence

l1 l2 l3

s1 s2 s3

ow1 ow2 ow3

ey1

ey2

ey3

d1

d2

d3

v3 v2

v1

Technical challenge


Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’


Challenges in embedded systems

• Low computational resources• Power consumption constraints• Noisy environment, poor audio quality

For a truly embedded speech recognition engine that works, we must move away from the pure software approach:

• Make the best of all hardware acceleration available• Dedicated chip (accelerator) to unload CPU and

relax memory constraints

Why do we want speech recognition on embedded devices anyway?

Applications on mobiles• Complement touch screen interface with

speech interface• Speech enable existing mobile applications

– Browse complex menus– Easily find items in large libraries,

local or online (contacts, music…)– Browse Web and search maps– Games– Compose text-messages,

emails…

Applications on mobiles• Speech enable mobile applications

Rubicon, "The Apple iPhone: Successes and Challenges for the Mobile Industry", 31 March 2008

Applications on mobiles• Key to safety when driving

– Text-messaging– Satellite-Navigation function

• Voice Memo– Shopping list– Activity scheduler

• Market of Speech technology in embedded devices– $125 million in 2006– $500 million in 2010

Opus Research report, March 2007

Other markets• Developing countries

– Access to information technology for illiterate people• Administrative tasks• Education• Social integration

• Health-care at home (self-manage diseases)– Exploding market

• Chronic diseases• Elderly people (Baby Boomers reach retirement age)• Market for home health care products is evaluated at $4.3 billion today

– Place for Speech recognition• Inexperience of patients with electronic interfaces• Poor physical condition (e.g. low vision)• Illiteracy Medical device today, March 2009

Other applications• Speech translation

– IraqCom

Okay, I can’t wait! Is there anything I can use now?

Upcoming solutions• Voicemail accessible via text-message,

email or dedicated application

– Server-based– Require agreement and implementation by the

carriers

Upcoming solutions• Nuance Voice Control 2

– Online search – Text-messaging

• Embedded software for simple voice command

• Server-based engine for large vocabulary speech recognition

• Speech Recognition API on Android 1.5

Conclusion• A truly embedded speech recognition system

– A range of exciting applications• Real-time dictation with no perceived delay• Natural language interface (ASR + TTS)• Applications independent of the carrier

– But… not available yet!

• New speech recognition API are arriving soon– Rely on network/server availability– Can still lead to innovative applications

Conclusion• Key to succeed

– Robustness, accuracy– Fast to load and execute– Well designed interface

• Speech cannot be used on its own• Should be cleverly combined with other interfaces

– Graphical– Touch– …

– Don’t put customers off by clumsy speech recognition widgets, again!

Questions?

dev days, speech recognition, lm aubert

Technology