dev days, speech recognition, lm aubert

42
Speech Recognition on embedded devices Louis-Marie Aubert ECIT – Queen’s University Belfast DevDays – Belfast – April 24, 2009

Upload: aubertlm

Post on 12-Jan-2015

2.008 views

Category:

Technology


4 download

DESCRIPTION

Overview of Automatic Speech Recognition (ASR) for embedded devices - Large vocabulary, continuous speech recognition. - Technical overview - Potential application - Upcoming alternatives to embedded engines Presented DevDays, Belfast, UK, 24 April 09 Louis-Marie Aubert, ECIT, Queen's University Belfast

TRANSCRIPT

Page 1: Dev Days, Speech Recognition, LM Aubert

Speech Recognition on embedded devices

Louis-Marie AubertECIT – Queen’s University Belfast

DevDays – Belfast – April 24, 2009

Page 2: Dev Days, Speech Recognition, LM Aubert

What should we expect from speech recognition?

Page 3: Dev Days, Speech Recognition, LM Aubert

Speech Recognition success?

• Natural continuous speech • Real-time• Large vocabulary (up to 100,000 words)• No training (speaker independent)• Adaptive to speaker accent• Robust against

– Background noise– Audio frontend imperfections

• N-best hypotheses with confidence value

Page 4: Dev Days, Speech Recognition, LM Aubert

What are the solutions on the market?

Page 5: Dev Days, Speech Recognition, LM Aubert

Existing solutions• Server-based

– Telephony, IVR

– Dictation (Heath care industry)

– Audio indexing

Either offline or with important delays

Page 6: Dev Days, Speech Recognition, LM Aubert

Existing solutions• Desktop-based

– Real-time dictation

– Language learning

Requires a good setup, powerful computer, quiet environmentVery good accuracy, no training required

Page 7: Dev Days, Speech Recognition, LM Aubert

Existing solutions• Embedded applications

– Simple voice commands(‘Call-mum’ type command)

– Disconnected word recognition

Small vocabulary and lack of naturalness restricts the range of applications

Page 8: Dev Days, Speech Recognition, LM Aubert

Is it so difficult?

Page 9: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Speech waveformTranscription

SpeechRecognizer

‘Hello world’

Page 10: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Speech waveform Acoustic feature vectors

Spectral Analyser ~40 coeff.

10 ms

Page 11: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 12: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 13: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

Acoustic Models

• 4000 acoustic models

• Sub-acoustic unit

• Functions that score 10 ms of speech

• Sets of mean and variance 40-long vectors of Gaussian mixtures (16)

‘Hello world’

Page 14: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 15: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

Phoneme Lexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 16: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexiconPhoneme

LexiconSenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

Phoneme

• 50 in English

• Differentiable sounds

• Represent a sequence of senomes: HMM (Hidden Markov Model)

ah1 ah2 ah3‘ah’:

l1 l2 l3‘l’:

‘Hello world’

Page 17: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

Triphone

• 2500 in English

• Differentiable sounds in their context

continuous speech

ah1 ah2 ah3‘hh-ah+l’:

l1 l2 l3‘ah-l+ow’:

‘Hello world’

Page 18: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 19: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 20: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Word

• Large vocabulary: 64000

• Represent a sequence of phonemes/triphones

‘hello’:

‘world’:

hh ah l ow

w er l d

Page 21: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 22: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 23: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexicon

Phoneme Lexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Statistical language model

• Bi-gram / Tri-gram

• Give the probability of sequence of 2/3 words

• 64000 words leads to roughly 10 million states / 50 million arcs

hello

world

mum

dad0.20.05

0.3

Page 24: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

Page 25: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

~ 25 million states / 250 million arcs

Page 26: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

~ 25 million states / 250 million arcs

Page 27: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Multi-dim.Gaussian mixt.

calculation

WordLexicon

TriphoneLexicon

SenomeLexicon

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

~ 25 million states / 250 million arcs

Viterbi decoding

• Token passing algorithm

• 5000/10000 tokens to propagate every 10 ms

• Select the most promising tokens and output associated sequence of: senomes triphones words sentence

l1 l2 l3

s1 s2 s3

ow1 ow2 ow3

ey1

ey2

ey3

d1

d2

d3

v3 v2

v1

Page 28: Dev Days, Speech Recognition, LM Aubert

Technical challenge

Acoustic feature vectors Recognizer Transcription

Senomecalculation

WordLexicon

TriphoneLexicon

Acoustic Models

Viterbi decoding

StatisticalLanguage

Model

‘Hello world’

~ 25 million states / 250 million arcs

Page 29: Dev Days, Speech Recognition, LM Aubert

Challenges in embedded systems

• Low computational resources• Power consumption constraints• Noisy environment, poor audio quality

For a truly embedded speech recognition engine that works, we must move away from the pure software approach:

• Make the best of all hardware acceleration available• Dedicated chip (accelerator) to unload CPU and

relax memory constraints

Page 30: Dev Days, Speech Recognition, LM Aubert

Why do we want speech recognition on embedded devices anyway?

Page 31: Dev Days, Speech Recognition, LM Aubert

Applications on mobiles• Complement touch screen interface with

speech interface• Speech enable existing mobile applications

– Browse complex menus– Easily find items in large libraries,

local or online (contacts, music…)– Browse Web and search maps– Games– Compose text-messages,

emails…

Page 32: Dev Days, Speech Recognition, LM Aubert

Applications on mobiles• Speech enable mobile applications

Rubicon, "The Apple iPhone: Successes and Challenges for the Mobile Industry", 31 March 2008

Page 33: Dev Days, Speech Recognition, LM Aubert

Applications on mobiles• Key to safety when driving

– Text-messaging– Satellite-Navigation function

• Voice Memo– Shopping list– Activity scheduler

• Market of Speech technology in embedded devices– $125 million in 2006– $500 million in 2010

Opus Research report, March 2007

Page 34: Dev Days, Speech Recognition, LM Aubert

Other markets• Developing countries

– Access to information technology for illiterate people• Administrative tasks• Education• Social integration

• Health-care at home (self-manage diseases)– Exploding market

• Chronic diseases• Elderly people (Baby Boomers reach retirement age)• Market for home health care products is evaluated at $4.3 billion today

– Place for Speech recognition• Inexperience of patients with electronic interfaces• Poor physical condition (e.g. low vision)• Illiteracy Medical device today, March 2009

Page 35: Dev Days, Speech Recognition, LM Aubert

Other applications• Speech translation

– IraqCom

Page 36: Dev Days, Speech Recognition, LM Aubert

Okay, I can’t wait! Is there anything I can use now?

Page 37: Dev Days, Speech Recognition, LM Aubert

Upcoming solutions• Voicemail accessible via text-message,

email or dedicated application

– Server-based– Require agreement and implementation by the

carriers

Page 38: Dev Days, Speech Recognition, LM Aubert

Upcoming solutions• Nuance Voice Control 2

– Online search – Text-messaging

• Embedded software for simple voice command

• Server-based engine for large vocabulary speech recognition

• Speech Recognition API on Android 1.5

Page 39: Dev Days, Speech Recognition, LM Aubert

So?

Page 40: Dev Days, Speech Recognition, LM Aubert

Conclusion• A truly embedded speech recognition system

– A range of exciting applications• Real-time dictation with no perceived delay• Natural language interface (ASR + TTS)• Applications independent of the carrier

– But… not available yet!

• New speech recognition API are arriving soon– Rely on network/server availability– Can still lead to innovative applications

Page 41: Dev Days, Speech Recognition, LM Aubert

Conclusion• Key to succeed

– Robustness, accuracy– Fast to load and execute– Well designed interface

• Speech cannot be used on its own• Should be cleverly combined with other interfaces

– Graphical– Touch– …

– Don’t put customers off by clumsy speech recognition widgets, again!

Page 42: Dev Days, Speech Recognition, LM Aubert

Questions?