Speech Interfaces
User Interfaces Spring 1998
Drew Roselli
Motivation: Mechanical
• Smaller devices => difficult I/O
• Speed, > 90 wpm (?)
• “Virtually unlimited” set of commands
• Freedom for other body parts
Motivation: User
• Natural
• Easy to remember
• Evolutionarily selected for– reading and writing are not– neither is typing
Speech Background
• Speech is faster than vocal apparatus
» nasals spread
• Phonetic rules provide redundancy
» taboo combinations, SR in Srini
» contextual pronunciation:
/t/ -> aspirated, flap, unreleased
Speech Recognition
• Often misunderstood by people» continuous feedback
• Longer words are easier
• Maximally different vowels: a, i, u
• Individual training» gender-based» “meaningless” conversation openers
Speech Production
• Three formants visible on oscilloscope
• Harmonics from larynx, throat, mouth
• Two needed for recognition but “tinny”
• 1989 demo– http://cahn.www.media.mit.edu/people/cahn/
emot-speech.html
More Gratuitous Opinions(I’m really talking out of my butt here.)
• Recently a visual culture
• TV generation require pictured textbooks
• Notes mean “I’ll learn it later”
• Oral tradition has strong history– http://www.missouri.edu/~csottime/index.html
Could we go verbal?
Recognition Problems• Poor recognition
– humans < 1% error rate on dictation– Janus 7% error rate (how much context?)– Janus 20% in real time
• Background noise
• Slow – (simple matter of hardware)
• Homonym-rich languages (Cantonese)
More Recognition Problems
• Isolated, short words difficult– common words become short
• Segmentation– silly versus sill lea
• No semantic help
• Spelling– interface with printer, mail
UI Problems: Navigation
• Aural no-nos– modes– deep hierarchies
• Speech analog
• Grammar = how to re-structure linear sequence of words
Is there a UI equivalent?
UI Problems: Feedback
• Verbose feedback wastes time/patience– only confirm consequential things– use meaningful, short cues
• Interruption– half-duplex communication– real-time scheduling
UI Problems: Meaning
• “Do what I mean not what I say”
• Silence means “Do the right thing”
VoiceNotes
• Voice-based file system
• Replacement for tapes
• “Hierarchical” access to voice data
• Thorough documentation of problems
SpeechActs
• Speech interface to computer tools– email, calendar, weather, stock quotes
• Conversions to canonical form– keyword based? confused by negations?
• Inconsistent recognition – misunderstand system– progressive assistance– implicit confirmation
Multimodal Error Correction
• Dictation error correction study
• Results very unclear
• Recognizer got it wrong the first time
=> will get it wrong the second time
hyperarticulating aggravates
• Correct dictation errors with:
vocal spelling, writing, typing, etc