Creating User Interfaces
[Continue presentations as needed] Speech recognition. Speech synthesis
Homework: Report on current products. Register on Tellme Studies.
Study VoiceXML
Speech recognition
• User speaks. System 'understands', at least enough to perform some action.
• Related to (but not the same as)– Natural language understanding– Voice print identification– Record information to be re-played to human in
compressed form for later interaction– Speech synthesis (other direction): words to speech– ?
Natural language understanding
• Skip speech altogether, but type in statements or phrases in normal language– What is normal? We tend not to speak that
grammatically– Many 'natural language systems' actually use
keywords• Histor• Moon rocks example
• Combine speech to natural language …
Continuous versus discrete
• Speaker speaks 'naturally' versus
• Speaker separates words
Examples• Dictation: no understanding as such, produce
words/sentences in a program• (Telephone) Help desk / Information: generally
restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process
• [Restricted] commands: actually carrying out operations– Factory example: start and stop– Car: radio, heat/AC– Phone: call specific number
Training
• Dictation application: user takes time to read specific test to train the system– Note: some systems also adapt with use. If &
when user corrects the results, system may do better next time.
• Phone lookup: user records names. No 'understanding', just record for matching.
Audience & content
• Some systems may allow adapting to audiences, for example, male versus female
• Some systems have restrictions on types of content– Historical note: IBM system in 1980s & 1990s
was restricted to male, American-born speakers (no speech impediments) and legal text.
Speech recognition concepts
• Air pressure diaphragm in phone electrical signal (Fourier Transform) wave pattern
matched against• sets of canonical patterns
(native speaker of English, perhaps male/female & young/old alternatives)
• generated for the specified grammar (using a segmentation=dividing up of the parts)
Note: interplay of grammar and statistics distinguishes different approaches
Fourier Transform(Discrete Fourier Transform -- FFT)• Takes data representing
a signal
• And produces numbers representing the combination of sine and cosine waves that make up the signal
Speech recognition
• Works on the product of the FFT
• Uses (in most cases) – Segmentation: attempt to break up into
pieces, perhaps syllables or words– Grammar: definition of what is to be expected– Probabilities: if first part matched X, then
greater probability that then next would match to Y
Current State of the Art• General, no restrictions, speech reco, good
enough to act on the speech? always about to happen?
• dictation / substitute for keyboard+ exists and satisfies many– Is this most important application for most users?– May not be killer ap, but may be good for motivating
research
Homework: prepare brief report on [a] current product or application. Can be one you use yourself.
Speech synthesis
• aka TTS (text to speech)
• Application determines that the computer needs to say certain words
• lexical units (syllables of words) phonemes pre-recorded (wav) files of phonemes
Speech synthesis• This is again a segmentation process: need to
divide up the words and then put together so speech sounds 'natural'. – particular phoneme may [need to] sound different in
different context.– also need to deal with abbreviations & local accents– Place names (important in travel & weather
applications)• Special case: detect and use wav file for each name.
• Older methods were all synthesized – similar distinction between all synthesized and
samples of music
Speech synthesis
is essentially ‘the computer’ reading ‘out loud’.
Easy to do most things
More and more difficult to do complete job
Different languages may be easier than English.
People who are not monolingual please comment!
Restricted / directed speech applications
• We will use the tellme studio engine to create directed speech applications.
• These make use of– Grammars– Options to use numbers (buttons)– Recorded (.wav) sounds– Text to speech
studio.tellme.com• Company that provides ‘engine’ for applications• Provides developing environment
– We are doing the Tellme version of VoiceXML, but it appears to be standard.
• Register as a developer:– Provide your own id; assigned a PIN– Put VoiceXML in ScratchPad place (no audio files)
• 1-800-555-VXML (8965)– SAY id and then PIN or can give phone number. Tellme
runs either• program in ScratchPad OR• program at Application URL for projects with multiple files
• To look at someone else's project, you change your Application URL– called pointing your account to a new source.
XML
• Generalization of HTML
• XML documents have markup.– Tag indicating type of element and, possibly
with attributes, content, tag closer.
• Document must be well-formed.
• Developers decide on element types.
VoiceXML• XML document (VXML header)
– This means proper nesting of elements, quotation marks on attributes
• VoiceXML has tags for flow-of-control and calculations.– Also can use <script> for JavaScript
• Grammars come in different varieties. We will use the Tellme way. – Grammars are included in CDATA tags to prevent
XML interpretation.– Many grammars constructed for you.
• <field name="answer" type="boolean" >…will listen for yes or no. <field name="price" type="currency" > … will listen for currency.
– <menu > <choice > <choice> for list
Very brief overview• <vxml> document contains <form> and/or
menu elements.– <form> can contain <block>, <field>
• <block> can contain <audio> or do its own audio• <field> can contain <prompt>, <grammar>,
<noinput>, etc.– NOTE: certain types of <field> elements use built-in
grammars, for example, boolean– Can have a child node <filled> that indicates what to do if
there is a match
– <menu> is a compressed way use a simple grammar
Very brief, cont.
• Logic can be done using a <script> element that contains a variant of JavaScript and/or
• vxml logic elements, including– <var>– <if>, <else> <elseif>– other
• These may be part of a <filled> element
Audio• Tellme studio provides way to record [your] speech as a
wav file to upload to a website. Sends it to your email address
• You upload your VoiceXML file plus any wav files (and anything else)<audio src="mygreeting.wav">Welcome to my site </audio>If Tellme can't find the mygreeting.wav file, it uses its Text to Speech on the string "Welcome to my site".
Note: you also can use a full URL: http://....
• You put in the URL for the voicexml file into your Tellme studio account, called pointing to the URL.
• TEST
VoiceXML basics, continued• <form> element can contain
– <block> elements, which can contain <audio>, <go>, other
– <field> which can contain• <prompt>• <grammar> (if not one of built-in grammars)• <filled>
• <var> tags can be at different levels (for example, document, block, or higher levels)
• <if> <elseif><else> tags• <script> elements for JavaScript (which can
also appear in expressions>
VoiceXML basics: typical case
• a form element – <field>
• <prompt>, made up of <audio>, with reference to recorded wav file and backup text
• <grammar>, if NOT using built-in grammars designated by type attribute of field. This is a CDATA section.
• <filled> with (follow-on) code using field• <catch> for nomatch, noinput cases
Caution
A form contains various elements,
including
a field.
If a field has a grammar and the grammar is satisfied, control goes to a
filled tag
obligatory…
<?xml version="1.0"?><vxml version="2.0"> <form> <block> <audio src="prompt1.wav">Hello, world </audio>
</block> </form></vxml>
recorded using tellme studio
backup using TTS, just in case src file missing
example• Asks for number of credits and calculates
when you/caller can register
• uses built-in grammar for number
• No error recovery. You need to do better than this in your project.
• Unfortunate situation: there is a element type filled and an element type field.
• The < symbols are represented using lt;
<?xml version="1.0"?><vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"><form id="credit"> <var name="rest" expr="1000"/> <field name="bcount" type="number"> <prompt> <audio src="howmanycredits.wav">Hello there. How many credits
have you earned? </audio> </prompt><grammar type="application/x-gsl" mode="voice" ><![CDATA[ NATURAL_NUMBER_THRU_999]]></grammar><catch event="noinput nomatch"> <audio src="sorry.wav">Sorry. I
didn't get that.</audio> <exit/> </catch>
<filled> <assign name="rest" expr="bcount"/> <audio> <value expr="rest" /> </audio> <if cond="rest<30" > <audio src="homestretch.wav">You can register on the
third day </audio> <elseif cond="rest<60" /> <audio src="morethanhalf.wav">You can register on the
second day </audio> <elseif cond="rest<90" /> <audio src="goodstart.wav">You can register on the first
day</audio> <else/> <audio>You can register on the fourth day </audio> </if> <audio src="goodbye.wav">Good bye. </audio> </filled> </field> </form> </vxml>
Homework
• Do research / think about your own experiences and come prepared to report on a speech recognition / speech synthesis application
• Start learning VoiceXML