speech technologies and voicexml
DESCRIPTION
Speech Technologies and VoiceXML. Chun-Feng Liao NCCU Department of Computer Science Intelligent Media Lab [email protected]. Presentation Agenda. Voice technologies Backgrounds ASR/TTS Voice browsing with VoiceXML VoiceXML architecture VoiceXML Programming Future of VoiceXML - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/1.jpg)
Speech Technologies and VoiceXML
Chun-Feng LiaoNCCU Department of Computer Science
Intelligent Media [email protected]
![Page 2: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/2.jpg)
Presentation Agenda
Voice technologies Backgrounds• ASR/TTS
Voice browsing with VoiceXML VoiceXML architecture VoiceXML Programming Future of VoiceXML Summary
![Page 3: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/3.jpg)
Reference [1]Bob Edgar(2001),“The VoiceXML Handbook” ,NY:CM
P Books. [2]Dave Raggett(2001),”Getting started with VoiceXML
2.0”,W3C. [3]Sun Microsystems(1998),”Java Speech Grammar For
mat Specification v1.0”,Sun Microsystems. [4]Chetan Sharma and Jeff Kunins(2002),”VoiceXML:St
rategies and Techniques for Effective Voice Application Development with VoiceXML 2.0”,Wiley.
[5]Brian Eberman,Jerry Carter,Darren Meyer,David Goddeau(2002),”Building VoiceXML Browsers with OpenVXI”, NY:ACM Press.
![Page 4: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/4.jpg)
Reference [6]Microsoft (2002),“Speech Technology Overview ” ,
http://www.microsoft.com/speech/evaluation/techover/
[7] VoiceGenie Technologies Inc.(2001),”White Paper:Speaking Freely About The VoiceGenie VoiceXML Gateway and the VoiceXML Interpreter”,VoiceGenie Technologies Inc.
[8]W3C(2002),”VoiceXML Specification v2.0”,W3C.
![Page 5: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/5.jpg)
Voice Technologies
In the mid- to late 1990s, personal computers started to become powerful enough to support ASR
The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
![Page 6: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/6.jpg)
Speech Recognition
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
![Page 7: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/7.jpg)
Speech Synthesis
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
![Page 8: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/8.jpg)
Pervasive Computing Model
E-business has changed from client-server model to web-centric model
Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet.
Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.
![Page 9: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/9.jpg)
![Page 10: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/10.jpg)
Voice Browsing
VoiceXML instead of HTML A voice browser instead of an ordina
ry web browser Phone instead of PC.
![Page 11: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/11.jpg)
VoiceXML Key Design Issues
Speech Input: speech recognition and DTMF
Speech Output: pre-recorded audio and synthesized speech
Internet: XML, IP, HTTP, SSL, JavaScript
Telephony: call transfer, data passing
![Page 12: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/12.jpg)
W3C Voice Browser Working Group
Founded May 1999 60 company members Mission — Standards group to prepa
re and review markup languages to enable internet-based speech applications
http://www.w3.org/Voice
![Page 13: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/13.jpg)
VoiceXML Forum
Industry Group to promote VoiceXML
550+ member companies Submitted VoiceXML 1.0 to W3C in
May 2000 http://www.voicexml.org
![Page 14: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/14.jpg)
• VoiceXML v1.0 (May 2000)• VoiceXML Forum • Specification submitted to the W3C
• VoiceXML v2.0 • W3C Voice Browser Working Group• 50+ members collaborating• Addressed 400+ change requests
![Page 15: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/15.jpg)
VoiceXML Overview A language for specifying voice dialogs. Voice dialogs use audio prompts and text-to-spee
ch (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input.
Main input/output device (initially) is the phone. Leverages the Internet for application developm
ent and delivery. Standard language enables portability.(VoiceXM
L 統一了 Dialog 描述語言 )
![Page 16: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/16.jpg)
VoiceXML Platform Architecture
![Page 17: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/17.jpg)
VoiceXML Platform Architecture-1
Telephone and Telephone network-Connects caller’s telephone with Telephony Server
VoiceXML Gateway• Voice Browser• Audio input-Speech Recognition (ASR), Touch
tone (DTMF), Audio recording.• Audio output-Audio playback, Speech Synthes
is (TTS)• Interface, Call Controls
![Page 18: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/18.jpg)
VoiceXML Platform Architecture-2
VoiceXML Documents• Dialog and flow control• Client-side scripting (ECMAScript)• Speech Recognition grammar• Speech Synthesis pronunciation control
Document servers(web server)• Feeding Static VoiceXML documents or audio file
s. Application servers
• Generate VoiceXML documents dynamically.• Server-side application logic• Connect to Database, or database interface
![Page 19: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/19.jpg)
Example
VoiceXML-browser
<% user.storePreference(“try”) %><form> <block> 今天的氣溫是 <%= weather.getTemp() %> 度 </block></form>
Web server+ Servlet/JSP engine
weather.jsp - VoiceXML and JSP
<form> <block> 今天的氣溫是 25 度 </block></form>
DB
![Page 20: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/20.jpg)
Voice Gateway
![Page 21: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/21.jpg)
Implementations of VoiceXML Gateways
In Taiwan:• Yes Mobile• Chunghwa Telecom Laboratories ( 二代
語音平台 )• eWings Technologies, Inc
Free• IBM VoiceServerSDK
Open Source• CMU:OpenVXI
![Page 22: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/22.jpg)
[DEMO]A Simple VoiceXML Applicati
on
![Page 23: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/23.jpg)
DEMO A Simple VoiceXML application to i
ntroduce the department of Computer Science .
Exp. show that to build a corresponding HTML version first is helpful.
![Page 24: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/24.jpg)
Document A VoiceXML
document defines one or more dialogs
The user is always in one dialog at any time
Each dialog specifies the next dialog to transition to using a URL
Dialog 1
doc1.vxml
Dialog 2
Transition: #dialog 2
Transition: http://xyz.com/doc2.vxml
![Page 25: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/25.jpg)
Dialog
A Dialog describes an interaction between a user and the system
Two kinds of dialogs: form and menu
![Page 26: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/26.jpg)
VoiceXML Document Structure.
![Page 27: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/27.jpg)
Form
output
input
Form 會依照 Grammar 的定義,持續搜集 filed 中的資訊。
eval
<form> <field name="travellers“> <grammar mode=“voice” src=“./number.grxml”/> <prompt>How many are travelling?</prompt>
<filled> <submit next=”http://travel.com/order”/> </filled> </field></form>
![Page 28: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/28.jpg)
Menu
<menu id=“commands”>
What service would you like?
<choice next=“/cars”> Car hire </choice>
<choice next=“/hotels”> Hotel reservations </choice>
<choice next=“/news”> Today’s news </choice>
</menu>
menu 其實就是沒有欄位的 form
menu 是一個流程控制的方式,依照 user 的選擇,分別傳送到不同 URL 。
![Page 29: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/29.jpg)
Submit
Typically used to send results from client to server
Syntax:<submit next=”URI” namelist=”var1 var2 ...”/>
namelist: 指定要傳到下一頁的Fields 。
![Page 30: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/30.jpg)
Submit, Example
<form> <field name=“dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field> <field name="travellers“> <prompt> How many are travelling to <value expr="city"/>?
</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=“dest-city
travellers"/> </filled></form>
![Page 31: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/31.jpg)
Variables
Variables can be manipulated and referenced
•宣告 : <field name="user2">•設值 : <assign name="user1"
expr=”’peter’"/>•清除 : <clear namelist="user1
user2"/>•引用 : How many are travelling to
<value expr=“dest-city”/> ? - 引用時不用加 $
![Page 32: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/32.jpg)
Variable Scope
session
application
document
dialog
Session variables are ”read-only”
variables provided by the interpreter
context
Session variables are ”read-only”
variables provided by the interpreter
context
Scope defined by element containing executable content (<block>, <filled> or
event handler)
Scope defined by element containing executable content (<block>, <filled> or
event handler)
Search for variable name
![Page 33: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/33.jpg)
錯誤處理 :Events
Events are used to signal ”unexpected” situations
Events are caught by an catch event handler • <catch
event=”com.acme.mailreader”>...</catch>• <catch event=”nomatch
noinput”>...</catch>• Shortcut: <nomatch> is equivalent to <catch
event="nomatch"> • Other shortcuts: <noinput>, <error>
![Page 34: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/34.jpg)
<field name=“dest-city">
<prompt> Where do you want to go to? </prompt> <grammar mode=“voice” src=“./cities.grxml”/> <nomatch> Please say the city you want to fly to. </nomatch>
</field>
Events, Example
![Page 35: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/35.jpg)
Multimodal Web Browsing xHTML + VoiceXML SALT
![Page 36: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/36.jpg)
[DEMO]Multimodal Browsing
![Page 37: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/37.jpg)
Future of the “Voice” web and VoiceXML
VoiceXML1.0
VoiceXML2.0
VoiceXML forum (2000)
W3C (2003 -in CR)
Speech synthesis (SSML)
Speech reco. grammar
NLP
Speech semantics
Pronunciation lexicon [early]
Call control [early]
Voice Browser interoperation [early]
W3C
SALT
Microsoft-led (2002)
Speech ApplicationLanguage Tags
JSML
Sun/SpeechWorks (1999)
JSGF
VoiceXML 3?
![Page 38: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/38.jpg)
Conclusion
Speech is the most natural way for human to communicate thus it will become an important way in HCI.
VoiceXML has revolutionized speech recognition & telephony application development & deployment.
![Page 39: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/39.jpg)
Q & A
![Page 40: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/40.jpg)
Backup
![Page 41: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/41.jpg)
History of VoiceXMLSource:VoiceXML forum(http://www.voicexml.org)
![Page 43: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/43.jpg)
Classification of Voice Application
Basic interactive voice response (IVR)• Computer: “For stock quotes, press
1. For trading, press 2. …”• Human: (presses DTMF “1”)
Basic speech ASR• C: “Say the stock name for a price
quote.”• H: “Lucent Technologies”
![Page 44: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/44.jpg)
Classification of Voice Application
Advanced speech ASR• C: “Stock Services, how may I help you?”• H: “Uh, what’s Lucent trading at?”
“Near-natural language” ASR• C: “How may I help you?”• H: “Um, yeah, I’d like to get the current price
of Lucent Technologies”• C: “Lucent is up two at sixty eight and a half.”• H: “OK. I want to buy one hundred shares at
market price.”• C: “…”
![Page 45: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/45.jpg)
Speech Recognition Capturing speech (analog) signals Digitizing the sound waves,
converting them to basic language units or phonemes,
Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
![Page 46: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/46.jpg)
Speech Synthesis
Speech Synthesis, or text-to-speech, is the process of converting text into spoken language. • Breaking down the words into
phonemes; • Analyzing for special handling of text
such as numbers, currency amounts.• Generating the digital audio for
playback.
![Page 47: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/47.jpg)
VoiceXML Gateway(detail)
![Page 48: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/48.jpg)
Programming VoiceXML
Writing a VoiceXML application is programming.
Control constructs are procedural (if-else etc.)
VoiceXML platform iterates through a <form> until values for all field items have been collected
![Page 49: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/49.jpg)
VoiceXML System Components
VoiceXMLserver
Telecom boardsPBX
CT Integration
Speech synthesis (TTS)
Speech recognition (SR)
Speech grammars
Voice Biometrics
Software utilities
VoiceXML servers serve as integratorsof various hardware and software
Callcentre
![Page 50: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/50.jpg)
FIA - Form Interpretation
Algorithm The FIA has a main loop that repeatedly selects a form item and then visits it
The first (in document order) form item, whose field item variable is undefined, is selected
As a result, the user is prompted for each field item in turn
![Page 51: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/51.jpg)
FIA – Form Example
Field item 1
Field item 2
<form> <prompt>Where do you want to go to and how many are travelling ?
</prompt>
<field name=“dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=“voice” src=“./cities.grxml”/> </field>
<field name="travellers”> <prompt>How many are travelling to your destination?</prompt> <grammar mode=“voice” src=“./number.grxml”/> </field> <!-- other fields --></form>
![Page 52: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/52.jpg)
if, else and elseif
<form> ... <filled> <if cond="travellers > 10">
Sorry, we cannot handle groups larger than 10 persons <clear namelist="travellers"/> <elseif cond="travellers > 5 && dest-city == 'London'"/> Sorry, we cannot handle groups larger than 5 persons travelling to
London
<clear namelist=”city travellers"/> <else/> <submit next="http://travel.com/order"/> </if> </filled></form>
![Page 53: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/53.jpg)
JSML - JSpeech Markup Language
Developed by Sun and SpeechWorks, as a markup language for text-to-speech dialogs.
Based on the Java Speech API Markup Languagehttp://java.sun.com/products/java-media/speech/
Text annotation to provide hints to speech synthesizers• Aimed at making TTS speech more natural, more understandable
Feature set:• hints to word pronunciation• hints to phrasing, emphasis, pitch and speaking rate• “marker” elements -- notifications from the speech synthesizer
to applications when marker is reached.
![Page 54: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/54.jpg)
JSML - JSpeech Grammar Format
Developed by Sun and SpeechWorks, as a syntax for expressing speech grammars
Based on the Java Speech Grammar API Grammar Formathttp://java.sun.com/products/java-media/speech/
![Page 55: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/55.jpg)
Microsoft’s SALT Speech Application Language Tags
• Microsoft, Cisco, Intel, Comverse, SpeechWorks, Philips
A “lightweight” set of tags designed to be used with HTML and XHTML to enable lightweight telephony applications driven from regular Web documents.
Targeted at supporting multimodal access
![Page 56: Speech Technologies and VoiceXML](https://reader036.vdocuments.mx/reader036/viewer/2022070401/56813702550346895d9e8e66/html5/thumbnails/56.jpg)