human-machine dialogue espere and reality

106
Report Document Human-Machine Dialogue Espere and Reality Dr. Zhang Sen Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, CHINA 22/05/16

Upload: paulos

Post on 05-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

Human-Machine Dialogue Espere and Reality. Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, CHINA 2014/8/15. Overview Core Technologies Speech-to-Text Text-to-Speech Natural Language Processing Dialogue Management Middlewares & Protocols Conclusion. OUTLINE. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

Human-Machine DialogueEspere and Reality

Dr. Zhang SenDr. Zhang Sen

[email protected]

Chinese Academy of SciencesBeijing, CHINA

23/04/22

Page 2: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

2

OUTLINE

• Overview

• Core Technologies– Speech-to-Text– Text-to-Speech– Natural Language Processing– Dialogue Management– Middlewares & Protocols

• Conclusion

Page 3: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

3

Overview

• Motivation and Goal

• State of the art

• Why so difficult?

• Application Areas

• My works

Page 4: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

4

Motivation and Goal

• Machine is tool invented by human– Industry Revolution, free human’s manual labor

– Information Revolution, free human’s mental labor? – Fundamental functions required

• Espere and Goal (Bill Gates)– talk with machine via speech/NL freely– machine can understand/imitate human activities

• Machine’s intelligence– Turing test, classical and extended

Page 5: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

5

Turing’s Question

• Alan M. Turing• “Computing Machinery and Intelligence”,

(Mind, 1950 - Vol. 59, No. 236, pp. 433-460)

– I propose to consider the question, “Can machines think?” This should begin with definitions of the meaning of the terms “machine” and “think”.

• To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test”

Page 6: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

6

Turing Test

Observer

Subject #1

Subject #2

Subject #2Which subject is the

machine?

Simple, operative objective and convincible

Page 7: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

7

Conditions and Answers

• Conditions– classical Turing test: assumed communications

would be via typed text (keyboard)– extended Turing test: assumed communications

would be via speech input/output– assumed communications would be unrestricted

(as to subject, etc)

• The ability to communicate is equal to “thinking” and “intelligence” (Turing)

Page 8: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

8

Turing Test - Today

• Today, great advances in HW/SW, even computer can defeat the greatest player in chess game, but machine is still unable to fool interrogator on unrestricted subjects.

• Turing predicted that test (classical) would passed in 50 years, but exactly speaking, not passed yet, not failed yet. The extended Turing test is harder and still has a long way to go. So did some AI experts’ predictions in 50s and 60s.

• Human-machine dialogue become possible, and can provide useful functions– Travel Reservations, Stock Brokerages, Banking, etc.

Page 9: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

9

Impact and Influence

• Though Turing test not passed, it have promoted and boosted great advances in many areas: – Computer Science

– AI

– Cognitive Science

– Natural Language Processing (NLU, NLG, ...)

– MT

– Robot

– Speech-to-Text

– Text-to-Speech

– Computer Vision

– etc

Page 10: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

10

Projects

• DARPA Projects– two times, 80s and 90s, ATIS (996 words, connected),

– communicator (>5000 words, continuous)

• MIT projects, Galaxy

• CMU, OGI, JANUS project

• Bell Lab, IBM, Microsoft, VUI, VoiceXML, SALT

• Verbmobil, DFKI (German), SUNDIAL

• Grenoble, INRIA (France), MIAMM, OZONE

• ATR, JSPS projects (Japan)

• CSTAR-I, II, III, S2S project

• etc

Page 11: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

11

State of the Art

• Subject-restricted, small vocabulary, possible, but far from

satisfactory

• Metrics for the evaluation of H-M Dialogue Systems– CU Communicator 2002, the values are means– task of completion (70%), time to completion (260s)

– total turns to completion (37), response latency (2s)

– user words to task end (39), system words to task end (332)

– # of reprompts (3)

– WER (22% ? 30%)

– DARPA Communicator project proposed a set of metrics

including more than 18 items.

Page 12: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

12

Overview Architecture

DM

NLU

Speech I/O

Applications

middleware

middleware

middleware

DBKB

Page 13: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

13

Ozone’s Architecture

PPll aa tt ff oo rr mm

AArr cc hh ii tt ee cc tt uu rr ee

ll aa yy ee rr (( WWPP44 ))

DDee vv ii cc ee PP

ll aa tt ff oo rr mmAApp pp ll ii cc aa tt ii oo nn ssTT

oo pp ll aa yy ee rrSSee rr vv ii cc ee EE

nn aa bb ll ii nn gg

ll aa yy ee rr (( WWPP 22 ))

SS oo ff tt wwaa rr ee EE

nn vv ii rr oo nn mmee nn tt

ll aa yy ee rr (( WWPP 33 ))

MMii dd dd ll ee ww

aa rr ee

Ozone Applications

Situation Sensitivity

Ozone Services

Data processing

Compression/Decomp. Encryption/decryption Rendering/Composition Scaling

Multi-modality

Security

Screen / Speaker Camera/ Microphone Keyboard Pointer (mouse, touch) Sensor (Position/Identity) Actuator (Control) Clock

User+Env Interaction

Service Infrastructure

Service Disc/Lookup Service Naming scheme Service-Control

Communication Eventing & Transactions Service Composition Service Migration

Platform Infrastructure

Device Disc/Lookup Booting Device Mgmt Resource Mgmt Network Mgmt Power Mgmt Network Mobility

Seamless Operation

Interoperability

Extendibility

High Performance

Adaptability

Reconfigurability

Context-Awareness Mgmt Context Model Community Mgmt Preference Mgmt Profile Mgmt

Context Awareness

UI Mgmt Smart agent (context vs.

modality) Multi-modal widgets Perception QoS

User-Interface System

User Mgmt Key Mgmt Access Rights Mgmt Digital Rights Mgmt

Identity

Storage

Disk File-system DBMS

External services

Web services Video-on-Demand …

Application-related services

Video-conferencing Watch dogs (incoming

messages)

(Device-related) Functionalities

Networking

Network adaptero Streaming protocolso Network monitoringo Network control

Proprietary device-platform interface

Ozone-compliant middleware interface

Ozone-compliant device-platform interface

System Apps

Initial Access Content Browser

General Apps

VideoWatch MediaAlbum ...

Platform-relatedServices

Device-control

One per device Device descr.

One per functionality (each with specific type) Interactiono A/V Render + Captureo Multi-modal Interactiono Sensors (loc., ident.)o Actuators (control)o ...

Functionality-ctrl services

Data processingo Transcoder

Storageo Content Storeo Knowledge Store

Compute Infrastructure

Run-time Environment Networks on Silicon Reconf. Computing Elements

Multi-processors Memory Hierarchies Device Power-control

Ozone Run-timeEnvironment

Standardizedexecutionenvironment

Portable code

Ozone NetworkAbstraction

Addressing Streaming QoS

monitor+control Topology view ...

Device Platform

Middleware-relatedServices

Service Enabling

Examples: Preference service Community service Smart-agent service ...

Software Env.

Examples: Registry Stream Manager ...

SS ee rr vv ii cc ee ss

Content Disc/Lookup Content Naming scheme Application Startup ApplicationMigration Stream Mgmt Stream-Plug Model Replic/Synchronization

Appl & Content Infra

Page 14: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

14

Ozone’s Architecture (WP2)

Application services

Ozone applications & services

Software Environment layer

ContextAwareness

User context

Ozone context

Multi-modal widgets

Dialog management

Smartagent User-

InterfaceMgmt

PerceptionQoS

Security

Speechrecognition

Animatedagent

User-interaction module

Gesturerecognition

Interaction services

Videobrowser

Authen-tication

Securityservices

Content-accessprotection

encryption

Page 15: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

15

Galaxy Hub Architecture (MIT, CU)

Hub

ASR

TTS

AudioServer

Database

NLgenerator

NL parser

DM

MIT Galaxy hub architecture with CU communicator

Confidenceserver

WWW

Page 16: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

16

Why So Difficult?

• Natural language variation– ambiguity at word, sentence levels

– NL as an open, changing set, numerical?

• Speech variation and communication channel distortion– non-stationary, rate, power, timbre, …

– what is the fundamental feature of speech?

• Computing power limitations– optimal search algorithms’ requirement

• Current computer architecture limitations– weak to deal with analogous, fuzzy values

• Limited knowledge on human intelligence– learning mechanism of human beings

Page 17: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

17

Open Issues

• Can ASR hear everything?

• Can NLP understand everything heard?

• Can DM deal with multiple strands?

• Does TTS sound natural?

• In my opinion, Problems such as ASR,

NLP, TTS, MT, etc., have some common

characteristics. One solved, others too.

Page 18: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

18

Main Methodologies

• Statistical approach– training problems, false sample problems

• Rule-based approach– rules’ selection and conflict

• DP-based search algorithms– Viterbi, F-B search, beam search

• Mathematical modeling– time-series finite state transition model

Page 19: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

19

Application Areas

• Improve Existing Applications– Scheduling - Airlines, Hotels

– Financial - Banks, Brokerages

• Enabling New Applications– Complex Travel Planning

– Voice Web search and browsing

– speech-to-speech MT

– Catalogue Order

• Many applications require Text-to-Speech– role games

– speaking toys

Page 20: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

20

Works in Waseda

• Project “Research on human-machine dialog through

spoken language”, JSPS sponsored, 1998-2000Improved DTW approach with regard to prominent acoustic features,

Proceedings of the ASJ, 1999

Re-estimation of LP coefficients in the sense of L∞ criterion,

IEEE ICSLP, 2000, Beijing, China

Visual approach for Automatic Pitch Period Estimation,

IEEE ICASSP, 2000, Istanbul, Turkey

Automatic Labeling Initials and Finals in Chinese Speech Corpus,

IEEE ICSLP 2000, Beijing, China

• A speech coding approach based on human hearing model,

Proceedings of the ASJ, 2000

Page 21: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

21

Works in CSLR, CU

• Project “CU Communicator”,

DARPA sponsored and NSF supported, 2000-2001

• N-gram LM smoothing based on word class information

• Dynamic pronunciation modeling for ASR adaptation

Amdahl law, 50 most common words• What kind of pronunciation variations hard for tri-phones to model?

IEEE ICASSP 2001, Salt Lake city, USA

Page 22: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

22

Works in INRIA-LORIA

• Project “Multidimensional Information Access using

Multiple Modalities”, EU IST sponsored,2002-2003

• Middleware between ASR engine and DM, XML

• Domain-specific N-gram LM generation based on a set

of French language rules, PERL

• HMM-based acoustic modeling improvement• Some issues on speech signal re-sampling at arbitrary rate,

IEEE ISSPA, 2003, Paris, FRANCE

• An Effective Combination of Different Order N-Grams, The 17th Pacific Asia

Conference on Language, Information and Computation, 2003, Singapore

• Comparison of speech signal resampling approaches, Proc. of ASJ, 2003, Tokyo, Japan

• Text-to-Pinyin conversion based on context knowledge and d-tree for Mandarin,

IEEE NLP-KE, 2003, Beijing, China

Page 23: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

23

Spoken Language Toolkit

• Finished in 2003, speech signal analysis

module was integrated into Snorri, LORIA

• Functions: – speech signal analysis– speech-to-text– text-to-speech– text-to-grapheme

Page 24: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

24

Snapshot of Toolkit (1)

Page 25: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

25

Snapshot of Toolkit (2)

Page 26: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

26

Snapshot of Toolkit (3)

Page 27: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

27

Core Technologies

Based on the requirements analysis of human-machine

communication, at least the following technologies

should be included:

• Speech-to-Text

• Text-to-Speech

• Natural Language Processing

• Dialogue Management

• Middlewares & Protocols

Page 28: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

28

Speech-To-Text

Page 29: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

29

The Speech-to-Text Problem

Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence AA tractable reformulation of the problem is:

Language model

Acoustic model

Daunting search task

Page 30: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

30

Speech Recognition Architecture

FrontEnd

Recognition

O1O2 OT

AnalogSpeech

ObservationSequence

W1W2 WT

Best WordSequence

Decoder

AcousticModel

DictionaryLanguage

Model

Page 31: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

31

Front-End Processing Feature Extraction

Dynamic featuresK.F. Lee

Page 32: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

32

Overlapping Sample Windows

Speech signal is non-stationary signalshort-term approximation: viewed as stationary signal

Page 33: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

33

Cepstrum Computation

• Cepstrum is the inverse Fourier transform of the log spectrum

1,,1,0,)(log2

1)( LndeeSnc njj

IDFT takes form of weighted DCT in computation, see in HTK

Page 34: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

34

Mel Cepstral Coefficients

• Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples:

Filter-bank, under 1k hz, linear, above 1k hz, log Motivated by human auditory response characteristicsMost common feature set for recognizers

Page 35: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

35

Cepstrum as Vector Space Features

Overlap

Page 36: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

36

Features Used in ASR

• LPC– Linear predictive coefficients

• PLP– Perceptual Linear Prediction

• Though MFCC has been successfully used,

what is the robust speech feature?

Page 37: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

37

Acoustic Models

• Template-based AM, used in DTW, obsolete

• Acoustic states represented by Hidden Markov Models (HMMs)

– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed

– Each state has output symbol distribution

– Each state has transition probability distribution

– Issues: what topology is proper? how many states in a model?

How many mixtures in a state?

normal silence connected

Page 38: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

38

Limitations of HMM

• HMMs assume the duration follows an exponential

distribution• The transition probability depends only on the

origin and destination • All observation frames are dependent only on the

state that generated them, not on the neighboring

observation frames (observation frames dependent)

Paper: “Transition control in acoustic modeling and Viterbi search”

Page 39: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

39

Basic Speech Unit Models

• Create a set of HMM’s representing the basic sounds (phones) of a language?– English has about 40 distinct phonemes

– Chinese has about 22 Initials + 37 Finials

– Need “lexicon” for pronunciations

– Letter to sound rules for unusual words

– Co-articulation effects must be modeled

• tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English)– e.g. pl-c+pr

Page 40: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

40

Language Models

• What is a language model?– Quantitative ordering of the likelihood of word

sequences (statistical viewpoint)– A set of rule specifying how to create word

sequences or sentences (grammar viewpoint)• Why use language models?

– Not all word sequences equally likely– Search space optimization (*)– Improve accuracy (multiple passes)– Wordlattice to n-best

Page 41: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

41

Finite-State Language Model

• Write Grammar of Possible Sentence Patterns• Advantages:

– Long History/ Context– Don’t Need Large Text Database (Rapid Prototyping)– Integrated Syntactic Parsing

• Problem:– Work to write grammars– Words sequences not enabled do not exist– Used in small vocabulary ASR, not for LVCASR

show me

display

any

the next

the last

page

picture

text file

Page 42: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

42

Statistical Language Models• Predict next word based on current and history• Probability of next word is given by

– Trigram: P(wi | wi-1, wi-2)– Bigram: P(wi | wi-1)– Unigram: P(wi)

• Advantage:– Trainable on Large Text Databases– ‘Soft’ Prediction (Probabilities)– Can be directly combined with AM in decoding

• Problem:– Need Large Text Database for each Domain– Sparse problems, smoothing approaches

• backoff approach• word class approach

• Used in LVCASR

Page 43: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

43

Statistical LM Performance

Page 44: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

44

ASR Decoding Levels

/w/ -> /ah/ -> /ts/

/th/ -> /ax/

what's the

display

kirk's

willamette's

sterett's

location

longitude

lattitude

/w/ /ah/ /ts/

/th/ /ax/

States

Phonemes

Words

Sentences

AcousticModels

Dictionary

LanguageModel

Page 45: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

45

Decoding Algorithms

• Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match)

• Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process,

the “principle of optimality” is divide and conquer.

• The DP-based search algorithms have been used in speech recognition decoder to return n-best paths or wordlattice through the acoustic model and the language model

• Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load.

• Issues: computation underflow, balance of LM, AM.

Page 46: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

46

Viterbi Search

• Uses Viterbi decoding– Takes MAX, not SUM (Viterbi vs. Forward)– Finds the optimal state sequence, not optimal

word sequence– Computation load: O(T*N2)

• Time synchronous– Extends all paths at each time step– All paths have same length (no need to normalize

to compare scores, but A* decoding needs)

Page 47: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

47

Viterbi Search AlgorithmFunction Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))

then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s

Backtrace from highest prob state in final column of viterbi[] & return

Page 48: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

48

Viterbi Search Trellis

W1

W2

0 1 2 3 t

Page 49: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

49

Viterbi Search Insight

Word 1 Word 2

time t time t+1

Word 1

Word 2

S1S2S3

S1

S1 S1S2S2

S2S3

S3 S3

OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1)

scorebackptrparmptr

Page 50: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

50

Bachtracking

• Find Best Association between Word and Signal• Compose Words from Phones Using Dictionary• Backtracking is to find the best state sequence

/th/

/e/

t1 tn

Page 51: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

51

N-Best Speech Results

• Use grammar to guide recognition • Post-processing based on grammar/LM• Wordlattice to n-best conversion

“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”ASR

SpeechWaveform

Grammar

N-Best Result

N=1N=2N=3

Page 52: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

52

Complexity of Search

•Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon)

•Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state)

•Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy)

Page 53: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

53

ASR As Modern AI

• Draws on wide range of AI techniques– Knowledge representation & manipulation

• AM and LM, lexicon, observation vector

– Machine Learning• Baum-Welch for HMMs

• Nearest neighbor & k-means clustering for signal id

– “Soft” probabilistic reasoning/Bayes rule• Manage uncertainty mapping in signal, phone, word

• ASR as an expert system

Page 54: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

54

ASR Summary

• Performance criterion is WER (word error rate)

• Three main knowledge sources– Acoustic Model (Gaussian Mixture Models)– Language Model (N-Grams, FS Grammars)– Dictionary (Context-dependent sub-phonetic units)

• Decoding– Viterbi Decoder– Time-synchronous– A* decoding (stack decoding, IBM, X.D. Huang)

Page 55: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

55

Text-to-Speech

Page 56: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

56

Text-to-Speech• What is Text-to-speech?

– To produce spoken language based on input text and high-level prosodic parameters.

• Main approaches– Concatentative synthesis

• Glue waveforms together (Festival, MBROLA)

– Parameter-based synthesis• Klatt’s formant synthesis (MITalk, some clones)

– Articulatory synthesis (still under R&D)

• Basic units selection– di-(tri-)phone models: mid-point to mid-point

– syllable, sub-syllable (Initials, Finals)

Page 57: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

57

Text-to-Speech Status

• State-of-the-art Text-to-Speech– Intelligible, but– More natural-sounding needed– Better prosody, personnel feeling needed– Nouns, spec. names handling not complete– Times, digitals handling not complete– Abbreviation handling not complete

• Some TTS systems– Festival, MBROLA, Jin-Sheng-Yu-Zheng, CTTS

Page 58: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

58

Human Speech Production Levels

• World Knowledge (text normalization)

• Semantics (concept, thought, meaning)

• Syntax (grammar)• Word (word pronunciation)

• Phonology (intonation assignment)

• articulator movements, F0, amplitude, duration• Acoustics (synthesis)

Page 59: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

59

Concatenative Synthesis

• Pre-recorded human speech– Cut up into units, code, store (indexed)– Diphones, triphone

• Given a phonemic transcription– Rules to select unit sequence– Rules to concatenate units based on some selection

criteria– Rules to modify duration, amplitude, pitch and sm

ooth spectrum across junctures

Page 60: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

60

Concatenative Synthesis Issues

• Speech quality varies based on– Size and number of units (coverage)– Rules for selection and concatenation– Speech coding method used to decompose

acoustic signal into spectral, F0, amplitude parameters

– How to modify the original signal to produce the output to meet the target pattern?

Page 61: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

61

Formant Synthesis

• Parameters of acoustic model:– Formant frequencies, bandwidths, amplitude, etc

• Phonemes have target values for parameters• Given a phonemic transcription of the input:

– Rules to select sequence of targets

– Rules to determine duration of target values

• Speech quality not natural– Acoustic model incomplete

– Human knowledge of linguistic and acoustic control rules incomplete (param. acquisition by short-term analysis)

Page 62: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

62

Articulatory Synthesis

• Model articulators: tongue body, tip, jaw, lips, velum, vocal folds, etc. (by 3D X-ray)

• Rules to control timing of movements of each articulator

• Easy to model coarticulation since articulators modeled separately

• But sounds very unnatural– From vocal tract to acoustics not well understood– Knowledge of articulator control rules incomplete– Model parameters acquisition issue

Page 63: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

63

TTS Front End

• Segmentation and combination

• Plain text or tagged, tags analysis

• Word to phoneme sequence conversion– English: pronunciation model and rules– Chinese: lexicon and d-tree

• Text analysis tools: pos tagger, morphological analyzer, little parsing

Page 64: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

64

Text Normalization

• Context independent:– Mr., 22, $n, USA, VISA

• Context-dependent:– Dr., St., 1997, 3/16

• How to resolve abbreviation ambiguities?– Dr. (doctor or drive ?), PM (? or ?)– Application restrictions– Rule or corpus-based decision procedure

Page 65: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

65

Duration Modeling

• How long should each phoneme be?– Context phonemes– Position within syllable, word– Number of syllables– Phrasing– Stress– Speaking rate– Speaking style

Page 66: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

66

Pitch Modeling

• How to create F0 contour from accent/phrasing/contour assignment plus duration assignment and phonemes?– Contour or target models for accents, phrase

boundaries (Fujisaki model, statistical model)– Rules to align phoneme string and smooth– How does F0 align with different phonemes?

Page 67: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

67

Prosody Factors

Page 68: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

68

Can Prosody be Modified?

• Model duration and pitch variation– Could extract pitch contour directly

• time-domain: auto-correlation, peak-detection• frequency-domain: FFT, WT• still under research (recent ICASSP papers)

– Common approach: TD-PSOLA• Time-domain pitch synchronous overlap and add

– Center frames around pitchmarks to next pitch period– Adjust prosody by combining frames at pitchmarks for

desired pitch and duration– Increase pitch by shrinking distance b/t pitchmarks– Can be squeaky

Page 69: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

69

TD-PSOLA

Page 70: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

70

Text-to-Speech Architecture

Text Analysis

Prosody Generation

Unit Concatenation

Input text

Speech Output

Prosody Templates

Unit Inventory

Prosody Model

word dictionary

Page 71: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

71

Text-to-Speech Summary

• Intelligible, but not very natural

• Many TTS applications now (i.e.,e-dict)

• How to model prosody to meet target?

• Some other approaches– Large corpus-based concatenative synthesis– Synthesis-by-analysis

Page 72: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

72

Natural Language Understandingin Human-Machine Dialog

Page 73: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

73

NLU in H-M Dialog

• The task of NLU in human-machine dialog is to do discourse analysis and then send message to DM to act/respond.

• Not “fully” understand the meaning of the discourse or sentence, only parse it into some classes and determine their attributes, relationships.

Page 74: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

74

Knowledge Issues for NLU

• Knowledge representation:– how to organize and describe knowledge

• Knowledge control:– how to apply knowledge

• Knowledge integration:– how to use the various knowledge sources

• Knowledge acquisition:– how to acquire the required knowledge and maintain

consistency of the knowledge base

Page 75: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

75

Why Parsing Needed?

• Allow dialogue more flexible, more open

• Allow “wild card” descriptions:• I want to fly from “$X” to “$Y”

• Allow out-of sequence phrases• I want to go from Chicago to Dallas today

• Today I want to go to Dallas from Chicago

• Extract needed information• fill out slots and frames

• set values to global and local variables

Page 76: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

76

Parsing in Dialog

• Knowledge for dialog parsing– Lexicon

– Parsing Rules (grammar)

– Ontology

• Linguistic analysis of discourse– Syntactic analysis (shallow parsing)

– Semantic analysis (deep parsing)

• Parsing methods– whole matching: driven by FSG

– partial matching: driven by SLM

Page 77: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

77

Lexicon Structure• Lexicon - A List of Words and their Syntactical and

Semantic Attributes• Root or stem word form

– fox, run, Boston

• Optional forms plural, tenses– fox ,foxes

– run, ran, running

• part of speech– fox - noun

– run - verb

– Boston - proper noun

• Link to Ontology– fox - animal, brown, furry

– run - action, move fast

– Boston - city

Page 78: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

78

Structural Parsing

Which is the biggest American city

WP VBD DT JJ NNP NN

NP

VPS

city

PLACE

biggestAmerican

Page 79: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

79

Classes-Relationships Parsing

PERSON LOCATION DATE TIME PRODUCT NUMERICAL MONEY ORGANIZATION MANNER VALUE

DEGREE DIMENSION RATE DURATION PERCENTAGE COUNT

time of daymidnight

prime time

clock time

hockeyteam

team,squad

institution,establishment

financialinstitution

educationalinstitution

numerosity,multiplicity

integer,whole number

population denominatorthickness

width,breadth

distance,length

altitude wingspan

Slot filling method

Page 80: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

80

Phoenix Parser

• W. Ward designed in CMU in 1990s

• Used in DARPA Communicator

• Parse a sentence into a sequence semantic frames

• parsing: pattern matching and slot filling– concept: a set of organized frames– frame: a set of organized slots– slot: patterns, attributes, context-free grammar– pattern: a set of constrains

Page 81: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

81

Phoenix Parser Example

• “I want to go from Boston to Denver Tuesday morning”

• Phoenix parsing result:– Flight_Constraint: Depart_Location.City.Boston

– Flight_Constraint: Arrive_Location.City.Denver

– Flight_Constraint: [Date_Time].[Date].[Day_Name].tuesday

[Time_Range].[Period_Of_Day].morning

Page 82: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

82

NLU Summary

• Semantic analysis is still very difficult in NLU

due to the ambiguity in NL• Today, slot-filling as a practical parsing technique

has been used in H-M dialogue systems and NLU• Knowledge base and its organization structure

have

heavy influence on the performance of parsing

Page 83: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

83

Dialog Manager

Page 84: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

84

Tasks of DM

• DM is the hub in the H-M dialog system,

it performs the following functions:– control the interaction b/t the user and the system

– decide and plan the system’s action at each step

– resolve the ambiguities in the interpretation from NLU

– estimate confidence in the extracted information

– integrate new input with dialog context/history

– prompt user for missing information

– send information to NLG for presentation to user

Page 85: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

85

State-of-the-Art DM

• DM as a DSS– decision-making based on input, rules, context– decision-making method: d-tree

• Dialog modes– directed dialog: current– free dialog: future

• DM design– event driven (DARPA Communicator)

Page 86: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

86

Directed Dialogue

• Computer asks all the questions– Usually presented as a menu or give some choices

– “Do you want your account balance, cleared checks, or deposits?”

• Computer always has the initiative– User just answers questions, never gets to ask any questions

• DM avoid asking open-ended questions– “What can I do for you?”

• Questions’ answers can be explicitly predicted– “Do you want to buy or sell stocks”

• All possible answers must be pre-defined by the application developer (grammars)

• The job could be done, but may be tedious and tiresome

Page 87: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

87

CU Communicator DM

• Context– a set of frames and a set of global variables

• Event driven– an incoming parse will cause a set of actions,

– and modify the current context

• The DM attempts the following actions in order:– clarify if necessary

– sign off if all jobs done

– retrieve data and present to user

– prompt user for required information

Page 88: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

88

Rules to Prompt

• The rules for deciding what to prompt for next are

based on the frame in focus or the last system prompt– if there are unfilled slots in the focus frame, then

prompt for the highest priority unfilled slot in

the frame

– if there are no unfilled slots in the focus frame,

prompt for the highest priority missing piece of

information in the context

– the system will prompt for whatever information

is missing until the frame is complete

Page 89: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

89

Task Frame Example

Frame: Air

[Depart_Loc]+

Prompt: “where are departing from?”

[City_Name]*

Confirm: “you are departing from $([City_Name]),

is that correct?”

SQL: “dep_$[leg_num] in (select airport_code from

airport_codes where city is like ‘!%’ $(and

state_province like ‘[Depart_Loc].[State]’))”

[Airport_Code]*

Page 90: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

90

Issues of DM

• Though a job can be done, but the dialog

process may be quite long and tiring

• The dialog process is controlled by the

system, users don’t have initiative

• Exceptions/strands handling if unexpected

information comes

• The work to create frames and grammars

Page 91: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

91

Middlewares & Protocols

Page 92: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

92

Middlewares

• Middlewares sit between sub-systems or layers

(e.g. MS ODBC is for VB applications and DBMS)

• Middlewares are responsible for the communication between sub-systems and facilitate them to work as an integrated system

• Middleware’s Design– the input/output of related sub-systems

– protocols for formatting information for communication

– format conversion

– Example: middleware for ASR and NLU

Page 93: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

93

VoiceXML

• VoiceXML is a web-oriented voice-application markup language which was approved by W3C as standard.– Some dialog system developers have adopted VXML

in their project design and development

• Assume telephone as user’s input/output device• Assume voice or key input• Pre-recorded or TTS for output

Page 94: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

94

Web applications using VoiceXML

VoiceXML uses a voice browser (on voice gateway) for audio

input and output

Users use a regular phone to access a VoiceXML

-based application

Page 95: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

95

VoiceXML Evolution

1995

2003W3C

Standardization

VoiceXMLForum

AT&TPML

IBMSpeech ML

MotorolaVoxML

LucentPML

AT&T Bell LabsPML/PhoneWeb 1996--

8/1999 VoiceXML 0.9

3/2000 VoiceXML 1.0

4/2002 VoiceXML 2.0(work draft)

380+ othercompanies

Page 96: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

96

ASR

DTMF recognizer

Languageunderstanding Context

interp-retation

Dialogmanager

WWW

Telefone

Mediaplanning

Prerecorded audio

LanguagegenerationTTS

lexiconlexicon

N-gram grammar ML

Speech recognition grammar ML

Naturallanguage

semantic ML

voiceXML

Call controlSpeech synthesis ML

W3C speech interface framework

Page 97: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

97

Relations with other MLs

SGML XML

HTML

VoiceXML

WML

SALT

applications

Meta-language XHTML

Versions of MLs: SGML[ISO8879], XML1.0, HTML 4.0, XHTML 1.0, VoiceXML 2.0, WML 1.0

VXML:=VoiceXML

Page 98: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

98

VoiceXML Architecture

Processes requests received from the VoiceXML Interpreter and responds with VoiceXML docu

ments

Interprets the VoiceXML documents it receives from

the document server

generates events in response touser actions and system events

Voice server

VoiceXML browser

Application

ASR engine TTS engine

DTMF

Page 99: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

99

VoiceXML Example

<?xml version="1.0"?><vxml version="1.0">

<!--Example 1 for VoiceXML Review --> <form> <block> Hello, World! </block> </form></vxml>

Page 100: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

100

VoiceXML Applications

• Query applications– Informational retrieval

• News, sports, hotel, stock quotes, traffic.– Telephone services

• Voice routing, voice dialing.

• Transaction applications – e-transactions (e-commerce, e-tailing, etc)

• Call center, account status, stock trading.– Intranet

• Inventory, ordering.

Page 101: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

101

SALT• SALT: Speech Application Language Tags • SALT targets speech applications across a whole spectrum

of devices, including telephones, PDAs, tablet computers and desktop PCs

• SALT supports Multi-modal systems

• Assume input comes from speech recognition, keyboard or keypad, or mouse

• Output to screen or speaker (speech)

• Both VoiceXML and SALT are markup languages that describe a speech interface, the main difference is the assumption of device.

Page 102: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

102

SALT Code

• <!—- Speech Application Language Tags -->

• <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt>

• <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt>

• <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()">

• Sorry, I didn't understand. </salt:prompt>

• <salt:listen id="recoOriginCity"

• onReco="procOriginCity()” onNoReco="sayDidntUnderstand.Start()">

• <salt:grammar src="city.xml" />

• </salt:listen>

• <salt:listen id="recoDestCity"

• onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()">

• <salt:grammar src="city.xml" /> </salt:listen>

Page 103: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

103

Conclusions

Page 104: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

104

• Human-Machine dialogue becomes possible in some restricted subjects now, such as in Stock Brokerages, Travel Agencies, etc., but far from convenient and satisfactory.

• Artificial Intelligence and Natural Language technology have made rapid advances and promoted Human-Machine dialogue’s R&D and many conversational applications.

• Machine intelligence is beyond Human-Machine dialogue.

• The research on Human-Machine dialogue will

benefit and enrich computer science.

Page 105: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

105

References• Speech & Language Processing

– Jurafsky & Martin -Prentice Hall - 2000

• Spoken Language Processing, – X.. D. Huang, al et, Prentice Hall, Inc., 2000

• Statistical Methods for Speech Recognition

– Jelinek - MIT Press - 1999

• Foundations of Statistical Natural Language Processing

– Manning & Schutze - MIT Press - 1999

• Fundamentals of Speech Recognition– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993

• Dr. J. Picone - Speech Website

– www.isip.msstate.edu

Page 106: Human-Machine Dialogue Espere and Reality

Report

Docum

ent

106

Thanks