the uninvited guest information’s role in spoken language steven greenberg the speech institute...

33
The Uninvited Guest Information’s Role in Spoken Language Steven Greenberg The Speech Institute http://www.icsi.berkeley.edu/~steveng [email protected]

Upload: joella-carter

Post on 25-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

The Uninvited Guest Information’s Role in Spoken

Language

Steven GreenbergThe Speech Institute

http://www.icsi.berkeley.edu/[email protected]

For Further Information

Consult the web site:

www.icsi.berkeley.edu/~steveng

I have a dream …

A Vision of the Future

That one day a true science of spoken language technology will emerge

A Vision of the Future

A scientific approach to speech technology based on rigorous quantitative models of spoken languageSpanning multiple tiers of linguistic organization

A Vision of the Future

And that is capable of developing superb technology applications in synthesis, recognition and coding of spoken language

A Vision of the Future

On what will this technology be based? And how to jump-start the process?This forms the focus of my presentation

A Vision of the Future

ALL technology is ultimately based on scientific insight and understanding – without this, the technology will be inherently flawed and unreliable

Importance of a Scientific Framework

An example will suffice to illustrate the point ….Atomic weapons (and power plants) would not be possible without a thorough understanding of nuclear fission and fusionThe science came first, the technology secondThis is a horrific example, but one that (I hope) makes the point

Importance of a Scientific Framework

Unfortunately, we currently lack a deep understanding of spoken languageUntil a scientifically grounded framework is developed, speech technology will neither be truly effective nor efficient to produceWhat is required to accomplish this objective?

Dawn of a New Age

A theoretical framework is required, one that is based on large amounts of empirical dataNo single scientific perspective is likely to provide the detail and depth required to encapsulate the major properties of spoken language

Theoretical Framework for Spoken Language

An example of how much we have to learn about spoken language …According to the conventional wisdom the phoneme is the key phonetic unitIn my view, the syllable - not the phoneme - is the key unit, serving as the interface between sound and meaningIt binds the higher and lower tiers of linguistic organizationThere is a lot of evidence in support of this position

A Syllable-Centric Perspective

But we currently lack the tools to definitively address such issuesInstead of elaborating on this particular topic, I would like to focus on what is required to develop flawless speech technology, and how to go about accomplishing this objectiveOur current technology, flawed though it may be, offers a means to develop a firm scientific basis of spoken language researchAnd thereby dramatically improve the quality of speech technology itself

The Soul of a New (Speech) Machine

We already possess the seeds of the requisite technologyIt is mainly a matter of improving certain components and then using the technology to perform the requisite researchAnd using the results of the research to improve the technology

Seeds of a New Technology

In a previous presentation (Eurospeech 2001), the following figure was used to illustrate the idea I would like to extend the process encapsulated in this illustration to incorporate a separate (but related) technology – SYNTHESIS

Seeds of a New Technology

The problem with the diagram below is that it doesn’t provide sufficient data with which to fine-tune synthesis and recognition systems

Or to fully understand the underlying bases of spoken language organization and its realization in the acoustic signal

Importance of Controlled Synthesis

One could annotate hundreds of hours of speech material and STILL lack sufficient data with which to rigorously test various models of spoken language comprehension (by human or machine)

Instead, one requires MODELS based on statistical characterization of real speech using a realistic, highly controllable synthesis technology

Importance of Controlled Synthesis

The best candidate (in my opinion) for this controlled synthesis is ….

STRAIGHTIt offers exquisite control over virtually all aspects of the acoustic signal, including:

Duration, amplitude, formant trajectories and fundamental frequency

STRAIGHT’s progenitor, Hideki Kawahara, will undoubtedly provide further information during this workshop about its current capabilities

I would like to focus on what additional capabilities could be incorporated into STRAIGHT in order for it to reach its true potential as a tool for performing basic and applied speech research

Going STRAIGHT

What follows is a “wish list” ….STRAIGHT currently lacks some important featuresSuch as:(1) a linguistic interface(2) a prosody engine(3) pronunciation models(4) duration models

So… I would like to focus on what STRAIGHT needs to be truly useful for the sort of spoken language research and technology described earlierAnd then describe how this new application – “Super STRAIGHT” – could be used to dramatically improve both our scientific understanding as well as the quality of speech technology

Straightening Out STRAIGHT

Most synthesizers currently on the market provide a rudimentary (but effective) linguistic interfaceType in a sequence of words and out comes synthetic speech – Voilá!“Now is the time for all good men ….” and so on

The problems with this approach are several(1) Either the output sounds mechanical and not human – LPC or formant-based synthesis(2) The synthesis sounds relatively natural, but is based entirely on human voice recordings and can’t be changed in many ways - Concantenative synthesis, re-synthesis or PSOLA(3) In either case, we understand relatively little about why (or why not) the synthesized voice sounds intelligible and/or natural (or not)

In my view, Problem Number (3) is the most serious of all, as it limits what can be done to improve the technologyMoreover, such ignorance-based technology means that it is both time-consuming and expensive to develop new applications, and hence limits the commercial potential

The Problem(s) with Conventional Synthesis

Utterances are composed of more than merely wordsAnd words contain more than just phonemesThe specific units of organization used have a dramatic impact on the results of both science and technology (see below)Therefore, it is important (essential, really) to specify the appropriate units of linguistic organization that not only describe what is heard and spoken, but also enables synthesis and recognition systems to perform wellWhat are the essential units of spoken language?Articulatory Features? Phones? Syllables? Words? Lexemes? Phrases?All of the above? None of the above? Something else?An example to illustrate the point:Synthesis based on DIPHONES (a trans-segmental unit) sounds better than synthesis based solely on SEGMENTSHowever, synthesis based on a sophisticated and flexible UNIT-SELECTION system (used in concatenative approaches) sounds even better!Yet … we do not fully UNDERSTAND why this should be so

The Essential Units of Spoken Language

The problem of linguistic units is even more challenging for recognitionVirtually all current systems use the phone(me) as the basic unit (unless whole word or phrase recognition is being performed, as is done in small-vocabulary tasks)The primacy of the phoneme in automatic speech recognition means that lexical representations and pronunciation models must also be in the form of phoneme sequencesThis is a problem when a speaker refuses to pronounce words as the dictionary states he/she shouldOr where emotion, sarcasm and other “paralinguistic” phenomena intrudeCurrently, the only effective way for ASR systems to deal with this sort of variation is through “training”But reliance on training is exceedingly expensive – in terms of time, energy and fundsIf the ASR application is only as good as the training material, then extensive data collection and annotation are required for the technology to work wellIf the training material is NOT representative of the task, then the application will do poorly or fail altogetherThis is ignorance-based technology at its worst

The Essential Units of Spoken Language - ASR

So ….The problems with speech technology are several:

(1) The optimal units of spoken language are essentially unknownWe know from unit-selection-based concantenative synthesis that a MIX of time scales is useful, but we don’t know precisely why this is so

(2) The manner in which these units are combined and organized is not well understoodThis is even more of a problem than not knowing the units themselvesIt is unlikely that the units are organized on a single level and a uniform time scaleBut how many different levels? Moreover, the time scales associated with these levels are unknown

(3) How emotion and meaning affect the phonetic character of the utterance is also poorly understoodThis is perhaps the greatest challenge of allHow can emotion and meaning be quantified for computational ends?And what is actually meant by the terms “emotion” and “meaning”?

The Crux of the Problem(s)

So far, I’ve spoken in terms of essentially a doomsday scenario – we don’t know much about spoken language and without knowing a lot more, speech technology is effectively VERDAMMT (i.e., f _ _ k _ d) However …. the situation is far from hopelessA lot IS actually known about speech, it’s just that the current state of knowledge is somewhat disorganized and not terribly useful for technologyThis is where STRAIGHT comes in …. [trumpets blaring]Because STRAIGHT is capable of fine manipulation of the speech signal that far exceeds the capabilities of other synthesis approaches …It is possible (in principle) to use STRAIGHT as a tool for evaluating synthesis and recognition applications, as well as for learning more about how human listeners process and decode the speech signalNow that we’ve all (hopefully) agreed that STRAIGHT can serve an extremely useful role in advancing the state of speech technology and research, the only issue remaining is ….HOW? HOW? HOW? HOW? HOW? HOW? HOW? HOW? HOW? HOW?

STRAIGHT to the Rescue

We could spend the entire workshop discussing the specific ways in which STRAIGHT could be used to advance scientific insight and technology (perhaps the NEXT workshop ….)However, I’ll dwell on just a couple of possibilities to give the basic gist of the approach Because many of the other presentations in this workshop focus on (human) perception of speech (this was the original motivation for Hideki’s development of STRAIGHT, I believe)I’ll focus instead on some of the TECHNOLOGICAL applications that could be served by STRAIGHTIn particular, I’ll discuss how STRAIGHT could be used to transcend the limitations of current synthesis technologyAs well as be used to correct problems with the acoustic models of current-generation speech recognition systems

The Many Uses of STRAIGHT

Even the best current-generation synthesizers lack the capability of speaking in a wide range of styles and emotions without recourse to relevant recording material Because STRAIGHT is derived from human recordings, it can sound extremely natural The problem, currently, is that the effort required to modify the original recording is time consuming and tediousA more efficient means is required to give STRAIGHT a linguistic SOUL

The Soul of a New Machine

Acoustic models for a variety of linguistic units is essentialHow can these be developed? There are several possible approaches:(1) Analyze the current unit-selection algorithms to infer the appropriate units

This method is relatively efficient, but relies on the efficacy of the unit-selection method. Deficiencies may be incorporated into the models. Still, this may be an appropriate starting point(2) Perform systematic recombination experiments to deduce the appropriate units through perceptual studies evaluating both intelligibility and quality

Ghitza (in his “tiling” studies) and others have done this to a limited extent, but much more remains to be done. It is important to learn the time-frequency coordinates associated with various units. However, this approach is somewhat inefficient in that it requires dozens (if not hundreds) of listening experiments using trained human observers(3) Develop machine-learning and pattern-recognition algorithms that efficiently evaluate the synthesizer’s output and reformulates the input parameters to improve intelligibility and naturalness

This is the way to go with respect to efficiency and elegance. However, we currently lack the knowledge and sophistication to do so

A Linguistic Interface for STRAIGHT

Let’s ponder the third alternative a little further, as it provides a convenient starting point for subsequent slides(3) Develop machine-learning and pattern-recognition algorithms that efficiently evaluate the synthesizer’s output and reformulates the input parameters to improve intelligibility and naturalnessAs mentioned, this would be the optimum approach if our knowledge and computational sophistication were not limiting factorsIf we COULD do this, how would we proceed?First, we would need to develop acoustic models for various units of speech representative of the material synthesizedThese acoustic models could be derived from material known to be natural sounding and intelligibleOr they could be based on knowledge of the relevant acoustic patternsOr a combination of knowledge and statistical methodsThe key is to develop a method of supervised learning that is both efficient and meaningful – easier said than done!But now that we’ve opened the door to using speech recognition methods for synthesis, let’s ponder the utility of using STRAIGHT for recognition system development

Using Recognition Methods for Synthesis

The potential for STRAIGHT improving automatic recognition performance is even greater than using recognition for synthesisThis is because we currently lack ANY systematic methods for evaluating the efficacy of recognition models other than word-recognition accuracyThe approach currently used is truly “trial and error” (almost literally)There is no real insight as to precisely why certain acoustic representations work and others don’tDozens of engineers and scientists spend years trying different acoustic models based on a variety of front-end featuresThis is no way to advance the state of the art (or science)Instead, STRAIGHT could be used to introduce a variety of systematic changes in the signal over time and frequencyAnd the impact on both phonetic-feature and word recognition be ascertainedThis would enable the relation between the acoustic details of the waveform and the efficacy of the acoustic models to be determined with precisionAll sorts of recognition experiments could be performed to provide insight into the specific factors enhancing or degrading recognition performance

Using Synthesis to Improve Recognition

STRAIGHT’s potential for improving automatic speech recognition is not confined to acoustic modelsBecause STRAIGHT enables fine control over the durational properties of the waveform, it is also possible to develop far more sophisticated durational models in recognition than are currently usedBecause STRAIGHT also allows control over the signal’s fundamental frequency it should be possible to develop algorithms in ASR that specifically factor pitch contours into the recognition processSTRAIGHT also provides the potential for artificially introducing other sources of variation, such as speaking rate and voice quality, as a means of developing and testing ASR algorithms robustness to such commonly encountered sources of variabilityCurrently, the only way to do this is through time-consuming and tedious annotation of spoken corpora and hope that the material encompasses sufficient variation for training recognition systems (and that the annotation is accurate)Instead, STRAIGHT could be used to develop MODELS of speech variation that are used to train recognition systemsThis would save lots of time and money – and probably be more effective as well

Using Synthesis to Improve ASR Systems

But clever, sophisticated machine-learning approaches are incapable of solving the speech technology puzzle by themselvesDeep insight and penetrating knowledge is also requiredSTRAIGHT can help in this sector as well, providing the tools for systematic exploration of the relation between the acoustic signal and the perceived quality and intelligibility of the speech waveformWhat is REALLY required to advance the state of the art is a theoretical framework with which to embed our scattered observationsThe most effective way to develop this theoretical framework is through rigorous, systematic experiments – hence the crucial role of STRAIGHTI personally believe that the syllable serves as a primary coordinating unit which contains important information about prosody and hence meaningArticulatory features interact in certain ways that phoneticians label conventionally as segments, but which are in reality reflections of something much deeperThis is not the appropriate forum to discuss this theoretical framework – I mention it only as one specific approach that could be empirically tested using STRAIGHTIt would take years to fully test the theory without STRAIGHT!

What Else is Required?

What is REALLY required to advance the state of the art is a theoretical framework with which to embed our scattered observationsSTRAIGHT provides the technical basis with which to begin developing a true science of spoken languageLET US REASON TOGETHER to develop STRAIGHT as an effective scientific tool for building speech technology of the future

Tomorrow’s Child

That’s All

Many Thanks for Your Time and Attention