gujarati text-to-speech presentation

09:39 09:39

Text-to-Speech System for GujaratiProject Presentation by Samyak Bhuta

09:39 09:39

* PROJECT PROFILE *

Objective : Developing a Text-to-Speech

System for Gujarati

09:39 09:39

* PROJECT PROFILE *

Under the guidance of

Prof. Ram Mohan Shri Jignesh Dholakia

09:39 09:39

* PROJECT PROFILE *

At Resorce Centre for Indian Language Technology Solutions in Gujarati,

Faculty of Arts,The M. S. University of Baroda, BARODA.

09:39 09:39

Next 25 minutes …

> Sound and Speech Sound> ABC of TTS Systems> Pilot Project> GTTS from scratch > Speech , Syllable and Partneme> Speech Sounds in detail> Core Engine> Language Dependent Components

09:39 09:39

Sound : a flow of air

Source EarAir flows

Sound♫♪♫

09:39 09:39

What makes different sounds ? The factors, responsible for perceptual

difference between one kind of sound from the another are

Amplitude (or volume) which tells how much power the air-flow holds within

Frequency (or pitch) which tells at what rate the air-flow is repeating itself

09:39 09:39

The “Source” doesn’t matters

An air-flow of kind A will sound same

weather it has generated from source X

or source Y.

09:39 09:39

Speech Sound

A kind of sound whose source is

Human Vocal Organism and who

finds its place in human speech. e.g. ક્� , સ્� , અ , ઈ A standard called International Phonetic

Alphabet (IPA) is used to depict such sounds

09:39 09:39

IPA

IPA comprises almost all the speech sounds

of all languages in the world. Speech sounds are more formally known as

Phones IPA uses set of symbols to represent them

e.g. k , s , ə , i , ʤ IPA Chart …

09:39 09:39

IPA Chart

09:39 09:39

Synthesized Speech Sound

If we can produce the same pattern of

air-flow as it is produced by Human Vocal

Organism, representing a speech sound,

we can say that we have synthesized the

speech sound

09:39 09:39

Speech Synthesizer

A mechanism which is capable of producing

synthesized speech sound in controlled

manner.

09:39 09:39

Text-to-Speech Systems

A Speech Synthesizer which is smart enough

to produce equivalent Speech output of the

given text. The smartness accounts for making the

output as natural and intelligible as

possible.

09:39 09:39

Text-to-Speech Systems

Usually, the TTS Systems are specific to

only one human language and takes input

text from only that language

09:39 09:39

Basic structure of TTS Systems Function of any TTS System is, generally,

divided into three subtasks or phases. I. PreprocessingII. Phonetic-Prosodic TranslationIII. Speech Production The text input travels through these

phases, one by one, and eventually ends up in a speech .

09:39 09:39

Preprocessing

“Dr. Ajay Shah will come to clinic on 23 ,Jan.” We read it …

“DOCTOR Ajay Shah will come to clinic on

TWENTY THIRD OF JANUARY”. The Preprocessing is meant to convert

the input text, from raw condition, to

pronounceable word text.

09:39 09:39

Phonetic-Prosodic Translation This phase can be logically divided into two

different phases, • Phonetic Translation• Prosodic Translation Real TTS Systems may implement these

phases separately or as a unit but together

they provide data for the next phase of TTS.

09:39 09:39

Phonetic Translation

In human languages, the script under use

doesn’t necessarily posses the one to one

mapping with speech. e.g. enough is pronounced as INAF / inəf IPA

છો�ક્રો� is pronounced as છો�ક્રો� / ʧokɾo IPA

09:39 09:39

Phonetic Translation

A Phonetic Translation is used to provide

information, to the next phase, about exactly

what kind of speech sounds (phones) to be

produced for the given text. Phonetic Translation is also regarded as

Letter-to-Sound rules.

09:39 09:39

Prosodic Translation

Mapping from letter-to-sound rules only

provides information about kind of speech

sound to be generated. To convey the

emotions and expressions residing in the

input text , Prosody needs to be applied. By Prosody we mean,

Amplitude + Pitch + Duration

09:39 09:39

Speech Production

This phase is responsible for actual output

of the speech. The phase uses the phonetic and prosodic

information provided from the previous

phase. Various approaches exist for production of

speech.

09:39 09:39

Different ways for Speech Production Three widely used approaches for speech production are • Articulatory Synthesis• Source-Filter Synthesis• Concatenative Synthesis

Speech production part of the TTS System is generally regarded as speech engine.

09:39 09:39

Usecases

As we understood the structure of the TTS

Systems we realized that all three phases is

required in order to develop complete TTS

for Gujarati. At the top most abstraction level a use case

can be conceived for fulfilling the requirement

of having a TTS System for Gujarati.

09:39 09:39

Usecases

The topmost use case, then, can be divided

into three further use cases each fulfilling

the requirement of three different phases

During the project we tried to realize each

use case one by one.

09:39 09:39

Pilot Project

As we approached various requirements

and usecases to be realized, we found that

developing a Preprocessor is not so much

significant as developing the other two

phases. So we decided to develop later on. We decided to develop Phonetic-Prosodic

Translation phase first as if it can be easily

plugged into any already build ….speech

09:39 09:39

Pilot Project

… speech engine who takes input in terms of

of IPA. FreeTTS, IBMJS, Dhvani, Narad were

studied We used Java Speech API along with IBMJS

as a speech engine to be used. The input to the engine was provided through

Java Speech Markup Language (JSML)

09:39 09:39

Pilot Project : Objective

To develop a TTS System using already

available Speech Engine and supplying

transcripted (equivalent ) IPA text of target

Gujarati Unicode text to the engine.

09:39 09:39

Pilot Project : S/W Requirement A Speech Engine Component which takes

IPA and speaks it out .

09:39 09:39

Pilot Project : Design

No of usecases were conceived and its

implementation was provided as different

java classes.

09:39 09:39

Pilot Project : Conclusion

We cannot continue developing a TTS

System with “outsider” speech engine as

the accent and other things need to be

Gujarati in nature.

09:39 09:39

Starting of GTTS from Scratch From the result of the Pilot Project we

concluded that it is required to develop the

Speech Engine keeping Gujarati in mind. Concatenative approach was to be used

since it provides naturalness and has proven

track record.

09:39 09:39

Concatenation

In Concatenative approach, already stored

segments of sounds are joined together to

produce the complete speech. Such segments are known as concatenation

unit. We used Partnemes as our concatenation

unit.

09:39 09:39

Partnemes

Partneme is a very small segment of sound

whose typical length ranges from 8 ms to

100 ms. We get the partnemes by cutting

the recorded speech. But before understanding what is partneme

we have to understand human speech in

greater detail. Especially the relation

between speech and syllable.

09:39 09:39

How we speak ?

At time of normal breathing the period we

devote to breath-in is longer than that of

breath-out in a complete breath cycle. But when we start speaking, the breath-in

period becomes shorter paving the way for

a longer breath-out period. It is so because to speak out (anything) we

need some air-flow. We use the air-flow …

09:39 09:39

How we speak ? : Human Vocal Tract … powered by lungs, during breath-out. This air-flow is modified at various points

of Human Vocal Tract, ending up in a one

or another kind of speech sound (phones). Human Vocal Tract comprises of various

organs which, in one or another way,

changes the air-flow. Human Vocal Tract …

09:39 09:39

Hu

man

V

oca

l T

ract

09:39 09:39

09:39 09:39

How we speak ? : Syllable and Speech During the one complete breath cycle

we can speak out more than one phones. These all phones, spoken out in just one

breath cycle, constitutes a syllable . Sequence of such syllables in their

continuity forms a speech.

09:39 09:39

How we speak ? : Syllable Structure It is important to know the structure of

syllable in order to understand partnemes. Typically a syllable is made up of vowel as a

nucleus with consonants around it. Gujarati employees the following syllable

structure.

< C + C + C + V + V ̯ + C + C >

09:39 09:39

How we speak ? : Syllable Structure < C + C + C + V + V ̯ + C + C >

where C - consonants

V - vowel

V ̯ - unsyllablized vowel An utterance ( spoken word ) is made up

series of such syllables.

09:39 09:39

How we speak ? : Syllable Structure રો�મ - ɾam is made up of single syllable. here the structure becomes < ɾC + aV + mC > . પત્ર - pətɾ is also made up of single syllable. here the structure becomes < pC + əV + tC + ɾC > લશ્ક્રો - ləʃkəɾ is made up of two syllables. here the structure becomes < lC + əV + ʃC > < kC + əV + ɾC >

09:39 09:39

How we speak ? : Consonants and Vowels Consonants and vowels are two different

kind of speech sounds with different

acoustic parameters. To know the exact difference between

consonants and vowels we have to

understand how the single vocal tract is

capable of producing so many different

sounds.

09:39 09:39

How we speak ? : Articulation Modification of the air-flow is achieved by

articulation of various speech organs of the

vocal tract. The exact nature of speech sound that will

come up during the breath-out is determined

by

1 Place of Articulation

2 Manner of Articulation

09:39 09:39

How we speak ? : Place of articulation Place of articulation refers to the exact point,

in human vocal tract, where articulation happened.

e.g. [p] - two lips

[k] - back of tongue with velum

[ɾ] - tip of tongue with alveolar

09:39 09:39

How we speak ? : Manner of articulation Manner of articulation refers to the degree

of constriction made, during the articulation.

e.g. [p] - stop or plosive

[ʧ] - affricate

[ɾ] - tapped

[ j ] - glide

[ o ] - vowel ( no constriction )

09:39 09:39

How we speak ? : Voicedness

If, during the traveling of the air-flow from the

glottis, vocal cords are vibrating (and thus

changing the air-flow) we get a voiced

sound.

e.g. [g] - voiced

[k] - unvoiced

09:39 09:39

How we speak ? : Aspiration

Aspiration refers to the state of vocal cords,

during the final stage of process, when

speaking out phones. When we speak out

aspirated phones the vocal cords

approaches, itself to vibrating state, as

time goes ( irrespective of their voicednees ).

e.g. [kʰ ] - aspirated

[ k ] - unaspirated

09:39 09:39

Segmentation and Partneme

Segmentation of partnemes is achieved by

separating the recorded syllable. Given is sound wave form for ગમન build with

partnemes. Red lines mark the separation.

09:39 09:39

Partnemes

As shown syallable is logically divided into null sound to consonant transition core consonant consonant to vowel transition core vowel vowel to consonant transition core consonant consonant to null sound transition

09:39 09:39

Partnemes

If we can provide the partnemes for each

vowel and consonant we can join them

accordingly to produce any complete syllable

and hence any utterance.

e.g.

ક્રોણ - kəɾə ɳ

0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0

09:39 09:39

ભા�રોત - bʰaɾə t

0_bʰ;bʰ;bʰ_a;a;a_ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0

09:39 09:39

Core Engine

The speech engine, we developed to concatenate such partneme sequence based on given IPA, uses pair of files. One, called Voice File , contains the audio data of all the partnemes. The other serves as a reference to the Voice File and is called Voice Info File . It contains the place and length of partnemes in the Voice File .

09:39 09:39

Core Engine

The Core Engine realizes the usecase for

having a speech engine.

09:39 09:39

Language Dependent Components Since Core Engine only understands IPA sequence we have to provide a component which translate the Gujarati text to IPA sequence . The Preprocessing capabilities need also be developed for a complete TTS System. Unlike Core Engine, both aforementioned components would be specific to particular language and …

09:39 09:39

Language Dependent Components … therefore kept aside as language dependent

components. Preprocessor :

As preprocessing should be highly

customizable from the end user end we

have provided a text file which can be

edited to control the functionality of the

preprocessor.

09:39 09:39

IPATranscriptor : This component currently

provides only phonetic translation of the given

Gujarati text as complete rules for prosodic

translation are not available.

09:39 09:39

Thanks

Prof. Bhartiben Modi Mr. Ajay Sarvaiya Mr. Irshad Shaikh Mr. Mihir Trivedi

09:39 09:39

Sloka

બુ� દ્ધિ� વડે� અર્થો��ન�� ગ્રહણ ક્રો", આત્મા� મનન� ઉચ્ચા�રોણન" ઇચ્છો� સ્�ર્થો� યો�જે� છો� . મન ક્�યો�ગ્નિ,ન� પ્રજ્વદ્ધિલત ક્રો� છો� , અન� ત� (ક્�યો�ગ્નિ, ) પ્ર�ણવ�યો� ન� પ્ર�રો� છો� . ત� પ્ર�રિરોત વ�યો� , મ0 ર્ધા�� ( શી"ર્ષ� ) સ્�ર્થો� અભિભાઘા�ત પ�મ"ન� , મ�ખન� પ્ર�પ્ત ક્રો"ન� , ત� ત� સ્થા�ન�મ�� ર્થો" પસ્�રો ર્થોત�� , સ્વરો, ક્�ળ , સ્થા�ન , બુ�હ્ય અન� આભ્યો� તરો પ્રયોત્નો�ન� અન� પ્રદા�નર્થો" પ�� ચા પ્રક્�રોન� વણ��ન� પ્ર�દા� ભા�� વ ક્રો� છો� .

- પ�ભિણન"યો દ્ધિશીક્ષા�, દાસ્મ� અધ્યો�યો, ક્�રિરોક્� ૬, ૯ .

gujarati text-to-speech presentation

Technology

speech systems

human speech

speech engine

speech synthesizer

scratch speech

speech soundsof

partneme speech sounds

ipa ipa