gujarati text-to-speech presentation
DESCRIPTION
Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.TRANSCRIPT
09:39 09:39
Text-to-Speech System for GujaratiProject Presentation by Samyak Bhuta
09:39 09:39
* PROJECT PROFILE *
Objective : Developing a Text-to-Speech
System for Gujarati
09:39 09:39
* PROJECT PROFILE *
Under the guidance of
Prof. Ram Mohan Shri Jignesh Dholakia
09:39 09:39
* PROJECT PROFILE *
At Resorce Centre for Indian Language Technology Solutions in Gujarati,
Faculty of Arts,The M. S. University of Baroda, BARODA.
09:39 09:39
Next 25 minutes …
> Sound and Speech Sound> ABC of TTS Systems> Pilot Project> GTTS from scratch > Speech , Syllable and Partneme> Speech Sounds in detail> Core Engine> Language Dependent Components
09:39 09:39
Sound : a flow of air
Source EarAir flows
Sound♫♪♫
09:39 09:39
What makes different sounds ? The factors, responsible for perceptual
difference between one kind of sound from the another are
Amplitude (or volume) which tells how much power the air-flow holds within
Frequency (or pitch) which tells at what rate the air-flow is repeating itself
09:39 09:39
The “Source” doesn’t matters
An air-flow of kind A will sound same
weather it has generated from source X
or source Y.
09:39 09:39
Speech Sound
A kind of sound whose source is
Human Vocal Organism and who
finds its place in human speech. e.g. ક્� , સ્� , અ , ઈ A standard called International Phonetic
Alphabet (IPA) is used to depict such sounds
09:39 09:39
IPA
IPA comprises almost all the speech sounds
of all languages in the world. Speech sounds are more formally known as
Phones IPA uses set of symbols to represent them
e.g. k , s , ə , i , ʤ IPA Chart …
09:39 09:39
IPA Chart
09:39 09:39
Synthesized Speech Sound
If we can produce the same pattern of
air-flow as it is produced by Human Vocal
Organism, representing a speech sound,
we can say that we have synthesized the
speech sound
09:39 09:39
Speech Synthesizer
A mechanism which is capable of producing
synthesized speech sound in controlled
manner.
09:39 09:39
Text-to-Speech Systems
A Speech Synthesizer which is smart enough
to produce equivalent Speech output of the
given text. The smartness accounts for making the
output as natural and intelligible as
possible.
09:39 09:39
Text-to-Speech Systems
Usually, the TTS Systems are specific to
only one human language and takes input
text from only that language
09:39 09:39
Basic structure of TTS Systems Function of any TTS System is, generally,
divided into three subtasks or phases. I. PreprocessingII. Phonetic-Prosodic TranslationIII. Speech Production The text input travels through these
phases, one by one, and eventually ends up in a speech .
09:39 09:39
Preprocessing
“Dr. Ajay Shah will come to clinic on 23 ,Jan.” We read it …
“DOCTOR Ajay Shah will come to clinic on
TWENTY THIRD OF JANUARY”. The Preprocessing is meant to convert
the input text, from raw condition, to
pronounceable word text.
09:39 09:39
Phonetic-Prosodic Translation This phase can be logically divided into two
different phases, • Phonetic Translation• Prosodic Translation Real TTS Systems may implement these
phases separately or as a unit but together
they provide data for the next phase of TTS.
09:39 09:39
Phonetic Translation
In human languages, the script under use
doesn’t necessarily posses the one to one
mapping with speech. e.g. enough is pronounced as INAF / inəf IPA
છો�ક્રો� is pronounced as છો�ક્રો� / ʧokɾo IPA
09:39 09:39
Phonetic Translation
A Phonetic Translation is used to provide
information, to the next phase, about exactly
what kind of speech sounds (phones) to be
produced for the given text. Phonetic Translation is also regarded as
Letter-to-Sound rules.
09:39 09:39
Prosodic Translation
Mapping from letter-to-sound rules only
provides information about kind of speech
sound to be generated. To convey the
emotions and expressions residing in the
input text , Prosody needs to be applied. By Prosody we mean,
Amplitude + Pitch + Duration
09:39 09:39
Speech Production
This phase is responsible for actual output
of the speech. The phase uses the phonetic and prosodic
information provided from the previous
phase. Various approaches exist for production of
speech.
09:39 09:39
Different ways for Speech Production Three widely used approaches for speech production are • Articulatory Synthesis• Source-Filter Synthesis• Concatenative Synthesis
Speech production part of the TTS System is generally regarded as speech engine.
09:39 09:39
Usecases
As we understood the structure of the TTS
Systems we realized that all three phases is
required in order to develop complete TTS
for Gujarati. At the top most abstraction level a use case
can be conceived for fulfilling the requirement
of having a TTS System for Gujarati.
09:39 09:39
Usecases
The topmost use case, then, can be divided
into three further use cases each fulfilling
the requirement of three different phases
During the project we tried to realize each
use case one by one.
09:39 09:39
Pilot Project
As we approached various requirements
and usecases to be realized, we found that
developing a Preprocessor is not so much
significant as developing the other two
phases. So we decided to develop later on. We decided to develop Phonetic-Prosodic
Translation phase first as if it can be easily
plugged into any already build ….speech
09:39 09:39
Pilot Project
… speech engine who takes input in terms of
of IPA. FreeTTS, IBMJS, Dhvani, Narad were
studied We used Java Speech API along with IBMJS
as a speech engine to be used. The input to the engine was provided through
Java Speech Markup Language (JSML)
09:39 09:39
Pilot Project : Objective
To develop a TTS System using already
available Speech Engine and supplying
transcripted (equivalent ) IPA text of target
Gujarati Unicode text to the engine.
09:39 09:39
Pilot Project : S/W Requirement A Speech Engine Component which takes
IPA and speaks it out .
09:39 09:39
Pilot Project : Design
No of usecases were conceived and its
implementation was provided as different
java classes.
09:39 09:39
Pilot Project : Conclusion
We cannot continue developing a TTS
System with “outsider” speech engine as
the accent and other things need to be
Gujarati in nature.
09:39 09:39
Starting of GTTS from Scratch From the result of the Pilot Project we
concluded that it is required to develop the
Speech Engine keeping Gujarati in mind. Concatenative approach was to be used
since it provides naturalness and has proven
track record.
09:39 09:39
Concatenation
In Concatenative approach, already stored
segments of sounds are joined together to
produce the complete speech. Such segments are known as concatenation
unit. We used Partnemes as our concatenation
unit.
09:39 09:39
Partnemes
Partneme is a very small segment of sound
whose typical length ranges from 8 ms to
100 ms. We get the partnemes by cutting
the recorded speech. But before understanding what is partneme
we have to understand human speech in
greater detail. Especially the relation
between speech and syllable.
09:39 09:39
How we speak ?
At time of normal breathing the period we
devote to breath-in is longer than that of
breath-out in a complete breath cycle. But when we start speaking, the breath-in
period becomes shorter paving the way for
a longer breath-out period. It is so because to speak out (anything) we
need some air-flow. We use the air-flow …
09:39 09:39
How we speak ? : Human Vocal Tract … powered by lungs, during breath-out. This air-flow is modified at various points
of Human Vocal Tract, ending up in a one
or another kind of speech sound (phones). Human Vocal Tract comprises of various
organs which, in one or another way,
changes the air-flow. Human Vocal Tract …
09:39 09:39
Hu
man
V
oca
l T
ract
09:39 09:39
09:39 09:39
How we speak ? : Syllable and Speech During the one complete breath cycle
we can speak out more than one phones. These all phones, spoken out in just one
breath cycle, constitutes a syllable . Sequence of such syllables in their
continuity forms a speech.
09:39 09:39
How we speak ? : Syllable Structure It is important to know the structure of
syllable in order to understand partnemes. Typically a syllable is made up of vowel as a
nucleus with consonants around it. Gujarati employees the following syllable
structure.
< C + C + C + V + V ̯ + C + C >
09:39 09:39
How we speak ? : Syllable Structure < C + C + C + V + V ̯ + C + C >
where C - consonants
V - vowel
V ̯ - unsyllablized vowel An utterance ( spoken word ) is made up
series of such syllables.
09:39 09:39
How we speak ? : Syllable Structure રો�મ - ɾam is made up of single syllable. here the structure becomes < ɾC + aV + mC > . પત્ર - pətɾ is also made up of single syllable. here the structure becomes < pC + əV + tC + ɾC > લશ્ક્રો - ləʃkəɾ is made up of two syllables. here the structure becomes < lC + əV + ʃC > < kC + əV + ɾC >
09:39 09:39
How we speak ? : Consonants and Vowels Consonants and vowels are two different
kind of speech sounds with different
acoustic parameters. To know the exact difference between
consonants and vowels we have to
understand how the single vocal tract is
capable of producing so many different
sounds.
09:39 09:39
How we speak ? : Articulation Modification of the air-flow is achieved by
articulation of various speech organs of the
vocal tract. The exact nature of speech sound that will
come up during the breath-out is determined
by
1 Place of Articulation
2 Manner of Articulation
09:39 09:39
How we speak ? : Place of articulation Place of articulation refers to the exact point,
in human vocal tract, where articulation happened.
e.g. [p] - two lips
[k] - back of tongue with velum
[ɾ] - tip of tongue with alveolar
09:39 09:39
How we speak ? : Manner of articulation Manner of articulation refers to the degree
of constriction made, during the articulation.
e.g. [p] - stop or plosive
[ʧ] - affricate
[ɾ] - tapped
[ j ] - glide
[ o ] - vowel ( no constriction )
09:39 09:39
How we speak ? : Voicedness
If, during the traveling of the air-flow from the
glottis, vocal cords are vibrating (and thus
changing the air-flow) we get a voiced
sound.
e.g. [g] - voiced
[k] - unvoiced
09:39 09:39
How we speak ? : Aspiration
Aspiration refers to the state of vocal cords,
during the final stage of process, when
speaking out phones. When we speak out
aspirated phones the vocal cords
approaches, itself to vibrating state, as
time goes ( irrespective of their voicednees ).
e.g. [kʰ ] - aspirated
[ k ] - unaspirated
09:39 09:39
Segmentation and Partneme
Segmentation of partnemes is achieved by
separating the recorded syllable. Given is sound wave form for ગમન build with
partnemes. Red lines mark the separation.
09:39 09:39
Partnemes
As shown syallable is logically divided into null sound to consonant transition core consonant consonant to vowel transition core vowel vowel to consonant transition core consonant consonant to null sound transition
09:39 09:39
Partnemes
If we can provide the partnemes for each
vowel and consonant we can join them
accordingly to produce any complete syllable
and hence any utterance.
e.g.
ક્રોણ - kəɾə ɳ
0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
09:39 09:39
ભા�રોત - bʰaɾə t
0_bʰ;bʰ;bʰ_a;a;a_ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0
09:39 09:39
Core Engine
The speech engine, we developed to concatenate such partneme sequence based on given IPA, uses pair of files. One, called Voice File , contains the audio data of all the partnemes. The other serves as a reference to the Voice File and is called Voice Info File . It contains the place and length of partnemes in the Voice File .
09:39 09:39
Core Engine
The Core Engine realizes the usecase for
having a speech engine.
09:39 09:39
Language Dependent Components Since Core Engine only understands IPA sequence we have to provide a component which translate the Gujarati text to IPA sequence . The Preprocessing capabilities need also be developed for a complete TTS System. Unlike Core Engine, both aforementioned components would be specific to particular language and …
09:39 09:39
Language Dependent Components … therefore kept aside as language dependent
components. Preprocessor :
As preprocessing should be highly
customizable from the end user end we
have provided a text file which can be
edited to control the functionality of the
preprocessor.
09:39 09:39
IPATranscriptor : This component currently
provides only phonetic translation of the given
Gujarati text as complete rules for prosodic
translation are not available.
09:39 09:39
Thanks
Prof. Bhartiben Modi Mr. Ajay Sarvaiya Mr. Irshad Shaikh Mr. Mihir Trivedi
09:39 09:39
Sloka
બુ� દ્ધિ� વડે� અર્થો��ન�� ગ્રહણ ક્રો", આત્મા� મનન� ઉચ્ચા�રોણન" ઇચ્છો� સ્�ર્થો� યો�જે� છો� . મન ક્�યો�ગ્નિ,ન� પ્રજ્વદ્ધિલત ક્રો� છો� , અન� ત� (ક્�યો�ગ્નિ, ) પ્ર�ણવ�યો� ન� પ્ર�રો� છો� . ત� પ્ર�રિરોત વ�યો� , મ0 ર્ધા�� ( શી"ર્ષ� ) સ્�ર્થો� અભિભાઘા�ત પ�મ"ન� , મ�ખન� પ્ર�પ્ત ક્રો"ન� , ત� ત� સ્થા�ન�મ�� ર્થો" પસ્�રો ર્થોત�� , સ્વરો, ક્�ળ , સ્થા�ન , બુ�હ્ય અન� આભ્યો� તરો પ્રયોત્નો�ન� અન� પ્રદા�નર્થો" પ�� ચા પ્રક્�રોન� વણ��ન� પ્ર�દા� ભા�� વ ક્રો� છો� .
- પ�ભિણન"યો દ્ધિશીક્ષા�, દાસ્મ� અધ્યો�યો, ક્�રિરોક્� ૬, ૯ .