sub-project i prosody, tones and text-to-speech synthesis

Sub-Project IProsody, Tones and Text-To-Speech

Synthesis

Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI),Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI),

Lin-shan Lee, Hsin-min Wang

Outline

Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction

Members

Sin-Horng ChenProfessor (PI)NCTU

Yih-Ru Wang, Associate Professor(Co-PI) , NCTU

Lin-shan LeeProfessor , NTU

Chiu-yu TsengProfessor & Research Fellow(Co-PI) Academia Sinica

Yuan-Fu LiaoAssistant Professor(Co-PI) , NTUT

Hsin-min WangAssociate Research FellowAcademia Sinica

40 0.02 0.04 0.06 0.08 0.1 0.12

Dimension 1

8-4 8-1

8-58-78-6

7-57-77-47-8

5-35-8

6-56-7

Keyword

SpeakerFast

speakers

Slow speake

More breaks

Less breaks

Tone Behavior and Modeling

Applications inSpeech/Speaker

Recognition

Applications inSpeech/Speaker

Recognition

Applications inText-to-speech

Synthesis

Applications inText-to-speech

Synthesis

Theme of Sub-Project I

Prosody Analysis and Modeling

Latent Factor-based pitch contour model

nn ssnn YZ )( Mean model:

Shape model:

nnnnnn pfiftpttnn XY

nnnnn fisqtcnn bbbbbXZ

Tone Sandhi

Hierarchical modeling of fluent prosody

High performance TTSSpeaker recognition

Prosodic model-based tone recognizer

Research Focus

How to analyze and model fluent speech prosody– Approach 1: Hierarchical modeling of fluent speech prosody

• Develop a hierarchical prosody framework of fluent speech• Construct modular acoustic models for: (1) F0 contours, (2) duration pattern

s, (3) Intensity distribution and (4) boundary breaks– Approach 2: Latent factor analysis-based modeling

• Assume there are some latent affecting factors• Latent factor analysis for syllable duration, pitch contour, energy and Inter-s

yllable coarticulation• Explore the relation between latent factors and syntactic information

How to integrate these two approaches and apply them to– Text-to-speech synthesis– Speech/tone/speaker recognition

Research Roadmap

•Automatic prosodic labeling•Prosodic phrase analysis

•High performance TTSMandarin, Min-south, Hakka

Current Achievements Future Direction

•Eigen prosody analysis-based speaker recognition

•RNN/VQ-basedprosodic modeling

•COSPRO corpus/Toolkits•Hierarchical modeling of fluent speech prosody

•Corpus-based TTS•Model-based TTS

•Language model+pause, PM

•Tone modeling and recognition, MLP/RNN

•HMM •Model-based tone recognizer

•Prosodic model-basedspeaker recognition

•Prosodic cues-dependent LM

•Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation

•Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality

Hierarchical Prosody Framework of Fluent Speech (1/4)

Hierarchical framework of fluent speech prosody for multi-phrase speech paragraphs – Hierarchical cross-phrase patterns and contributions are found in all 4

acoustic dimensions.

– Acoustic templates are derived for each prosody level • F0 template

• Syllable duration templates and temporal allocation patterns

• Intensity distribution patterns

• Boundary break patterns

Breath Group

Initial PP Final PPMiddle Prosodic Phrase

PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW

B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2

Prosodic Group

The Prosody Hierarchy with Prosodic Boundaries

F0 cadence of multi-phrase PG (Prosodic Phrase Group )

Tide over Wave and Ripple

Syllable duration cadence of multi-phrase PG

the PW level

the PPh level

PG-initial PPh l

PG-medial PPh l

PG-final PPh l

1 2 3 4 5 6 7 8 9 10 11

1 2 3 4

1 2 3 4 5 6 7 8 9 10 11

Duration Re-synthesis, F054C F0 Re-synthesis, F054C

Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s

Initial Medial Final

OriginalOriginal

Initial Middle Final

Initial Middle FinalF0

OriginalOriginal

Standard

Average

65.3287

199.565

Speech Rate 236.18

72.3278

289.832

362.346

Modified

OriginalOriginal

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.4

companding factor of syllable

pandin

g facto

r of in

l and fin

final initial

Syllable Duration Model– Multiplicative model

– Additive model

Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models

n n n n nn n t y j l sZ X

mean: 42.3 frames 43.9 frames

variance: 180 frame2 2.52 frame2

RMSE: 1.93 frames

(5ms/frame)

Latent Factor Analysis-based Prosody Modeling (1/3)

0 2 4 6 8 10 12 14 16 18 206

pitch p

eriod (

033133233333433533020030

Syllable Pitch Contour Model– Mean model

– Shape model

The patterns of x-3-3

nn ssnn YZ )(

nnnnnn pfiftpttnn XY

nnnnn fisqtcnn bbbbbXZ

Reconstructed pitch mean

Inter-syllable coarticulation pitch contour model

The relationship of syllable pitch contours and affecting factors

Reconstructed pitch contour

Mandarin/Taiwanese TTS

Block diagram of TTS system TTS samples

input Min-Nan or Chinese text

Text Analyzer

AcousticInventory

RNN-basedProsody

Generator

PSOLA Speech Synthesizer

synthetic speech

base-syllablesequence

linguisticfeature

waveformsequence

prosodicparameters

Model-based TTS

Corpus-based TTS

female 1

female 2

female 3

female 4

female 5

female 1

female 2

female 3

female 4

female 5

Taiwanese -

Tone Behavior Modeling and Recognition with Inter-Syllabic Features

Gabor-IFAS-based pitch detection Four inter-syllabic features

– Ratio of duration of adjacent syllables

– Averaged pitch value over a syllable

– Maximum pitch difference within a syllable

– Averaged slope of the pitch contour over a syllable

Context-dependent tone behavior modeling

60.2 61.9

58.360.5

MAP-GMM/CMS

+GPD_S +ML-AKI +EPA

unseen handset

Eigen-Prosody Analysis-based Robust Speaker Recognition

Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data

– Step 1: Automatic prosodic state labeling and speaker-keyword statistics

– Step 2: Eigen-prosody space construction using Latent semantic analysis

prosodicfeatures

Prosody State Labeling

Prosodykeywordparsing

prosodykeywords

….........

…….

……..

Co-occurrence Matrix

speakers

dictionary

VQ-basedProsody modeling

sequences of prosodystates

eigen-prosodyspace

high dimensionalprosody space

Eigen-prosodyanalysis(SVD)

0 0.02 0.04 0.06 0.08 0.1 0.12-0.2

Dimension 1

8-4 8-1

8-58-78-6

7-57-77-47-8

5-35-8

6-56-7

Keyword

Speaker

Fast speake

Slow speakersMore

breaks

Less break

Experimental results on HTIMIT corpus

– Ten different handsets

– 302 speakers

– 7/3 utterances for training/test respectively

Research Infrastructure (1/2)

Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/– 9 sets of Mandarin Chinese fluent speech corpora collected – Platform developed– Each corpus was designed to bring out different prosody features involved in fluent speech. – Annotation processes include labeling and tagging perceived units and boundaries in fluent s

peech, especially the ultimate unit the multiple phrase speech paragraph.– Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship

characteristic to narrative or discourse organization.

COSPRO Toolkit

Performing Acoustic Analysis Function Re-synthesizing Speech Signals Function

PointProcess

Intensity

Formant

Spectrogram

WavF0 Block

PitchTier DurationTier

Duration Block

Labeling Continuous Fluent Speech Function

Adjustment

Syllable

User-defined Tag

Tree-Bank Speech Database– Uttered by a single female speaker

– Short paragraphs, 110,000 syllables

– Sentence-based syntactic tree annotated manually

– Pitch contour and syllable segmentation corrected manually

Research Infrastructure (2/2)

Future Direction (1/5)

Automatic prosodic labeling of Mandarin speech corpus

Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using

prosodic cues Prosodic modeling-based robust speaker recognition

Future Direction (2/5) Automatic prosodic labeling of Mandarin Speech corpus

– Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues:

• Prosodic phrase boundary detection• Inter-syllable/inter-word coarticulation classification• Full/half/sandhi tone labeling for Tone 3• Syllable pronunciation clustering• Homograph determination• The grouping of monosyllabic words with their neighboring words

Future Direction (3/5) Analysis of prosodic phrase structure

– 4-level prosody hierarchy: PW, PPh, BG, PG– Issues to be studied

• Detection and classification of prosodic phrases• Relation between syntactic phrase structure and prosodic phrase structure• Other affecting factors: speaking rate, speaking style, emotion type, spontaneit

y of speech

Model-based tone recognition– Current approach

• Acoustic feature normalization• Context-dependent tone modeling

– Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour

High performance TTS – Applying the sophisticated prosody models

• Modular model of fluent speech prosody• Latent factor analysis-based modeling

– Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient.

• Consider both linguistic information and acoustic cues • Specially treat to monosyllabic words

– Use the above prosody-syntax models to assist in the generation of prosodic information

Future Direction (4/5)

Future Direction (5/5) Speech recognition/language modeling using prosodic cues

– Automatic prosodic states labeling– Prosodic state-dependent acoustic modeling– Prosodic state-dependent language modeling

Prosodic modeling-based robust speaker recognition– Automatic prosodic cues labeling– N-gram language model to learn the prosodic behavior of speakers– Applying principle component analysis (PCA) to N-gram to find a

compact prosodic speaker space

sub-project i prosody, tones and text-to-speech synthesis

prosody hierarchy

prosody organization

duration patterns

shape model

duration resynthesis

duration parameters

associate professorcopi

yihru wang copi

Documents

psychoacoustic cues to emotion in speech prosody and music...

prosody, tone, and intonationuclyyix/yispapers/xu...prosody,...

the prosody of greek speech (devine, stephens)

segments, tones and distribution in khoekhoe...

prosody-controllable hmm-based speech synthesis using speech...

the prosody of speech: melody and rhythm

adapting prosody in a text-to-speech system · 2018. 9....

prosody of direct speech reports · 2018. 4. 3. · dolakha...

modelling personality features by changing prosody in...

speech, prosody, and voice characteristics of a mother and...

animated speech prosody...

prosody in speech interaction expression of the speaker...

music and speech prosody

quote - unquote? the role of prosody in the ... · reported...

speech, prosody, and voice characteristics of a - waisman...

decoding speech prosody do music lessons help

phrasal prosody constrains syntactic analysis in...

harnessing speech prosody for human-computer interaction ·...

the auditory kappa effect in a speech context alejna brugos...

prosody modification in speech signals