sub-project i prosody, tones and text-to-speech synthesis

23
Sub-Project I Prosody, Tones and Text-To- Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-P I), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI ), Lin-shan Lee, Hsin-min Wang

Upload: betha

Post on 14-Jan-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis. Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan Lee, Hsin-min Wang. Outline. Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

Sub-Project IProsody, Tones and Text-To-Speech

Synthesis

Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI),Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI),

Lin-shan Lee, Hsin-min Wang

Page 2: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

2

Outline

Members Theme of Sub-project I Research Roadmap Current Achievements Research Infrastructure Future Direction

Page 3: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

3

Members

Sin-Horng ChenProfessor (PI)NCTU

Yih-Ru Wang, Associate Professor(Co-PI) , NCTU

Lin-shan LeeProfessor , NTU

Chiu-yu TsengProfessor & Research Fellow(Co-PI) Academia Sinica

Yuan-Fu LiaoAssistant Professor(Co-PI) , NTUT

Hsin-min WangAssociate Research FellowAcademia Sinica

Page 4: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

40 0.02 0.04 0.06 0.08 0.1 0.12

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Dimension 1

Dim

ensi

on 2

1

65

7

3

2

4

8

8-4 8-1

8-3

8-58-78-6

8-8

4-7

4-6

4-1

4-3

2-1

2-7

2-3

2-6

2-4

2-2

2-5

3-1

3-6

3-2

3-5

3-7

3-3

3-4

7-6

7-1

7-57-77-47-8

7-2

7-3

5-7

5-6

5-5

5-1

5-2

5-35-8

5-4

6-56-7

6-6

6-1

6-2

6-3

6-4

6-8

1-6

1-5

1-3

1-7

1-1

1-2

1-4

1-8

Keyword

SpeakerFast

speakers

Slow speake

rs

More breaks

Less breaks

Tone Behavior and Modeling

Tone Behavior and Modeling

Applications inSpeech/Speaker

Recognition

Applications inSpeech/Speaker

Recognition

Applications inText-to-speech

Synthesis

Applications inText-to-speech

Synthesis

Theme of Sub-Project I

Prosody Analysis and Modeling

Prosody Analysis and Modeling

Latent Factor-based pitch contour model

nn ssnn YZ )( Mean model:

Shape model:

nnnnnn pfiftpttnn XY

nnnnn fisqtcnn bbbbbXZ

Tone Sandhi

Hierarchical modeling of fluent prosody

High performance TTSSpeaker recognition

Prosodic model-based tone recognizer

Page 5: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

5

Research Focus

How to analyze and model fluent speech prosody– Approach 1: Hierarchical modeling of fluent speech prosody

• Develop a hierarchical prosody framework of fluent speech• Construct modular acoustic models for: (1) F0 contours, (2) duration pattern

s, (3) Intensity distribution and (4) boundary breaks– Approach 2: Latent factor analysis-based modeling

• Assume there are some latent affecting factors• Latent factor analysis for syllable duration, pitch contour, energy and Inter-s

yllable coarticulation• Explore the relation between latent factors and syntactic information

How to integrate these two approaches and apply them to– Text-to-speech synthesis– Speech/tone/speaker recognition

Page 6: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

6

Research Roadmap

•Automatic prosodic labeling•Prosodic phrase analysis

•High performance TTSMandarin, Min-south, Hakka

Current Achievements Future Direction

•Eigen prosody analysis-based speaker recognition

•RNN/VQ-basedprosodic modeling

•COSPRO corpus/Toolkits•Hierarchical modeling of fluent speech prosody

•Corpus-based TTS•Model-based TTS

•Language model+pause, PM

•Tone modeling and recognition, MLP/RNN

•HMM •Model-based tone recognizer

•Prosodic model-basedspeaker recognition

•Prosodic cues-dependent LM

•Latent factor analysis duration, pitch mean, shape, inter-syllable coarticulation

•Investigation in relation to prosody organization: F0 range and reset, naturalness and measurement, voice quality

Page 7: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

7

Hierarchical Prosody Framework of Fluent Speech (1/4)

Hierarchical framework of fluent speech prosody for multi-phrase speech paragraphs – Hierarchical cross-phrase patterns and contributions are found in all 4

acoustic dimensions.

– Acoustic templates are derived for each prosody level • F0 template

• Syllable duration templates and temporal allocation patterns

• Intensity distribution patterns

• Boundary break patterns

Page 8: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

8

Breath Group

Initial PP Final PPMiddle Prosodic Phrase

PWPW .. .. .. .. .. .. .. .. .. .. .. .. .. PW

B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2 B2

B3 B3

B4

Prosodic Group

B4

B5

Hierarchical Prosody Framework of Fluent Speech (2/4)

The Prosody Hierarchy with Prosodic Boundaries

Page 9: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

9

Hierarchical Prosody Framework of Fluent Speech (3/4)

F0 cadence of multi-phrase PG (Prosodic Phrase Group )

Tide over Wave and Ripple

Syllable duration cadence of multi-phrase PG

the PW level

the PPh level

PG-initial PPh l

PG-medial PPh l

PG-final PPh l

-1.2

-1

-0.8

-0.6

-0.4

-0.2

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11

-1.2

-1

-0.8

-0.6

-0.4

-0.2

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11

-1.2

-1

-0.8

-0.6

-0.4

-0.2

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11

-1.2

-1

-0.8

-0.6

-0.4

-0.2

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

00.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11

Page 10: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

10

Hierarchical Prosody Framework of Fluent Speech (4/4)

Duration Re-synthesis, F054C F0 Re-synthesis, F054C

Cross speaker synthesis: To manipulate Speaker A’s Duration Parameters with Speaker B’s

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

0

50

100

150

200

250

300

350

Initial Medial Final

OriginalOriginal

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle Final

F0 (H

z)

0

50

100

150

200

250

300

350

Initial Middle FinalF0

(Hz)

OriginalOriginal

F051P

Standard

Average

65.3287

199.565

Speech Rate 236.18

F01S

72.3278

289.832

362.346

Modified

OriginalOriginal

Page 11: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

11

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

companding factor of syllable

com

pandin

g facto

r of in

itia

l and fin

al

final initial

Syllable Duration Model– Multiplicative model

– Additive model

Relations between Prosodic State CFs of Initial/Final and Syllable Duration Models

n n n n nn n t y j l sZ X

n n n n nn n t y j l sZ X

mean: 42.3 frames 43.9 frames

variance: 180 frame2 2.52 frame2

RMSE: 1.93 frames

(5ms/frame)

Latent Factor Analysis-based Prosody Modeling (1/3)

Page 12: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

12

0 2 4 6 8 10 12 14 16 18 206

6.5

7

7.5

8

8.5

9

9.5

frame

pitch p

eriod (

ms)

033133233333433533020030

Syllable Pitch Contour Model– Mean model

– Shape model

The patterns of x-3-3

Latent Factor Analysis-based Prosody Modeling (2/3)

nn ssnn YZ )(

nnnnnn pfiftpttnn XY

nnnnn fisqtcnn bbbbbXZ

Reconstructed pitch mean

Page 13: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

13

Inter-syllable coarticulation pitch contour model

The relationship of syllable pitch contours and affecting factors

Reconstructed pitch contour

Latent Factor Analysis-based Prosody Modeling (3/3)

Page 14: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

14

Mandarin/Taiwanese TTS

Block diagram of TTS system TTS samples

input Min-Nan or Chinese text

Text Analyzer

AcousticInventory

RNN-basedProsody

Generator

PSOLA Speech Synthesizer

synthetic speech

base-syllablesequence

linguisticfeature

waveformsequence

prosodicparameters

Model-based TTS

Corpus-based TTS

female 1

female 2

female 3

female 4

female 5

female 1

female 2

female 3

female 4

female 5

Taiwanese -

Page 15: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

15

Tone Behavior Modeling and Recognition with Inter-Syllabic Features

Gabor-IFAS-based pitch detection Four inter-syllabic features

– Ratio of duration of adjacent syllables

– Averaged pitch value over a syllable

– Maximum pitch difference within a syllable

– Averaged slope of the pitch contour over a syllable

Context-dependent tone behavior modeling

Page 16: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

16

60.2 61.9

74.9

79.3

58.360.5

69.4

74.6

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

MAP-GMM/CMS

+GPD_S +ML-AKI +EPA

avg s

pk re

cog r

ate(%

)

avg

unseen handset

Eigen-Prosody Analysis-based Robust Speaker Recognition

Use latent semantic analysis (LSA) to efficiently extract useful speaker cues to resist handset mismatch from few training/test data

– Step 1: Automatic prosodic state labeling and speaker-keyword statistics

– Step 2: Eigen-prosody space construction using Latent semantic analysis

prosodicfeatures

Prosody State Labeling

Prosodykeywordparsing

prosodykeywords

A

….........

…….

……..

1

1

2

1

Co-occurrence Matrix

speakers

dictionary

VQ-basedProsody modeling

sequences of prosodystates

eigen-prosodyspace

A U

VTS

high dimensionalprosody space

Eigen-prosodyanalysis(SVD)

0 0.02 0.04 0.06 0.08 0.1 0.12-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Dimension 1

Dim

ensi

on 2

1

65

7

3

2

4

8

8-4 8-1

8-3

8-58-78-6

8-8

4-7

4-6

4-1

4-3

2-1

2-7

2-3

2-6

2-4

2-2

2-5

3-1

3-6

3-2

3-5

3-7

3-3

3-4

7-6

7-1

7-57-77-47-8

7-2

7-3

5-7

5-6

5-5

5-1

5-2

5-35-8

5-4

6-56-7

6-6

6-1

6-2

6-3

6-4

6-8

1-6

1-5

1-3

1-7

1-1

1-2

1-4

1-8

Keyword

Speaker

Fast speake

rs

Slow speakersMore

breaks

Less break

s

Experimental results on HTIMIT corpus

– Ten different handsets

– 302 speakers

– 7/3 utterances for training/test respectively

Page 17: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

17

Research Infrastructure (1/2)

Sinica COSPRO and Toolkits: http://www.myet.com/COSPRO/– 9 sets of Mandarin Chinese fluent speech corpora collected – Platform developed– Each corpus was designed to bring out different prosody features involved in fluent speech. – Annotation processes include labeling and tagging perceived units and boundaries in fluent s

peech, especially the ultimate unit the multiple phrase speech paragraph.– Framework constructed to bring out speech paragraphs and cross-phrase prosodic relationship

characteristic to narrative or discourse organization.

COSPRO Toolkit

Performing Acoustic Analysis Function Re-synthesizing Speech Signals Function

PointProcess

Pitch

Intensity

Formant

Spectrogram

WavF0 Block

PitchTier DurationTier

Duration Block

Labeling Continuous Fluent Speech Function

Adjustment

Break

Syllable

User-defined Tag

PAC

Page 18: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

18

Tree-Bank Speech Database– Uttered by a single female speaker

– Short paragraphs, 110,000 syllables

– Sentence-based syntactic tree annotated manually

– Pitch contour and syllable segmentation corrected manually

Research Infrastructure (2/2)

Page 19: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

19

Future Direction (1/5)

Automatic prosodic labeling of Mandarin speech corpus

Analysis of prosodic phrase structure Model-based tone recognition High performance TTS Speech recognition/language modeling using

prosodic cues Prosodic modeling-based robust speaker recognition

Page 20: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

20

Future Direction (2/5) Automatic prosodic labeling of Mandarin Speech corpus

– Goal: To construct a prosody-syntax model by exploiting the relationship of prosodic features and linguistic features and use it to automatic labeling of various acoustic cues:

• Prosodic phrase boundary detection• Inter-syllable/inter-word coarticulation classification• Full/half/sandhi tone labeling for Tone 3• Syllable pronunciation clustering• Homograph determination• The grouping of monosyllabic words with their neighboring words

Page 21: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

21

Future Direction (3/5) Analysis of prosodic phrase structure

– 4-level prosody hierarchy: PW, PPh, BG, PG– Issues to be studied

• Detection and classification of prosodic phrases• Relation between syntactic phrase structure and prosodic phrase structure• Other affecting factors: speaking rate, speaking style, emotion type, spontaneit

y of speech

Model-based tone recognition– Current approach

• Acoustic feature normalization• Context-dependent tone modeling

– Main idea: Use the above statistics-based prosody models to compensate the effects of various affecting factors on syllable pitch contour, duration, and energy contour

Page 22: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

22

High performance TTS – Applying the sophisticated prosody models

• Modular model of fluent speech prosody• Latent factor analysis-based modeling

– Main idea: with important prosodic cues being properly labeled, the searching for an optimal synthesis unit sequence in a large database can be more efficient.

• Consider both linguistic information and acoustic cues • Specially treat to monosyllabic words

– Use the above prosody-syntax models to assist in the generation of prosodic information

Future Direction (4/5)

Page 23: Sub-Project I Prosody, Tones and Text-To-Speech Synthesis

23

Future Direction (5/5) Speech recognition/language modeling using prosodic cues

– Automatic prosodic states labeling– Prosodic state-dependent acoustic modeling– Prosodic state-dependent language modeling

Prosodic modeling-based robust speaker recognition– Automatic prosodic cues labeling– N-gram language model to learn the prosodic behavior of speakers– Applying principle component analysis (PCA) to N-gram to find a

compact prosodic speaker space