voice transformation : speech morphing by...

12/1/2016 1

Voice Transformation : Speech Morphing By PWI

Submitted by: Gidon Porat Supervisor: Dr. Yizhar Lavner

Signal and Image Processing Lab

12/1/2016 2

Objective

Gradually change a source speaker’s voice, to sound like the voice of a target speaker.

The inputs : two reference voice signals.

12/1/2016 3

Applications

● Multimedia and video entertainment: While seeing a face gradually changing from one person to another’s (like often done in video clips) ,we could simultaneously hear his voice changing as well.

● Forensic voice identification by synthesis:Identifying a suspect voice by creating a voice-bank.

12/1/2016 4

The Challenges

Produce a natural sounding speech signal by a computer.

The “naive solution” gives poor results.

The source and target voice signals will never be of the same length.

Speaker A Speaker B interpolation

12/1/2016 5

Speech Modeling - Synthesis

Impulsetrain

generator

Glottal PulsemodelG(z)

RandomNoise

generator

Vocaltract model

V(z)

RadiationmodelR(z)

PitchDetector

LinearPredictor

SpeechProduction

System

Voiced/unvoiced

The basic synthesis of digital speech is done by the discrete-time system model.

12/1/2016 6

Speech Modeling

A1 A2 A3 Ak

Sound transmission in the vocal tract can be modeled as sound passing through concatenated lossless acoustic tubes.

A known mathematical relation between the areas of these tubes and the vocal tract’s filter, will help in the implementation of our algorithm.

⇒ ⇒1

1

k kk

k k

A ArA A

+

+

−=

+

01

1 1

11 1

( ) 1

( ) ( ) ( ). . :

1 (1)

kk k k k

D zD z D z r z D ze gD r z

− −− −

−

=

= + ⋅ ⋅

= + ⋅ ⋅

( )( )GV z

D z=

12/1/2016 7

Prototype Waveform Interpolation

PWI is a speech signal representation method, which is based on the presentation of a speech signal, or its residual error function, by a 3-D surface.

This method allows the coding of a prototype waveform’s structure evolution in time. φ t

12/1/2016 80 100 200 300 400 500 600 700 800 900 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

time

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time0 100 200 300 400 500 600 700 800 900 1000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time0 100 200 300 400 500 600 700 800 900 1000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

time

time

φφ

tt

12/1/2016 9

The Solution

Use 3D surfaces that will capture each vocal phoneme’s residual error signal characteristics and interpolate between the two speakers.

Unvoiced phonemes are not dealt with due to the fact that they carry less information about the speaker.

12/1/2016 10

Algorithm – Block Diagram

PhonemesBoundaries

Demarcation

LPCAnalysis

PitchDetector

PhonemesBoundariesDemarcation

PitchDetector

LPCAnalysis

PWICreation

PWICreation

Interpolator

e(n)reconstruction

new LPCcalculator

Speaker A Speaker B

Synthesizer

12/1/2016 11

PWI Surface Interpolation

+

Intermediate surface

Speaker A

Speaker B

The new error function, that will be reconstructed from that surface, will then be the input of a new Vocal Tract Filter.

Once the surfaces for both source and target speakers are created (for each phoneme) an interpolated surface is created.

( , ) ( , ) (1 ) ( , )new s tu t u t u tφ α φ α φ= ⋅ + − ⋅

12/1/2016 12

The New Excitation - ReconstructionThe new, intermediate error signal can be evaluated from the new surface by the equation :

And that :

( ) ( , ( )), [0 : ]new new new newe t u t t t Tφ= ∀ =

0

''

2( )( )

t

newnewt

t dtp t

πφ = ∫

Assuming the pitch cycle changes slowly in time :

( )new tφ

( ) ( ) (1 ) ( )new s tp t p t p tα α= ⋅ + − ⋅

12/1/2016 13

New Lossless Tube Model

A1s A2s A3s A4s A5s A6s

A1n A2n A3n A4n A5n A6n

The areas of the new tube model will be an interpolation between the source’s and the target’s.Once the new Areas are computed the LPC parameters and V(z) can be calculated, and the signal can be synthesized.1 2 3 4 5 6 7( , , , , , , )a a a a a a a

(1 ) :[1, 2,..., ]new s ti i iA A A i Nα α= ⋅ + − ⋅ ∀

12/1/2016 14

The morphing factor

Variant factor (from t=0 to t=T) => Gradually changing voice.

Invariant factor => Intermediate voice.

In order for one to hear a “linear” change between the source’s and the target’s voices, the morphing factor, (the relative part of ) has to vary nonlinearly in time in a logarithmic way(specifically in this algorithm).

α

( , )su t φ

12/1/2016 15

0 1000 2000 3000 4000 5000 6000 7000-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

Concatenating the new phonemes

The final morphed speech signal is created by concatenating the new vocal phonemes, sequentially, along with the source’s/target’s unvoiced phonemes and silence periods.

1

silence

2voiced3

unvoiced

4voiced

12/1/2016 16

Demonstrations

Speaker A.

Speaker B.

Superposition intermediate signal.Algorithm’s intermediate signal.

Superposition gradual changing signal.

Algorithm’s gradual changing signal.

12/1/2016 17

ConclusionAdvantages:

The utterances produced were shown to consist of intermediate features of the two speakers, as apposed to the “naive solution”.

This field of study is relatively short of publicly available algorithms.

Limitations:

The fact that the unvoiced parts, are not dealt with and the need to better define an “intermediate sound”.

12/1/2016 18

Surface Construction Algorithm

error signalcalculation

LPC parametersPitch detectionmarks

short timepitch function

calculation

prototypewaveformsampler

alignbetween

0 and 2*pi

align withpreviouswaveform

interpolatealong

time axis

the errorwaveform surface

LPF

12/1/2016 19

New Voiced Phoneme Creation

evaluatenew p(t)

source and target error surface

source and target short time

pitch function

modificationparameter (a)

calculatenew fi(t)

create newerror surface

recover newerror function

claculate newtube areas

source and targetLPC parameters

calulatenew LPC

parameters

synthesizenew speech

signalnew

phoneme

modificationparameter (a)

( )tφ

12/1/2016 20

Appendix - Equations

( , ) ( , ) (1 ) ( , )new s t alignedu t u t u tφ α β φ α γ φ−= ⋅ ⋅ + − ⋅ ⋅

( , ) ( , )t aligned t nu t u tφ φ φ− = +2

0 0

( , ) ( , )arg max { }

( , )e

Tnew

s t et

ns t e

u t u t dtd

u u t

π

φφ

φ φ φ φφ

φ φ= =

⋅ +

=⋅ +

∫ ∫

( ) ( ) (1 ) ( )new s tp t p t p tα β α γ= ⋅ ⋅ + − ⋅ ⋅

12/1/2016 21

Appendix – transfer function.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

12/1/2016 22

The Challenges cont.

Here are two identical words (“shade”) form source and target

The target speaker’s word lasts longer then the source speaker’s and it’s “shape” is quite different.

12/1/2016 23

Interpolation between source and target surfaces is carried out.

Intermediate surface construction

( , ) ( , ) (1 ) ( , )new s tu t u t u tφ α β φ α γ φ= ⋅ ⋅ + − ⋅ ⋅

/ ; /new s new tT T T Tβ γ= =

(1 )new s tT T Tα α= ⋅ + − ⋅

voice transformation : speech morphing by...

Documents