area efficient audio morphing processor based on...

5
IJDACR ISSN: 2319-4863 International Journal of Digital Application & Contemporary research Website: www.ijdacr.com (Volume 3, Issue 1, August 2014) Area Efficient Audio MORPHING Processor based on FFT Navneet Kour Gandhi [email protected] Prof. Preet Jain [email protected] Abstract Voice morphing is a name given to procedures which take speech as input from one speaker and attempt to generate speech that sounds like it came from another speaker. One compelling argument for good voice morphing is that it lessens the trouble in creating additional synthetic voices with new characters and styles once an existing voice has been created based on a full-sized corpus. There are further voice transformation applications for security, privacy, and assistive technologies. Although current voice transformation techniques perform well in the sense that humans typically judge transformed speech to sound more like the target speaker than the source speaker, there is still room for improvement. We investigate the use of FFT to extract the feature difference and store it in RAM, then for morphing, FFT source voice is applied to it and further feature difference is applied on it. Keywords FFT, RAM, Voice Morphing. I. INTRODUCTION Among all the mechanisms that allow humans communicating and interacting with each other, speech is the most natural and precise one. Nowadays, the scientific community tries to face the challenge of designing speech-based human- computer interfaces, extending the role of speech to certain real-life situations in which more primitive ways of interaction (keyboard, mouse, joystick, graphic user interfaces, commands, buttons, etc.) are used to present. In other words, it is intended to make machines recognize well what human speakers say, and answer by generating output utterances that the listeners are capable of understanding, trying to imitate the human way of communicating with similar naturalness and precision. The development of speech technologies has led to a wide variety of research areas related to different tasks involved in making computers interact orally with humans: modelling of speech production and perception, prosody analysis and generation, speech and audio processing, enhancement, coding and transmission, speech synthesis, analysis and synthesis of emotions in expressive speech, speech and speaker recognition, speech understanding, accent and language identification, cross and multi-lingual processing, multimodal signal processing, dialogue systems, information retrieval, translation, applications for handicapped persons, etc. The voice morphing model should be capable of performing two main tasks: By using a certain amount of training information logged from a particular source and target speakers, the model has to conclude the optimal morphing information for converting one voice into the other one [1]. The model has to apply this optimal morphing information to convert new input utterances of the source speaker [2]. In this form of framework, speech synthesis can be considered as the artificial construction of human speech. The central topic of this paper, voice morphing, can be considered a part of the speech synthesis area. The objective of voice conversion models is to transform the voice created by a specific speaker, referred to as the source speaker, into a different particular speaker, named as target speaker [3]. Therefore, the features of the source speaker have to be recognized by the model and swapped with those of the target speaker, without dropping any information or enhancing the message that is being transmitted. Figure 1: Basic flow of voice morphing II. MOTIVATION AND OBJECTIVE Voice morphing is a very useful tool, which has wide area of applications like customizing voices for text to speech model, to produce the sounds of a well-known celebrity for adverts and films, and enhancing the speech of impaired speakers such as laryngectomies. Two basic necessities of many of these uses are that primarily they should not rely on large amounts of Source Speaker Feature Extraction Feature Extraction Source Speaker Target Speaker Morphing

Upload: others

Post on 22-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Area Efficient Audio MORPHING Processor based on FFTijdacr.com/sites/default/files/volume3/aug14/NK.pdf · 2016-09-06 · Voice morphing is a very useful tool, which has wide area

IJDACR

ISSN: 2319-4863

International Journal of Digital Application & Contemporary research

Website: www.ijdacr.com (Volume 3, Issue 1, August 2014)

Area Efficient Audio MORPHING Processor based on

FFT

Navneet Kour Gandhi

[email protected]

Prof. Preet Jain

[email protected]

Abstract – Voice morphing is a name given to

procedures which take speech as input from one

speaker and attempt to generate speech that sounds

like it came from another speaker. One compelling

argument for good voice morphing is that it lessens the

trouble in creating additional synthetic voices with new

characters and styles once an existing voice has been

created based on a full-sized corpus. There are further

voice transformation applications for security,

privacy, and assistive technologies. Although current

voice transformation techniques perform well in the

sense that humans typically judge transformed speech

to sound more like the target speaker than the source

speaker, there is still room for improvement. We

investigate the use of FFT to extract the feature

difference and store it in RAM, then for morphing,

FFT source voice is applied to it and further feature

difference is applied on it.

Keywords – FFT, RAM, Voice Morphing.

I. INTRODUCTION

Among all the mechanisms that allow humans

communicating and interacting with each other,

speech is the most natural and precise one.

Nowadays, the scientific community tries to face the

challenge of designing speech-based human-

computer interfaces, extending the role of speech to

certain real-life situations in which more primitive

ways of interaction (keyboard, mouse, joystick,

graphic user interfaces, commands, buttons, etc.) are

used to present. In other words, it is intended to

make machines recognize well what human speakers

say, and answer by generating output utterances that

the listeners are capable of understanding, trying to

imitate the human way of communicating with

similar naturalness and precision. The development

of speech technologies has led to a wide variety of

research areas related to different tasks involved in

making computers interact orally with humans:

modelling of speech production and perception,

prosody analysis and generation, speech and audio

processing, enhancement, coding and transmission,

speech synthesis, analysis and synthesis of emotions

in expressive speech, speech and speaker

recognition, speech understanding, accent and

language identification, cross and multi-lingual

processing, multimodal signal processing, dialogue

systems, information retrieval, translation,

applications for handicapped persons, etc.

The voice morphing model should be capable of

performing two main tasks:

By using a certain amount of training

information logged from a particular source

and target speakers, the model has to conclude

the optimal morphing information for

converting one voice into the other one [1].

The model has to apply this optimal morphing

information to convert new input utterances of

the source speaker [2].

In this form of framework, speech synthesis can be

considered as the artificial construction of human

speech. The central topic of this paper, voice

morphing, can be considered a part of the speech

synthesis area. The objective of voice conversion

models is to transform the voice created by a specific

speaker, referred to as the source speaker, into a

different particular speaker, named as target speaker

[3]. Therefore, the features of the source speaker

have to be recognized by the model and swapped

with those of the target speaker, without dropping

any information or enhancing the message that is

being transmitted.

Figure 1: Basic flow of voice morphing

II. MOTIVATION AND OBJECTIVE

Voice morphing is a very useful tool, which has wide

area of applications like customizing voices for text to

speech model, to produce the sounds of a well-known

celebrity for adverts and films, and enhancing the

speech of impaired speakers such as laryngectomies.

Two basic necessities of many of these uses are that

primarily they should not rely on large amounts of

Source

Speaker

Feature

Extraction

Feature

Extraction

Source

Speaker

Target

Speaker

Morphing

Page 2: Area Efficient Audio MORPHING Processor based on FFTijdacr.com/sites/default/files/volume3/aug14/NK.pdf · 2016-09-06 · Voice morphing is a very useful tool, which has wide area

IJDACR

ISSN: 2319-4863

International Journal of Digital Application & Contemporary research

Website: www.ijdacr.com (Volume 3, Issue 1, August 2014)

parallel training information where both speakers recite

identical texts, and subsequently, the high audio

excellence of the source must be conserved in the

transformed speech. The fundamental procedure in a

voice morphing model is the alteration of the spectral

envelope of the source speaker to resemble that of the

target speaker and various methods have been

projected for doing this. Yet other possible applications

could be speech-to-speech translation and dubbing of

television programs

There are some issues that must be discussed

before constructing a voice morphing model. Initially,

an accurate model must be selected which permits the

speech signal to be regenerated and manipulated with

least alteration. Earlier study recommends that the

sinusoidal system is a good candidate, in principle at

least, this kind of system can support changes to both

the prosody and the spectral features of the source

signal without the use of significant artifacts. But, in

run time, the conversion excellence is always

negotiated by phase incoherency in the redeveloped

signal, in order to diminish this problem, we use the

phase vocoder approach.

Second, the acoustic information which allow

humans to recognize speakers must be mined and

coded. These information should not dependent on the

message and the environment so that wherever the

source speaker speaks, his/her voice features can be

positively transformed to sound like the target speaker.

III. PROPOSED METHODOLOGY

Figure 2: Block diagram of proposed methodology

The analysis enables two pieces of information to be

obtained: pitch information and the overall envelope

of the sound. A key element in the morphing is the

manipulation of the pitch information. If two signals

with different pitches were simply cross-faded it is

highly likely that two separate sounds will be heard.

This occurs because the signal will have two distinct

Morphing

Extraction

Source

RAM

Target

RAM

FFT

Storage

FFT

Morphing

Block

O/P

RAM

Feature

Extraction

Target

Signal

Feature

Extraction

Source

Page 3: Area Efficient Audio MORPHING Processor based on FFTijdacr.com/sites/default/files/volume3/aug14/NK.pdf · 2016-09-06 · Voice morphing is a very useful tool, which has wide area

IJDACR

ISSN: 2319-4863

International Journal of Digital Application & Contemporary research

Website: www.ijdacr.com (Volume 3, Issue 1, August 2014)

pitches causing the auditory system to perceive two

different objects.

A successful morph must exhibit a

smoothly changing pitch throughout. The pitch

information of each sound is compared to provide

the best match between the two signals' pitches. To

do this match, the signals are stretched and

compressed so that important sections of each signal

match in time. The interpolation of the two sounds

can then be performed which creates the

intermediate sounds in the morph. At last stage

convert this frames back into a normal waveform.

However, after the morphing has been performed,

the legacy of the earlier analysis becomes apparent.

The conversion of the sound to a representation in

which the pitch and spectral envelope can be

separated loses some information. Therefore, this

information has to be re-estimated for the morphed

sound. This process obtains an acoustic waveform,

which can then be stored or listened to.

So as per our work, we first take samples of

source and destination voice and converted both to

the frequency domain using FFT and extracted

features of both. Then the difference between both

is calculated, it gives info that how the source is

different from the target. Then the full source voice

is varied by that difference thus we get morphed

voice to target.

Definition of Fourier Transform

The Fourier transform (FT) of the function f (x) is

the function 𝐹 (𝜔), where:

𝐹(𝜔) = ∫ 𝑓(𝑥)𝑒−𝑖𝜔𝑥𝑑𝑥∞

−∞

𝑓(𝑥) = 1/2𝜋 ∫ 𝐹(𝜔)𝑒𝑖𝜔𝑥𝑑𝜔∞

−∞

(1)

Where 𝑖 = √−1 and 𝑒𝑖𝜃 = cos 𝜃 + 𝑖 sin 𝜃

Think of it as a transformation into a different set of

basic functions. The Fourier transform uses complex

exponentials (sinusoids) of various frequencies as its

basis functions.

A Fourier transform pair is often written 𝑓(𝑥) ↔

𝐹(𝜔), 𝑜𝑟 𝐹(𝑓(𝑥)) = 𝐹(𝜔)Where F is the Fourier

transform operator.

If 𝑓 (𝑥)is thought of as a signal (i.e. input data) then

we call 𝐹(𝜔 ), the signal’s spectrum. If f is thought

as the impulse response of a filter (which operates

on input data to produce Output data) then we call F

the filter’s frequency response.

Fast Fourier Transform (FFT) Algorithm

The FFT is a fastest algorithm for calculating the

DFT of the signal. If we take the 2-point DFT and 4-

point DFT and generalize them to 8-point, 16-

point... 2r-point, we get the FFT algorithm.

There are many variants of the FFT

algorithm. We will debate one of them, the

“decimation-in-time” FFT procedure for orders

whose span is a power of two (N = 2r for some

integer r).

Figure 3 shows diagram of an 8-point FFT,

𝑊 = 𝑊8 = 𝑒𝑖𝜋/4 = (1 − 𝑖)/√2 (2)

Figure 3: Butterfly diagram for 8-point FFT

𝑎0

𝑎1

𝑎2

𝑎3

𝑎4

𝑎5

𝑎6

𝑎7

−1

−1

−1

1

1

1

𝑊0

𝑊2

𝑊6

𝑊4

𝑊0

𝑊2

𝑊6

𝑊4

𝑊0

𝑊1

𝑊3

𝑊2

𝑊4

𝑊5

𝑊7

𝑊6

𝐴0

𝐴1

𝐴3

𝐴2

𝐴4

𝐴5

𝐴7

𝐴6

1

−1

Page 4: Area Efficient Audio MORPHING Processor based on FFTijdacr.com/sites/default/files/volume3/aug14/NK.pdf · 2016-09-06 · Voice morphing is a very useful tool, which has wide area

IJDACR

ISSN: 2319-4863

International Journal of Digital Application & Contemporary research

Website: www.ijdacr.com (Volume 3, Issue 1, August 2014)

IV. SIMULATION AND RESULTS

Figure 4: Simulation waveform for Test

Figure 5: Simulation waveform for U2

Figure 6: Simulation waveform for U3

Figure 7: Simulation waveform for U4

Figure 8: Simulation waveform for U5

Figure 9: Simulation waveform for U55

V. CONCLUSION

The speech model, voice source and transformation

technique are proposed in this work to make our

system more robust to extreme applications.

Proposed phase vocoder shows promising

performance. However, there is still considerable

room for further improvement mostly regarding our

aimed approach. The areas which would greatly

benefit from future work are briefly discussed. Some

of the altered features are still not accurate enough,

and this is the most vital feature to expand in order

to carry on with this working field.

Another idea in order to improve the

features is to transform the very end of the feature in

a different way than the rest of the feature. It has

been observed that, in the very end of some features,

some formants change place, so that the vowel and

the degree of the feature vary. This would be also an

applicable feature to work on, and it could enhance

the results dramatically.

REFERENCE [1] Mireia Farrus, Daniel Erro, and Javier Hernando,

“Speaker Recognition Robustness to Voice

Conversion”, Jornadas de Reconocimiento Biométrico de Personas, PP. 73-82, 2008.

Page 5: Area Efficient Audio MORPHING Processor based on FFTijdacr.com/sites/default/files/volume3/aug14/NK.pdf · 2016-09-06 · Voice morphing is a very useful tool, which has wide area

IJDACR

ISSN: 2319-4863

International Journal of Digital Application & Contemporary research

Website: www.ijdacr.com (Volume 3, Issue 1, August 2014)

[2] Mireia Farrús, Michael Wagner, Daniel Erro and

Javier Hernando, “Automatic speaker recognition as a measurement of voice imitation and conversion”,

International Journal of Speech Language and the

Law, Vol. 17, No. 1, 2010. [3] Helenca Duxans i Barrobes, “Voice Conversion

applied to Text-to-Speech systems”, Barcelona, May

2006. [4] Dharavathu Vijaya Babu, R. V. Krishnaiah “Voice

Morphing System for People Suffering from

Laryngectomy”, International Journal of Science and

Research, ISSN: 2319‐7064, 2013. [5] Shaik Shafee, B. Anuradha “Voice Conversion Using

Different Pitch Shifting Approach over TD-PSOLA

Algorithm”, International Journal of Advanced Research in Computer and Communication

Engineering, 2013.

[6] Parmod Kumar and Anuj “VHDL Implementation of Phase Vocoder for Voice Morphing”, International

Journal of New Trends in Electronics and

Communication, 2013. [7] J.H.Nirmal, Suparva Patnaik, and Mukesh A.Zaveri

“Line Spectral Pairs Based Voice Conversion using

Radial Basis Function”, ACEEE Int. J. on Signal & Image Processing, Vol. 4, No. 2, May 2013.

[8] Palak Chawda, Prof. Jaikaran Singh, Prof. Mukesh

Tiwari, “High Speed FFT Based Audio MORPHING Processor Using VHDL”International Journal Of

Engineering Sciences & Research Technology, ISSN

2277-9655. [9] Radhika Karthikeyan and G.Ayyappan, “Cepstral

approach in voice morphing”, Computer Science and

Engineering, 2013. [10] Jagannath H Nirmal, Mukesh A Zaveri, Suprava

Patnaik and Pramod H Kachare “A novel voice

conversion approach using admissible wavelet packet decomposition”, EURASIP Journal on Audio,

Speech, andMusic Processing 2013.

[11] Yogesh Ganvit, M.A.Lokhandwala, Ninad S. Bhatt “Implementation & Overall Performance Evaluation

Of Voice Morphing Based On PSOLA”, International

Journal of Advanced Engineering Technology, E-ISSN 0976-3945, 2012.

[12] Aman Chadha, Bharatraaj Savardekar, Jay Padhya

“Analysis of a Modern Voice Morphing Approach using Gaussian Mixture Models for Laryngectomees”,

International Journal of Computer Applications (0975

– 8887) Volume 49– No.21, July 2012. [13] Allam Mousa, “Speech Segmentation in Synthesized

Speech Morphing Using Pitch Shifting”, The

International Arab Journal of Information Technology, Vol. 8, No. 2, April 2011.